Web Scraping with Playwright and Python in 2026: The Complete Guide

Published March 15, 2026 ยท 16 min read ยท Updated for Playwright 1.50+
Playwright Python Web Scraping JavaScript Rendering
TL;DR: Playwright is the most powerful browser automation tool for web scraping in 2026 โ€” but it comes with complexity, cost, and detection challenges. This guide covers everything from basic setup to advanced anti-detection, infinite scroll handling, and stealth mode. For production scraping at scale, a web scraping API handles rendering and anti-detection automatically at a fraction of the cost.

Modern websites are JavaScript-heavy. React, Next.js, Vue, Angular โ€” over 70% of the top 10,000 websites rely on client-side rendering. Traditional HTTP scraping with requests + BeautifulSoup returns empty shells. You need a real browser.

Enter Playwright: Microsoft's open-source browser automation library. It drives Chromium, Firefox, and WebKit with a single API, handles dynamic content natively, and has become the go-to tool for scraping JavaScript-rendered pages in 2026.

This guide covers everything you need to scrape with Playwright effectively โ€” and helps you decide when an API is the smarter choice.

Why Playwright for Web Scraping?

Playwright has overtaken Selenium as the preferred browser automation tool for scraping. Here's why:

FeaturePlaywrightSeleniumPuppeteer
SpeedFast (CDP protocol)Slower (WebDriver)Fast (CDP)
Browser SupportChromium, Firefox, WebKitChrome, Firefox, Edge, SafariChromium only
Auto-WaitingBuilt-inManual waits neededBasic
Language SupportPython, JS, Java, C#Python, JS, Java, C#, RubyJavaScript only
Shadow DOMNative supportWorkarounds neededNative support
Async APIFirst-classLimitedFirst-class
Network InterceptionBuilt-in routingVia proxyBuilt-in
iframesSimple APISwitch context manuallycontentFrame()

Setup: Installing Playwright for Python

Get started in under 60 seconds:

# Install playwright
pip install playwright

# Download browser binaries (Chromium, Firefox, WebKit)
playwright install

# Or install just Chromium (smaller download)
playwright install chromium

Verify the installation:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())  # "Example Domain"
    browser.close()

Basic Web Scraping with Playwright

Let's scrape a JavaScript-rendered page. Here's a complete example that extracts product data from a dynamic e-commerce site:

from playwright.sync_api import sync_playwright
import json

def scrape_products(url: str) -> list[dict]:
    """Scrape product listings from a JS-rendered page."""
    products = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/122.0.0.0 Safari/537.36"
        )
        page = context.new_page()

        # Navigate and wait for product cards to render
        page.goto(url, wait_until="networkidle")
        page.wait_for_selector(".product-card", timeout=10000)

        # Extract product data
        cards = page.query_selector_all(".product-card")
        for card in cards:
            name = card.query_selector(".product-name")
            price = card.query_selector(".product-price")
            rating = card.query_selector(".product-rating")

            products.append({
                "name": name.inner_text() if name else None,
                "price": price.inner_text() if price else None,
                "rating": rating.get_attribute("data-score") if rating else None,
                "url": card.query_selector("a").get_attribute("href") if card.query_selector("a") else None,
            })

        browser.close()

    return products

# Usage
products = scrape_products("https://example-shop.com/products")
print(json.dumps(products, indent=2))

Async Scraping for Better Performance

For scraping multiple pages concurrently, use Playwright's async API:

import asyncio
from playwright.async_api import async_playwright

async def scrape_page(context, url: str) -> dict:
    """Scrape a single page using a shared browser context."""
    page = await context.new_page()
    try:
        await page.goto(url, wait_until="domcontentloaded", timeout=30000)
        await page.wait_for_selector("h1", timeout=5000)

        title = await page.title()
        content = await page.inner_text("body")

        return {"url": url, "title": title, "length": len(content)}
    except Exception as e:
        return {"url": url, "error": str(e)}
    finally:
        await page.close()

async def scrape_many(urls: list[str], concurrency: int = 5) -> list[dict]:
    """Scrape multiple URLs with controlled concurrency."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()

        semaphore = asyncio.Semaphore(concurrency)

        async def limited_scrape(url):
            async with semaphore:
                return await scrape_page(context, url)

        results = await asyncio.gather(
            *[limited_scrape(url) for url in urls]
        )

        await browser.close()
    return results

# Scrape 20 pages, 5 at a time
urls = [f"https://example.com/page/{i}" for i in range(1, 21)]
results = asyncio.run(scrape_many(urls, concurrency=5))
Performance tip: Use a shared browser_context across pages to share cookies and cache. Limit concurrency to 5-10 pages โ€” each tab uses 100-300MB RAM. For 1,000+ pages, use a web scraping API instead.

Handling JavaScript-Rendered Content

Waiting Strategies

The most common Playwright scraping mistake: not waiting for content to render. Here are the key waiting strategies:

# 1. Wait for network to be idle (all XHR/fetch complete)
await page.goto(url, wait_until="networkidle")

# 2. Wait for a specific element to appear
await page.wait_for_selector(".results-container", state="visible")

# 3. Wait for a specific element to have content
await page.wait_for_function(
    "document.querySelector('.results-count')?.textContent?.includes('results')"
)

# 4. Wait for a specific network request to complete
async with page.expect_response("**/api/products*") as response_info:
    await page.click("#load-more")
response = await response_info.value
data = await response.json()

# 5. Wait for navigation after a click
async with page.expect_navigation():
    await page.click("a.next-page")

Intercepting API Calls

Often the fastest approach โ€” intercept the XHR/fetch calls that load data, and skip HTML parsing entirely:

import json
from playwright.sync_api import sync_playwright

def intercept_api_data(url: str) -> list[dict]:
    """Intercept API responses instead of parsing HTML."""
    captured_data = []

    def handle_response(response):
        if "/api/products" in response.url and response.status == 200:
            try:
                data = response.json()
                captured_data.append(data)
            except Exception:
                pass

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.on("response", handle_response)

        page.goto(url, wait_until="networkidle")
        # Trigger pagination to capture more data
        for _ in range(5):
            next_btn = page.query_selector("button.load-more")
            if next_btn:
                next_btn.click()
                page.wait_for_timeout(2000)

        browser.close()

    return captured_data
Pro tip: Intercepting API calls is 10x faster than parsing rendered HTML. Open DevTools Network tab on your target site, filter by XHR/Fetch, and look for JSON endpoints. Many SPAs have clean REST or GraphQL APIs behind the scenes.

Handling Infinite Scroll

Social media feeds, product listings, and news sites use infinite scroll. Here's how to handle it:

async def scrape_infinite_scroll(url: str, max_items: int = 200) -> list[str]:
    """Scroll to bottom repeatedly until max_items reached or no new content."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")

        items = []
        last_height = 0
        no_change_count = 0

        while len(items) < max_items and no_change_count < 3:
            # Scroll to bottom
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(2000)  # Wait for content to load

            # Check if page grew
            new_height = await page.evaluate("document.body.scrollHeight")
            if new_height == last_height:
                no_change_count += 1
            else:
                no_change_count = 0
            last_height = new_height

            # Extract items
            items = await page.query_selector_all(".feed-item")

        # Extract data from all items
        results = []
        for item in items[:max_items]:
            text = await item.inner_text()
            results.append(text)

        await browser.close()
    return results

Handling Authentication and Login

Many sites require login before scraping. Playwright makes this straightforward:

async def scrape_with_login(url: str, username: str, password: str):
    """Login and scrape authenticated content."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        # Navigate to login page
        await page.goto("https://example.com/login")

        # Fill in credentials
        await page.fill("#username", username)
        await page.fill("#password", password)
        await page.click("#login-button")

        # Wait for redirect after login
        await page.wait_for_url("**/dashboard**")

        # Save session for reuse (avoid logging in every time)
        await context.storage_state(path="auth_state.json")

        # Now scrape authenticated pages
        await page.goto(url)
        data = await page.inner_text(".protected-content")

        await browser.close()
        return data

# Reuse saved session in future runs
async def scrape_with_saved_session(url: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(storage_state="auth_state.json")
        page = await context.new_page()
        await page.goto(url)
        # Already logged in!
        data = await page.inner_text(".protected-content")
        await browser.close()
        return data

Screenshots and PDF Generation

Playwright can capture visual snapshots โ€” useful for monitoring, archiving, or visual comparison:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page(viewport={"width": 1920, "height": 1080})
    page.goto("https://example.com")

    # Full page screenshot
    page.screenshot(path="fullpage.png", full_page=True)

    # Element screenshot
    element = page.query_selector(".main-content")
    element.screenshot(path="content.png")

    # PDF (Chromium only)
    page.pdf(path="page.pdf", format="A4")

    browser.close()

Need Screenshots at Scale?

Mantis API renders screenshots in the cloud โ€” no browser infrastructure needed. One API call, instant PNG/PDF.

Try Mantis Free โ†’

Stealth Mode: Avoiding Detection

Default Playwright is trivially detected. Anti-bot systems check for automation signatures:

# โŒ Default Playwright is detected instantly
# navigator.webdriver === true
# Missing Chrome plugins
# Automation-specific properties exposed

Use playwright-stealth to patch the most obvious fingerprints:

pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-features=IsolateOrigins,site-per-process",
            "--no-sandbox",
        ]
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/122.0.0.0 Safari/537.36",
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
    )

    page = context.new_page()
    stealth_sync(page)  # Apply stealth patches

    # Additional manual patches
    page.add_init_script("""
        // Override webdriver flag
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});

        // Fix chrome object
        window.chrome = { runtime: {}, loadTimes: function(){}, csi: function(){} };

        // Fix permissions
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) =>
            parameters.name === 'notifications'
                ? Promise.resolve({ state: Notification.permission })
                : originalQuery(parameters);

        // Fix plugins
        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3, 4, 5],
        });
    """)

    page.goto("https://bot-detection-test.example.com")
    browser.close()
Reality check: playwright-stealth defeats basic bot detection, but sophisticated systems like Cloudflare Turnstile, Akamai Bot Manager, and DataDome still catch stealth Playwright through TLS fingerprinting (JA3/JA4), HTTP/2 frame analysis, and behavioral scoring. There's no silver bullet for DIY anti-detection.

Proxy Rotation with Playwright

Route Playwright through rotating proxies to avoid IP bans:

from playwright.sync_api import sync_playwright

PROXIES = [
    {"server": "http://proxy1.example.com:8080", "username": "user", "password": "pass"},
    {"server": "http://proxy2.example.com:8080", "username": "user", "password": "pass"},
    {"server": "http://proxy3.example.com:8080", "username": "user", "password": "pass"},
]

def scrape_with_proxy(url: str, proxy: dict) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy=proxy,
        )
        page = browser.new_page()
        page.goto(url, timeout=30000)
        content = page.content()
        browser.close()
        return content

# Rotate through proxies
import random
for url in urls_to_scrape:
    proxy = random.choice(PROXIES)
    html = scrape_with_proxy(url, proxy)

Blocking Unnecessary Resources

Speed up scraping 2-5x by blocking images, fonts, and tracking scripts:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()

    # Block images, fonts, and tracking
    await context.route("**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2,ttf}", lambda route: route.abort())
    await context.route("**/*google-analytics*", lambda route: route.abort())
    await context.route("**/*facebook.net*", lambda route: route.abort())
    await context.route("**/*doubleclick.net*", lambda route: route.abort())

    page = context.new_page()
    page.goto("https://example.com")  # Loads 2-5x faster

    browser.close()

Playwright vs. Web Scraping API: When to Use Each

Playwright is powerful but comes with significant overhead. Here's an honest comparison:

FactorPlaywright (DIY)Mantis API
Setup TimeHours to days5 minutes
JS Renderingโœ… Full browserโœ… Cloud rendering
Speed2-10 sec/page<2 sec/page
RAM Usage200-500MB per tabZero (cloud)
Anti-DetectionDIY (stealth plugins)Built-in
Proxy ManagementDIY ($50-200/mo)Included
CAPTCHA Handling3rd party ($20-100/mo)Included
ScaleLimited by RAM/CPU100K+ pages/mo
MaintenanceConstant (browsers update, sites change)Zero
Cost (5K pages/mo)$150-600/mo$29/mo
AI Data ExtractionCustom codeBuilt-in (GPT-4o)

Use Playwright When:

Use a Web Scraping API When:

The progression most teams follow: Start with requests + BeautifulSoup โ†’ Hit JS rendering walls โ†’ Switch to Playwright โ†’ Hit scale/detection walls โ†’ Switch to an API. Skip the middle steps if you're building for production.

Complete Production Scraper Example

Here's a production-ready Playwright scraper with error handling, retries, and structured output:

import asyncio
import json
import random
from dataclasses import dataclass, asdict
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

@dataclass
class ScrapedPage:
    url: str
    title: str
    content: str
    links: list[str]
    status: str  # "success" | "error"
    error: str | None = None

async def scrape_with_retry(
    context, url: str, max_retries: int = 3
) -> ScrapedPage:
    """Scrape a page with retries and error handling."""
    for attempt in range(max_retries):
        page = await context.new_page()
        try:
            await stealth_async(page)

            # Random delay to appear human
            await page.wait_for_timeout(random.randint(1000, 3000))

            response = await page.goto(url, wait_until="domcontentloaded", timeout=30000)

            if response and response.status == 403:
                raise Exception(f"Blocked (403) on attempt {attempt + 1}")

            await page.wait_for_load_state("networkidle", timeout=10000)

            title = await page.title()
            content = await page.inner_text("body")
            links = await page.eval_on_selector_all(
                "a[href]", "els => els.map(e => e.href)"
            )

            return ScrapedPage(
                url=url, title=title, content=content[:5000],
                links=links[:50], status="success"
            )

        except Exception as e:
            if attempt == max_retries - 1:
                return ScrapedPage(
                    url=url, title="", content="", links=[],
                    status="error", error=str(e)
                )
            await page.wait_for_timeout(2000 * (attempt + 1))  # Backoff
        finally:
            await page.close()

async def main():
    urls = [
        "https://example.com/page/1",
        "https://example.com/page/2",
        "https://example.com/page/3",
    ]

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
        )

        # Block unnecessary resources
        await context.route(
            "**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2}",
            lambda route: route.abort()
        )

        semaphore = asyncio.Semaphore(3)

        async def limited(url):
            async with semaphore:
                return await scrape_with_retry(context, url)

        results = await asyncio.gather(*[limited(u) for u in urls])

        await browser.close()

    # Output results
    for r in results:
        print(json.dumps(asdict(r), indent=2))

asyncio.run(main())

Or Skip the Complexity: Use Mantis API

Everything above โ€” browser rendering, stealth mode, proxy rotation, CAPTCHA solving, retries โ€” in a single API call:

import requests

# Scrape any JavaScript-rendered page
response = requests.post(
    "https://api.mantisapi.com/scrape",
    headers={"x-api-key": "your-api-key"},
    json={
        "url": "https://example.com/products",
        "render_js": True,
        "wait_for": ".product-card",
        "extract": {
            "products": {
                "selector": ".product-card",
                "type": "list",
                "fields": {
                    "name": ".product-name",
                    "price": ".product-price",
                    "rating": {"selector": ".stars", "attr": "data-score"}
                }
            }
        }
    }
)

products = response.json()["data"]["products"]
# Clean, structured data โ€” no browser management needed
# AI-powered extraction โ€” no selectors needed
response = requests.post(
    "https://api.mantisapi.com/extract",
    headers={"x-api-key": "your-api-key"},
    json={
        "url": "https://example.com/products",
        "prompt": "Extract all products with name, price, rating, and availability",
        "schema": {
            "products": [{
                "name": "string",
                "price": "number",
                "rating": "number",
                "in_stock": "boolean"
            }]
        }
    }
)

# GPT-4o extracts structured data from any page layout
products = response.json()["data"]["products"]

Stop Managing Browsers

Mantis handles JS rendering, anti-detection, proxies, and AI extraction. Free tier: 100 requests/month.

Start Free โ†’

Summary

Playwright is the most capable browser automation tool for web scraping in 2026. It handles JavaScript rendering, SPAs, authentication, and complex interactions that HTTP-based scrapers can't touch.

But capability comes with cost:

For learning, prototyping, and complex single-site scrapers, Playwright is the right choice. For production scraping at scale, a web scraping API eliminates the complexity and costs less.

Further Reading