Is Playwright good for web scraping?

Playwright is excellent for scraping JavaScript-heavy websites that require browser rendering. It supports Chromium, Firefox, and WebKit, handles SPAs and dynamic content natively, and offers powerful waiting strategies. However, it's resource-intensive (each browser instance uses 200-500MB RAM) and slower than HTTP-based scraping. For large-scale scraping, a web scraping API like Mantis is more efficient.

Is Playwright better than Selenium for web scraping?

Yes, Playwright is generally better than Selenium for scraping in 2026. It's faster (no WebDriver protocol overhead), has built-in auto-waiting, supports all major browsers from one API, handles modern web features better (Shadow DOM, iframes, service workers), and has a more intuitive async API. Selenium still has a larger community, but Playwright has become the preferred choice for new scraping projects.

Can websites detect Playwright scraping?

Yes. Default Playwright instances are easily detected through navigator.webdriver flag, missing browser plugins, automation-specific JavaScript properties, and TLS fingerprinting. Stealth plugins like playwright-stealth help, but sophisticated anti-bot systems (Cloudflare, Akamai, DataDome) can still detect Playwright through behavioral analysis and HTTP/2 fingerprinting. A web scraping API handles all detection avoidance automatically.

How much does Playwright web scraping cost to run?

Running Playwright at scale costs $150-600/month: cloud compute for headless browsers ($100-400 for 4-16 vCPU instances), residential proxies ($50-200/month for 5-20GB), and CAPTCHA solving services ($20-100/month). Each browser instance uses 200-500MB RAM and takes 2-10 seconds per page. A web scraping API like Mantis costs $29-299/month and handles rendering, proxies, and anti-detection automatically.

How do I handle infinite scroll with Playwright?

Use page.evaluate() to scroll to the bottom of the page in a loop, waiting for new content to load between scrolls. Check for a stable document height or a maximum number of items to know when to stop. Playwright's wait_for_selector() and wait_for_load_state() methods help ensure content has loaded before scraping. See the code example in this guide for a complete implementation.

Should I use Playwright or an API for web scraping?

Use Playwright when you need complex browser interactions (login flows, multi-step forms, custom JavaScript execution) on a small number of sites. Use a web scraping API like Mantis when you need scale (thousands of pages), reliability (built-in anti-detection and proxy rotation), or speed (no browser overhead). Most production scraping projects start with Playwright and migrate to APIs as they scale.

Web Scraping with Playwright and Python in 2026: The Complete Guide

Published March 15, 2026 · 16 min read · Updated for Playwright 1.50+
Playwright Python Web Scraping JavaScript Rendering

TL;DR: Playwright is the most powerful browser automation tool for web scraping in 2026 — but it comes with complexity, cost, and detection challenges. This guide covers everything from basic setup to advanced anti-detection, infinite scroll handling, and stealth mode. For production scraping at scale, a web scraping API handles rendering and anti-detection automatically at a fraction of the cost.

Modern websites are JavaScript-heavy. React, Next.js, Vue, Angular — over 70% of the top 10,000 websites rely on client-side rendering. Traditional HTTP scraping with requests + BeautifulSoup returns empty shells. You need a real browser.

Enter Playwright: Microsoft's open-source browser automation library. It drives Chromium, Firefox, and WebKit with a single API, handles dynamic content natively, and has become the go-to tool for scraping JavaScript-rendered pages in 2026.

This guide covers everything you need to scrape with Playwright effectively — and helps you decide when an API is the smarter choice.

Why Playwright for Web Scraping?

Playwright has overtaken Selenium as the preferred browser automation tool for scraping. Here's why:

Feature	Playwright	Selenium	Puppeteer
Speed	Fast (CDP protocol)	Slower (WebDriver)	Fast (CDP)
Browser Support	Chromium, Firefox, WebKit	Chrome, Firefox, Edge, Safari	Chromium only
Auto-Waiting	Built-in	Manual waits needed	Basic
Language Support	Python, JS, Java, C#	Python, JS, Java, C#, Ruby	JavaScript only
Shadow DOM	Native support	Workarounds needed	Native support
Async API	First-class	Limited	First-class
Network Interception	Built-in routing	Via proxy	Built-in
iframes	Simple API	Switch context manually	contentFrame()

Setup: Installing Playwright for Python

Get started in under 60 seconds:

# Install playwright
pip install playwright

# Download browser binaries (Chromium, Firefox, WebKit)
playwright install

# Or install just Chromium (smaller download)
playwright install chromium

Verify the installation:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())  # "Example Domain"
    browser.close()

Basic Web Scraping with Playwright

Let's scrape a JavaScript-rendered page. Here's a complete example that extracts product data from a dynamic e-commerce site:

from playwright.sync_api import sync_playwright
import json

def scrape_products(url: str) -> list[dict]:
    """Scrape product listings from a JS-rendered page."""
    products = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/122.0.0.0 Safari/537.36"
        )
        page = context.new_page()

        # Navigate and wait for product cards to render
        page.goto(url, wait_until="networkidle")
        page.wait_for_selector(".product-card", timeout=10000)

        # Extract product data
        cards = page.query_selector_all(".product-card")
        for card in cards:
            name = card.query_selector(".product-name")
            price = card.query_selector(".product-price")
            rating = card.query_selector(".product-rating")

            products.append({
                "name": name.inner_text() if name else None,
                "price": price.inner_text() if price else None,
                "rating": rating.get_attribute("data-score") if rating else None,
                "url": card.query_selector("a").get_attribute("href") if card.query_selector("a") else None,
            })

        browser.close()

    return products

# Usage
products = scrape_products("https://example-shop.com/products")
print(json.dumps(products, indent=2))

Async Scraping for Better Performance

For scraping multiple pages concurrently, use Playwright's async API:

import asyncio
from playwright.async_api import async_playwright

async def scrape_page(context, url: str) -> dict:
    """Scrape a single page using a shared browser context."""
    page = await context.new_page()
    try:
        await page.goto(url, wait_until="domcontentloaded", timeout=30000)
        await page.wait_for_selector("h1", timeout=5000)

        title = await page.title()
        content = await page.inner_text("body")

        return {"url": url, "title": title, "length": len(content)}
    except Exception as e:
        return {"url": url, "error": str(e)}
    finally:
        await page.close()

async def scrape_many(urls: list[str], concurrency: int = 5) -> list[dict]:
    """Scrape multiple URLs with controlled concurrency."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()

        semaphore = asyncio.Semaphore(concurrency)

        async def limited_scrape(url):
            async with semaphore:
                return await scrape_page(context, url)

        results = await asyncio.gather(
            *[limited_scrape(url) for url in urls]
        )

        await browser.close()
    return results

# Scrape 20 pages, 5 at a time
urls = [f"https://example.com/page/{i}" for i in range(1, 21)]
results = asyncio.run(scrape_many(urls, concurrency=5))

Performance tip: Use a shared browser_context across pages to share cookies and cache. Limit concurrency to 5-10 pages — each tab uses 100-300MB RAM. For 1,000+ pages, use a web scraping API instead.

Handling JavaScript-Rendered Content

Waiting Strategies

The most common Playwright scraping mistake: not waiting for content to render. Here are the key waiting strategies:

# 1. Wait for network to be idle (all XHR/fetch complete)
await page.goto(url, wait_until="networkidle")

# 2. Wait for a specific element to appear
await page.wait_for_selector(".results-container", state="visible")

# 3. Wait for a specific element to have content
await page.wait_for_function(
    "document.querySelector('.results-count')?.textContent?.includes('results')"
)

# 4. Wait for a specific network request to complete
async with page.expect_response("**/api/products*") as response_info:
    await page.click("#load-more")
response = await response_info.value
data = await response.json()

# 5. Wait for navigation after a click
async with page.expect_navigation():
    await page.click("a.next-page")

Intercepting API Calls

Often the fastest approach — intercept the XHR/fetch calls that load data, and skip HTML parsing entirely:

import json
from playwright.sync_api import sync_playwright

def intercept_api_data(url: str) -> list[dict]:
    """Intercept API responses instead of parsing HTML."""
    captured_data = []

    def handle_response(response):
        if "/api/products" in response.url and response.status == 200:
            try:
                data = response.json()
                captured_data.append(data)
            except Exception:
                pass

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.on("response", handle_response)

        page.goto(url, wait_until="networkidle")
        # Trigger pagination to capture more data
        for _ in range(5):
            next_btn = page.query_selector("button.load-more")
            if next_btn:
                next_btn.click()
                page.wait_for_timeout(2000)

        browser.close()

    return captured_data

Pro tip: Intercepting API calls is 10x faster than parsing rendered HTML. Open DevTools Network tab on your target site, filter by XHR/Fetch, and look for JSON endpoints. Many SPAs have clean REST or GraphQL APIs behind the scenes.

Handling Infinite Scroll

Social media feeds, product listings, and news sites use infinite scroll. Here's how to handle it:

async def scrape_infinite_scroll(url: str, max_items: int = 200) -> list[str]:
    """Scroll to bottom repeatedly until max_items reached or no new content."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")

        items = []
        last_height = 0
        no_change_count = 0

        while len(items) < max_items and no_change_count < 3:
            # Scroll to bottom
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(2000)  # Wait for content to load

            # Check if page grew
            new_height = await page.evaluate("document.body.scrollHeight")
            if new_height == last_height:
                no_change_count += 1
            else:
                no_change_count = 0
            last_height = new_height

            # Extract items
            items = await page.query_selector_all(".feed-item")

        # Extract data from all items
        results = []
        for item in items[:max_items]:
            text = await item.inner_text()
            results.append(text)

        await browser.close()
    return results

Handling Authentication and Login

Many sites require login before scraping. Playwright makes this straightforward:

async def scrape_with_login(url: str, username: str, password: str):
    """Login and scrape authenticated content."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        # Navigate to login page
        await page.goto("https://example.com/login")

        # Fill in credentials
        await page.fill("#username", username)
        await page.fill("#password", password)
        await page.click("#login-button")

        # Wait for redirect after login
        await page.wait_for_url("**/dashboard**")

        # Save session for reuse (avoid logging in every time)
        await context.storage_state(path="auth_state.json")

        # Now scrape authenticated pages
        await page.goto(url)
        data = await page.inner_text(".protected-content")

        await browser.close()
        return data

# Reuse saved session in future runs
async def scrape_with_saved_session(url: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(storage_state="auth_state.json")
        page = await context.new_page()
        await page.goto(url)
        # Already logged in!
        data = await page.inner_text(".protected-content")
        await browser.close()
        return data

Screenshots and PDF Generation

Playwright can capture visual snapshots — useful for monitoring, archiving, or visual comparison:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page(viewport={"width": 1920, "height": 1080})
    page.goto("https://example.com")

    # Full page screenshot
    page.screenshot(path="fullpage.png", full_page=True)

    # Element screenshot
    element = page.query_selector(".main-content")
    element.screenshot(path="content.png")

    # PDF (Chromium only)
    page.pdf(path="page.pdf", format="A4")

    browser.close()

Need Screenshots at Scale?

Mantis API renders screenshots in the cloud — no browser infrastructure needed. One API call, instant PNG/PDF.

Try Mantis Free →

Stealth Mode: Avoiding Detection

Default Playwright is trivially detected. Anti-bot systems check for automation signatures:

# ❌ Default Playwright is detected instantly
# navigator.webdriver === true
# Missing Chrome plugins
# Automation-specific properties exposed

Use playwright-stealth to patch the most obvious fingerprints:

pip install playwright-stealth

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-features=IsolateOrigins,site-per-process",
            "--no-sandbox",
        ]
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/122.0.0.0 Safari/537.36",
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
    )

    page = context.new_page()
    stealth_sync(page)  # Apply stealth patches

    # Additional manual patches
    page.add_init_script("""
        // Override webdriver flag
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});

        // Fix chrome object
        window.chrome = { runtime: {}, loadTimes: function(){}, csi: function(){} };

        // Fix permissions
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) =>
            parameters.name === 'notifications'
                ? Promise.resolve({ state: Notification.permission })
                : originalQuery(parameters);

        // Fix plugins
        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3, 4, 5],
        });
    """)

    page.goto("https://bot-detection-test.example.com")
    browser.close()

Reality check: playwright-stealth defeats basic bot detection, but sophisticated systems like Cloudflare Turnstile, Akamai Bot Manager, and DataDome still catch stealth Playwright through TLS fingerprinting (JA3/JA4), HTTP/2 frame analysis, and behavioral scoring. There's no silver bullet for DIY anti-detection.

Proxy Rotation with Playwright

Route Playwright through rotating proxies to avoid IP bans:

from playwright.sync_api import sync_playwright

PROXIES = [
    {"server": "http://proxy1.example.com:8080", "username": "user", "password": "pass"},
    {"server": "http://proxy2.example.com:8080", "username": "user", "password": "pass"},
    {"server": "http://proxy3.example.com:8080", "username": "user", "password": "pass"},
]

def scrape_with_proxy(url: str, proxy: dict) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy=proxy,
        )
        page = browser.new_page()
        page.goto(url, timeout=30000)
        content = page.content()
        browser.close()
        return content

# Rotate through proxies
import random
for url in urls_to_scrape:
    proxy = random.choice(PROXIES)
    html = scrape_with_proxy(url, proxy)

Blocking Unnecessary Resources

Speed up scraping 2-5x by blocking images, fonts, and tracking scripts:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()

    # Block images, fonts, and tracking
    await context.route("**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2,ttf}", lambda route: route.abort())
    await context.route("**/*google-analytics*", lambda route: route.abort())
    await context.route("**/*facebook.net*", lambda route: route.abort())
    await context.route("**/*doubleclick.net*", lambda route: route.abort())

    page = context.new_page()
    page.goto("https://example.com")  # Loads 2-5x faster

    browser.close()

Playwright vs. Web Scraping API: When to Use Each

Playwright is powerful but comes with significant overhead. Here's an honest comparison:

Factor	Playwright (DIY)	Mantis API
Setup Time	Hours to days	5 minutes
JS Rendering	✅ Full browser	✅ Cloud rendering
Speed	2-10 sec/page	<2 sec/page
RAM Usage	200-500MB per tab	Zero (cloud)
Anti-Detection	DIY (stealth plugins)	Built-in
Proxy Management	DIY ($50-200/mo)	Included
CAPTCHA Handling	3rd party ($20-100/mo)	Included
Scale	Limited by RAM/CPU	100K+ pages/mo
Maintenance	Constant (browsers update, sites change)	Zero
Cost (5K pages/mo)	$150-600/mo	$29/mo
AI Data Extraction	Custom code	Built-in (GPT-4o)

Use Playwright When:

You need complex browser interactions (multi-step forms, drag-and-drop)
You're scraping <100 pages/day from 1-2 sites
You need custom JavaScript execution in the page context
You're building a prototype or learning web scraping

Use a Web Scraping API When:

You're scraping at scale (1,000+ pages/day)
You need reliability (SLA, built-in retries, anti-detection)
You want AI-powered data extraction (structured JSON from any page)
You'd rather spend time on business logic than infrastructure

The progression most teams follow: Start with requests + BeautifulSoup → Hit JS rendering walls → Switch to Playwright → Hit scale/detection walls → Switch to an API. Skip the middle steps if you're building for production.

Complete Production Scraper Example

Here's a production-ready Playwright scraper with error handling, retries, and structured output:

import asyncio
import json
import random
from dataclasses import dataclass, asdict
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

@dataclass
class ScrapedPage:
    url: str
    title: str
    content: str
    links: list[str]
    status: str  # "success" | "error"
    error: str | None = None

async def scrape_with_retry(
    context, url: str, max_retries: int = 3
) -> ScrapedPage:
    """Scrape a page with retries and error handling."""
    for attempt in range(max_retries):
        page = await context.new_page()
        try:
            await stealth_async(page)

            # Random delay to appear human
            await page.wait_for_timeout(random.randint(1000, 3000))

            response = await page.goto(url, wait_until="domcontentloaded", timeout=30000)

            if response and response.status == 403:
                raise Exception(f"Blocked (403) on attempt {attempt + 1}")

            await page.wait_for_load_state("networkidle", timeout=10000)

            title = await page.title()
            content = await page.inner_text("body")
            links = await page.eval_on_selector_all(
                "a[href]", "els => els.map(e => e.href)"
            )

            return ScrapedPage(
                url=url, title=title, content=content[:5000],
                links=links[:50], status="success"
            )

        except Exception as e:
            if attempt == max_retries - 1:
                return ScrapedPage(
                    url=url, title="", content="", links=[],
                    status="error", error=str(e)
                )
            await page.wait_for_timeout(2000 * (attempt + 1))  # Backoff
        finally:
            await page.close()

async def main():
    urls = [
        "https://example.com/page/1",
        "https://example.com/page/2",
        "https://example.com/page/3",
    ]

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
        )

        # Block unnecessary resources
        await context.route(
            "**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2}",
            lambda route: route.abort()
        )

        semaphore = asyncio.Semaphore(3)

        async def limited(url):
            async with semaphore:
                return await scrape_with_retry(context, url)

        results = await asyncio.gather(*[limited(u) for u in urls])

        await browser.close()

    # Output results
    for r in results:
        print(json.dumps(asdict(r), indent=2))

asyncio.run(main())

Or Skip the Complexity: Use Mantis API

Everything above — browser rendering, stealth mode, proxy rotation, CAPTCHA solving, retries — in a single API call:

import requests

# Scrape any JavaScript-rendered page
response = requests.post(
    "https://api.mantisapi.com/scrape",
    headers={"x-api-key": "your-api-key"},
    json={
        "url": "https://example.com/products",
        "render_js": True,
        "wait_for": ".product-card",
        "extract": {
            "products": {
                "selector": ".product-card",
                "type": "list",
                "fields": {
                    "name": ".product-name",
                    "price": ".product-price",
                    "rating": {"selector": ".stars", "attr": "data-score"}
                }
            }
        }
    }
)

products = response.json()["data"]["products"]
# Clean, structured data — no browser management needed

# AI-powered extraction — no selectors needed
response = requests.post(
    "https://api.mantisapi.com/extract",
    headers={"x-api-key": "your-api-key"},
    json={
        "url": "https://example.com/products",
        "prompt": "Extract all products with name, price, rating, and availability",
        "schema": {
            "products": [{
                "name": "string",
                "price": "number",
                "rating": "number",
                "in_stock": "boolean"
            }]
        }
    }
)

# GPT-4o extracts structured data from any page layout
products = response.json()["data"]["products"]

Stop Managing Browsers

Mantis handles JS rendering, anti-detection, proxies, and AI extraction. Free tier: 100 requests/month.

Start Free →

Summary

Playwright is the most capable browser automation tool for web scraping in 2026. It handles JavaScript rendering, SPAs, authentication, and complex interactions that HTTP-based scrapers can't touch.

But capability comes with cost:

Infrastructure: Each browser instance uses 200-500MB RAM
Anti-detection: Stealth plugins help but don't defeat modern anti-bot systems
Maintenance: Browser updates, site changes, and proxy rotation require constant attention
Scale: Going from 100 to 10,000 pages/day requires significant infrastructure investment

For learning, prototyping, and complex single-site scrapers, Playwright is the right choice. For production scraping at scale, a web scraping API eliminates the complexity and costs less.