Web Scraping with Playwright and Python in 2026: The Complete Guide
Modern websites are JavaScript-heavy. React, Next.js, Vue, Angular โ over 70% of the top 10,000 websites rely on client-side rendering. Traditional HTTP scraping with requests + BeautifulSoup returns empty shells. You need a real browser.
Enter Playwright: Microsoft's open-source browser automation library. It drives Chromium, Firefox, and WebKit with a single API, handles dynamic content natively, and has become the go-to tool for scraping JavaScript-rendered pages in 2026.
This guide covers everything you need to scrape with Playwright effectively โ and helps you decide when an API is the smarter choice.
Why Playwright for Web Scraping?
Playwright has overtaken Selenium as the preferred browser automation tool for scraping. Here's why:
| Feature | Playwright | Selenium | Puppeteer |
|---|---|---|---|
| Speed | Fast (CDP protocol) | Slower (WebDriver) | Fast (CDP) |
| Browser Support | Chromium, Firefox, WebKit | Chrome, Firefox, Edge, Safari | Chromium only |
| Auto-Waiting | Built-in | Manual waits needed | Basic |
| Language Support | Python, JS, Java, C# | Python, JS, Java, C#, Ruby | JavaScript only |
| Shadow DOM | Native support | Workarounds needed | Native support |
| Async API | First-class | Limited | First-class |
| Network Interception | Built-in routing | Via proxy | Built-in |
| iframes | Simple API | Switch context manually | contentFrame() |
Setup: Installing Playwright for Python
Get started in under 60 seconds:
# Install playwright
pip install playwright
# Download browser binaries (Chromium, Firefox, WebKit)
playwright install
# Or install just Chromium (smaller download)
playwright install chromium
Verify the installation:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
print(page.title()) # "Example Domain"
browser.close()
Basic Web Scraping with Playwright
Let's scrape a JavaScript-rendered page. Here's a complete example that extracts product data from a dynamic e-commerce site:
from playwright.sync_api import sync_playwright
import json
def scrape_products(url: str) -> list[dict]:
"""Scrape product listings from a JS-rendered page."""
products = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
)
page = context.new_page()
# Navigate and wait for product cards to render
page.goto(url, wait_until="networkidle")
page.wait_for_selector(".product-card", timeout=10000)
# Extract product data
cards = page.query_selector_all(".product-card")
for card in cards:
name = card.query_selector(".product-name")
price = card.query_selector(".product-price")
rating = card.query_selector(".product-rating")
products.append({
"name": name.inner_text() if name else None,
"price": price.inner_text() if price else None,
"rating": rating.get_attribute("data-score") if rating else None,
"url": card.query_selector("a").get_attribute("href") if card.query_selector("a") else None,
})
browser.close()
return products
# Usage
products = scrape_products("https://example-shop.com/products")
print(json.dumps(products, indent=2))
Async Scraping for Better Performance
For scraping multiple pages concurrently, use Playwright's async API:
import asyncio
from playwright.async_api import async_playwright
async def scrape_page(context, url: str) -> dict:
"""Scrape a single page using a shared browser context."""
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_selector("h1", timeout=5000)
title = await page.title()
content = await page.inner_text("body")
return {"url": url, "title": title, "length": len(content)}
except Exception as e:
return {"url": url, "error": str(e)}
finally:
await page.close()
async def scrape_many(urls: list[str], concurrency: int = 5) -> list[dict]:
"""Scrape multiple URLs with controlled concurrency."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
semaphore = asyncio.Semaphore(concurrency)
async def limited_scrape(url):
async with semaphore:
return await scrape_page(context, url)
results = await asyncio.gather(
*[limited_scrape(url) for url in urls]
)
await browser.close()
return results
# Scrape 20 pages, 5 at a time
urls = [f"https://example.com/page/{i}" for i in range(1, 21)]
results = asyncio.run(scrape_many(urls, concurrency=5))
browser_context across pages to share cookies and cache. Limit concurrency to 5-10 pages โ each tab uses 100-300MB RAM. For 1,000+ pages, use a web scraping API instead.
Handling JavaScript-Rendered Content
Waiting Strategies
The most common Playwright scraping mistake: not waiting for content to render. Here are the key waiting strategies:
# 1. Wait for network to be idle (all XHR/fetch complete)
await page.goto(url, wait_until="networkidle")
# 2. Wait for a specific element to appear
await page.wait_for_selector(".results-container", state="visible")
# 3. Wait for a specific element to have content
await page.wait_for_function(
"document.querySelector('.results-count')?.textContent?.includes('results')"
)
# 4. Wait for a specific network request to complete
async with page.expect_response("**/api/products*") as response_info:
await page.click("#load-more")
response = await response_info.value
data = await response.json()
# 5. Wait for navigation after a click
async with page.expect_navigation():
await page.click("a.next-page")
Intercepting API Calls
Often the fastest approach โ intercept the XHR/fetch calls that load data, and skip HTML parsing entirely:
import json
from playwright.sync_api import sync_playwright
def intercept_api_data(url: str) -> list[dict]:
"""Intercept API responses instead of parsing HTML."""
captured_data = []
def handle_response(response):
if "/api/products" in response.url and response.status == 200:
try:
data = response.json()
captured_data.append(data)
except Exception:
pass
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
# Trigger pagination to capture more data
for _ in range(5):
next_btn = page.query_selector("button.load-more")
if next_btn:
next_btn.click()
page.wait_for_timeout(2000)
browser.close()
return captured_data
Handling Infinite Scroll
Social media feeds, product listings, and news sites use infinite scroll. Here's how to handle it:
async def scrape_infinite_scroll(url: str, max_items: int = 200) -> list[str]:
"""Scroll to bottom repeatedly until max_items reached or no new content."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
items = []
last_height = 0
no_change_count = 0
while len(items) < max_items and no_change_count < 3:
# Scroll to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000) # Wait for content to load
# Check if page grew
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == last_height:
no_change_count += 1
else:
no_change_count = 0
last_height = new_height
# Extract items
items = await page.query_selector_all(".feed-item")
# Extract data from all items
results = []
for item in items[:max_items]:
text = await item.inner_text()
results.append(text)
await browser.close()
return results
Handling Authentication and Login
Many sites require login before scraping. Playwright makes this straightforward:
async def scrape_with_login(url: str, username: str, password: str):
"""Login and scrape authenticated content."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
# Navigate to login page
await page.goto("https://example.com/login")
# Fill in credentials
await page.fill("#username", username)
await page.fill("#password", password)
await page.click("#login-button")
# Wait for redirect after login
await page.wait_for_url("**/dashboard**")
# Save session for reuse (avoid logging in every time)
await context.storage_state(path="auth_state.json")
# Now scrape authenticated pages
await page.goto(url)
data = await page.inner_text(".protected-content")
await browser.close()
return data
# Reuse saved session in future runs
async def scrape_with_saved_session(url: str):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(storage_state="auth_state.json")
page = await context.new_page()
await page.goto(url)
# Already logged in!
data = await page.inner_text(".protected-content")
await browser.close()
return data
Screenshots and PDF Generation
Playwright can capture visual snapshots โ useful for monitoring, archiving, or visual comparison:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={"width": 1920, "height": 1080})
page.goto("https://example.com")
# Full page screenshot
page.screenshot(path="fullpage.png", full_page=True)
# Element screenshot
element = page.query_selector(".main-content")
element.screenshot(path="content.png")
# PDF (Chromium only)
page.pdf(path="page.pdf", format="A4")
browser.close()
Need Screenshots at Scale?
Mantis API renders screenshots in the cloud โ no browser infrastructure needed. One API call, instant PNG/PDF.
Try Mantis Free โStealth Mode: Avoiding Detection
Default Playwright is trivially detected. Anti-bot systems check for automation signatures:
# โ Default Playwright is detected instantly
# navigator.webdriver === true
# Missing Chrome plugins
# Automation-specific properties exposed
Use playwright-stealth to patch the most obvious fingerprints:
pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins,site-per-process",
"--no-sandbox",
]
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
geolocation={"latitude": 40.7128, "longitude": -74.0060},
permissions=["geolocation"],
)
page = context.new_page()
stealth_sync(page) # Apply stealth patches
# Additional manual patches
page.add_init_script("""
// Override webdriver flag
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
// Fix chrome object
window.chrome = { runtime: {}, loadTimes: function(){}, csi: function(){} };
// Fix permissions
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) =>
parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters);
// Fix plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
});
""")
page.goto("https://bot-detection-test.example.com")
browser.close()
playwright-stealth defeats basic bot detection, but sophisticated systems like Cloudflare Turnstile, Akamai Bot Manager, and DataDome still catch stealth Playwright through TLS fingerprinting (JA3/JA4), HTTP/2 frame analysis, and behavioral scoring. There's no silver bullet for DIY anti-detection.
Proxy Rotation with Playwright
Route Playwright through rotating proxies to avoid IP bans:
from playwright.sync_api import sync_playwright
PROXIES = [
{"server": "http://proxy1.example.com:8080", "username": "user", "password": "pass"},
{"server": "http://proxy2.example.com:8080", "username": "user", "password": "pass"},
{"server": "http://proxy3.example.com:8080", "username": "user", "password": "pass"},
]
def scrape_with_proxy(url: str, proxy: dict) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy=proxy,
)
page = browser.new_page()
page.goto(url, timeout=30000)
content = page.content()
browser.close()
return content
# Rotate through proxies
import random
for url in urls_to_scrape:
proxy = random.choice(PROXIES)
html = scrape_with_proxy(url, proxy)
Blocking Unnecessary Resources
Speed up scraping 2-5x by blocking images, fonts, and tracking scripts:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context()
# Block images, fonts, and tracking
await context.route("**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2,ttf}", lambda route: route.abort())
await context.route("**/*google-analytics*", lambda route: route.abort())
await context.route("**/*facebook.net*", lambda route: route.abort())
await context.route("**/*doubleclick.net*", lambda route: route.abort())
page = context.new_page()
page.goto("https://example.com") # Loads 2-5x faster
browser.close()
Playwright vs. Web Scraping API: When to Use Each
Playwright is powerful but comes with significant overhead. Here's an honest comparison:
| Factor | Playwright (DIY) | Mantis API |
|---|---|---|
| Setup Time | Hours to days | 5 minutes |
| JS Rendering | โ Full browser | โ Cloud rendering |
| Speed | 2-10 sec/page | <2 sec/page |
| RAM Usage | 200-500MB per tab | Zero (cloud) |
| Anti-Detection | DIY (stealth plugins) | Built-in |
| Proxy Management | DIY ($50-200/mo) | Included |
| CAPTCHA Handling | 3rd party ($20-100/mo) | Included |
| Scale | Limited by RAM/CPU | 100K+ pages/mo |
| Maintenance | Constant (browsers update, sites change) | Zero |
| Cost (5K pages/mo) | $150-600/mo | $29/mo |
| AI Data Extraction | Custom code | Built-in (GPT-4o) |
Use Playwright When:
- You need complex browser interactions (multi-step forms, drag-and-drop)
- You're scraping <100 pages/day from 1-2 sites
- You need custom JavaScript execution in the page context
- You're building a prototype or learning web scraping
Use a Web Scraping API When:
- You're scraping at scale (1,000+ pages/day)
- You need reliability (SLA, built-in retries, anti-detection)
- You want AI-powered data extraction (structured JSON from any page)
- You'd rather spend time on business logic than infrastructure
requests + BeautifulSoup โ Hit JS rendering walls โ Switch to Playwright โ Hit scale/detection walls โ Switch to an API. Skip the middle steps if you're building for production.
Complete Production Scraper Example
Here's a production-ready Playwright scraper with error handling, retries, and structured output:
import asyncio
import json
import random
from dataclasses import dataclass, asdict
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
@dataclass
class ScrapedPage:
url: str
title: str
content: str
links: list[str]
status: str # "success" | "error"
error: str | None = None
async def scrape_with_retry(
context, url: str, max_retries: int = 3
) -> ScrapedPage:
"""Scrape a page with retries and error handling."""
for attempt in range(max_retries):
page = await context.new_page()
try:
await stealth_async(page)
# Random delay to appear human
await page.wait_for_timeout(random.randint(1000, 3000))
response = await page.goto(url, wait_until="domcontentloaded", timeout=30000)
if response and response.status == 403:
raise Exception(f"Blocked (403) on attempt {attempt + 1}")
await page.wait_for_load_state("networkidle", timeout=10000)
title = await page.title()
content = await page.inner_text("body")
links = await page.eval_on_selector_all(
"a[href]", "els => els.map(e => e.href)"
)
return ScrapedPage(
url=url, title=title, content=content[:5000],
links=links[:50], status="success"
)
except Exception as e:
if attempt == max_retries - 1:
return ScrapedPage(
url=url, title="", content="", links=[],
status="error", error=str(e)
)
await page.wait_for_timeout(2000 * (attempt + 1)) # Backoff
finally:
await page.close()
async def main():
urls = [
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3",
]
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
)
# Block unnecessary resources
await context.route(
"**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2}",
lambda route: route.abort()
)
semaphore = asyncio.Semaphore(3)
async def limited(url):
async with semaphore:
return await scrape_with_retry(context, url)
results = await asyncio.gather(*[limited(u) for u in urls])
await browser.close()
# Output results
for r in results:
print(json.dumps(asdict(r), indent=2))
asyncio.run(main())
Or Skip the Complexity: Use Mantis API
Everything above โ browser rendering, stealth mode, proxy rotation, CAPTCHA solving, retries โ in a single API call:
import requests
# Scrape any JavaScript-rendered page
response = requests.post(
"https://api.mantisapi.com/scrape",
headers={"x-api-key": "your-api-key"},
json={
"url": "https://example.com/products",
"render_js": True,
"wait_for": ".product-card",
"extract": {
"products": {
"selector": ".product-card",
"type": "list",
"fields": {
"name": ".product-name",
"price": ".product-price",
"rating": {"selector": ".stars", "attr": "data-score"}
}
}
}
}
)
products = response.json()["data"]["products"]
# Clean, structured data โ no browser management needed
# AI-powered extraction โ no selectors needed
response = requests.post(
"https://api.mantisapi.com/extract",
headers={"x-api-key": "your-api-key"},
json={
"url": "https://example.com/products",
"prompt": "Extract all products with name, price, rating, and availability",
"schema": {
"products": [{
"name": "string",
"price": "number",
"rating": "number",
"in_stock": "boolean"
}]
}
}
)
# GPT-4o extracts structured data from any page layout
products = response.json()["data"]["products"]
Stop Managing Browsers
Mantis handles JS rendering, anti-detection, proxies, and AI extraction. Free tier: 100 requests/month.
Start Free โSummary
Playwright is the most capable browser automation tool for web scraping in 2026. It handles JavaScript rendering, SPAs, authentication, and complex interactions that HTTP-based scrapers can't touch.
But capability comes with cost:
- Infrastructure: Each browser instance uses 200-500MB RAM
- Anti-detection: Stealth plugins help but don't defeat modern anti-bot systems
- Maintenance: Browser updates, site changes, and proxy rotation require constant attention
- Scale: Going from 100 to 10,000 pages/day requires significant infrastructure investment
For learning, prototyping, and complex single-site scrapers, Playwright is the right choice. For production scraping at scale, a web scraping API eliminates the complexity and costs less.