Web Scraping with Python in 2026: The Complete Guide
Web Scraping with Python in 2026: The Complete Guide
Web scraping with Python remains one of the most in-demand skills for developers. But in 2026, the landscape has changed dramatically. JavaScript-heavy sites, anti-bot measures, and the rise of AI agents have made traditional scraping harder — and smarter alternatives essential.
This guide covers everything: from basic scraping with BeautifulSoup to production-grade extraction with AI-powered APIs like WebPerception.
The Evolution of Python Web Scraping
The Old Way (2015–2022)
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")
products = []
for item in soup.select(".product-card"):
products.append({
"name": item.select_one(".title").text,
"price": item.select_one(".price").text,
})
This worked fine for static HTML pages. But it breaks when:
- The page renders with JavaScript (most modern sites)
- The site has anti-bot protection (Cloudflare, reCAPTCHA)
- The HTML structure changes (your selectors break)
- You need to scale beyond a few hundred requests
The Headless Browser Era (2018–2024)
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com/products")
# Wait for JavaScript to render
driver.implicitly_wait(10)
products = driver.find_elements(By.CSS_SELECTOR, ".product-card")
Selenium and Playwright solved the JavaScript problem but introduced new ones:
- Resource-heavy: Each browser instance uses 200–500MB RAM
- Slow: Pages take 3–10 seconds to fully render
- Fragile: Browser updates break your setup
- Expensive to scale: Running headless Chrome at scale requires serious infrastructure
The AI-Powered Era (2025+)
import requests
response = requests.post(
"https://api.mantisapi.com/extract",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"url": "https://example.com/products",
"prompt": "Extract all products with name, price, and rating",
"schema": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"}
}
}
}
}
)
products = response.json()["data"]
No selectors. No browser management. No breaking when HTML changes. Just describe what you want, and the API extracts it.
Method 1: BeautifulSoup (Basic Static Pages)
Best for: Simple HTML pages, learning, small projects.
Installation
pip install beautifulsoup4 requests
Basic Example
import requests
from bs4 import BeautifulSoup
def scrape_articles(url):
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"
})
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
articles = []
for article in soup.select("article"):
title = article.select_one("h2")
link = article.select_one("a")
summary = article.select_one("p")
articles.append({
"title": title.text.strip() if title else None,
"url": link["href"] if link else None,
"summary": summary.text.strip() if summary else None,
})
return articles
Limitations
- ❌ No JavaScript rendering
- ❌ Breaks when HTML structure changes
- ❌ No built-in rate limiting or proxy rotation
- ❌ You handle all error handling, retries, and edge cases
Method 2: Playwright (JavaScript-Heavy Sites)
Best for: SPAs, sites requiring interaction, when you need browser automation.
Installation
pip install playwright
playwright install chromium
Basic Example
from playwright.sync_api import sync_playwright
def scrape_spa(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
# Wait for content to load
page.wait_for_selector(".product-list")
products = page.evaluate("""
() => Array.from(document.querySelectorAll('.product')).map(el => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
}))
""")
browser.close()
return products
Limitations
- ❌ Resource-heavy (200–500MB per browser)
- ❌ Slow (3–10s per page)
- ❌ Complex to deploy and scale
- ❌ Anti-bot detection catches headless browsers
Method 3: WebPerception API (Production-Grade)
Best for: Production applications, AI agents, structured data extraction, scaling.
Installation
pip install requests # That's it.
Scrape Any Page
import requests
API_KEY = "your_api_key" # Get free at mantisapi.com
BASE_URL = "https://api.mantisapi.com"
def scrape(url):
"""Get clean, readable content from any URL."""
response = requests.post(
f"{BASE_URL}/scrape",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"url": url}
)
return response.json()
result = scrape("https://news.ycombinator.com")
print(result["content"]) # Clean markdown content
Extract Structured Data with AI
def extract_products(url):
"""Extract structured product data — no selectors needed."""
response = requests.post(
f"{BASE_URL}/extract",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": url,
"prompt": "Extract all products with name, price, rating, and availability",
"schema": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"rating": {"type": "number"},
"in_stock": {"type": "boolean"}
}
}
}
}
)
return response.json()["data"]
# Works on ANY e-commerce site — no site-specific code
products = extract_products("https://example-store.com/laptops")
Take Screenshots
def screenshot(url, full_page=False):
"""Capture a screenshot of any webpage."""
response = requests.post(
f"{BASE_URL}/screenshot",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"url": url, "fullPage": full_page}
)
# Save the image
with open("screenshot.png", "wb") as f:
f.write(response.content)
screenshot("https://competitor.com/pricing", full_page=True)
Why WebPerception Wins for Production
| Feature | BeautifulSoup | Playwright | WebPerception API |
|---------|:---:|:---:|:---:|
| JavaScript rendering | ❌ | ✅ | ✅ |
| AI data extraction | ❌ | ❌ | ✅ |
| Anti-bot handling | ❌ | ⚠️ | ✅ |
| Scales to 100K+ pages | ❌ | ⚠️ | ✅ |
| Infrastructure needed | None | Heavy | None |
| Maintenance | High | High | Zero |
| Setup time | 5 min | 30 min | 2 min |
Common Web Scraping Patterns in Python
Pattern 1: Paginated Data
def scrape_all_pages(base_url):
all_items = []
page = 1
while True:
result = requests.post(
f"{BASE_URL}/extract",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": f"{base_url}?page={page}",
"prompt": "Extract all items. Also tell me if there's a next page.",
"schema": {
"type": "object",
"properties": {
"items": {"type": "array", "items": {"type": "object"}},
"has_next_page": {"type": "boolean"}
}
}
}
).json()
all_items.extend(result["data"]["items"])
if not result["data"]["has_next_page"]:
break
page += 1
return all_items
Pattern 2: Monitoring & Alerts
import time
def monitor_price(url, product_name, target_price):
"""Monitor a product page and alert when price drops."""
while True:
data = requests.post(
f"{BASE_URL}/extract",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": url,
"prompt": f"What is the current price of {product_name}?",
"schema": {
"type": "object",
"properties": {
"price": {"type": "number"},
"currency": {"type": "string"}
}
}
}
).json()
if data["data"]["price"] <= target_price:
print(f"🚨 Price drop! {product_name} is now ${data['data']['price']}")
break
time.sleep(3600) # Check every hour
Pattern 3: Competitive Intelligence
def analyze_competitor(url):
"""Extract competitor information from their website."""
return requests.post(
f"{BASE_URL}/extract",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"url": url,
"prompt": "Extract the company's pricing tiers, key features, and any mentioned customer count or metrics",
"schema": {
"type": "object",
"properties": {
"pricing_tiers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"features": {"type": "array", "items": {"type": "string"}}
}
}
},
"key_metrics": {"type": "array", "items": {"type": "string"}}
}
}
}
).json()["data"]
Best Practices for Web Scraping in Python
1. Respect robots.txt
Always check a site's robots.txt before scraping. WebPerception API handles this automatically.
2. Rate Limit Your Requests
import time
for url in urls:
result = scrape(url)
time.sleep(1) # Be respectful
3. Handle Errors Gracefully
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503])
session.mount("https://", HTTPAdapter(max_retries=retries))
4. Use Structured Schemas
When using WebPerception's /extract endpoint, always define a JSON schema. This ensures consistent, typed output you can pipe directly into your database or downstream systems.
5. Cache Results
import hashlib
import json
import os
def cached_scrape(url, cache_dir="cache"):
os.makedirs(cache_dir, exist_ok=True)
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_file = os.path.join(cache_dir, f"{cache_key}.json")
if os.path.exists(cache_file):
with open(cache_file) as f:
return json.load(f)
result = scrape(url)
with open(cache_file, "w") as f:
json.dump(result, f)
return result
When to Use What
| Scenario | Recommended Tool |
|----------|-----------------|
| Learning/hobby project | BeautifulSoup |
| Browser automation & testing | Playwright |
| Production data extraction | WebPerception API |
| AI agent building | WebPerception API |
| One-off simple scrape | BeautifulSoup |
| Scaling to thousands of pages | WebPerception API |
Getting Started with WebPerception API
Sign up at mantisapi.com — 100 free API calls/month
Get your API key from the dashboard
Start scraping with the Python examples above
# Test it right now
curl -X POST https://api.mantisapi.com/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://news.ycombinator.com"}'
No infrastructure. No browser management. No broken selectors. Just data.
---
WebPerception API by Mantis — web perception for AI agents. Start free →