Web Scraping with Python in 2026: The Complete Guide

March 6, 2026 Web Scraping

Web Scraping with Python in 2026: The Complete Guide

Web scraping with Python remains one of the most in-demand skills for developers. But in 2026, the landscape has changed dramatically. JavaScript-heavy sites, anti-bot measures, and the rise of AI agents have made traditional scraping harder — and smarter alternatives essential.

This guide covers everything: from basic scraping with BeautifulSoup to production-grade extraction with AI-powered APIs like WebPerception.

The Evolution of Python Web Scraping

The Old Way (2015–2022)

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")

products = []
for item in soup.select(".product-card"):
    products.append({
        "name": item.select_one(".title").text,
        "price": item.select_one(".price").text,
    })

This worked fine for static HTML pages. But it breaks when:

The page renders with JavaScript (most modern sites)
The site has anti-bot protection (Cloudflare, reCAPTCHA)
The HTML structure changes (your selectors break)
You need to scale beyond a few hundred requests

The Headless Browser Era (2018–2024)

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://example.com/products")

# Wait for JavaScript to render
driver.implicitly_wait(10)

products = driver.find_elements(By.CSS_SELECTOR, ".product-card")

Selenium and Playwright solved the JavaScript problem but introduced new ones:

Resource-heavy: Each browser instance uses 200–500MB RAM
Slow: Pages take 3–10 seconds to fully render
Fragile: Browser updates break your setup
Expensive to scale: Running headless Chrome at scale requires serious infrastructure

The AI-Powered Era (2025+)

import requests

response = requests.post(
    "https://api.mantisapi.com/extract",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://example.com/products",
        "prompt": "Extract all products with name, price, and rating",
        "schema": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "rating": {"type": "number"}
                }
            }
        }
    }
)

products = response.json()["data"]

No selectors. No browser management. No breaking when HTML changes. Just describe what you want, and the API extracts it.

Method 1: BeautifulSoup (Basic Static Pages)

Best for: Simple HTML pages, learning, small projects.

Installation

pip install beautifulsoup4 requests

Basic Example

import requests
from bs4 import BeautifulSoup

def scrape_articles(url):
    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"
    })
    response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    articles = []
    for article in soup.select("article"):
        title = article.select_one("h2")
        link = article.select_one("a")
        summary = article.select_one("p")
        
        articles.append({
            "title": title.text.strip() if title else None,
            "url": link["href"] if link else None,
            "summary": summary.text.strip() if summary else None,
        })
    
    return articles

Limitations

❌ No JavaScript rendering
❌ Breaks when HTML structure changes
❌ No built-in rate limiting or proxy rotation
❌ You handle all error handling, retries, and edge cases

Method 2: Playwright (JavaScript-Heavy Sites)

Best for: SPAs, sites requiring interaction, when you need browser automation.

Installation

pip install playwright
playwright install chromium

Basic Example

from playwright.sync_api import sync_playwright

def scrape_spa(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        
        # Wait for content to load
        page.wait_for_selector(".product-list")
        
        products = page.evaluate("""
            () => Array.from(document.querySelectorAll('.product')).map(el => ({
                name: el.querySelector('.name')?.textContent,
                price: el.querySelector('.price')?.textContent,
            }))
        """)
        
        browser.close()
        return products

Limitations

❌ Resource-heavy (200–500MB per browser)
❌ Slow (3–10s per page)
❌ Complex to deploy and scale
❌ Anti-bot detection catches headless browsers

Method 3: WebPerception API (Production-Grade)

Best for: Production applications, AI agents, structured data extraction, scaling.

Installation

pip install requests  # That's it.

Scrape Any Page

import requests

API_KEY = "your_api_key"  # Get free at mantisapi.com
BASE_URL = "https://api.mantisapi.com"

def scrape(url):
    """Get clean, readable content from any URL."""
    response = requests.post(
        f"{BASE_URL}/scrape",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"url": url}
    )
    return response.json()

result = scrape("https://news.ycombinator.com")
print(result["content"])  # Clean markdown content

Extract Structured Data with AI

def extract_products(url):
    """Extract structured product data — no selectors needed."""
    response = requests.post(
        f"{BASE_URL}/extract",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "url": url,
            "prompt": "Extract all products with name, price, rating, and availability",
            "schema": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "number"},
                        "currency": {"type": "string"},
                        "rating": {"type": "number"},
                        "in_stock": {"type": "boolean"}
                    }
                }
            }
        }
    )
    return response.json()["data"]

# Works on ANY e-commerce site — no site-specific code
products = extract_products("https://example-store.com/laptops")

Take Screenshots

def screenshot(url, full_page=False):
    """Capture a screenshot of any webpage."""
    response = requests.post(
        f"{BASE_URL}/screenshot",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"url": url, "fullPage": full_page}
    )
    
    # Save the image
    with open("screenshot.png", "wb") as f:
        f.write(response.content)

screenshot("https://competitor.com/pricing", full_page=True)

Why WebPerception Wins for Production

|---------|:---:|:---:|:---:|

| JavaScript rendering | ❌ | ✅ | ✅ |

| AI data extraction | ❌ | ❌ | ✅ |

| Anti-bot handling | ❌ | ⚠️ | ✅ |

| Scales to 100K+ pages | ❌ | ⚠️ | ✅ |

| Setup time | 5 min | 30 min | 2 min |

Common Web Scraping Patterns in Python

Pattern 1: Paginated Data

def scrape_all_pages(base_url):
    all_items = []
    page = 1
    
    while True:
        result = requests.post(
            f"{BASE_URL}/extract",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "url": f"{base_url}?page={page}",
                "prompt": "Extract all items. Also tell me if there's a next page.",
                "schema": {
                    "type": "object",
                    "properties": {
                        "items": {"type": "array", "items": {"type": "object"}},
                        "has_next_page": {"type": "boolean"}
                    }
                }
            }
        ).json()
        
        all_items.extend(result["data"]["items"])
        
        if not result["data"]["has_next_page"]:
            break
        page += 1
    
    return all_items

Pattern 2: Monitoring & Alerts

import time

def monitor_price(url, product_name, target_price):
    """Monitor a product page and alert when price drops."""
    while True:
        data = requests.post(
            f"{BASE_URL}/extract",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "url": url,
                "prompt": f"What is the current price of {product_name}?",
                "schema": {
                    "type": "object",
                    "properties": {
                        "price": {"type": "number"},
                        "currency": {"type": "string"}
                    }
                }
            }
        ).json()
        
        if data["data"]["price"] <= target_price:
            print(f"🚨 Price drop! {product_name} is now ${data['data']['price']}")
            break
        
        time.sleep(3600)  # Check every hour

Pattern 3: Competitive Intelligence

def analyze_competitor(url):
    """Extract competitor information from their website."""
    return requests.post(
        f"{BASE_URL}/extract",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "url": url,
            "prompt": "Extract the company's pricing tiers, key features, and any mentioned customer count or metrics",
            "schema": {
                "type": "object",
                "properties": {
                    "pricing_tiers": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "price": {"type": "string"},
                                "features": {"type": "array", "items": {"type": "string"}}
                            }
                        }
                    },
                    "key_metrics": {"type": "array", "items": {"type": "string"}}
                }
            }
        }
    ).json()["data"]

Best Practices for Web Scraping in Python

1. Respect robots.txt

Always check a site's robots.txt before scraping. WebPerception API handles this automatically.

2. Rate Limit Your Requests

import time

for url in urls:
    result = scrape(url)
    time.sleep(1)  # Be respectful

3. Handle Errors Gracefully

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503])
session.mount("https://", HTTPAdapter(max_retries=retries))

4. Use Structured Schemas

When using WebPerception's /extract endpoint, always define a JSON schema. This ensures consistent, typed output you can pipe directly into your database or downstream systems.

5. Cache Results

import hashlib
import json
import os

def cached_scrape(url, cache_dir="cache"):
    os.makedirs(cache_dir, exist_ok=True)
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_file = os.path.join(cache_dir, f"{cache_key}.json")
    
    if os.path.exists(cache_file):
        with open(cache_file) as f:
            return json.load(f)
    
    result = scrape(url)
    
    with open(cache_file, "w") as f:
        json.dump(result, f)
    
    return result

When to Use What

| Scenario | Recommended Tool |

|----------|-----------------|

| Learning/hobby project | BeautifulSoup |

| Browser automation & testing | Playwright |

| Production data extraction | WebPerception API |

| AI agent building | WebPerception API |

| One-off simple scrape | BeautifulSoup |

| Scaling to thousands of pages | WebPerception API |

Getting Started with WebPerception API

Sign up at mantisapi.com — 100 free API calls/month

Get your API key from the dashboard

Start scraping with the Python examples above

# Test it right now
curl -X POST https://api.mantisapi.com/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com"}'

No infrastructure. No browser management. No broken selectors. Just data.

---

WebPerception API by Mantis — web perception for AI agents. Start free →

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →