Web scraping is the automated extraction of data from websites. Instead of manually copying information, you write code that fetches web pages, parses the HTML, and extracts the data you need — product prices, article text, contact details, job listings, or any other structured information.
In 2026, web scraping powers everything from price monitoring and market research to AI agent data pipelines and competitive intelligence. And Python is by far the most popular language for it.
Python owns web scraping for good reason:
Here's every major tool in the Python scraping ecosystem and when to use each:
| Tool | Type | Best For | JS Support | Guide |
|---|---|---|---|---|
| Requests | HTTP client | Simple HTTP requests, APIs | ❌ | Full guide → |
| httpx | Async HTTP client | High-performance async scraping | ❌ | Full guide → |
| BeautifulSoup | HTML parser | Parsing HTML, extracting data | ❌ | Full guide → |
| Scrapy | Framework | Large-scale crawling | ❌ (plugins) | Full guide → |
| Playwright | Browser automation | JS-rendered pages, SPAs | ✅ | Full guide → |
| Selenium | Browser automation | Legacy browser testing + scraping | ✅ | Full guide → |
| Mantis API | Web scraping API | Production scraping, AI agents | ✅ | Full guide → |
Let's build a working scraper in under 20 lines. We'll use Requests to fetch a page and BeautifulSoup to parse it:
import requests
from bs4 import BeautifulSoup
# 1. Fetch the page
url = "https://news.ycombinator.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers, timeout=10)
# 2. Parse the HTML
soup = BeautifulSoup(response.text, "lxml")
# 3. Extract data
stories = soup.select(".titleline > a")
for story in stories[:10]:
print(story.text, "→", story["href"])
Install the dependencies:
pip install requests beautifulsoup4 lxml
That's it — a working scraper in 10 lines. For a deep dive into every Requests feature (sessions, auth, proxies, retries), see our complete Requests guide. For advanced BeautifulSoup techniques, see the BeautifulSoup guide.
The scraper above won't work on modern SPAs (React, Angular, Vue) because those sites render content with JavaScript. Requests only downloads raw HTML — it doesn't execute JS.
Playwright is the modern solution. It runs a real browser (Chromium, Firefox, or WebKit) headlessly:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/spa-app")
# Wait for dynamic content to load
page.wait_for_selector(".product-card")
# Get rendered HTML and parse
soup = BeautifulSoup(page.content(), "lxml")
products = soup.select(".product-card")
for product in products:
name = product.select_one(".name").text
price = product.select_one(".price").text
print(f"{name}: {price}")
browser.close()
pip install playwright
playwright install chromium
Playwright is faster and more reliable than Selenium for scraping in 2026. It supports auto-waiting, network interception, and runs all three browser engines. See our complete Playwright guide for stealth mode, infinite scroll handling, and concurrent scraping.
When you need to scrape thousands of pages, Scrapy is the gold standard. It's a complete crawling framework with built-in concurrency, request scheduling, middleware, and data pipelines:
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
def parse(self, response):
for product in response.css(".product-card"):
yield {
"name": product.css(".name::text").get(),
"price": product.css(".price::text").get(),
"url": product.css("a::attr(href)").get(),
}
# Follow pagination
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
pip install scrapy
scrapy runspider products_spider.py -o products.json
Scrapy handles concurrency (16 requests by default), retries, rate limiting, and exports out of the box. It's overkill for small scripts but essential for crawling entire sites. See our complete Scrapy guide for middleware, pipelines, and deployment.
httpx is the modern replacement for Requests — it supports async/await, HTTP/2, and connection pooling. Perfect for scraping multiple pages concurrently:
import asyncio
import httpx
from bs4 import BeautifulSoup
async def scrape_page(client, url):
response = await client.get(url)
soup = BeautifulSoup(response.text, "lxml")
title = soup.select_one("h1").text
return {"url": url, "title": title}
async def main():
urls = [f"https://example.com/page/{i}" for i in range(1, 51)]
async with httpx.AsyncClient(
headers={"User-Agent": "Mozilla/5.0"},
timeout=30,
limits=httpx.Limits(max_connections=10)
) as client:
tasks = [scrape_page(client, url) for url in urls]
results = await asyncio.gather(*tasks)
for r in results:
print(r)
asyncio.run(main())
httpx scrapes 50 pages concurrently in the time Requests would take to do 5. See our complete httpx guide for HTTP/2 multiplexing, proxy rotation, and retry strategies.
Websites fight scrapers with anti-bot systems (Cloudflare, DataDome, PerimeterX). Here's how to avoid detection:
import random
import time
import requests
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/537.36 Chrome/131.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/131.0",
]
PROXIES = [
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
]
session = requests.Session()
for url in urls:
session.headers["User-Agent"] = random.choice(USER_AGENTS)
proxy = {"http": random.choice(PROXIES), "https": random.choice(PROXIES)}
response = session.get(url, proxies=proxy, timeout=10)
# ... parse response ...
time.sleep(random.uniform(1, 3)) # Random delay
For a complete deep dive into anti-blocking strategies, CAPTCHA handling, and stealth techniques, see our guide to scraping without getting blocked.
Mantis handles proxy rotation, JavaScript rendering, and anti-blocking automatically. One API call, clean data back.
Try Mantis Free — 100 Calls/Month →Building scraping infrastructure is a rabbit hole. Here's the real cost comparison:
| Component | DIY Cost (Monthly) | Mantis API |
|---|---|---|
| Proxy rotation | $50–500 | ✅ Included |
| Headless browsers | $100–300 | ✅ Included |
| CAPTCHA solving | $50–200 | ✅ Included |
| Anti-bot bypass | Engineering time | ✅ Included |
| Maintenance | Ongoing dev hours | ✅ Managed |
| Total | $200–1,000+ | From $29/mo |
Use a web scraping API when:
Use DIY Python scraping when:
See our comparison of the best web scraping APIs for AI agents for a detailed breakdown.
| Criteria | Requests + BS4 | Scrapy | Playwright | httpx | Mantis API |
|---|---|---|---|---|---|
| Learning curve | ⭐ Easy | ⭐⭐⭐ Steep | ⭐⭐ Medium | ⭐⭐ Medium | ⭐ Easy |
| Speed | Medium | Fast | Slow | Very fast | Fast |
| JS support | ❌ | ❌ (plugin) | ✅ | ❌ | ✅ |
| Concurrency | Manual | Built-in | Limited | Built-in | Built-in |
| Anti-bot bypass | Manual | Manual | Stealth plugins | Manual | Automatic |
| Best for | Quick scripts | Large crawls | JS-heavy sites | Async scraping | Production / AI |
Mantis WebPerception API: scraping, screenshots, and AI extraction — one API call. Built for AI agents.
Start Free →It depends on your needs. For simple HTML parsing, use BeautifulSoup with Requests. For large-scale crawling, use Scrapy. For JavaScript-rendered pages, use Playwright. For async high-performance scraping, use httpx. For production workloads without infrastructure overhead, use a web scraping API like Mantis.
Scraping publicly available data is generally legal in the US (hiQ v. LinkedIn, 2022). However, you should respect robots.txt, avoid scraping personal data without consent (GDPR/CCPA), and not overload servers with excessive requests. Always consult legal counsel for your specific use case.
Yes, but not with Requests or BeautifulSoup alone. You need a browser automation tool like Playwright or Selenium that runs a real browser to execute JavaScript. Alternatively, a web scraping API like Mantis handles JavaScript rendering server-side.
Use rotating User-Agent headers, add random delays between requests, rotate proxies, respect robots.txt, and use headless browsers with stealth plugins. For detailed techniques, see our complete anti-blocking guide.
They serve different purposes. BeautifulSoup is a parser — it extracts data from HTML. Scrapy is a complete crawling framework with concurrency, scheduling, and pipelines. Use BeautifulSoup for quick scripts; use Scrapy for large-scale production crawling.
Python libraries are free, but production scraping has hidden costs: proxy services ($50–500/month), headless browser infrastructure ($100–300/month), CAPTCHA solving ($1–3 per 1,000), and engineering time. A web scraping API like Mantis starts free (100 calls/month) with paid plans from $29/month — often cheaper than DIY infrastructure.
© 2026 Mantis · Web scraping, screenshots, and AI data extraction for agents and developers.