Scrapy is the most powerful open-source web scraping framework for Python. Built on the Twisted asynchronous networking library, Scrapy handles concurrent requests, automatic retries, request throttling, data pipelines, and export — all out of the box.
While libraries like BeautifulSoup only handle HTML parsing, Scrapy is a complete framework that manages the entire scraping workflow: making HTTP requests, parsing responses, extracting data, cleaning it, and storing it.
Key features:
# Create a virtual environment
python -m venv scraping-env
source scraping-env/bin/activate # Linux/macOS
# scraping-env\Scripts\activate # Windows
# Install Scrapy
pip install scrapy
# Verify installation
scrapy version
# Scrapy 2.11.x
# Generate project structure
scrapy startproject myproject
cd myproject
# Project structure:
# myproject/
# ├── scrapy.cfg
# └── myproject/
# ├── __init__.py
# ├── items.py
# ├── middlewares.py
# ├── pipelines.py
# ├── settings.py
# └── spiders/
# └── __init__.py
scrapy runspider my_spider.py for quick scripts. But for production work, always use the project structure.
Create a file at myproject/spiders/quotes_spider.py:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Run it:
# Run spider and save output
scrapy crawl quotes -O quotes.json
# Run with logging level
scrapy crawl quotes -O quotes.csv -L INFO
That's it — Scrapy handles the HTTP requests, follows pagination, and exports structured data. Let's break down each component.
Scrapy supports both CSS selectors and XPath expressions. Use whichever you're more comfortable with — or mix them.
# Get text content
response.css("h1::text").get() # First match
response.css("p::text").getall() # All matches
# Get attributes
response.css("a::attr(href)").get() # href attribute
response.css("img::attr(src)").getall() # All image sources
# Nested selectors
response.css("div.product").css("span.price::text").get()
# Pseudo-selectors
response.css("li:nth-child(1)::text").get() # First list item
# XPath equivalents
response.xpath("//h1/text()").get()
response.xpath("//a/@href").getall()
# XPath advantages — text contains
response.xpath('//p[contains(text(), "price")]/text()').get()
# Select by position
response.xpath("//table/tr[position()>1]").getall() # Skip header row
# Parent/sibling traversal
response.xpath('//span[@class="price"]/parent::div').get()
response.css("div.item").xpath(".//span/text()").get()
Scrapy excels at crawling — following links across pages to build a complete dataset.
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
def parse(self, response):
"""Parse the listing page — follow each product link."""
for product_link in response.css("a.product-card::attr(href)"):
yield response.follow(product_link, callback=self.parse_product)
# Follow pagination
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_product(self, response):
"""Parse individual product page."""
yield {
"url": response.url,
"name": response.css("h1.product-title::text").get("").strip(),
"price": response.css("span.price::text").get("").strip(),
"description": response.css("div.description p::text").getall(),
"sku": response.css("span.sku::text").get(),
"in_stock": bool(response.css("span.in-stock")),
"images": response.css("img.product-image::attr(src)").getall(),
}
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SiteCrawler(CrawlSpider):
name = "site_crawler"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]
rules = (
# Follow category links, don't scrape them
Rule(LinkExtractor(allow=r"/category/")),
# Follow and scrape product pages
Rule(LinkExtractor(allow=r"/product/"), callback="parse_product"),
)
def parse_product(self, response):
yield {
"url": response.url,
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}
For structured, validated data, define Items and use Item Loaders to clean data during extraction.
# myproject/items.py
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
description = scrapy.Field()
sku = scrapy.Field()
category = scrapy.Field()
rating = scrapy.Field()
reviews_count = scrapy.Field()
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from myproject.items import ProductItem
import re
def clean_price(value):
"""Extract numeric price from string."""
match = re.search(r"[\d,.]+", value)
return float(match.group().replace(",", "")) if match else None
class ProductLoader(ItemLoader):
default_item_class = ProductItem
default_output_processor = TakeFirst()
name_in = MapCompose(str.strip)
price_in = MapCompose(str.strip, clean_price)
description_in = MapCompose(str.strip)
description_out = Join("\n")
# In your spider:
class ProductSpider(scrapy.Spider):
name = "products"
def parse_product(self, response):
loader = ProductLoader(selector=response)
loader.add_css("name", "h1.product-title::text")
loader.add_css("price", "span.price::text")
loader.add_css("description", "div.description p::text")
loader.add_css("sku", "span.sku::text")
loader.add_value("url", response.url)
yield loader.load_item()
Scrapy is powerful but complex. Mantis API gives you scraped, structured data with a single HTTP request — no spiders, no pipelines, no infrastructure.
Try Mantis API Free →Pipelines process items after extraction — validate, clean, deduplicate, and store data.
# myproject/pipelines.py
import json
import sqlite3
class ValidationPipeline:
"""Drop items missing required fields."""
def process_item(self, item, spider):
if not item.get("name"):
raise scrapy.exceptions.DropItem(f"Missing name: {item}")
if not item.get("price"):
raise scrapy.exceptions.DropItem(f"Missing price: {item}")
return item
class DuplicatesPipeline:
"""Filter duplicate items by URL."""
def __init__(self):
self.seen_urls = set()
def process_item(self, item, spider):
url = item.get("url")
if url in self.seen_urls:
raise scrapy.exceptions.DropItem(f"Duplicate: {url}")
self.seen_urls.add(url)
return item
class SQLitePipeline:
"""Store items in SQLite database."""
def open_spider(self, spider):
self.conn = sqlite3.connect("products.db")
self.cursor = self.conn.cursor()
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
url TEXT PRIMARY KEY,
name TEXT,
price REAL,
sku TEXT,
description TEXT
)
""")
def close_spider(self, spider):
self.conn.commit()
self.conn.close()
def process_item(self, item, spider):
self.cursor.execute(
"INSERT OR REPLACE INTO products VALUES (?, ?, ?, ?, ?)",
(item.get("url"), item.get("name"), item.get("price"),
item.get("sku"), item.get("description"))
)
return item
Enable pipelines in settings.py:
# myproject/settings.py
ITEM_PIPELINES = {
"myproject.pipelines.ValidationPipeline": 100,
"myproject.pipelines.DuplicatesPipeline": 200,
"myproject.pipelines.SQLitePipeline": 300,
}
# Lower number = higher priority (runs first)
Middleware intercepts requests and responses — essential for anti-bot evasion.
# myproject/middlewares.py
import random
class RandomUserAgentMiddleware:
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:123.0) Gecko/20100101 Firefox/123.0",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.USER_AGENTS)
class ProxyRotationMiddleware:
def __init__(self):
self.proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(self.proxies)
def process_exception(self, request, exception, spider):
# Retry with a different proxy on failure
request.meta["proxy"] = random.choice(self.proxies)
return request
Enable middleware in settings.py:
DOWNLOADER_MIDDLEWARES = {
# Disable default User-Agent middleware
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
# Enable custom middleware
"myproject.middlewares.RandomUserAgentMiddleware": 400,
"myproject.middlewares.ProxyRotationMiddleware": 410,
}
Key settings for production scraping:
# myproject/settings.py
# Concurrency
CONCURRENT_REQUESTS = 16 # Total concurrent requests
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Per domain limit
DOWNLOAD_DELAY = 1.0 # Seconds between requests (per domain)
RANDOMIZE_DOWNLOAD_DELAY = True # Randomize ±50%
# Retry
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Caching (avoid re-downloading during development)
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400 # 24 hours
HTTPCACHE_DIR = "httpcache"
# Respect robots.txt
ROBOTSTXT_OBEY = True
# Auto-throttle (adjusts delay based on server response time)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
# Logging
LOG_LEVEL = "INFO"
LOG_FILE = "scrapy.log"
# Export
FEEDS = {
"output/%(name)s_%(time)s.json": {
"format": "json",
"encoding": "utf8",
"overwrite": True,
},
}
CONCURRENT_REQUESTS too high without proxies will get your IP blocked quickly. Start with 4-8 concurrent requests and increase gradually. Always enable AUTOTHROTTLE for production spiders.
The Scrapy shell is an interactive environment for testing selectors — use it before writing spiders.
# Launch shell with a URL
scrapy shell "https://quotes.toscrape.com"
# Test selectors interactively
>>> response.css("div.quote span.text::text").getall()
>>> response.xpath("//small[@class='author']/text()").get()
>>> response.css("li.next a::attr(href)").get()
# View the response
>>> view(response) # Opens in browser
>>> response.status # HTTP status code
>>> response.headers # Response headers
ipython for a better shell experience: pip install ipython
Scrapy downloads raw HTML — it doesn't execute JavaScript. For SPAs and dynamic content, integrate a headless browser.
# Install
pip install scrapy-playwright
playwright install chromium
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True}
import scrapy
class JSSpider(scrapy.Spider):
name = "js_spider"
def start_requests(self):
yield scrapy.Request(
"https://example.com/spa-page",
meta={
"playwright": True,
"playwright_page_methods": [
# Wait for content to load
{"method": "wait_for_selector", "args": ["div.results"]},
],
},
)
def parse(self, response):
# response now contains JS-rendered HTML
for item in response.css("div.result"):
yield {
"title": item.css("h3::text").get(),
"description": item.css("p::text").get(),
}
Managing headless browsers in Scrapy adds significant complexity. A web scraping API handles JavaScript rendering, proxy rotation, and anti-bot detection in a single request — no browser management required.
import scrapy
class MantisSpider(scrapy.Spider):
"""Use Mantis API as a download backend for Scrapy."""
name = "mantis_spider"
api_key = "your_api_key"
def start_requests(self):
urls = ["https://example.com/page1", "https://example.com/page2"]
for url in urls:
yield scrapy.Request(
f"https://api.mantisapi.com/v1/scrape?url={url}&render_js=true",
headers={"Authorization": f"Bearer {self.api_key}"},
cb_kwargs={"original_url": url},
)
def parse(self, response, original_url):
data = response.json()
yield {
"url": original_url,
"content": data.get("content"),
"title": data.get("metadata", {}).get("title"),
}
# Install scrapyd
pip install scrapyd scrapyd-client
# Start the daemon
scrapyd
# Deploy your project
scrapyd-deploy default -p myproject
# Schedule a spider
curl http://localhost:6800/schedule.json -d project=myproject -d spider=products
# Check status
curl http://localhost:6800/listjobs.json?project=myproject
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["scrapy", "crawl", "products", "-O", "/data/output.json"]
# Run with Docker
docker build -t my-scraper .
docker run -v $(pwd)/data:/data my-scraper
# Schedule with cron
# 0 6 * * * docker run --rm -v /data:/data my-scraper
Options for production:
| Feature | Scrapy | BeautifulSoup | Selenium | Mantis API |
|---|---|---|---|---|
| Type | Framework | Parser library | Browser automation | API service |
| Concurrency | Built-in (async) | Manual (threads) | Manual (processes) | Built-in |
| JavaScript | Plugin required | ❌ No | ✅ Yes | ✅ Yes |
| Anti-bot | Manual middleware | Manual | Manual | ✅ Built-in |
| Proxies | Manual middleware | Manual | Manual | ✅ Built-in |
| Learning curve | High | Low | Medium | Low |
| Speed | ⚡ Very fast | Fast (parsing only) | Slow | Fast |
| Best for | Large-scale crawling | Quick scripts | JS-heavy sites | Any site, any scale |
| Infrastructure | Self-managed | None | Browser required | None (managed) |
| Cost (10K pages/mo) | $50-300 (servers + proxies) | $0-50 | $100-500 | $29-99 |
When to use Scrapy: Large-scale crawling projects with thousands of pages, when you need full control over the scraping pipeline, and when you have engineering resources to maintain infrastructure.
When to use an API instead: When you need data fast without managing spiders, proxies, and servers. When building AI agents that need web data on-demand. When you'd rather focus on using data than collecting it.
pip install spidermon for alerts on failures, data quality checksHTTPCACHE to avoid re-downloading pages while testing selectorsCLOSESPIDER_ITEMCOUNT — limit items during testing: scrapy crawl products -s CLOSESPIDER_ITEMCOUNT=10errback — handle request errors gracefully instead of crashing the spiderLove Scrapy's power but tired of managing proxies, browsers, and servers? Mantis API gives you the same data with a single API call. Free tier included.
Start Free — 100 Requests/Month →Scrapy is a Python framework for large-scale web scraping and crawling. It provides built-in request scheduling, concurrency, middleware, data pipelines, and export formats. It's used for extracting structured data from websites — product catalogs, news articles, job listings, real estate data, and more.
They serve different purposes. BeautifulSoup is a parsing library — it only handles HTML parsing and needs Requests for HTTP. Scrapy is a complete framework with built-in concurrency, request scheduling, middleware, pipelines, and export. Use BeautifulSoup for quick one-off scripts; use Scrapy for large-scale crawling projects.
Not natively. Scrapy downloads raw HTML without executing JavaScript. For JS-rendered pages, use scrapy-playwright (recommended) or scrapy-splash to integrate a headless browser. Alternatively, a web scraping API like Mantis handles JavaScript rendering automatically.
Scrapy's asynchronous Twisted engine can handle hundreds of concurrent requests, scraping thousands of pages per minute. It's significantly faster than sequential tools like BeautifulSoup + Requests or Selenium.
Options include: Scrapyd (self-hosted daemon), Zyte Scrapy Cloud (managed hosting), Docker containers on any cloud provider, or cron jobs on a VPS. For production, you'll also need monitoring, proxy rotation, and error alerting.
Yes — Scrapy remains the most powerful Python scraping framework for large-scale projects. However, for AI agents and applications that need web data on-demand, a web scraping API is often more practical than managing your own spiders and infrastructure.