Web Scraping with Scrapy in 2026: The Complete Guide

Q: How fast is Scrapy compared to other scraping tools?

Scrapy is one of the fastest scraping tools available. Its asynchronous Twisted engine can handle hundreds of concurrent requests, scraping thousands of pages per minute. It's significantly faster than sequential tools like BeautifulSoup + Requests or Selenium. However, you'll still need to manage proxies, user agents, and anti-bot detection yourself.

Q: How do I deploy a Scrapy spider to production?

Common deployment options include: Scrapyd (self-hosted daemon), Zyte Scrapy Cloud (managed hosting), Docker containers on any cloud provider, or cron jobs on a VPS. For production, you'll also need monitoring, proxy rotation, and error alerting — or you can use a managed API like Mantis that handles infrastructure automatically.

Q: Is Scrapy still worth learning in 2026?

Yes, Scrapy remains the most powerful Python scraping framework for large-scale projects. However, for many use cases — especially AI agents that need web data — a web scraping API is more practical. APIs like Mantis eliminate the need to manage spiders, proxies, and infrastructure, letting you focus on using the data rather than collecting it.

Published March 16, 2026 · 20 min read · Python, Scrapy, Web Scraping

What Is Scrapy?
Installation & Setup
Your First Spider
Selectors: CSS & XPath
Following Links & Crawling
Items & Item Loaders
Pipelines
Middleware
Settings & Configuration
Scrapy Shell
Handling JavaScript
Deployment
Scrapy vs Other Tools
Production Tips
FAQ

What Is Scrapy?

Scrapy is the most powerful open-source web scraping framework for Python. Built on the Twisted asynchronous networking library, Scrapy handles concurrent requests, automatic retries, request throttling, data pipelines, and export — all out of the box.

While libraries like BeautifulSoup only handle HTML parsing, Scrapy is a complete framework that manages the entire scraping workflow: making HTTP requests, parsing responses, extracting data, cleaning it, and storing it.

Key features:

Asynchronous engine — handles hundreds of concurrent requests
Built-in selectors — CSS and XPath support
Middleware system — plug in proxies, user agents, cookies, retries
Data pipelines — clean, validate, and store data automatically
Export formats — JSON, CSV, XML, databases
Extensible — signals, extensions, and custom components
robots.txt compliance — respects crawl rules by default

Installation & Setup

Install Scrapy

# Create a virtual environment
python -m venv scraping-env
source scraping-env/bin/activate  # Linux/macOS
# scraping-env\Scripts\activate   # Windows

# Install Scrapy
pip install scrapy

# Verify installation
scrapy version
# Scrapy 2.11.x

Create a New Project

# Generate project structure
scrapy startproject myproject
cd myproject

# Project structure:
# myproject/
# ├── scrapy.cfg
# └── myproject/
#     ├── __init__.py
#     ├── items.py
#     ├── middlewares.py
#     ├── pipelines.py
#     ├── settings.py
#     └── spiders/
#         └── __init__.py

💡 Tip: You can also run Scrapy without a project using scrapy runspider my_spider.py for quick scripts. But for production work, always use the project structure.

Your First Spider

Create a file at myproject/spiders/quotes_spider.py:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        # Follow pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run it:

# Run spider and save output
scrapy crawl quotes -O quotes.json

# Run with logging level
scrapy crawl quotes -O quotes.csv -L INFO

That's it — Scrapy handles the HTTP requests, follows pagination, and exports structured data. Let's break down each component.

Selectors: CSS & XPath

Scrapy supports both CSS selectors and XPath expressions. Use whichever you're more comfortable with — or mix them.

CSS Selectors

# Get text content
response.css("h1::text").get()                    # First match
response.css("p::text").getall()                  # All matches

# Get attributes
response.css("a::attr(href)").get()               # href attribute
response.css("img::attr(src)").getall()            # All image sources

# Nested selectors
response.css("div.product").css("span.price::text").get()

# Pseudo-selectors
response.css("li:nth-child(1)::text").get()       # First list item

XPath Selectors

# XPath equivalents
response.xpath("//h1/text()").get()
response.xpath("//a/@href").getall()

# XPath advantages — text contains
response.xpath('//p[contains(text(), "price")]/text()').get()

# Select by position
response.xpath("//table/tr[position()>1]").getall()  # Skip header row

# Parent/sibling traversal
response.xpath('//span[@class="price"]/parent::div').get()

💡 Tip: Use CSS selectors for simple queries and XPath when you need advanced features like text matching, parent traversal, or positional logic. You can also chain them: response.css("div.item").xpath(".//span/text()").get()

Following Links & Crawling

Scrapy excels at crawling — following links across pages to build a complete dataset.

import scrapy


class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        """Parse the listing page — follow each product link."""
        for product_link in response.css("a.product-card::attr(href)"):
            yield response.follow(product_link, callback=self.parse_product)

        # Follow pagination
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        """Parse individual product page."""
        yield {
            "url": response.url,
            "name": response.css("h1.product-title::text").get("").strip(),
            "price": response.css("span.price::text").get("").strip(),
            "description": response.css("div.description p::text").getall(),
            "sku": response.css("span.sku::text").get(),
            "in_stock": bool(response.css("span.in-stock")),
            "images": response.css("img.product-image::attr(src)").getall(),
        }

CrawlSpider for Rule-Based Crawling

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class SiteCrawler(CrawlSpider):
    name = "site_crawler"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com"]

    rules = (
        # Follow category links, don't scrape them
        Rule(LinkExtractor(allow=r"/category/")),

        # Follow and scrape product pages
        Rule(LinkExtractor(allow=r"/product/"), callback="parse_product"),
    )

    def parse_product(self, response):
        yield {
            "url": response.url,
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }

Items & Item Loaders

For structured, validated data, define Items and use Item Loaders to clean data during extraction.

Define Items

# myproject/items.py
import scrapy


class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    description = scrapy.Field()
    sku = scrapy.Field()
    category = scrapy.Field()
    rating = scrapy.Field()
    reviews_count = scrapy.Field()

Use Item Loaders

from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from myproject.items import ProductItem
import re


def clean_price(value):
    """Extract numeric price from string."""
    match = re.search(r"[\d,.]+", value)
    return float(match.group().replace(",", "")) if match else None


class ProductLoader(ItemLoader):
    default_item_class = ProductItem
    default_output_processor = TakeFirst()

    name_in = MapCompose(str.strip)
    price_in = MapCompose(str.strip, clean_price)
    description_in = MapCompose(str.strip)
    description_out = Join("\n")


# In your spider:
class ProductSpider(scrapy.Spider):
    name = "products"

    def parse_product(self, response):
        loader = ProductLoader(selector=response)
        loader.add_css("name", "h1.product-title::text")
        loader.add_css("price", "span.price::text")
        loader.add_css("description", "div.description p::text")
        loader.add_css("sku", "span.sku::text")
        loader.add_value("url", response.url)
        yield loader.load_item()

Skip the Framework — Get Data in One API Call

Scrapy is powerful but complex. Mantis API gives you scraped, structured data with a single HTTP request — no spiders, no pipelines, no infrastructure.

Try Mantis API Free →

Pipelines

Pipelines process items after extraction — validate, clean, deduplicate, and store data.

# myproject/pipelines.py
import json
import sqlite3


class ValidationPipeline:
    """Drop items missing required fields."""

    def process_item(self, item, spider):
        if not item.get("name"):
            raise scrapy.exceptions.DropItem(f"Missing name: {item}")
        if not item.get("price"):
            raise scrapy.exceptions.DropItem(f"Missing price: {item}")
        return item


class DuplicatesPipeline:
    """Filter duplicate items by URL."""

    def __init__(self):
        self.seen_urls = set()

    def process_item(self, item, spider):
        url = item.get("url")
        if url in self.seen_urls:
            raise scrapy.exceptions.DropItem(f"Duplicate: {url}")
        self.seen_urls.add(url)
        return item


class SQLitePipeline:
    """Store items in SQLite database."""

    def open_spider(self, spider):
        self.conn = sqlite3.connect("products.db")
        self.cursor = self.conn.cursor()
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                url TEXT PRIMARY KEY,
                name TEXT,
                price REAL,
                sku TEXT,
                description TEXT
            )
        """)

    def close_spider(self, spider):
        self.conn.commit()
        self.conn.close()

    def process_item(self, item, spider):
        self.cursor.execute(
            "INSERT OR REPLACE INTO products VALUES (?, ?, ?, ?, ?)",
            (item.get("url"), item.get("name"), item.get("price"),
             item.get("sku"), item.get("description"))
        )
        return item

Enable pipelines in settings.py:

# myproject/settings.py
ITEM_PIPELINES = {
    "myproject.pipelines.ValidationPipeline": 100,
    "myproject.pipelines.DuplicatesPipeline": 200,
    "myproject.pipelines.SQLitePipeline": 300,
}
# Lower number = higher priority (runs first)

Middleware

Middleware intercepts requests and responses — essential for anti-bot evasion.

Random User-Agent Middleware

# myproject/middlewares.py
import random


class RandomUserAgentMiddleware:
    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:123.0) Gecko/20100101 Firefox/123.0",
    ]

    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.USER_AGENTS)

Proxy Rotation Middleware

class ProxyRotationMiddleware:
    def __init__(self):
        self.proxies = [
            "http://user:pass@proxy1.example.com:8080",
            "http://user:pass@proxy2.example.com:8080",
            "http://user:pass@proxy3.example.com:8080",
        ]

    def process_request(self, request, spider):
        request.meta["proxy"] = random.choice(self.proxies)

    def process_exception(self, request, exception, spider):
        # Retry with a different proxy on failure
        request.meta["proxy"] = random.choice(self.proxies)
        return request

Enable middleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    # Disable default User-Agent middleware
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
    # Enable custom middleware
    "myproject.middlewares.RandomUserAgentMiddleware": 400,
    "myproject.middlewares.ProxyRotationMiddleware": 410,
}

Settings & Configuration

Key settings for production scraping:

# myproject/settings.py

# Concurrency
CONCURRENT_REQUESTS = 16              # Total concurrent requests
CONCURRENT_REQUESTS_PER_DOMAIN = 8    # Per domain limit
DOWNLOAD_DELAY = 1.0                  # Seconds between requests (per domain)
RANDOMIZE_DOWNLOAD_DELAY = True       # Randomize ±50%

# Retry
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Caching (avoid re-downloading during development)
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400     # 24 hours
HTTPCACHE_DIR = "httpcache"

# Respect robots.txt
ROBOTSTXT_OBEY = True

# Auto-throttle (adjusts delay based on server response time)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0

# Logging
LOG_LEVEL = "INFO"
LOG_FILE = "scrapy.log"

# Export
FEEDS = {
    "output/%(name)s_%(time)s.json": {
        "format": "json",
        "encoding": "utf8",
        "overwrite": True,
    },
}

⚠️ Warning: Setting CONCURRENT_REQUESTS too high without proxies will get your IP blocked quickly. Start with 4-8 concurrent requests and increase gradually. Always enable AUTOTHROTTLE for production spiders.

Scrapy Shell

The Scrapy shell is an interactive environment for testing selectors — use it before writing spiders.

# Launch shell with a URL
scrapy shell "https://quotes.toscrape.com"

# Test selectors interactively
>>> response.css("div.quote span.text::text").getall()
>>> response.xpath("//small[@class='author']/text()").get()
>>> response.css("li.next a::attr(href)").get()

# View the response
>>> view(response)   # Opens in browser
>>> response.status  # HTTP status code
>>> response.headers # Response headers

💡 Tip: Always test selectors in the Scrapy shell before writing your spider. It saves enormous debugging time. Install ipython for a better shell experience: pip install ipython

Handling JavaScript-Rendered Pages

Scrapy downloads raw HTML — it doesn't execute JavaScript. For SPAs and dynamic content, integrate a headless browser.

scrapy-playwright (Recommended)

# Install
pip install scrapy-playwright
playwright install chromium

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True}

import scrapy


class JSSpider(scrapy.Spider):
    name = "js_spider"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/spa-page",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    # Wait for content to load
                    {"method": "wait_for_selector", "args": ["div.results"]},
                ],
            },
        )

    def parse(self, response):
        # response now contains JS-rendered HTML
        for item in response.css("div.result"):
            yield {
                "title": item.css("h3::text").get(),
                "description": item.css("p::text").get(),
            }

The API Alternative

Managing headless browsers in Scrapy adds significant complexity. A web scraping API handles JavaScript rendering, proxy rotation, and anti-bot detection in a single request — no browser management required.

import scrapy


class MantisSpider(scrapy.Spider):
    """Use Mantis API as a download backend for Scrapy."""
    name = "mantis_spider"
    api_key = "your_api_key"

    def start_requests(self):
        urls = ["https://example.com/page1", "https://example.com/page2"]
        for url in urls:
            yield scrapy.Request(
                f"https://api.mantisapi.com/v1/scrape?url={url}&render_js=true",
                headers={"Authorization": f"Bearer {self.api_key}"},
                cb_kwargs={"original_url": url},
            )

    def parse(self, response, original_url):
        data = response.json()
        yield {
            "url": original_url,
            "content": data.get("content"),
            "title": data.get("metadata", {}).get("title"),
        }

Deployment

Scrapyd (Self-Hosted)

# Install scrapyd
pip install scrapyd scrapyd-client

# Start the daemon
scrapyd

# Deploy your project
scrapyd-deploy default -p myproject

# Schedule a spider
curl http://localhost:6800/schedule.json -d project=myproject -d spider=products

# Check status
curl http://localhost:6800/listjobs.json?project=myproject

Docker Deployment

# Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["scrapy", "crawl", "products", "-O", "/data/output.json"]

# Run with Docker
docker build -t my-scraper .
docker run -v $(pwd)/data:/data my-scraper

# Schedule with cron
# 0 6 * * * docker run --rm -v /data:/data my-scraper

Cloud Deployment

Options for production:

Zyte Scrapy Cloud — Managed Scrapy hosting (formerly Scrapinghub)
AWS ECS / Lambda — Containerized spiders on AWS
GitHub Actions — Schedule spiders as CI/CD workflows
Any VPS + cron — Simple and effective for smaller projects

Scrapy vs Other Tools

Feature	Scrapy	BeautifulSoup	Selenium	Mantis API
Type	Framework	Parser library	Browser automation	API service
Concurrency	Built-in (async)	Manual (threads)	Manual (processes)	Built-in
JavaScript	Plugin required	❌ No	✅ Yes	✅ Yes
Anti-bot	Manual middleware	Manual	Manual	✅ Built-in
Proxies	Manual middleware	Manual	Manual	✅ Built-in
Learning curve	High	Low	Medium	Low
Speed	⚡ Very fast	Fast (parsing only)	Slow	Fast
Best for	Large-scale crawling	Quick scripts	JS-heavy sites	Any site, any scale
Infrastructure	Self-managed	None	Browser required	None (managed)
Cost (10K pages/mo)	$50-300 (servers + proxies)	$0-50	$100-500	$29-99

When to use Scrapy: Large-scale crawling projects with thousands of pages, when you need full control over the scraping pipeline, and when you have engineering resources to maintain infrastructure.

When to use an API instead: When you need data fast without managing spiders, proxies, and servers. When building AI agents that need web data on-demand. When you'd rather focus on using data than collecting it.

Production Tips

Always enable AutoThrottle — it automatically adjusts request speed based on server load
Use Item Loaders — they keep parsing logic clean and data consistently formatted
Monitor with Spidermon — pip install spidermon for alerts on failures, data quality checks
Rotate user agents AND proxies — one without the other isn't enough for serious targets
Cache during development — enable HTTPCACHE to avoid re-downloading pages while testing selectors
Set CLOSESPIDER_ITEMCOUNT — limit items during testing: scrapy crawl products -s CLOSESPIDER_ITEMCOUNT=10
Use errback — handle request errors gracefully instead of crashing the spider
Log stats — Scrapy logs request/response/item stats automatically; review them after each run
Test with contracts — Scrapy supports docstring-based contracts for spider testing
Consider the API shortcut — if you're spending more time on infrastructure than data, a managed API may be more cost-effective

From Scrapy to Production in Minutes

Love Scrapy's power but tired of managing proxies, browsers, and servers? Mantis API gives you the same data with a single API call. Free tier included.

Start Free — 100 Requests/Month →

Frequently Asked Questions

What is Scrapy used for?

Scrapy is a Python framework for large-scale web scraping and crawling. It provides built-in request scheduling, concurrency, middleware, data pipelines, and export formats. It's used for extracting structured data from websites — product catalogs, news articles, job listings, real estate data, and more.

Is Scrapy better than BeautifulSoup?

They serve different purposes. BeautifulSoup is a parsing library — it only handles HTML parsing and needs Requests for HTTP. Scrapy is a complete framework with built-in concurrency, request scheduling, middleware, pipelines, and export. Use BeautifulSoup for quick one-off scripts; use Scrapy for large-scale crawling projects.

Can Scrapy handle JavaScript-rendered pages?

Not natively. Scrapy downloads raw HTML without executing JavaScript. For JS-rendered pages, use scrapy-playwright (recommended) or scrapy-splash to integrate a headless browser. Alternatively, a web scraping API like Mantis handles JavaScript rendering automatically.

How fast is Scrapy?

Scrapy's asynchronous Twisted engine can handle hundreds of concurrent requests, scraping thousands of pages per minute. It's significantly faster than sequential tools like BeautifulSoup + Requests or Selenium.

How do I deploy a Scrapy spider to production?

Options include: Scrapyd (self-hosted daemon), Zyte Scrapy Cloud (managed hosting), Docker containers on any cloud provider, or cron jobs on a VPS. For production, you'll also need monitoring, proxy rotation, and error alerting.

Is Scrapy still worth learning in 2026?

Yes — Scrapy remains the most powerful Python scraping framework for large-scale projects. However, for AI agents and applications that need web data on-demand, a web scraping API is often more practical than managing your own spiders and infrastructure.

Web Scraping with Scrapy in 2026: The Complete Guide

Table of Contents

What Is Scrapy?

Installation & Setup

Install Scrapy

Create a New Project

Your First Spider

Selectors: CSS & XPath

CSS Selectors

XPath Selectors

Following Links & Crawling

CrawlSpider for Rule-Based Crawling

Items & Item Loaders

Define Items

Use Item Loaders

Skip the Framework — Get Data in One API Call

Pipelines

Middleware

Random User-Agent Middleware

Proxy Rotation Middleware

Settings & Configuration

Scrapy Shell

Handling JavaScript-Rendered Pages

scrapy-playwright (Recommended)

The API Alternative

Deployment

Scrapyd (Self-Hosted)

Docker Deployment

Cloud Deployment

Scrapy vs Other Tools

Production Tips

From Scrapy to Production in Minutes

Frequently Asked Questions

What is Scrapy used for?

Is Scrapy better than BeautifulSoup?

Can Scrapy handle JavaScript-rendered pages?

How fast is Scrapy?

How do I deploy a Scrapy spider to production?

Is Scrapy still worth learning in 2026?

Related Guides