Step 1: Crawl to find all product pages

March 5, 2026 Web Scraping

---|-------------|--------------|

| Goal | Discover URLs/pages | Extract data from pages |

| Scope | Broad (entire sites) | Narrow (specific pages) |

| Output | List of URLs | Structured data |

| Scale | Millions of pages | Hundreds to thousands |

| Complexity | Queue management, dedup, politeness | Parsing, selectors, rendering |

| Example tools | Scrapy, Apache Nutch | BeautifulSoup, Playwright |

| Common use | Search engines, site audits | Price monitoring, lead gen |

When You Need Both (Crawl + Scrape)

Most real-world projects need both. The pattern is simple:

Crawl to discover all the pages you care about

Scrape each discovered page to extract the data

# Step 1: Crawl to find all product pages
product_urls = crawl_for_pattern(
    seed="https://store.example.com",
    pattern=r"/product/\d+"
)

# Step 2: Scrape each product page
products = [scrape_product(url) for url in product_urls]

This two-phase approach is how search engines work: Googlebot crawls the web to discover pages, then Google's indexer scrapes each page to extract content for search results.

The Challenges of Building Your Own

Whether you're crawling or scraping, you'll hit the same walls:

Anti-bot protection: Cloudflare, DataDome, PerimeterX — modern sites actively block automated access. You need rotating proxies, browser fingerprint management, and CAPTCHA solving.

JavaScript rendering: Over 70% of websites require JavaScript execution to display content. Simple HTTP requests won't work — you need a headless browser (Playwright, Puppeteer), which means managing browser instances, memory, and CPU.

Scale management: Crawling a 100K-page site means managing request queues, rate limiting, error handling, retries, and distributed coordination. Scraping 10K product pages means handling layout changes, missing elements, and data validation.

Maintenance: Website structures change constantly. CSS selectors break. New anti-bot measures deploy. What worked yesterday fails tomorrow.

The Modern Approach: API-Based Scraping

Instead of building and maintaining crawlers and scrapers, modern teams use web scraping APIs that handle the infrastructure for you.

WebPerception API: Scrape Any Page in One Call

import requests

# Scrape raw HTML (like a crawler would see)
response = requests.post(
    "https://api.mantisapi.com/scrape",
    headers={"x-api-key": "your-api-key"},
    json={"url": "https://store.example.com/product/123"}
)

html_content = response.json()["content"]

Extract Structured Data (No Selectors Needed)

# AI-powered extraction — no CSS selectors, no breaking when layouts change
response = requests.post(
    "https://api.mantisapi.com/extract",
    headers={"x-api-key": "your-api-key"},
    json={
        "url": "https://store.example.com/product/123",
        "prompt": "Extract product name, price, rating, and description"
    }
)

data = response.json()["data"]
# {"product_name": "Widget Pro", "price": "$29.99", "rating": "4.5/5", ...}

Why This Changes Everything

| Challenge | DIY Crawl + Scrape | WebPerception API |

|-----------|-------------------|-------------------|

| Anti-bot bypass | You manage proxies + fingerprints | Built-in |

| JavaScript rendering | Run headless browsers | Cloud-rendered |

| Selector maintenance | Breaks when sites change | AI extraction adapts |

| Infrastructure | Servers, queues, monitoring | Single API call |

| Time to first data | Days to weeks | Minutes |

| Cost (10K pages/mo) | $50-200+ (servers + proxies) | $29/mo (Starter plan) |

When to Crawl, When to Scrape, When to Use an API

Build a crawler when:

You need to index an entire website or discover new content
You're building a search engine or site audit tool
The site structure is unknown and you need to map it first

Build a scraper when:

You know exactly which pages and data you need
The site structure is stable and well-known
You have the engineering resources to maintain it

Use WebPerception API when:

You need data fast without infrastructure overhead
Sites use anti-bot protection or heavy JavaScript
You're building an AI agent that needs web perception
You want AI-powered extraction instead of brittle CSS selectors
You'd rather ship features than maintain scrapers

Getting Started

Most teams today skip building custom crawlers and scrapers entirely. With WebPerception API, you can go from zero to structured web data in under 5 minutes:

Get your free API key at mantisapi.com (100 calls/month free)

Make your first call using the /scrape or /extract endpoint

Build your pipeline — feed extracted data into your app, agent, or database

The free tier is enough to prototype. When you scale, plans start at $29/month for 5,000 API calls.

---

Stop building infrastructure. Start extracting data. Try WebPerception API free →

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →