Web Scraping Without Getting Blocked: The 2026 Playbook

March 6, 2026 guide
# Web Scraping Without Getting Blocked: The 2026 Playbook You've built a scraper. It works perfectly — for about 10 minutes. Then: 403 Forbidden. CAPTCHA walls. IP bans. Your beautiful data pipeline grinds to a halt. Getting blocked is the #1 frustration in web scraping. And in 2026, anti-bot systems are smarter than ever. Cloudflare, DataDome, PerimeterX, and custom WAFs use browser fingerprinting, behavioral analysis, and machine learning to detect and block scrapers. But there are proven ways to scrape reliably without getting blocked. This guide covers every technique — from basics to advanced — plus a modern approach that sidesteps the entire problem. ## Why Scrapers Get Blocked Before solving the problem, understand what triggers blocks: ### 1. Request Pattern Detection - Too many requests per second from one IP - Identical headers across all requests - No cookies or session continuity - Hitting pages in non-human order (e.g., skipping the homepage) ### 2. Browser Fingerprinting - Missing or inconsistent JavaScript execution - No WebGL, Canvas, or AudioContext fingerprints - Navigator properties that don't match the User-Agent - Missing browser plugins and fonts ### 3. Behavioral Analysis - No mouse movements or scroll events - Instant page loads (no rendering time) - Accessing robots.txt-disallowed paths - No referrer headers ### 4. IP Reputation - Known datacenter IP ranges (AWS, GCP, Azure) - IPs flagged by threat intelligence feeds - Tor exit nodes and known proxy IPs ## Technique 1: Request-Level Fixes The basics. These won't beat sophisticated anti-bot systems alone, but they're table stakes. ### Rotate User-Agents ```python import random user_agents = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/537.36 Chrome/119.0.0.0", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0", ] headers = { "User-Agent": random.choice(user_agents), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Referer": "https://www.google.com/", } ``` ### Add Delays Between Requests ```python import time import random def polite_delay(): """Random delay between 2-5 seconds — mimics human browsing.""" time.sleep(random.uniform(2, 5)) ``` ### Use Sessions (Cookie Persistence) ```python import requests session = requests.Session() # First request establishes cookies session.get("https://example.com") # Subsequent requests carry cookies automatically response = session.get("https://example.com/data") ``` **Verdict:** Works against basic protections. Fails against Cloudflare, DataDome, and any modern anti-bot system. ## Technique 2: Proxy Rotation Distributing requests across many IP addresses is essential for scale. ### Types of Proxies | Type | Cost | Detection Risk | Speed | |------|------|----------------|-------| | Datacenter | $1-5/GB | High | Fast | | Residential | $5-15/GB | Low | Medium | | Mobile | $15-30/GB | Very Low | Slow | | ISP (Static Residential) | $2-5/IP/mo | Low | Fast | ### Implementation ```python import itertools proxies = [ "http://user:pass@proxy1.example.com:8080", "http://user:pass@proxy2.example.com:8080", "http://user:pass@proxy3.example.com:8080", ] proxy_pool = itertools.cycle(proxies) def get_with_rotation(url): proxy = next(proxy_pool) return requests.get(url, proxies={"http": proxy, "https": proxy}) ``` ### Smart Rotation Don't just rotate randomly. Use sticky sessions per domain: ```python from collections import defaultdict domain_proxies = defaultdict(lambda: next(proxy_pool)) def get_sticky(url): from urllib.parse import urlparse domain = urlparse(url).netloc proxy = domain_proxies[domain] return requests.get(url, proxies={"http": proxy, "https": proxy}) ``` **Verdict:** Essential for scale. Residential proxies are expensive but effective. Still fails against fingerprinting. ## Technique 3: Headless Browsers When sites require JavaScript rendering, headless browsers simulate a real browser. ### Playwright (Recommended over Selenium) ```python from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=True) context = browser.new_context( viewport={"width": 1920, "height": 1080}, user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0", locale="en-US", ) page = context.new_page() page.goto("https://example.com") content = page.content() browser.close() ``` ### Stealth Plugins Standard headless browsers are trivially detectable. Use stealth plugins: ```python # playwright-stealth patches common detection vectors from playwright_stealth import stealth_sync with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() stealth_sync(page) page.goto("https://nowsecure.nl") # Anti-bot test site ``` ### What Stealth Fixes - `navigator.webdriver` property - Chrome DevTools protocol detection - Missing plugin arrays - Inconsistent permissions API - WebGL vendor/renderer strings **Verdict:** Beats many anti-bot systems. But headless browsers are slow (2-5s per page), resource-heavy, and stealth patches are a constant arms race. ## Technique 4: Browser Farms For serious scale, run pools of real browser instances: ```python # Using browserless.io or similar services import requests API_KEY = "your-browserless-key" def scrape_with_browser_farm(url): response = requests.post( f"https://chrome.browserless.io/content?token={API_KEY}", json={"url": url}, headers={"Content-Type": "application/json"} ) return response.text ``` **Verdict:** Expensive ($50-500/mo), complex to manage, and still requires proxy rotation for blocked sites. ## Technique 5: CAPTCHA Solving When you hit CAPTCHAs, you have three options: 1. **CAPTCHA solving services** (2Captcha, Anti-Captcha): $1-3 per 1,000 CAPTCHAs 2. **AI-based solvers**: Faster but less reliable 3. **Avoid triggering them entirely**: The best strategy ```python # 2Captcha integration example import requests def solve_recaptcha(site_key, page_url): # Submit CAPTCHA resp = requests.post("http://2captcha.com/in.php", data={ "key": API_KEY, "method": "userrecaptcha", "googlekey": site_key, "pageurl": page_url, }) captcha_id = resp.text.split("|")[1] # Poll for solution (takes 20-60 seconds) for _ in range(30): time.sleep(5) result = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}") if "CAPCHA_NOT_READY" not in result.text: return result.text.split("|")[1] return None ``` **Verdict:** Adds latency and cost. A symptom of an arms race you're losing. ## The Modern Approach: API-Based Scraping Here's the thing: all the techniques above are **workarounds** for a fundamental problem. You're trying to make a machine look like a human browsing the web. Anti-bot vendors spend millions making that harder every year. The modern approach is to use a **web scraping API** that handles all of this for you: ```python import requests response = requests.post( "https://api.mantisapi.com/v1/scrape", headers={"x-api-key": "your-api-key"}, json={"url": "https://example.com/products"} ) data = response.json() print(data["content"]) # Clean, extracted content ``` ### What WebPerception API Handles For You - **Anti-bot bypass**: Rotating residential proxies, browser fingerprinting, CAPTCHA handling — all built in - **JavaScript rendering**: Full browser execution without running your own headless browsers - **AI-powered extraction**: Ask for structured data in natural language — no CSS selectors to maintain ```python # Extract structured data without writing selectors response = requests.post( "https://api.mantisapi.com/v1/extract", headers={"x-api-key": "your-api-key"}, json={ "url": "https://example.com/products", "prompt": "Extract all product names, prices, and ratings as JSON" } ) products = response.json()["data"] # [{"name": "Widget Pro", "price": "$29.99", "rating": 4.8}, ...] ``` ### Cost Comparison | Approach | Monthly Cost (10K pages) | Maintenance | Reliability | |----------|-------------------------|-------------|-------------| | DIY (proxies + headless) | $200-500 | High | 70-85% | | Browser farm service | $150-300 | Medium | 80-90% | | WebPerception API (Starter) | $29 | Zero | 95%+ | ### When to Use Each **DIY scraping** makes sense when: - You're scraping a small number of simple, cooperative sites - You need sub-second latency - You're learning web scraping fundamentals **API-based scraping** makes sense when: - You're building production data pipelines - You can't afford downtime from blocked scrapers - You'd rather spend time on your product than on anti-bot cat-and-mouse - You're building AI agents that need reliable web access ## Quick Reference: Anti-Block Checklist Before you scrape, run through this checklist: - [ ] Rotate User-Agents (pool of 10+) - [ ] Add realistic headers (Accept, Accept-Language, Referer) - [ ] Use sessions for cookie persistence - [ ] Add random delays (2-5s between requests) - [ ] Rotate IPs (residential proxies for sensitive sites) - [ ] Use headless browser with stealth patches for JS-heavy sites - [ ] Respect robots.txt (ethical scraping = sustainable scraping) - [ ] Monitor success rates and adjust when they drop - [ ] Have a fallback plan (API-based scraping) for critical pipelines ## Conclusion Web scraping in 2026 is an arms race. Anti-bot systems get smarter every quarter. You can play the game — rotating proxies, stealth browsers, CAPTCHA solvers — or you can step out of the game entirely. For hobby projects and learning, DIY scraping teaches you invaluable skills. For production systems, agents, and anything where reliability matters, an API-based approach like [WebPerception API](https://mantisapi.com) saves you time, money, and headaches. **Ready to stop getting blocked?** [Get your free API key](https://mantisapi.com) — 100 requests/month, no credit card required.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →