Why do websites block web scrapers?

Websites block scrapers to protect server resources, prevent competitive data harvesting, enforce terms of service, and protect user privacy. Modern anti-bot systems like Cloudflare, Akamai, and PerimeterX use fingerprinting, behavioral analysis, CAPTCHAs, and machine learning to distinguish bots from human visitors.

What is the best way to avoid getting blocked while web scraping?

The most reliable approach is using a web scraping API like Mantis that handles anti-detection automatically. If scraping manually, combine residential proxy rotation, realistic browser headers, human-like delays (2-10 seconds between requests), and browser fingerprint randomization. No single technique works alone — you need multiple layers.

How do websites detect web scrapers?

Websites use multiple detection methods: IP reputation databases (flagging datacenter IPs), request rate analysis (too many requests too fast), browser fingerprinting (checking JavaScript APIs, canvas rendering, WebGL), TLS fingerprinting (JA3/JA4 signatures), behavioral analysis (mouse movements, scroll patterns), CAPTCHAs, and honeypot links invisible to real users.

Do I need residential proxies for web scraping?

It depends on the target. For basic sites without anti-bot protection, datacenter proxies ($1-5/GB) work fine. For sites using Cloudflare, Akamai, or PerimeterX, residential proxies ($5-15/GB) are usually necessary because datacenter IP ranges are easily flagged. Mobile proxies ($15-30/GB) have the highest trust scores but are the most expensive.

How much does it cost to web scrape without getting blocked?

DIY anti-detection scraping costs $200-800/month: residential proxies ($100-400), CAPTCHA solving ($50-200), headless browser infrastructure ($50-200). A web scraping API like Mantis costs $29-299/month and handles all anti-detection automatically, making it 3-5x cheaper for most use cases while requiring zero maintenance.

How to Web Scrape Without Getting Blocked in 2026: The Complete Anti-Detection Guide

Published March 15, 2026 · 14 min read · Updated for 2026 anti-bot systems
Anti-Detection Web Scraping Proxies Python

TL;DR: Getting blocked while web scraping is the #1 frustration for developers. This guide covers 10 proven anti-detection techniques — from proxy rotation to browser fingerprinting — with Python code examples. Or skip the arms race entirely: use a web scraping API that handles anti-bot detection for you at a fraction of the cost.

You write a perfect scraper. It works beautifully for 50 requests. Then: 403 Forbidden. Or worse — you get silently served fake data, your IP gets blacklisted, or you hit an infinite CAPTCHA loop.

Anti-bot detection in 2026 is more sophisticated than ever. Cloudflare's Turnstile, Akamai Bot Manager, PerimeterX (now HUMAN), and DataDome use machine learning, TLS fingerprinting, and behavioral analysis to catch scrapers within milliseconds. The days of just rotating user agents are long gone.

This guide covers every technique you need to scrape without getting blocked — and why most developers ultimately switch to an API that handles it all automatically.

Why Websites Block Scrapers

Before diving into solutions, understand what you're up against. Modern anti-bot systems detect scrapers through multiple signals simultaneously:

Detection Method	What It Checks	Difficulty to Bypass
IP Reputation	Datacenter IP ranges, known proxy lists, IP request volume	Medium
TLS Fingerprinting	JA3/JA4 signatures — your HTTP client's TLS handshake pattern	Hard
Browser Fingerprinting	Canvas, WebGL, AudioContext, navigator properties, fonts	Hard
Behavioral Analysis	Mouse movement, scroll patterns, click timing, page dwell time	Very Hard
HTTP/2 Fingerprinting	Frame ordering, header priorities, SETTINGS frame parameters	Very Hard
CAPTCHAs	Cloudflare Turnstile, reCAPTCHA v3, hCaptcha	Medium ($$)
Rate Limiting	Requests per IP per minute/hour, concurrent connections	Easy
Honeypot Links	Hidden links (CSS display:none) that only bots follow	Easy

The key insight: no single technique works alone. Modern anti-bot systems score requests across multiple signals and block based on a composite trust score. You need to pass all checks simultaneously.

Technique 1: Proxy Rotation

The foundation of any anti-detection strategy. Sending all requests from one IP is the fastest way to get blocked.

Types of Proxies

Proxy Type	Cost	Trust Score	Best For
Datacenter	$1-5/GB	Low	Unprotected sites, APIs
Residential	$5-15/GB	High	Cloudflare, Akamai protected sites
Mobile (4G/5G)	$15-30/GB	Very High	Heavily protected sites, social media
ISP (Static Residential)	$3-8/IP/day	High	Session-based scraping, accounts

Python: DIY Proxy Rotation

import random
import requests

# Residential proxy pool (you'd get these from BrightData, Oxylabs, etc.)
proxies = [
    "http://user:pass@gate.proxy-provider.com:7777",
    "http://user:pass@gate.proxy-provider.com:7778",
    "http://user:pass@gate.proxy-provider.com:7779",
]

def scrape_with_proxy(url):
    proxy = random.choice(proxies)
    try:
        response = requests.get(
            url,
            proxies={"http": proxy, "https": proxy},
            timeout=30
        )
        return response
    except requests.exceptions.ProxyError:
        # Rotate to next proxy on failure
        proxies.remove(proxy)
        return scrape_with_proxy(url)

💡 The hidden cost: Residential proxies from BrightData, Oxylabs, or Smartproxy cost $5-15 per GB. A typical scraping operation transferring 50GB/month spends $250-750/month on proxies alone — before CAPTCHA solving, infrastructure, or maintenance.

Technique 2: Request Headers & User-Agent Rotation

Default Python headers scream "bot." The requests library sends python-requests/2.31.0 as the user agent. Anti-bot systems flag this instantly.

import random

# Realistic 2026 browser user agents
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

def get_realistic_headers():
    ua = random.choice(USER_AGENTS)
    return {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

response = requests.get(url, headers=get_realistic_headers())

Critical: Keep your headers consistent within a session. A Chrome user agent with Firefox-style Accept headers is an instant red flag. Match the full header set to the browser you're impersonating.

Technique 3: Rate Limiting & Human-Like Delays

Humans don't request 100 pages per second. Anti-bot systems track request timing patterns and flag anything too fast or too regular.

import time
import random

def human_delay():
    """Simulate human browsing patterns."""
    # Base delay: 2-5 seconds (normal reading time)
    base = random.uniform(2, 5)
    
    # Occasionally pause longer (checking phone, reading content)
    if random.random() < 0.1:  # 10% chance
        base += random.uniform(5, 15)
    
    # Add small jitter to avoid exact patterns
    jitter = random.gauss(0, 0.5)
    
    delay = max(1, base + jitter)
    time.sleep(delay)

# Scraping loop with human-like timing
for url in urls:
    response = scrape_with_proxy(url)
    process(response)
    human_delay()

Rules of thumb:

Minimum 2 seconds between requests to the same domain
Randomize delays — uniform intervals are a bot signal
Respect robots.txt crawl-delay directives
Back off on 429/503 — exponential backoff with jitter
Limit concurrent connections to 2-3 per domain

Technique 4: TLS Fingerprint Matching

This is where most scrapers get caught in 2026. Your HTTP client's TLS handshake creates a unique fingerprint (JA3/JA4) that anti-bot systems check against known browser signatures.

Python's requests library has a distinctive TLS fingerprint that doesn't match any real browser. Solutions:

# Option 1: Use curl_cffi (impersonates real browser TLS)
from curl_cffi import requests as curl_requests

response = curl_requests.get(
    "https://example.com",
    impersonate="chrome124",  # Matches Chrome 124's TLS fingerprint
    headers=get_realistic_headers()
)

# Option 2: Use tls-client
import tls_client

session = tls_client.Session(
    client_identifier="chrome_124",
    random_tls_extension_order=True
)
response = session.get("https://example.com")

Why this matters: Cloudflare and Akamai check TLS fingerprints before your request even reaches the server. If your JA3 hash doesn't match a known browser, you're blocked at the edge — no amount of header manipulation will help.

Technique 5: JavaScript Rendering

Over 60% of modern websites require JavaScript to render content. Sites protected by Cloudflare Turnstile require JavaScript execution to pass the challenge.

# Using Playwright for full browser rendering
from playwright.sync_api import sync_playwright

def scrape_with_browser(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-features=IsolateOrigins,site-per-process",
            ]
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            locale="en-US",
        )
        
        page = context.new_page()
        
        # Remove automation indicators
        page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
            window.chrome = { runtime: {} };
        """)
        
        page.goto(url, wait_until="networkidle")
        content = page.content()
        browser.close()
        return content

⚠️ Performance cost: Headless browsers use 200-500MB RAM per instance. Running 10 concurrent browsers requires a 4-8GB server ($20-80/month). At scale, browser-based scraping infrastructure costs add up fast.

Technique 6: CAPTCHA Solving

When all other detection methods fail, sites fall back to CAPTCHAs. In 2026, the main CAPTCHA types are:

Cloudflare Turnstile — Invisible challenge, checks browser environment
reCAPTCHA v3 — Score-based (0.0-1.0), no user interaction
hCaptcha — Image classification challenges
Custom challenges — Site-specific puzzles, math problems, etc.

# Using 2Captcha API for CAPTCHA solving
import requests

def solve_captcha(site_key, page_url):
    # Submit CAPTCHA to solving service
    response = requests.post("http://2captcha.com/in.php", data={
        "key": "YOUR_API_KEY",
        "method": "userrecaptcha",
        "googlekey": site_key,
        "pageurl": page_url,
    })
    captcha_id = response.text.split("|")[1]
    
    # Poll for solution (takes 20-60 seconds)
    import time
    for _ in range(30):
        time.sleep(5)
        result = requests.get(
            f"http://2captcha.com/res.php?key=YOUR_API_KEY&action=get&id={captcha_id}"
        )
        if "CAPCHA_NOT_READY" not in result.text:
            return result.text.split("|")[1]
    
    raise Exception("CAPTCHA solving timed out")

Cost: CAPTCHA solving services charge $1-3 per 1,000 solves. If 10% of your requests trigger CAPTCHAs, scraping 10,000 pages costs $1-3 extra — but solving adds 20-60 seconds of latency per request.

Technique 7: Session & Cookie Management

Anti-bot systems track sessions across requests. Getting a fresh Cloudflare cookie, then making 1,000 requests with it in 5 minutes, is suspicious.

import requests

session = requests.Session()

# First request: establish a natural session
session.get("https://example.com", headers=get_realistic_headers())
human_delay()

# Browse naturally: visit homepage → category → product (not jump to deep pages)
session.get("https://example.com/category", headers=get_realistic_headers())
human_delay()

# Now scrape the actual target
response = session.get("https://example.com/category/product-123", headers=get_realistic_headers())

Key rules:

Warm up sessions — visit the homepage before deep pages
Preserve cookies — anti-bot tokens in cookies validate subsequent requests
Rotate sessions — create a new session every 50-100 requests
Match session to IP — don't use the same session across different proxy IPs

Technique 8: Browser Fingerprint Randomization

Modern anti-bot systems create a unique fingerprint from dozens of browser properties. Even with headless browsers, default configurations are detectable.

Properties checked include:

Canvas fingerprint — How your browser renders a hidden canvas element
WebGL renderer — GPU and driver information
AudioContext — Audio processing fingerprint
Navigator properties — Platform, languages, hardware concurrency, device memory
Screen resolution — Must match common resolutions (1920×1080, 1366×768, etc.)
Installed fonts — Font enumeration reveals OS and installed software
WebRTC — Can leak real IP even behind a proxy

Tools like Playwright Stealth and Puppeteer Stealth handle most of these, but sophisticated sites still detect them. This is an ongoing arms race.

Technique 9: Honeypot Detection

Honeypots are invisible links or form fields that only bots interact with. They're simple but effective:

from bs4 import BeautifulSoup

def safe_extract_links(html):
    soup = BeautifulSoup(html, "html.parser")
    links = []
    
    for a in soup.find_all("a", href=True):
        # Skip hidden links (honeypots)
        style = a.get("style", "")
        parent_style = a.parent.get("style", "") if a.parent else ""
        classes = " ".join(a.get("class", []))
        
        # Common honeypot indicators
        if any(indicator in style for indicator in ["display:none", "display: none", "visibility:hidden", "opacity:0"]):
            continue
        if any(indicator in parent_style for indicator in ["display:none", "display: none", "visibility:hidden"]):
            continue
        if "hidden" in classes or "trap" in classes or "honeypot" in classes:
            continue
        # Skip zero-size elements
        if "width:0" in style or "height:0" in style:
            continue
            
        links.append(a["href"])
    
    return links

Rule: Never follow links that aren't visible to real users. Parse the DOM carefully and check for CSS that hides elements.

Technique 10: Respect robots.txt (Seriously)

This isn't just about ethics — it's practical. Sites that see you ignoring robots.txt are more likely to escalate anti-bot measures against you specifically.

from urllib.robotparser import RobotFileParser

def can_scrape(url):
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    
    return rp.can_fetch("*", url)

# Check before scraping
if can_scrape("https://example.com/data"):
    response = scrape_with_proxy("https://example.com/data")
else:
    print("Blocked by robots.txt — skipping")

The Real Cost of DIY Anti-Detection

Let's add it up. For a production scraping operation doing 25,000 requests/month:

Component	DIY Cost/Month	Mantis API
Residential Proxies (50GB)	$250-750	✅ Included
CAPTCHA Solving	$50-200	✅ Included
Headless Browser Infra	$50-200	✅ Included
Anti-detect Libraries	$0-100	✅ Included
Maintenance (10-20 hrs)	$500-1000*	✅ Zero
Total	$350-1,250+	$99/month

*Developer time valued at $50/hr. Anti-bot systems update frequently, requiring ongoing maintenance.

The real cost isn't money — it's time. Anti-bot systems evolve constantly. Cloudflare updated Turnstile 12 times in 2025 alone. Each update can break your scraper, requiring hours of debugging. A scraping API absorbs that maintenance cost across thousands of customers.

The API Shortcut: Skip the Arms Race

Everything above — proxies, headers, fingerprints, CAPTCHAs, sessions — exists because you're trying to make an HTTP client look like a real browser. A web scraping API handles all of it behind a single endpoint.

One API Call Replaces 200 Lines of Anti-Detection Code

import requests

# Everything above — proxies, headers, TLS fingerprinting, CAPTCHAs,
# JavaScript rendering, session management — in one call:

response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://example.com/product/123",
        "render_js": True,           # Full browser rendering
        "extract": {                 # AI-powered data extraction
            "title": "product name",
            "price": "current price",
            "rating": "customer rating",
            "reviews": "number of reviews"
        }
    }
)

data = response.json()
# Returns: {"title": "...", "price": "$29.99", "rating": "4.5", "reviews": "1,234"}

What Mantis handles automatically:

✅ Residential proxy rotation across 195+ countries
✅ TLS fingerprint matching (JA3/JA4)
✅ Full JavaScript rendering with stealth mode
✅ Automatic CAPTCHA solving
✅ Browser fingerprint randomization
✅ Session and cookie management
✅ Adaptive rate limiting
✅ AI-powered structured data extraction

Stop Fighting Anti-Bot Systems

Mantis API handles all anti-detection automatically. Get structured data from any website with a single API call.

Start Free → 100 requests/month

When to DIY vs. When to Use an API

DIY scraping makes sense when:

You're scraping unprotected sites (no Cloudflare/Akamai)
You need full control over the scraping logic
You're scraping at very high volume (100K+ pages/day) where per-request pricing adds up
You're doing academic research with specific methodology requirements

A scraping API makes sense when:

Target sites use anti-bot protection (most commercial sites in 2026)
You need structured data extraction, not raw HTML
You're building an AI agent that needs reliable web access
Your time is worth more than the API subscription
You don't want to maintain anti-detection infrastructure

Putting It All Together: The Layered Approach

If you're going the DIY route, here's the recommended stack in order of importance:

Residential proxies — The foundation. Without clean IPs, nothing else matters.
TLS fingerprint matching — Use curl_cffi or tls-client. This catches most developers off guard.
Realistic headers — Full header sets that match your impersonated browser.
Human-like delays — 2-10 second randomized delays between requests.
Session management — Warm up sessions, rotate every 50-100 requests.
JavaScript rendering — Playwright with stealth plugins for JS-heavy sites.
CAPTCHA solving — As a last resort when everything else fails.
Honeypot avoidance — Parse DOM carefully, skip hidden elements.

Or, if you'd rather build your product than fight anti-bot systems: use an API.