How to Web Scrape Without Getting Blocked in 2026: The Complete Anti-Detection Guide

Published March 15, 2026 ยท 14 min read ยท Updated for 2026 anti-bot systems
Anti-Detection Web Scraping Proxies Python
TL;DR: Getting blocked while web scraping is the #1 frustration for developers. This guide covers 10 proven anti-detection techniques โ€” from proxy rotation to browser fingerprinting โ€” with Python code examples. Or skip the arms race entirely: use a web scraping API that handles anti-bot detection for you at a fraction of the cost.

You write a perfect scraper. It works beautifully for 50 requests. Then: 403 Forbidden. Or worse โ€” you get silently served fake data, your IP gets blacklisted, or you hit an infinite CAPTCHA loop.

Anti-bot detection in 2026 is more sophisticated than ever. Cloudflare's Turnstile, Akamai Bot Manager, PerimeterX (now HUMAN), and DataDome use machine learning, TLS fingerprinting, and behavioral analysis to catch scrapers within milliseconds. The days of just rotating user agents are long gone.

This guide covers every technique you need to scrape without getting blocked โ€” and why most developers ultimately switch to an API that handles it all automatically.

Why Websites Block Scrapers

Before diving into solutions, understand what you're up against. Modern anti-bot systems detect scrapers through multiple signals simultaneously:

Detection MethodWhat It ChecksDifficulty to Bypass
IP ReputationDatacenter IP ranges, known proxy lists, IP request volumeMedium
TLS FingerprintingJA3/JA4 signatures โ€” your HTTP client's TLS handshake patternHard
Browser FingerprintingCanvas, WebGL, AudioContext, navigator properties, fontsHard
Behavioral AnalysisMouse movement, scroll patterns, click timing, page dwell timeVery Hard
HTTP/2 FingerprintingFrame ordering, header priorities, SETTINGS frame parametersVery Hard
CAPTCHAsCloudflare Turnstile, reCAPTCHA v3, hCaptchaMedium ($$)
Rate LimitingRequests per IP per minute/hour, concurrent connectionsEasy
Honeypot LinksHidden links (CSS display:none) that only bots followEasy

The key insight: no single technique works alone. Modern anti-bot systems score requests across multiple signals and block based on a composite trust score. You need to pass all checks simultaneously.

Technique 1: Proxy Rotation

The foundation of any anti-detection strategy. Sending all requests from one IP is the fastest way to get blocked.

Types of Proxies

Proxy TypeCostTrust ScoreBest For
Datacenter$1-5/GBLowUnprotected sites, APIs
Residential$5-15/GBHighCloudflare, Akamai protected sites
Mobile (4G/5G)$15-30/GBVery HighHeavily protected sites, social media
ISP (Static Residential)$3-8/IP/dayHighSession-based scraping, accounts

Python: DIY Proxy Rotation

import random
import requests

# Residential proxy pool (you'd get these from BrightData, Oxylabs, etc.)
proxies = [
    "http://user:pass@gate.proxy-provider.com:7777",
    "http://user:pass@gate.proxy-provider.com:7778",
    "http://user:pass@gate.proxy-provider.com:7779",
]

def scrape_with_proxy(url):
    proxy = random.choice(proxies)
    try:
        response = requests.get(
            url,
            proxies={"http": proxy, "https": proxy},
            timeout=30
        )
        return response
    except requests.exceptions.ProxyError:
        # Rotate to next proxy on failure
        proxies.remove(proxy)
        return scrape_with_proxy(url)
๐Ÿ’ก The hidden cost: Residential proxies from BrightData, Oxylabs, or Smartproxy cost $5-15 per GB. A typical scraping operation transferring 50GB/month spends $250-750/month on proxies alone โ€” before CAPTCHA solving, infrastructure, or maintenance.

Technique 2: Request Headers & User-Agent Rotation

Default Python headers scream "bot." The requests library sends python-requests/2.31.0 as the user agent. Anti-bot systems flag this instantly.

import random

# Realistic 2026 browser user agents
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

def get_realistic_headers():
    ua = random.choice(USER_AGENTS)
    return {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

response = requests.get(url, headers=get_realistic_headers())

Critical: Keep your headers consistent within a session. A Chrome user agent with Firefox-style Accept headers is an instant red flag. Match the full header set to the browser you're impersonating.

Technique 3: Rate Limiting & Human-Like Delays

Humans don't request 100 pages per second. Anti-bot systems track request timing patterns and flag anything too fast or too regular.

import time
import random

def human_delay():
    """Simulate human browsing patterns."""
    # Base delay: 2-5 seconds (normal reading time)
    base = random.uniform(2, 5)
    
    # Occasionally pause longer (checking phone, reading content)
    if random.random() < 0.1:  # 10% chance
        base += random.uniform(5, 15)
    
    # Add small jitter to avoid exact patterns
    jitter = random.gauss(0, 0.5)
    
    delay = max(1, base + jitter)
    time.sleep(delay)

# Scraping loop with human-like timing
for url in urls:
    response = scrape_with_proxy(url)
    process(response)
    human_delay()

Rules of thumb:

Technique 4: TLS Fingerprint Matching

This is where most scrapers get caught in 2026. Your HTTP client's TLS handshake creates a unique fingerprint (JA3/JA4) that anti-bot systems check against known browser signatures.

Python's requests library has a distinctive TLS fingerprint that doesn't match any real browser. Solutions:

# Option 1: Use curl_cffi (impersonates real browser TLS)
from curl_cffi import requests as curl_requests

response = curl_requests.get(
    "https://example.com",
    impersonate="chrome124",  # Matches Chrome 124's TLS fingerprint
    headers=get_realistic_headers()
)

# Option 2: Use tls-client
import tls_client

session = tls_client.Session(
    client_identifier="chrome_124",
    random_tls_extension_order=True
)
response = session.get("https://example.com")

Why this matters: Cloudflare and Akamai check TLS fingerprints before your request even reaches the server. If your JA3 hash doesn't match a known browser, you're blocked at the edge โ€” no amount of header manipulation will help.

Technique 5: JavaScript Rendering

Over 60% of modern websites require JavaScript to render content. Sites protected by Cloudflare Turnstile require JavaScript execution to pass the challenge.

# Using Playwright for full browser rendering
from playwright.sync_api import sync_playwright

def scrape_with_browser(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-features=IsolateOrigins,site-per-process",
            ]
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            locale="en-US",
        )
        
        page = context.new_page()
        
        # Remove automation indicators
        page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
            window.chrome = { runtime: {} };
        """)
        
        page.goto(url, wait_until="networkidle")
        content = page.content()
        browser.close()
        return content
โš ๏ธ Performance cost: Headless browsers use 200-500MB RAM per instance. Running 10 concurrent browsers requires a 4-8GB server ($20-80/month). At scale, browser-based scraping infrastructure costs add up fast.

Technique 6: CAPTCHA Solving

When all other detection methods fail, sites fall back to CAPTCHAs. In 2026, the main CAPTCHA types are:

# Using 2Captcha API for CAPTCHA solving
import requests

def solve_captcha(site_key, page_url):
    # Submit CAPTCHA to solving service
    response = requests.post("http://2captcha.com/in.php", data={
        "key": "YOUR_API_KEY",
        "method": "userrecaptcha",
        "googlekey": site_key,
        "pageurl": page_url,
    })
    captcha_id = response.text.split("|")[1]
    
    # Poll for solution (takes 20-60 seconds)
    import time
    for _ in range(30):
        time.sleep(5)
        result = requests.get(
            f"http://2captcha.com/res.php?key=YOUR_API_KEY&action=get&id={captcha_id}"
        )
        if "CAPCHA_NOT_READY" not in result.text:
            return result.text.split("|")[1]
    
    raise Exception("CAPTCHA solving timed out")

Cost: CAPTCHA solving services charge $1-3 per 1,000 solves. If 10% of your requests trigger CAPTCHAs, scraping 10,000 pages costs $1-3 extra โ€” but solving adds 20-60 seconds of latency per request.

Technique 7: Session & Cookie Management

Anti-bot systems track sessions across requests. Getting a fresh Cloudflare cookie, then making 1,000 requests with it in 5 minutes, is suspicious.

import requests

session = requests.Session()

# First request: establish a natural session
session.get("https://example.com", headers=get_realistic_headers())
human_delay()

# Browse naturally: visit homepage โ†’ category โ†’ product (not jump to deep pages)
session.get("https://example.com/category", headers=get_realistic_headers())
human_delay()

# Now scrape the actual target
response = session.get("https://example.com/category/product-123", headers=get_realistic_headers())

Key rules:

Technique 8: Browser Fingerprint Randomization

Modern anti-bot systems create a unique fingerprint from dozens of browser properties. Even with headless browsers, default configurations are detectable.

Properties checked include:

Tools like Playwright Stealth and Puppeteer Stealth handle most of these, but sophisticated sites still detect them. This is an ongoing arms race.

Technique 9: Honeypot Detection

Honeypots are invisible links or form fields that only bots interact with. They're simple but effective:

from bs4 import BeautifulSoup

def safe_extract_links(html):
    soup = BeautifulSoup(html, "html.parser")
    links = []
    
    for a in soup.find_all("a", href=True):
        # Skip hidden links (honeypots)
        style = a.get("style", "")
        parent_style = a.parent.get("style", "") if a.parent else ""
        classes = " ".join(a.get("class", []))
        
        # Common honeypot indicators
        if any(indicator in style for indicator in ["display:none", "display: none", "visibility:hidden", "opacity:0"]):
            continue
        if any(indicator in parent_style for indicator in ["display:none", "display: none", "visibility:hidden"]):
            continue
        if "hidden" in classes or "trap" in classes or "honeypot" in classes:
            continue
        # Skip zero-size elements
        if "width:0" in style or "height:0" in style:
            continue
            
        links.append(a["href"])
    
    return links

Rule: Never follow links that aren't visible to real users. Parse the DOM carefully and check for CSS that hides elements.

Technique 10: Respect robots.txt (Seriously)

This isn't just about ethics โ€” it's practical. Sites that see you ignoring robots.txt are more likely to escalate anti-bot measures against you specifically.

from urllib.robotparser import RobotFileParser

def can_scrape(url):
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    
    return rp.can_fetch("*", url)

# Check before scraping
if can_scrape("https://example.com/data"):
    response = scrape_with_proxy("https://example.com/data")
else:
    print("Blocked by robots.txt โ€” skipping")

The Real Cost of DIY Anti-Detection

Let's add it up. For a production scraping operation doing 25,000 requests/month:

ComponentDIY Cost/MonthMantis API
Residential Proxies (50GB)$250-750โœ… Included
CAPTCHA Solving$50-200โœ… Included
Headless Browser Infra$50-200โœ… Included
Anti-detect Libraries$0-100โœ… Included
Maintenance (10-20 hrs)$500-1000*โœ… Zero
Total$350-1,250+$99/month

*Developer time valued at $50/hr. Anti-bot systems update frequently, requiring ongoing maintenance.

The real cost isn't money โ€” it's time. Anti-bot systems evolve constantly. Cloudflare updated Turnstile 12 times in 2025 alone. Each update can break your scraper, requiring hours of debugging. A scraping API absorbs that maintenance cost across thousands of customers.

The API Shortcut: Skip the Arms Race

Everything above โ€” proxies, headers, fingerprints, CAPTCHAs, sessions โ€” exists because you're trying to make an HTTP client look like a real browser. A web scraping API handles all of it behind a single endpoint.

One API Call Replaces 200 Lines of Anti-Detection Code

import requests

# Everything above โ€” proxies, headers, TLS fingerprinting, CAPTCHAs,
# JavaScript rendering, session management โ€” in one call:

response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://example.com/product/123",
        "render_js": True,           # Full browser rendering
        "extract": {                 # AI-powered data extraction
            "title": "product name",
            "price": "current price",
            "rating": "customer rating",
            "reviews": "number of reviews"
        }
    }
)

data = response.json()
# Returns: {"title": "...", "price": "$29.99", "rating": "4.5", "reviews": "1,234"}

What Mantis handles automatically:

Stop Fighting Anti-Bot Systems

Mantis API handles all anti-detection automatically. Get structured data from any website with a single API call.

Start Free โ†’ 100 requests/month

When to DIY vs. When to Use an API

DIY scraping makes sense when:

A scraping API makes sense when:

Putting It All Together: The Layered Approach

If you're going the DIY route, here's the recommended stack in order of importance:

  1. Residential proxies โ€” The foundation. Without clean IPs, nothing else matters.
  2. TLS fingerprint matching โ€” Use curl_cffi or tls-client. This catches most developers off guard.
  3. Realistic headers โ€” Full header sets that match your impersonated browser.
  4. Human-like delays โ€” 2-10 second randomized delays between requests.
  5. Session management โ€” Warm up sessions, rotate every 50-100 requests.
  6. JavaScript rendering โ€” Playwright with stealth plugins for JS-heavy sites.
  7. CAPTCHA solving โ€” As a last resort when everything else fails.
  8. Honeypot avoidance โ€” Parse DOM carefully, skip hidden elements.

Or, if you'd rather build your product than fight anti-bot systems: use an API.

Related Guides