How to Web Scrape Without Getting Blocked in 2026: The Complete Anti-Detection Guide
You write a perfect scraper. It works beautifully for 50 requests. Then: 403 Forbidden. Or worse โ you get silently served fake data, your IP gets blacklisted, or you hit an infinite CAPTCHA loop.
Anti-bot detection in 2026 is more sophisticated than ever. Cloudflare's Turnstile, Akamai Bot Manager, PerimeterX (now HUMAN), and DataDome use machine learning, TLS fingerprinting, and behavioral analysis to catch scrapers within milliseconds. The days of just rotating user agents are long gone.
This guide covers every technique you need to scrape without getting blocked โ and why most developers ultimately switch to an API that handles it all automatically.
Why Websites Block Scrapers
Before diving into solutions, understand what you're up against. Modern anti-bot systems detect scrapers through multiple signals simultaneously:
| Detection Method | What It Checks | Difficulty to Bypass |
|---|---|---|
| IP Reputation | Datacenter IP ranges, known proxy lists, IP request volume | Medium |
| TLS Fingerprinting | JA3/JA4 signatures โ your HTTP client's TLS handshake pattern | Hard |
| Browser Fingerprinting | Canvas, WebGL, AudioContext, navigator properties, fonts | Hard |
| Behavioral Analysis | Mouse movement, scroll patterns, click timing, page dwell time | Very Hard |
| HTTP/2 Fingerprinting | Frame ordering, header priorities, SETTINGS frame parameters | Very Hard |
| CAPTCHAs | Cloudflare Turnstile, reCAPTCHA v3, hCaptcha | Medium ($$) |
| Rate Limiting | Requests per IP per minute/hour, concurrent connections | Easy |
| Honeypot Links | Hidden links (CSS display:none) that only bots follow | Easy |
The key insight: no single technique works alone. Modern anti-bot systems score requests across multiple signals and block based on a composite trust score. You need to pass all checks simultaneously.
Technique 1: Proxy Rotation
The foundation of any anti-detection strategy. Sending all requests from one IP is the fastest way to get blocked.
Types of Proxies
| Proxy Type | Cost | Trust Score | Best For |
|---|---|---|---|
| Datacenter | $1-5/GB | Low | Unprotected sites, APIs |
| Residential | $5-15/GB | High | Cloudflare, Akamai protected sites |
| Mobile (4G/5G) | $15-30/GB | Very High | Heavily protected sites, social media |
| ISP (Static Residential) | $3-8/IP/day | High | Session-based scraping, accounts |
Python: DIY Proxy Rotation
import random
import requests
# Residential proxy pool (you'd get these from BrightData, Oxylabs, etc.)
proxies = [
"http://user:pass@gate.proxy-provider.com:7777",
"http://user:pass@gate.proxy-provider.com:7778",
"http://user:pass@gate.proxy-provider.com:7779",
]
def scrape_with_proxy(url):
proxy = random.choice(proxies)
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=30
)
return response
except requests.exceptions.ProxyError:
# Rotate to next proxy on failure
proxies.remove(proxy)
return scrape_with_proxy(url)
Technique 2: Request Headers & User-Agent Rotation
Default Python headers scream "bot." The requests library sends python-requests/2.31.0 as the user agent. Anti-bot systems flag this instantly.
import random
# Realistic 2026 browser user agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
def get_realistic_headers():
ua = random.choice(USER_AGENTS)
return {
"User-Agent": ua,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
response = requests.get(url, headers=get_realistic_headers())
Critical: Keep your headers consistent within a session. A Chrome user agent with Firefox-style Accept headers is an instant red flag. Match the full header set to the browser you're impersonating.
Technique 3: Rate Limiting & Human-Like Delays
Humans don't request 100 pages per second. Anti-bot systems track request timing patterns and flag anything too fast or too regular.
import time
import random
def human_delay():
"""Simulate human browsing patterns."""
# Base delay: 2-5 seconds (normal reading time)
base = random.uniform(2, 5)
# Occasionally pause longer (checking phone, reading content)
if random.random() < 0.1: # 10% chance
base += random.uniform(5, 15)
# Add small jitter to avoid exact patterns
jitter = random.gauss(0, 0.5)
delay = max(1, base + jitter)
time.sleep(delay)
# Scraping loop with human-like timing
for url in urls:
response = scrape_with_proxy(url)
process(response)
human_delay()
Rules of thumb:
- Minimum 2 seconds between requests to the same domain
- Randomize delays โ uniform intervals are a bot signal
- Respect robots.txt crawl-delay directives
- Back off on 429/503 โ exponential backoff with jitter
- Limit concurrent connections to 2-3 per domain
Technique 4: TLS Fingerprint Matching
This is where most scrapers get caught in 2026. Your HTTP client's TLS handshake creates a unique fingerprint (JA3/JA4) that anti-bot systems check against known browser signatures.
Python's requests library has a distinctive TLS fingerprint that doesn't match any real browser. Solutions:
# Option 1: Use curl_cffi (impersonates real browser TLS)
from curl_cffi import requests as curl_requests
response = curl_requests.get(
"https://example.com",
impersonate="chrome124", # Matches Chrome 124's TLS fingerprint
headers=get_realistic_headers()
)
# Option 2: Use tls-client
import tls_client
session = tls_client.Session(
client_identifier="chrome_124",
random_tls_extension_order=True
)
response = session.get("https://example.com")
Why this matters: Cloudflare and Akamai check TLS fingerprints before your request even reaches the server. If your JA3 hash doesn't match a known browser, you're blocked at the edge โ no amount of header manipulation will help.
Technique 5: JavaScript Rendering
Over 60% of modern websites require JavaScript to render content. Sites protected by Cloudflare Turnstile require JavaScript execution to pass the challenge.
# Using Playwright for full browser rendering
from playwright.sync_api import sync_playwright
def scrape_with_browser(url):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins,site-per-process",
]
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
locale="en-US",
)
page = context.new_page()
# Remove automation indicators
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
window.chrome = { runtime: {} };
""")
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
return content
Technique 6: CAPTCHA Solving
When all other detection methods fail, sites fall back to CAPTCHAs. In 2026, the main CAPTCHA types are:
- Cloudflare Turnstile โ Invisible challenge, checks browser environment
- reCAPTCHA v3 โ Score-based (0.0-1.0), no user interaction
- hCaptcha โ Image classification challenges
- Custom challenges โ Site-specific puzzles, math problems, etc.
# Using 2Captcha API for CAPTCHA solving
import requests
def solve_captcha(site_key, page_url):
# Submit CAPTCHA to solving service
response = requests.post("http://2captcha.com/in.php", data={
"key": "YOUR_API_KEY",
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
})
captcha_id = response.text.split("|")[1]
# Poll for solution (takes 20-60 seconds)
import time
for _ in range(30):
time.sleep(5)
result = requests.get(
f"http://2captcha.com/res.php?key=YOUR_API_KEY&action=get&id={captcha_id}"
)
if "CAPCHA_NOT_READY" not in result.text:
return result.text.split("|")[1]
raise Exception("CAPTCHA solving timed out")
Cost: CAPTCHA solving services charge $1-3 per 1,000 solves. If 10% of your requests trigger CAPTCHAs, scraping 10,000 pages costs $1-3 extra โ but solving adds 20-60 seconds of latency per request.
Technique 7: Session & Cookie Management
Anti-bot systems track sessions across requests. Getting a fresh Cloudflare cookie, then making 1,000 requests with it in 5 minutes, is suspicious.
import requests
session = requests.Session()
# First request: establish a natural session
session.get("https://example.com", headers=get_realistic_headers())
human_delay()
# Browse naturally: visit homepage โ category โ product (not jump to deep pages)
session.get("https://example.com/category", headers=get_realistic_headers())
human_delay()
# Now scrape the actual target
response = session.get("https://example.com/category/product-123", headers=get_realistic_headers())
Key rules:
- Warm up sessions โ visit the homepage before deep pages
- Preserve cookies โ anti-bot tokens in cookies validate subsequent requests
- Rotate sessions โ create a new session every 50-100 requests
- Match session to IP โ don't use the same session across different proxy IPs
Technique 8: Browser Fingerprint Randomization
Modern anti-bot systems create a unique fingerprint from dozens of browser properties. Even with headless browsers, default configurations are detectable.
Properties checked include:
- Canvas fingerprint โ How your browser renders a hidden canvas element
- WebGL renderer โ GPU and driver information
- AudioContext โ Audio processing fingerprint
- Navigator properties โ Platform, languages, hardware concurrency, device memory
- Screen resolution โ Must match common resolutions (1920ร1080, 1366ร768, etc.)
- Installed fonts โ Font enumeration reveals OS and installed software
- WebRTC โ Can leak real IP even behind a proxy
Tools like Playwright Stealth and Puppeteer Stealth handle most of these, but sophisticated sites still detect them. This is an ongoing arms race.
Technique 9: Honeypot Detection
Honeypots are invisible links or form fields that only bots interact with. They're simple but effective:
from bs4 import BeautifulSoup
def safe_extract_links(html):
soup = BeautifulSoup(html, "html.parser")
links = []
for a in soup.find_all("a", href=True):
# Skip hidden links (honeypots)
style = a.get("style", "")
parent_style = a.parent.get("style", "") if a.parent else ""
classes = " ".join(a.get("class", []))
# Common honeypot indicators
if any(indicator in style for indicator in ["display:none", "display: none", "visibility:hidden", "opacity:0"]):
continue
if any(indicator in parent_style for indicator in ["display:none", "display: none", "visibility:hidden"]):
continue
if "hidden" in classes or "trap" in classes or "honeypot" in classes:
continue
# Skip zero-size elements
if "width:0" in style or "height:0" in style:
continue
links.append(a["href"])
return links
Rule: Never follow links that aren't visible to real users. Parse the DOM carefully and check for CSS that hides elements.
Technique 10: Respect robots.txt (Seriously)
This isn't just about ethics โ it's practical. Sites that see you ignoring robots.txt are more likely to escalate anti-bot measures against you specifically.
from urllib.robotparser import RobotFileParser
def can_scrape(url):
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch("*", url)
# Check before scraping
if can_scrape("https://example.com/data"):
response = scrape_with_proxy("https://example.com/data")
else:
print("Blocked by robots.txt โ skipping")
The Real Cost of DIY Anti-Detection
Let's add it up. For a production scraping operation doing 25,000 requests/month:
| Component | DIY Cost/Month | Mantis API |
|---|---|---|
| Residential Proxies (50GB) | $250-750 | โ Included |
| CAPTCHA Solving | $50-200 | โ Included |
| Headless Browser Infra | $50-200 | โ Included |
| Anti-detect Libraries | $0-100 | โ Included |
| Maintenance (10-20 hrs) | $500-1000* | โ Zero |
| Total | $350-1,250+ | $99/month |
*Developer time valued at $50/hr. Anti-bot systems update frequently, requiring ongoing maintenance.
The API Shortcut: Skip the Arms Race
Everything above โ proxies, headers, fingerprints, CAPTCHAs, sessions โ exists because you're trying to make an HTTP client look like a real browser. A web scraping API handles all of it behind a single endpoint.
One API Call Replaces 200 Lines of Anti-Detection Code
import requests
# Everything above โ proxies, headers, TLS fingerprinting, CAPTCHAs,
# JavaScript rendering, session management โ in one call:
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"url": "https://example.com/product/123",
"render_js": True, # Full browser rendering
"extract": { # AI-powered data extraction
"title": "product name",
"price": "current price",
"rating": "customer rating",
"reviews": "number of reviews"
}
}
)
data = response.json()
# Returns: {"title": "...", "price": "$29.99", "rating": "4.5", "reviews": "1,234"}
What Mantis handles automatically:
- โ Residential proxy rotation across 195+ countries
- โ TLS fingerprint matching (JA3/JA4)
- โ Full JavaScript rendering with stealth mode
- โ Automatic CAPTCHA solving
- โ Browser fingerprint randomization
- โ Session and cookie management
- โ Adaptive rate limiting
- โ AI-powered structured data extraction
Stop Fighting Anti-Bot Systems
Mantis API handles all anti-detection automatically. Get structured data from any website with a single API call.
Start Free โ 100 requests/monthWhen to DIY vs. When to Use an API
DIY scraping makes sense when:
- You're scraping unprotected sites (no Cloudflare/Akamai)
- You need full control over the scraping logic
- You're scraping at very high volume (100K+ pages/day) where per-request pricing adds up
- You're doing academic research with specific methodology requirements
A scraping API makes sense when:
- Target sites use anti-bot protection (most commercial sites in 2026)
- You need structured data extraction, not raw HTML
- You're building an AI agent that needs reliable web access
- Your time is worth more than the API subscription
- You don't want to maintain anti-detection infrastructure
Putting It All Together: The Layered Approach
If you're going the DIY route, here's the recommended stack in order of importance:
- Residential proxies โ The foundation. Without clean IPs, nothing else matters.
- TLS fingerprint matching โ Use
curl_cffiortls-client. This catches most developers off guard. - Realistic headers โ Full header sets that match your impersonated browser.
- Human-like delays โ 2-10 second randomized delays between requests.
- Session management โ Warm up sessions, rotate every 50-100 requests.
- JavaScript rendering โ Playwright with stealth plugins for JS-heavy sites.
- CAPTCHA solving โ As a last resort when everything else fails.
- Honeypot avoidance โ Parse DOM carefully, skip hidden elements.
Or, if you'd rather build your product than fight anti-bot systems: use an API.