Web Scraping Without Getting Blocked: The 2026 Playbook
March 6, 2026 guide
# Web Scraping Without Getting Blocked: The 2026 Playbook
You've built a scraper. It works perfectly — for about 10 minutes. Then: 403 Forbidden. CAPTCHA walls. IP bans. Your beautiful data pipeline grinds to a halt.
Getting blocked is the #1 frustration in web scraping. And in 2026, anti-bot systems are smarter than ever. Cloudflare, DataDome, PerimeterX, and custom WAFs use browser fingerprinting, behavioral analysis, and machine learning to detect and block scrapers.
But there are proven ways to scrape reliably without getting blocked. This guide covers every technique — from basics to advanced — plus a modern approach that sidesteps the entire problem.
## Why Scrapers Get Blocked
Before solving the problem, understand what triggers blocks:
### 1. Request Pattern Detection
- Too many requests per second from one IP
- Identical headers across all requests
- No cookies or session continuity
- Hitting pages in non-human order (e.g., skipping the homepage)
### 2. Browser Fingerprinting
- Missing or inconsistent JavaScript execution
- No WebGL, Canvas, or AudioContext fingerprints
- Navigator properties that don't match the User-Agent
- Missing browser plugins and fonts
### 3. Behavioral Analysis
- No mouse movements or scroll events
- Instant page loads (no rendering time)
- Accessing robots.txt-disallowed paths
- No referrer headers
### 4. IP Reputation
- Known datacenter IP ranges (AWS, GCP, Azure)
- IPs flagged by threat intelligence feeds
- Tor exit nodes and known proxy IPs
## Technique 1: Request-Level Fixes
The basics. These won't beat sophisticated anti-bot systems alone, but they're table stakes.
### Rotate User-Agents
```python
import random
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/537.36 Chrome/119.0.0.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0",
]
headers = {
"User-Agent": random.choice(user_agents),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
}
```
### Add Delays Between Requests
```python
import time
import random
def polite_delay():
"""Random delay between 2-5 seconds — mimics human browsing."""
time.sleep(random.uniform(2, 5))
```
### Use Sessions (Cookie Persistence)
```python
import requests
session = requests.Session()
# First request establishes cookies
session.get("https://example.com")
# Subsequent requests carry cookies automatically
response = session.get("https://example.com/data")
```
**Verdict:** Works against basic protections. Fails against Cloudflare, DataDome, and any modern anti-bot system.
## Technique 2: Proxy Rotation
Distributing requests across many IP addresses is essential for scale.
### Types of Proxies
| Type | Cost | Detection Risk | Speed |
|------|------|----------------|-------|
| Datacenter | $1-5/GB | High | Fast |
| Residential | $5-15/GB | Low | Medium |
| Mobile | $15-30/GB | Very Low | Slow |
| ISP (Static Residential) | $2-5/IP/mo | Low | Fast |
### Implementation
```python
import itertools
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
proxy_pool = itertools.cycle(proxies)
def get_with_rotation(url):
proxy = next(proxy_pool)
return requests.get(url, proxies={"http": proxy, "https": proxy})
```
### Smart Rotation
Don't just rotate randomly. Use sticky sessions per domain:
```python
from collections import defaultdict
domain_proxies = defaultdict(lambda: next(proxy_pool))
def get_sticky(url):
from urllib.parse import urlparse
domain = urlparse(url).netloc
proxy = domain_proxies[domain]
return requests.get(url, proxies={"http": proxy, "https": proxy})
```
**Verdict:** Essential for scale. Residential proxies are expensive but effective. Still fails against fingerprinting.
## Technique 3: Headless Browsers
When sites require JavaScript rendering, headless browsers simulate a real browser.
### Playwright (Recommended over Selenium)
```python
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
locale="en-US",
)
page = context.new_page()
page.goto("https://example.com")
content = page.content()
browser.close()
```
### Stealth Plugins
Standard headless browsers are trivially detectable. Use stealth plugins:
```python
# playwright-stealth patches common detection vectors
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page)
page.goto("https://nowsecure.nl") # Anti-bot test site
```
### What Stealth Fixes
- `navigator.webdriver` property
- Chrome DevTools protocol detection
- Missing plugin arrays
- Inconsistent permissions API
- WebGL vendor/renderer strings
**Verdict:** Beats many anti-bot systems. But headless browsers are slow (2-5s per page), resource-heavy, and stealth patches are a constant arms race.
## Technique 4: Browser Farms
For serious scale, run pools of real browser instances:
```python
# Using browserless.io or similar services
import requests
API_KEY = "your-browserless-key"
def scrape_with_browser_farm(url):
response = requests.post(
f"https://chrome.browserless.io/content?token={API_KEY}",
json={"url": url},
headers={"Content-Type": "application/json"}
)
return response.text
```
**Verdict:** Expensive ($50-500/mo), complex to manage, and still requires proxy rotation for blocked sites.
## Technique 5: CAPTCHA Solving
When you hit CAPTCHAs, you have three options:
1. **CAPTCHA solving services** (2Captcha, Anti-Captcha): $1-3 per 1,000 CAPTCHAs
2. **AI-based solvers**: Faster but less reliable
3. **Avoid triggering them entirely**: The best strategy
```python
# 2Captcha integration example
import requests
def solve_recaptcha(site_key, page_url):
# Submit CAPTCHA
resp = requests.post("http://2captcha.com/in.php", data={
"key": API_KEY,
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
})
captcha_id = resp.text.split("|")[1]
# Poll for solution (takes 20-60 seconds)
for _ in range(30):
time.sleep(5)
result = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
if "CAPCHA_NOT_READY" not in result.text:
return result.text.split("|")[1]
return None
```
**Verdict:** Adds latency and cost. A symptom of an arms race you're losing.
## The Modern Approach: API-Based Scraping
Here's the thing: all the techniques above are **workarounds** for a fundamental problem. You're trying to make a machine look like a human browsing the web. Anti-bot vendors spend millions making that harder every year.
The modern approach is to use a **web scraping API** that handles all of this for you:
```python
import requests
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={"x-api-key": "your-api-key"},
json={"url": "https://example.com/products"}
)
data = response.json()
print(data["content"]) # Clean, extracted content
```
### What WebPerception API Handles For You
- **Anti-bot bypass**: Rotating residential proxies, browser fingerprinting, CAPTCHA handling — all built in
- **JavaScript rendering**: Full browser execution without running your own headless browsers
- **AI-powered extraction**: Ask for structured data in natural language — no CSS selectors to maintain
```python
# Extract structured data without writing selectors
response = requests.post(
"https://api.mantisapi.com/v1/extract",
headers={"x-api-key": "your-api-key"},
json={
"url": "https://example.com/products",
"prompt": "Extract all product names, prices, and ratings as JSON"
}
)
products = response.json()["data"]
# [{"name": "Widget Pro", "price": "$29.99", "rating": 4.8}, ...]
```
### Cost Comparison
| Approach | Monthly Cost (10K pages) | Maintenance | Reliability |
|----------|-------------------------|-------------|-------------|
| DIY (proxies + headless) | $200-500 | High | 70-85% |
| Browser farm service | $150-300 | Medium | 80-90% |
| WebPerception API (Starter) | $29 | Zero | 95%+ |
### When to Use Each
**DIY scraping** makes sense when:
- You're scraping a small number of simple, cooperative sites
- You need sub-second latency
- You're learning web scraping fundamentals
**API-based scraping** makes sense when:
- You're building production data pipelines
- You can't afford downtime from blocked scrapers
- You'd rather spend time on your product than on anti-bot cat-and-mouse
- You're building AI agents that need reliable web access
## Quick Reference: Anti-Block Checklist
Before you scrape, run through this checklist:
- [ ] Rotate User-Agents (pool of 10+)
- [ ] Add realistic headers (Accept, Accept-Language, Referer)
- [ ] Use sessions for cookie persistence
- [ ] Add random delays (2-5s between requests)
- [ ] Rotate IPs (residential proxies for sensitive sites)
- [ ] Use headless browser with stealth patches for JS-heavy sites
- [ ] Respect robots.txt (ethical scraping = sustainable scraping)
- [ ] Monitor success rates and adjust when they drop
- [ ] Have a fallback plan (API-based scraping) for critical pipelines
## Conclusion
Web scraping in 2026 is an arms race. Anti-bot systems get smarter every quarter. You can play the game — rotating proxies, stealth browsers, CAPTCHA solvers — or you can step out of the game entirely.
For hobby projects and learning, DIY scraping teaches you invaluable skills. For production systems, agents, and anything where reliability matters, an API-based approach like [WebPerception API](https://mantisapi.com) saves you time, money, and headaches.
**Ready to stop getting blocked?** [Get your free API key](https://mantisapi.com) — 100 requests/month, no credit card required.
Ready to try Mantis?
100 free API calls/month. No credit card required.
Get Your API Key →