Web Scraping with Python Requests in 2026: The Complete Guide
The Python Requests library is where most web scraping journeys begin — and for good reason. It's the most downloaded Python package on PyPI, with an API so intuitive it practically reads like English. If you need to fetch web pages and extract data, Requests is your foundation.
This guide covers everything from basic GET requests to production-ready scraping patterns — including sessions, authentication, headers, proxies, concurrency, and when to graduate to something more powerful.
Table of Contents
- Installation & Setup
- Your First Scraping Request
- HTTP Methods: GET, POST, PUT, DELETE
- Request Headers & User-Agent Rotation
- Session Objects & Cookie Persistence
- Authentication: Basic, Token, OAuth
- Parsing HTML with BeautifulSoup
- Working with JSON APIs
- Scraping Paginated Pages
- Proxy Rotation for Anti-Detection
- Rate Limiting & Retry Logic
- Concurrent Scraping with ThreadPoolExecutor
- Production-Ready Scraper Class
- Requests vs httpx vs aiohttp vs API
- The API Shortcut: Skip HTTP Complexity
- FAQ
1. Installation & Setup
Install Requests and BeautifulSoup (for HTML parsing):
pip install requests beautifulsoup4 lxml
Verify your setup:
import requests
from bs4 import BeautifulSoup
print(requests.__version__) # 2.32.x
print("Ready to scrape!")
python -m venv venv) to keep your scraping dependencies isolated.
2. Your First Scraping Request
Fetching a web page is one line:
import requests
response = requests.get("https://example.com")
print(response.status_code) # 200
print(response.headers["content-type"]) # text/html; charset=UTF-8
print(response.text[:500]) # First 500 chars of HTML
The response object gives you everything:
response.status_code— HTTP status (200, 404, 403, etc.)response.text— Response body as a string (decoded)response.content— Response body as bytes (for images, PDFs)response.headers— Response headers (dict-like)response.url— Final URL after redirectsresponse.cookies— Cookies set by the serverresponse.elapsed— Time the request took
Handling Errors Properly
Never assume a request succeeds. Always check:
import requests
response = requests.get("https://example.com")
# Option 1: Check status code
if response.status_code == 200:
html = response.text
else:
print(f"Failed: {response.status_code}")
# Option 2: Raise exception on HTTP errors (4xx, 5xx)
response.raise_for_status() # Raises requests.exceptions.HTTPError
Setting Timeouts (Critical!)
Without a timeout, your scraper can hang forever on unresponsive servers:
# Always set a timeout (seconds)
response = requests.get("https://example.com", timeout=10)
# Separate connect and read timeouts
response = requests.get("https://example.com", timeout=(5, 30))
# 5s to connect, 30s to read
timeout=10 as a sensible default.
3. HTTP Methods: GET, POST, PUT, DELETE
Most scraping uses GET, but sometimes you need POST (for forms, search queries, or API endpoints):
import requests
# GET — fetch a page
response = requests.get("https://api.example.com/products")
# POST — submit form data
response = requests.post("https://example.com/search", data={
"query": "web scraping",
"page": 1
})
# POST — submit JSON data (API endpoints)
response = requests.post("https://api.example.com/search", json={
"query": "web scraping",
"filters": {"category": "tools"}
})
# PUT — update a resource
response = requests.put("https://api.example.com/items/42", json={
"name": "Updated Item"
})
# DELETE — remove a resource
response = requests.delete("https://api.example.com/items/42")
4. Request Headers & User-Agent Rotation
Servers use headers to identify your client. The default Requests user-agent (python-requests/2.32.x) screams "bot" — and many sites block it immediately.
Setting Custom Headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
}
response = requests.get("https://example.com", headers=headers, timeout=10)
Rotating User-Agents
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
]
def get_random_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
response = requests.get("https://example.com", headers=get_random_headers(), timeout=10)
5. Session Objects & Cookie Persistence
A Session object persists cookies, headers, and connection pools across requests — essential for scraping sites that require login or track sessions:
import requests
session = requests.Session()
# Set default headers for ALL requests in this session
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
})
# First request — server sets cookies
response = session.get("https://example.com", timeout=10)
# Subsequent requests automatically include cookies
response = session.get("https://example.com/dashboard", timeout=10)
# Check stored cookies
print(session.cookies.get_dict())
# {'session_id': 'abc123', 'csrf_token': 'xyz789'}
Login with Session
session = requests.Session()
# Step 1: GET the login page (grab CSRF token)
login_page = session.get("https://example.com/login", timeout=10)
# Parse CSRF token from HTML with BeautifulSoup...
# Step 2: POST credentials
login_response = session.post("https://example.com/login", data={
"username": "your_user",
"password": "your_pass",
"csrf_token": csrf_token, # From step 1
}, timeout=10)
# Step 3: Access protected pages (cookies are automatic)
dashboard = session.get("https://example.com/dashboard", timeout=10)
print(dashboard.status_code) # 200 if login succeeded
requests.get() calls.
6. Authentication: Basic, Token, OAuth
Basic Authentication
from requests.auth import HTTPBasicAuth
response = requests.get(
"https://api.example.com/data",
auth=HTTPBasicAuth("username", "password"),
timeout=10
)
# Shorthand (tuple)
response = requests.get(
"https://api.example.com/data",
auth=("username", "password"),
timeout=10
)
Bearer Token (API Keys)
headers = {
"Authorization": "Bearer your_api_key_here",
"Content-Type": "application/json"
}
response = requests.get("https://api.example.com/data", headers=headers, timeout=10)
Custom Auth (OAuth 2.0 Token Refresh)
from requests.auth import AuthBase
class TokenAuth(AuthBase):
def __init__(self, token):
self.token = token
def __call__(self, r):
r.headers["Authorization"] = f"Bearer {self.token}"
return r
response = requests.get(
"https://api.example.com/data",
auth=TokenAuth("your_token"),
timeout=10
)
7. Parsing HTML with BeautifulSoup
Requests fetches the HTML. BeautifulSoup parses it. Together, they're the classic scraping duo:
import requests
from bs4 import BeautifulSoup
# Fetch the page
response = requests.get("https://news.ycombinator.com", timeout=10)
response.raise_for_status()
# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")
# Extract all story titles
stories = soup.select(".titleline > a")
for story in stories[:10]:
print(f"{story.text} → {story['href']}")
Extracting Structured Data
import requests
from bs4 import BeautifulSoup
import json
response = requests.get("https://example.com/products", timeout=10)
soup = BeautifulSoup(response.text, "lxml")
products = []
for card in soup.select(".product-card"):
product = {
"name": card.select_one(".product-name").text.strip(),
"price": card.select_one(".price").text.strip(),
"url": card.select_one("a")["href"],
"rating": card.select_one(".rating").text.strip() if card.select_one(".rating") else None,
}
products.append(product)
# Save as JSON
with open("products.json", "w") as f:
json.dump(products, f, indent=2)
print(f"Scraped {len(products)} products")
For a deep dive into BeautifulSoup, see our Complete BeautifulSoup Guide.
8. Working with JSON APIs
Many modern websites load data via JSON APIs behind the scenes. Scraping these directly is faster and more reliable than parsing HTML:
import requests
# Hit the API directly
response = requests.get("https://api.example.com/products", params={
"category": "electronics",
"page": 1,
"per_page": 50
}, timeout=10)
data = response.json() # Parse JSON response
for product in data["results"]:
print(f"{product['name']} — ${product['price']}")
Finding Hidden APIs
Here's how to discover the API endpoints a website uses:
- Open Chrome DevTools (F12)
- Go to Network tab → filter by XHR/Fetch
- Interact with the page (search, paginate, filter)
- Look at the requests — copy the URL, headers, and payload
- Reproduce the request with Python Requests
# Reproduce a discovered API call
response = requests.get(
"https://example.com/api/v2/search",
params={"q": "laptop", "sort": "price_asc", "page": 1},
headers={
"User-Agent": "Mozilla/5.0 ...",
"X-Requested-With": "XMLHttpRequest", # Often required
"Referer": "https://example.com/search?q=laptop",
},
timeout=10
)
results = response.json()
9. Scraping Paginated Pages
URL-Based Pagination
import requests
from bs4 import BeautifulSoup
import time
all_products = []
for page in range(1, 51): # Pages 1-50
response = requests.get(
f"https://example.com/products?page={page}",
headers=get_random_headers(),
timeout=10
)
if response.status_code != 200:
print(f"Page {page} failed: {response.status_code}")
break
soup = BeautifulSoup(response.text, "lxml")
products = soup.select(".product-card")
if not products: # No more products — we've reached the end
break
for p in products:
all_products.append({
"name": p.select_one(".name").text.strip(),
"price": p.select_one(".price").text.strip(),
})
print(f"Page {page}: {len(products)} products")
time.sleep(1.5) # Be polite — don't hammer the server
print(f"Total: {len(all_products)} products")
API Pagination with Cursors
import requests
all_items = []
cursor = None
while True:
params = {"limit": 100}
if cursor:
params["cursor"] = cursor
response = requests.get(
"https://api.example.com/items",
params=params,
timeout=10
)
data = response.json()
all_items.extend(data["items"])
cursor = data.get("next_cursor")
if not cursor: # No more pages
break
print(f"Total items: {len(all_items)}")
10. Proxy Rotation for Anti-Detection
Using the same IP for thousands of requests will get you blocked. Proxy rotation distributes requests across multiple IPs:
import requests
import random
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
def get_with_proxy(url, **kwargs):
proxy = random.choice(PROXIES)
return requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=10,
**kwargs
)
response = get_with_proxy("https://example.com")
Proxy with Session
session = requests.Session()
session.proxies = {
"http": "http://user:pass@proxy.example.com:8080",
"https": "http://user:pass@proxy.example.com:8080",
}
# All requests through this session use the proxy
response = session.get("https://example.com", timeout=10)
SOCKS5 Proxy (Tor)
# pip install requests[socks]
import requests
session = requests.Session()
session.proxies = {
"http": "socks5h://127.0.0.1:9050",
"https": "socks5h://127.0.0.1:9050",
}
response = session.get("https://check.torproject.org", timeout=30)
11. Rate Limiting & Retry Logic
Responsible scraping means not overloading servers. Here's a robust retry pattern:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import time
def create_scraping_session(retries=3, backoff=0.5):
session = requests.Session()
retry_strategy = Retry(
total=retries,
backoff_factor=backoff, # 0.5s, 1s, 2s between retries
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET", "POST"],
)
adapter = HTTPAdapter(
max_retries=retry_strategy,
pool_connections=10,
pool_maxsize=10,
)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
# Usage
session = create_scraping_session()
response = session.get("https://example.com", timeout=10)
Respecting Rate Limits
import time
def rate_limited_get(session, url, delay=1.5, **kwargs):
"""GET with rate limiting."""
response = session.get(url, timeout=10, **kwargs)
# Check for rate limit response
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
response = session.get(url, timeout=10, **kwargs)
time.sleep(delay) # Polite delay between requests
return response
12. Concurrent Scraping with ThreadPoolExecutor
Sequential scraping is slow. Use threads to scrape multiple pages simultaneously:
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from bs4 import BeautifulSoup
import time
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36"
})
def scrape_page(url):
"""Scrape a single page and return extracted data."""
try:
response = session.get(url, timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
title = soup.select_one("title").text.strip()
return {"url": url, "title": title, "status": "success"}
except Exception as e:
return {"url": url, "title": None, "status": f"error: {e}"}
# Generate URLs
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
# Scrape concurrently (10 threads)
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(scrape_page, url): url for url in urls}
for future in as_completed(futures):
result = future.result()
results.append(result)
if result["status"] == "success":
print(f"✓ {result['url']} — {result['title']}")
else:
print(f"✗ {result['url']} — {result['status']}")
print(f"\nScraped {len(results)} pages")
13. Production-Ready Scraper Class
Here's a complete, reusable scraper class with all best practices built in:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
import random
import time
import json
import csv
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)
class WebScraper:
"""Production-ready web scraper with retry, rotation, and export."""
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
]
def __init__(self, delay=1.5, max_retries=3, timeout=15, max_workers=5):
self.delay = delay
self.timeout = timeout
self.max_workers = max_workers
self.session = self._create_session(max_retries)
self.results = []
def _create_session(self, max_retries):
session = requests.Session()
retry = Retry(
total=max_retries,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry, pool_maxsize=20)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def _get_headers(self):
return {
"User-Agent": random.choice(self.USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
}
def get(self, url):
"""Fetch a URL with rotation and delay."""
response = self.session.get(
url, headers=self._get_headers(), timeout=self.timeout
)
response.raise_for_status()
time.sleep(self.delay + random.uniform(0, 0.5))
return response
def get_soup(self, url):
"""Fetch and parse a URL."""
response = self.get(url)
return BeautifulSoup(response.text, "lxml")
def scrape_urls(self, urls, parse_fn):
"""Scrape multiple URLs concurrently."""
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
futures = {executor.submit(self._safe_scrape, url, parse_fn): url for url in urls}
for future in as_completed(futures):
result = future.result()
if result:
self.results.append(result)
logger.info(f"Scraped {len(self.results)} items from {len(urls)} URLs")
return self.results
def _safe_scrape(self, url, parse_fn):
try:
soup = self.get_soup(url)
return parse_fn(soup, url)
except Exception as e:
logger.error(f"Failed: {url} — {e}")
return None
def to_json(self, filepath):
with open(filepath, "w") as f:
json.dump(self.results, f, indent=2)
logger.info(f"Saved {len(self.results)} items to {filepath}")
def to_csv(self, filepath):
if not self.results:
return
with open(filepath, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=self.results[0].keys())
writer.writeheader()
writer.writerows(self.results)
logger.info(f"Saved {len(self.results)} items to {filepath}")
# --- Usage Example ---
def parse_product(soup, url):
"""Custom parsing function for a product page."""
return {
"url": url,
"title": soup.select_one("h1").text.strip() if soup.select_one("h1") else "N/A",
"price": soup.select_one(".price").text.strip() if soup.select_one(".price") else "N/A",
"description": soup.select_one(".description").text.strip()[:200] if soup.select_one(".description") else "N/A",
}
scraper = WebScraper(delay=1.0, max_workers=5)
urls = [f"https://example.com/product/{i}" for i in range(1, 101)]
results = scraper.scrape_urls(urls, parse_product)
scraper.to_json("products.json")
scraper.to_csv("products.csv")
14. Requests vs httpx vs aiohttp vs API
| Feature | Requests | httpx | aiohttp | Mantis API |
|---|---|---|---|---|
| Async support | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes |
| HTTP/2 | ❌ No | ✅ Yes | ❌ No | ✅ Yes |
| Connection pooling | ✅ Session | ✅ Client | ✅ Connector | ✅ Managed |
| JS rendering | ❌ No | ❌ No | ❌ No | ✅ Yes |
| Anti-bot bypass | ❌ Manual | ❌ Manual | ❌ Manual | ✅ Automatic |
| Proxy management | ❌ Manual | ❌ Manual | ❌ Manual | ✅ Built-in |
| Learning curve | ⭐ Easy | ⭐ Easy | ⭐⭐ Medium | ⭐ Easy |
| Best for | Simple projects | Async + HTTP/2 | High concurrency | Production at scale |
| Monthly cost | $0 + proxies ($100-500) | $0 + proxies ($100-500) | $0 + proxies ($100-500) | $29-299 all-inclusive |
- Need async? → httpx (drop-in replacement with async support)
- Need maximum concurrency? → aiohttp (purpose-built for async HTTP)
- Need JS rendering + anti-detection? → Mantis API (handles everything)
- Simple, synchronous scraping? → Stick with Requests
15. The API Shortcut: Skip HTTP Complexity
Building a production scraper with Requests means managing headers, proxies, retries, rate limits, CAPTCHAs, and anti-bot detection — all yourself. Or you can make one API call:
import requests
# DIY with Requests: 50+ lines of proxy rotation, header management, retry logic
# ...or...
# Mantis API: one call, done
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"url": "https://example.com/products",
"extract": {
"products": {
"selector": ".product-card",
"fields": {
"name": ".product-name",
"price": ".price",
"rating": ".rating"
}
}
}
}
)
data = response.json()
for product in data["products"]:
print(f"{product['name']} — {product['price']}")
Stop Managing HTTP Infrastructure
Mantis handles proxies, headers, retries, JS rendering, and anti-detection. You write one API call.
Start Free → 100 requests/monthWhen to Use Requests vs an API
- Use Requests when: Scraping simple, static sites with no anti-bot protection; learning web scraping; budget is $0; less than 1,000 pages/month
- Use an API when: Scraping at scale (10K+ pages/month); sites have anti-bot protection; you need JS rendering; you value your engineering time; production reliability matters
FAQ
See the FAQ section above for answers to common questions about web scraping with Python Requests.
Next Steps
- Web Scraping with BeautifulSoup — Deep dive into HTML parsing
- Web Scraping with Scrapy — Full framework for large-scale crawling
- Web Scraping with Playwright — Handle JavaScript-rendered pages
- Web Scraping with Selenium — Browser automation for dynamic sites
- How to Scrape Without Getting Blocked — Anti-detection techniques
- Best Web Scraping APIs Comparison — Find the right tool for your needs