Web Scraping with Python Requests in 2026: The Complete Guide

Published March 16, 2026 · 20 min read · Updated for Requests 2.32+

The Python Requests library is where most web scraping journeys begin — and for good reason. It's the most downloaded Python package on PyPI, with an API so intuitive it practically reads like English. If you need to fetch web pages and extract data, Requests is your foundation.

This guide covers everything from basic GET requests to production-ready scraping patterns — including sessions, authentication, headers, proxies, concurrency, and when to graduate to something more powerful.

Table of Contents

  1. Installation & Setup
  2. Your First Scraping Request
  3. HTTP Methods: GET, POST, PUT, DELETE
  4. Request Headers & User-Agent Rotation
  5. Session Objects & Cookie Persistence
  6. Authentication: Basic, Token, OAuth
  7. Parsing HTML with BeautifulSoup
  8. Working with JSON APIs
  9. Scraping Paginated Pages
  10. Proxy Rotation for Anti-Detection
  11. Rate Limiting & Retry Logic
  12. Concurrent Scraping with ThreadPoolExecutor
  13. Production-Ready Scraper Class
  14. Requests vs httpx vs aiohttp vs API
  15. The API Shortcut: Skip HTTP Complexity
  16. FAQ

1. Installation & Setup

Install Requests and BeautifulSoup (for HTML parsing):

pip install requests beautifulsoup4 lxml

Verify your setup:

import requests
from bs4 import BeautifulSoup

print(requests.__version__)  # 2.32.x
print("Ready to scrape!")
💡 Pro tip: Always use a virtual environment (python -m venv venv) to keep your scraping dependencies isolated.

2. Your First Scraping Request

Fetching a web page is one line:

import requests

response = requests.get("https://example.com")

print(response.status_code)   # 200
print(response.headers["content-type"])  # text/html; charset=UTF-8
print(response.text[:500])    # First 500 chars of HTML

The response object gives you everything:

Handling Errors Properly

Never assume a request succeeds. Always check:

import requests

response = requests.get("https://example.com")

# Option 1: Check status code
if response.status_code == 200:
    html = response.text
else:
    print(f"Failed: {response.status_code}")

# Option 2: Raise exception on HTTP errors (4xx, 5xx)
response.raise_for_status()  # Raises requests.exceptions.HTTPError

Setting Timeouts (Critical!)

Without a timeout, your scraper can hang forever on unresponsive servers:

# Always set a timeout (seconds)
response = requests.get("https://example.com", timeout=10)

# Separate connect and read timeouts
response = requests.get("https://example.com", timeout=(5, 30))
# 5s to connect, 30s to read
⚠️ Never scrape without timeouts. A single hung request can block your entire scraper. Use timeout=10 as a sensible default.

3. HTTP Methods: GET, POST, PUT, DELETE

Most scraping uses GET, but sometimes you need POST (for forms, search queries, or API endpoints):

import requests

# GET — fetch a page
response = requests.get("https://api.example.com/products")

# POST — submit form data
response = requests.post("https://example.com/search", data={
    "query": "web scraping",
    "page": 1
})

# POST — submit JSON data (API endpoints)
response = requests.post("https://api.example.com/search", json={
    "query": "web scraping",
    "filters": {"category": "tools"}
})

# PUT — update a resource
response = requests.put("https://api.example.com/items/42", json={
    "name": "Updated Item"
})

# DELETE — remove a resource
response = requests.delete("https://api.example.com/items/42")
💡 Scraping tip: Many "AJAX-powered" sites load data via hidden API endpoints. Open your browser's Network tab (F12 → Network → XHR) and look for JSON responses. Hitting these APIs directly with Requests is faster than parsing HTML.

4. Request Headers & User-Agent Rotation

Servers use headers to identify your client. The default Requests user-agent (python-requests/2.32.x) screams "bot" — and many sites block it immediately.

Setting Custom Headers

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.google.com/",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
}

response = requests.get("https://example.com", headers=headers, timeout=10)

Rotating User-Agents

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
]

def get_random_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.google.com/",
    }

response = requests.get("https://example.com", headers=get_random_headers(), timeout=10)

5. Session Objects & Cookie Persistence

A Session object persists cookies, headers, and connection pools across requests — essential for scraping sites that require login or track sessions:

import requests

session = requests.Session()

# Set default headers for ALL requests in this session
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
})

# First request — server sets cookies
response = session.get("https://example.com", timeout=10)

# Subsequent requests automatically include cookies
response = session.get("https://example.com/dashboard", timeout=10)

# Check stored cookies
print(session.cookies.get_dict())
# {'session_id': 'abc123', 'csrf_token': 'xyz789'}

Login with Session

session = requests.Session()

# Step 1: GET the login page (grab CSRF token)
login_page = session.get("https://example.com/login", timeout=10)
# Parse CSRF token from HTML with BeautifulSoup...

# Step 2: POST credentials
login_response = session.post("https://example.com/login", data={
    "username": "your_user",
    "password": "your_pass",
    "csrf_token": csrf_token,  # From step 1
}, timeout=10)

# Step 3: Access protected pages (cookies are automatic)
dashboard = session.get("https://example.com/dashboard", timeout=10)
print(dashboard.status_code)  # 200 if login succeeded
💡 Performance benefit: Session objects reuse TCP connections via connection pooling. This makes sequential requests to the same host 2-5x faster than individual requests.get() calls.

6. Authentication: Basic, Token, OAuth

Basic Authentication

from requests.auth import HTTPBasicAuth

response = requests.get(
    "https://api.example.com/data",
    auth=HTTPBasicAuth("username", "password"),
    timeout=10
)

# Shorthand (tuple)
response = requests.get(
    "https://api.example.com/data",
    auth=("username", "password"),
    timeout=10
)

Bearer Token (API Keys)

headers = {
    "Authorization": "Bearer your_api_key_here",
    "Content-Type": "application/json"
}

response = requests.get("https://api.example.com/data", headers=headers, timeout=10)

Custom Auth (OAuth 2.0 Token Refresh)

from requests.auth import AuthBase

class TokenAuth(AuthBase):
    def __init__(self, token):
        self.token = token

    def __call__(self, r):
        r.headers["Authorization"] = f"Bearer {self.token}"
        return r

response = requests.get(
    "https://api.example.com/data",
    auth=TokenAuth("your_token"),
    timeout=10
)

7. Parsing HTML with BeautifulSoup

Requests fetches the HTML. BeautifulSoup parses it. Together, they're the classic scraping duo:

import requests
from bs4 import BeautifulSoup

# Fetch the page
response = requests.get("https://news.ycombinator.com", timeout=10)
response.raise_for_status()

# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")

# Extract all story titles
stories = soup.select(".titleline > a")
for story in stories[:10]:
    print(f"{story.text} → {story['href']}")

Extracting Structured Data

import requests
from bs4 import BeautifulSoup
import json

response = requests.get("https://example.com/products", timeout=10)
soup = BeautifulSoup(response.text, "lxml")

products = []
for card in soup.select(".product-card"):
    product = {
        "name": card.select_one(".product-name").text.strip(),
        "price": card.select_one(".price").text.strip(),
        "url": card.select_one("a")["href"],
        "rating": card.select_one(".rating").text.strip() if card.select_one(".rating") else None,
    }
    products.append(product)

# Save as JSON
with open("products.json", "w") as f:
    json.dump(products, f, indent=2)

print(f"Scraped {len(products)} products")

For a deep dive into BeautifulSoup, see our Complete BeautifulSoup Guide.

8. Working with JSON APIs

Many modern websites load data via JSON APIs behind the scenes. Scraping these directly is faster and more reliable than parsing HTML:

import requests

# Hit the API directly
response = requests.get("https://api.example.com/products", params={
    "category": "electronics",
    "page": 1,
    "per_page": 50
}, timeout=10)

data = response.json()  # Parse JSON response

for product in data["results"]:
    print(f"{product['name']} — ${product['price']}")

Finding Hidden APIs

Here's how to discover the API endpoints a website uses:

  1. Open Chrome DevTools (F12)
  2. Go to Network tab → filter by XHR/Fetch
  3. Interact with the page (search, paginate, filter)
  4. Look at the requests — copy the URL, headers, and payload
  5. Reproduce the request with Python Requests
# Reproduce a discovered API call
response = requests.get(
    "https://example.com/api/v2/search",
    params={"q": "laptop", "sort": "price_asc", "page": 1},
    headers={
        "User-Agent": "Mozilla/5.0 ...",
        "X-Requested-With": "XMLHttpRequest",  # Often required
        "Referer": "https://example.com/search?q=laptop",
    },
    timeout=10
)

results = response.json()
💡 Always check for APIs first. Parsing JSON is 10x easier than parsing HTML. Many sites have undocumented APIs that return clean, structured data.

9. Scraping Paginated Pages

URL-Based Pagination

import requests
from bs4 import BeautifulSoup
import time

all_products = []

for page in range(1, 51):  # Pages 1-50
    response = requests.get(
        f"https://example.com/products?page={page}",
        headers=get_random_headers(),
        timeout=10
    )

    if response.status_code != 200:
        print(f"Page {page} failed: {response.status_code}")
        break

    soup = BeautifulSoup(response.text, "lxml")
    products = soup.select(".product-card")

    if not products:  # No more products — we've reached the end
        break

    for p in products:
        all_products.append({
            "name": p.select_one(".name").text.strip(),
            "price": p.select_one(".price").text.strip(),
        })

    print(f"Page {page}: {len(products)} products")
    time.sleep(1.5)  # Be polite — don't hammer the server

print(f"Total: {len(all_products)} products")

API Pagination with Cursors

import requests

all_items = []
cursor = None

while True:
    params = {"limit": 100}
    if cursor:
        params["cursor"] = cursor

    response = requests.get(
        "https://api.example.com/items",
        params=params,
        timeout=10
    )
    data = response.json()

    all_items.extend(data["items"])
    cursor = data.get("next_cursor")

    if not cursor:  # No more pages
        break

print(f"Total items: {len(all_items)}")

10. Proxy Rotation for Anti-Detection

Using the same IP for thousands of requests will get you blocked. Proxy rotation distributes requests across multiple IPs:

import requests
import random

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

def get_with_proxy(url, **kwargs):
    proxy = random.choice(PROXIES)
    return requests.get(
        url,
        proxies={"http": proxy, "https": proxy},
        timeout=10,
        **kwargs
    )

response = get_with_proxy("https://example.com")

Proxy with Session

session = requests.Session()
session.proxies = {
    "http": "http://user:pass@proxy.example.com:8080",
    "https": "http://user:pass@proxy.example.com:8080",
}

# All requests through this session use the proxy
response = session.get("https://example.com", timeout=10)

SOCKS5 Proxy (Tor)

# pip install requests[socks]
import requests

session = requests.Session()
session.proxies = {
    "http": "socks5h://127.0.0.1:9050",
    "https": "socks5h://127.0.0.1:9050",
}

response = session.get("https://check.torproject.org", timeout=30)
⚠️ Free proxies are unreliable and insecure. For production scraping, use paid residential proxies or skip proxy management entirely with a scraping API.

11. Rate Limiting & Retry Logic

Responsible scraping means not overloading servers. Here's a robust retry pattern:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import time

def create_scraping_session(retries=3, backoff=0.5):
    session = requests.Session()

    retry_strategy = Retry(
        total=retries,
        backoff_factor=backoff,       # 0.5s, 1s, 2s between retries
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET", "POST"],
    )

    adapter = HTTPAdapter(
        max_retries=retry_strategy,
        pool_connections=10,
        pool_maxsize=10,
    )

    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

# Usage
session = create_scraping_session()
response = session.get("https://example.com", timeout=10)

Respecting Rate Limits

import time

def rate_limited_get(session, url, delay=1.5, **kwargs):
    """GET with rate limiting."""
    response = session.get(url, timeout=10, **kwargs)

    # Check for rate limit response
    if response.status_code == 429:
        retry_after = int(response.headers.get("Retry-After", 60))
        print(f"Rate limited. Waiting {retry_after}s...")
        time.sleep(retry_after)
        response = session.get(url, timeout=10, **kwargs)

    time.sleep(delay)  # Polite delay between requests
    return response

12. Concurrent Scraping with ThreadPoolExecutor

Sequential scraping is slow. Use threads to scrape multiple pages simultaneously:

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from bs4 import BeautifulSoup
import time

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36"
})

def scrape_page(url):
    """Scrape a single page and return extracted data."""
    try:
        response = session.get(url, timeout=15)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        title = soup.select_one("title").text.strip()
        return {"url": url, "title": title, "status": "success"}
    except Exception as e:
        return {"url": url, "title": None, "status": f"error: {e}"}

# Generate URLs
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]

# Scrape concurrently (10 threads)
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(scrape_page, url): url for url in urls}

    for future in as_completed(futures):
        result = future.result()
        results.append(result)
        if result["status"] == "success":
            print(f"✓ {result['url']} — {result['title']}")
        else:
            print(f"✗ {result['url']} — {result['status']}")

print(f"\nScraped {len(results)} pages")
💡 Thread count: Start with 5-10 threads. More threads = faster, but also more likely to trigger rate limits. For a single domain, 5 threads is usually the sweet spot.

13. Production-Ready Scraper Class

Here's a complete, reusable scraper class with all best practices built in:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
import random
import time
import json
import csv
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)

class WebScraper:
    """Production-ready web scraper with retry, rotation, and export."""

    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 Safari/605.1.15",
        "Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    ]

    def __init__(self, delay=1.5, max_retries=3, timeout=15, max_workers=5):
        self.delay = delay
        self.timeout = timeout
        self.max_workers = max_workers
        self.session = self._create_session(max_retries)
        self.results = []

    def _create_session(self, max_retries):
        session = requests.Session()
        retry = Retry(
            total=max_retries,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
        )
        adapter = HTTPAdapter(max_retries=retry, pool_maxsize=20)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        return session

    def _get_headers(self):
        return {
            "User-Agent": random.choice(self.USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": "https://www.google.com/",
        }

    def get(self, url):
        """Fetch a URL with rotation and delay."""
        response = self.session.get(
            url, headers=self._get_headers(), timeout=self.timeout
        )
        response.raise_for_status()
        time.sleep(self.delay + random.uniform(0, 0.5))
        return response

    def get_soup(self, url):
        """Fetch and parse a URL."""
        response = self.get(url)
        return BeautifulSoup(response.text, "lxml")

    def scrape_urls(self, urls, parse_fn):
        """Scrape multiple URLs concurrently."""
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            futures = {executor.submit(self._safe_scrape, url, parse_fn): url for url in urls}
            for future in as_completed(futures):
                result = future.result()
                if result:
                    self.results.append(result)
        logger.info(f"Scraped {len(self.results)} items from {len(urls)} URLs")
        return self.results

    def _safe_scrape(self, url, parse_fn):
        try:
            soup = self.get_soup(url)
            return parse_fn(soup, url)
        except Exception as e:
            logger.error(f"Failed: {url} — {e}")
            return None

    def to_json(self, filepath):
        with open(filepath, "w") as f:
            json.dump(self.results, f, indent=2)
        logger.info(f"Saved {len(self.results)} items to {filepath}")

    def to_csv(self, filepath):
        if not self.results:
            return
        with open(filepath, "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=self.results[0].keys())
            writer.writeheader()
            writer.writerows(self.results)
        logger.info(f"Saved {len(self.results)} items to {filepath}")


# --- Usage Example ---

def parse_product(soup, url):
    """Custom parsing function for a product page."""
    return {
        "url": url,
        "title": soup.select_one("h1").text.strip() if soup.select_one("h1") else "N/A",
        "price": soup.select_one(".price").text.strip() if soup.select_one(".price") else "N/A",
        "description": soup.select_one(".description").text.strip()[:200] if soup.select_one(".description") else "N/A",
    }

scraper = WebScraper(delay=1.0, max_workers=5)
urls = [f"https://example.com/product/{i}" for i in range(1, 101)]
results = scraper.scrape_urls(urls, parse_product)
scraper.to_json("products.json")
scraper.to_csv("products.csv")

14. Requests vs httpx vs aiohttp vs API

Feature Requests httpx aiohttp Mantis API
Async support ❌ No ✅ Yes ✅ Yes ✅ Yes
HTTP/2 ❌ No ✅ Yes ❌ No ✅ Yes
Connection pooling ✅ Session ✅ Client ✅ Connector ✅ Managed
JS rendering ❌ No ❌ No ❌ No ✅ Yes
Anti-bot bypass ❌ Manual ❌ Manual ❌ Manual ✅ Automatic
Proxy management ❌ Manual ❌ Manual ❌ Manual ✅ Built-in
Learning curve ⭐ Easy ⭐ Easy ⭐⭐ Medium ⭐ Easy
Best for Simple projects Async + HTTP/2 High concurrency Production at scale
Monthly cost $0 + proxies ($100-500) $0 + proxies ($100-500) $0 + proxies ($100-500) $29-299 all-inclusive
💡 When to switch from Requests:

15. The API Shortcut: Skip HTTP Complexity

Building a production scraper with Requests means managing headers, proxies, retries, rate limits, CAPTCHAs, and anti-bot detection — all yourself. Or you can make one API call:

import requests

# DIY with Requests: 50+ lines of proxy rotation, header management, retry logic
# ...or...

# Mantis API: one call, done
response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://example.com/products",
        "extract": {
            "products": {
                "selector": ".product-card",
                "fields": {
                    "name": ".product-name",
                    "price": ".price",
                    "rating": ".rating"
                }
            }
        }
    }
)

data = response.json()
for product in data["products"]:
    print(f"{product['name']} — {product['price']}")

Stop Managing HTTP Infrastructure

Mantis handles proxies, headers, retries, JS rendering, and anti-detection. You write one API call.

Start Free → 100 requests/month

When to Use Requests vs an API

FAQ

See the FAQ section above for answers to common questions about web scraping with Python Requests.

Next Steps