How to Scrape a Website: The Complete Beginner's Guide for 2026
How to Scrape a Website: The Complete Beginner's Guide for 2026
Web scraping — extracting data from websites — is one of the most useful skills you can learn as a developer. Whether you need to track prices, gather research data, monitor competitors, or feed data into an AI agent, knowing how to scrape a website opens up a world of possibilities.
But the web has changed dramatically. Modern sites use JavaScript rendering, anti-bot detection, CAPTCHAs, and dynamic content loading. The scraping techniques from 2020 often don't work anymore.
This guide covers every approach to scraping a website in 2026 — from the simplest to the most powerful — with working code examples you can copy and run today.
What Is Web Scraping?
Web scraping is the automated process of extracting data from websites. Instead of manually copying information from a browser, you write code (or use a tool) that:
Fetches the web page (like a browser would)
Parses the HTML to find the data you need
Extracts that data into a structured format (JSON, CSV, database)
Think of it as teaching a computer to read websites the way you do — but thousands of times faster.
Before You Scrape: Check These First
Before writing any code, always check:
- Is there an API? Many sites offer official APIs that are faster and more reliable than scraping. Check the site's developer docs first.
- robots.txt — Visit
example.com/robots.txtto see the site's scraping policy. RespectDisallowrules. - Terms of Service — Some sites explicitly prohibit scraping. Read the ToS.
- Rate limiting — Even if scraping is allowed, don't hammer the server. Add delays between requests.
Method 1: Basic HTTP Requests + HTML Parsing (Python)
The simplest approach. Good for static sites that don't use JavaScript rendering.
What you need
pip install requests beautifulsoup4
Example: Scrape article titles from a blog
import requests
from bs4 import BeautifulSoup
# Fetch the page
url = "https://example.com/blog"
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
# Parse the HTML
soup = BeautifulSoup(response.text, "html.parser")
# Extract article titles
articles = soup.find_all("h2", class_="post-title")
for article in articles:
title = article.get_text(strip=True)
link = article.find("a")["href"]
print(f"{title} → {link}")
When this works
- Static HTML pages (content is in the initial HTML)
- Simple blogs, news sites, documentation pages
- Sites without JavaScript-rendered content
When this fails
- Single-page apps (React, Vue, Angular)
- Sites that load data via AJAX/fetch after page load
- Sites with anti-bot protection (Cloudflare, DataDome)
Method 2: Headless Browser Scraping (Playwright)
When a site renders content with JavaScript, you need a real browser. Playwright automates Chrome, Firefox, or Safari headlessly.
What you need
pip install playwright
playwright install chromium
Example: Scrape a JavaScript-rendered page
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for content to load
page.goto("https://example.com/products")
page.wait_for_selector(".product-card")
# Extract product data
products = page.query_selector_all(".product-card")
for product in products:
name = product.query_selector(".name").inner_text()
price = product.query_selector(".price").inner_text()
print(f"{name}: {price}")
browser.close()
Handling infinite scroll
# Scroll to load all items
previous_height = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
previous_height = current_height
When this works
- JavaScript-rendered sites (SPAs)
- Sites requiring user interaction (clicking, scrolling)
- Pages with dynamic content loading
When this fails
- Sites with advanced anti-bot detection (they detect headless browsers)
- At scale (browsers use lots of memory and CPU)
- When you need to scrape thousands of pages quickly
Method 3: API-Based Scraping (WebPerception API)
The modern approach: let a cloud API handle the browser, anti-bot detection, and even data extraction for you. No infrastructure to manage.
Example: Scrape any page with one API call
import requests
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"url": "https://example.com/products",
"wait_for": ".product-card",
"render_js": True
}
)
html = response.json()["html"]
Example: AI-powered data extraction
Instead of writing CSS selectors, describe what you want in plain English:
response = requests.post(
"https://api.mantisapi.com/v1/extract",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"url": "https://example.com/products",
"prompt": "Extract all products with their name, price, rating, and availability",
"schema": {
"products": [{
"name": "string",
"price": "number",
"rating": "number",
"in_stock": "boolean"
}]
}
}
)
products = response.json()["data"]["products"]
for p in products:
print(f"{p['name']}: ${p['price']} ({'In stock' if p['in_stock'] else 'Out of stock'})")
When to use API-based scraping
- You need to scrape at scale (hundreds or thousands of pages)
- Sites have anti-bot protection
- You want structured data without writing CSS selectors
- You're building an AI agent that needs web access
- You don't want to manage browser infrastructure
Method 4: Scraping with CSS Selectors vs. XPath
Both CSS selectors and XPath let you target specific elements. Here's how they compare:
CSS Selectors
# BeautifulSoup
soup.select("div.product > h2.title")
soup.select("table tr:nth-child(2) td")
soup.select("[data-price]")
XPath
from lxml import html
tree = html.fromstring(page_content)
titles = tree.xpath("//div[@class='product']/h2[@class='title']/text()")
prices = tree.xpath("//span[contains(@class, 'price')]/text()")
Rule of thumb: CSS selectors are simpler and more readable. XPath is more powerful for complex queries (like selecting by text content or navigating up the DOM tree).
Handling Common Challenges
Pagination
page_num = 1
all_data = []
while True:
url = f"https://example.com/products?page={page_num}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select(".product-card")
if not items:
break
for item in items:
all_data.append({
"name": item.select_one(".name").text.strip(),
"price": item.select_one(".price").text.strip()
})
page_num += 1
time.sleep(1) # Be polite — don't hammer the server
Rate Limiting and Retries
import time
from random import uniform
def fetch_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
if response.status_code == 200:
return response
if response.status_code == 429: # Too many requests
wait = 2 ** attempt + uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s...")
time.sleep(wait)
except requests.RequestException:
time.sleep(2 ** attempt)
return None
Handling Login-Required Pages
session = requests.Session()
# Log in
session.post("https://example.com/login", data={
"username": "your_user",
"password": "your_pass"
})
# Now scrape authenticated pages
response = session.get("https://example.com/dashboard")
Saving Data
import json
import csv
# Save as JSON
with open("products.json", "w") as f:
json.dump(all_data, f, indent=2)
# Save as CSV
with open("products.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price"])
writer.writeheader()
writer.writerows(all_data)
Choosing the Right Approach
| Scenario | Best Method |
|----------|-------------|
| Simple static pages | Requests + BeautifulSoup |
| JavaScript-rendered sites | Playwright or API |
| Anti-bot protected sites | API (WebPerception) |
| Scraping at scale (1000+ pages) | API (WebPerception) |
| Need structured data from messy HTML | API with AI extraction |
| One-off quick scrape | Requests + BeautifulSoup |
| Building an AI agent | API (WebPerception) |
The Modern Stack: API-First Scraping
In 2026, the smartest approach for most use cases is API-based scraping:
No infrastructure — No browser to install, no proxies to manage, no CAPTCHAs to solve
AI extraction — Describe what you want in English instead of writing fragile CSS selectors
Scale instantly — From 1 to 100,000 pages without changing your code
Anti-bot handled — The API handles Cloudflare, DataDome, and other protection
Cost-effective — Pay per request instead of running servers 24/7
WebPerception API gives you all of this with a simple REST API. Start with 100 free requests per month — no credit card required.
Quick Start: Your First Scrape in 60 Seconds
import requests
# One API call = full page data
response = requests.post(
"https://api.mantisapi.com/v1/extract",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"url": "https://news.ycombinator.com",
"prompt": "Extract the top 10 stories with title, URL, points, and comment count"
}
)
stories = response.json()["data"]
for s in stories:
print(f"[{s.get('points', 0)} pts] {s['title']}")
That's it. No HTML parsing, no CSS selectors, no browser setup. Just data.
What's Next?
Now that you know how to scrape a website, here are some next steps:
- Build your first AI agent → — Use web scraping as the perception layer for an autonomous agent
- WebPerception API Quickstart → — Get set up with the API in 5 minutes
- Best Web Scraping Tools in 2026 → — Compare all the tools side by side
---
Ready to start scraping? Get your free API key → — 100 requests/month, no credit card required.