How to Scrape a Website: The Complete Beginner's Guide for 2026

March 5, 2026 Web Scraping

How to Scrape a Website: The Complete Beginner's Guide for 2026

Web scraping — extracting data from websites — is one of the most useful skills you can learn as a developer. Whether you need to track prices, gather research data, monitor competitors, or feed data into an AI agent, knowing how to scrape a website opens up a world of possibilities.

But the web has changed dramatically. Modern sites use JavaScript rendering, anti-bot detection, CAPTCHAs, and dynamic content loading. The scraping techniques from 2020 often don't work anymore.

This guide covers every approach to scraping a website in 2026 — from the simplest to the most powerful — with working code examples you can copy and run today.

What Is Web Scraping?

Web scraping is the automated process of extracting data from websites. Instead of manually copying information from a browser, you write code (or use a tool) that:

Fetches the web page (like a browser would)

Parses the HTML to find the data you need

Extracts that data into a structured format (JSON, CSV, database)

Think of it as teaching a computer to read websites the way you do — but thousands of times faster.

Before You Scrape: Check These First

Before writing any code, always check:

Method 1: Basic HTTP Requests + HTML Parsing (Python)

The simplest approach. Good for static sites that don't use JavaScript rendering.

What you need

pip install requests beautifulsoup4

Example: Scrape article titles from a blog

import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "https://example.com/blog"
response = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

# Parse the HTML
soup = BeautifulSoup(response.text, "html.parser")

# Extract article titles
articles = soup.find_all("h2", class_="post-title")
for article in articles:
    title = article.get_text(strip=True)
    link = article.find("a")["href"]
    print(f"{title} → {link}")

When this works

When this fails

Method 2: Headless Browser Scraping (Playwright)

When a site renders content with JavaScript, you need a real browser. Playwright automates Chrome, Firefox, or Safari headlessly.

What you need

pip install playwright
playwright install chromium

Example: Scrape a JavaScript-rendered page

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    # Navigate and wait for content to load
    page.goto("https://example.com/products")
    page.wait_for_selector(".product-card")
    
    # Extract product data
    products = page.query_selector_all(".product-card")
    for product in products:
        name = product.query_selector(".name").inner_text()
        price = product.query_selector(".price").inner_text()
        print(f"{name}: {price}")
    
    browser.close()

Handling infinite scroll

# Scroll to load all items
previous_height = 0
while True:
    page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    page.wait_for_timeout(2000)
    current_height = page.evaluate("document.body.scrollHeight")
    if current_height == previous_height:
        break
    previous_height = current_height

When this works

When this fails

Method 3: API-Based Scraping (WebPerception API)

The modern approach: let a cloud API handle the browser, anti-bot detection, and even data extraction for you. No infrastructure to manage.

Example: Scrape any page with one API call

import requests

response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://example.com/products",
        "wait_for": ".product-card",
        "render_js": True
    }
)

html = response.json()["html"]

Example: AI-powered data extraction

Instead of writing CSS selectors, describe what you want in plain English:

response = requests.post(
    "https://api.mantisapi.com/v1/extract",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://example.com/products",
        "prompt": "Extract all products with their name, price, rating, and availability",
        "schema": {
            "products": [{
                "name": "string",
                "price": "number",
                "rating": "number",
                "in_stock": "boolean"
            }]
        }
    }
)

products = response.json()["data"]["products"]
for p in products:
    print(f"{p['name']}: ${p['price']} ({'In stock' if p['in_stock'] else 'Out of stock'})")

When to use API-based scraping

Method 4: Scraping with CSS Selectors vs. XPath

Both CSS selectors and XPath let you target specific elements. Here's how they compare:

CSS Selectors

# BeautifulSoup
soup.select("div.product > h2.title")
soup.select("table tr:nth-child(2) td")
soup.select("[data-price]")

XPath

from lxml import html

tree = html.fromstring(page_content)
titles = tree.xpath("//div[@class='product']/h2[@class='title']/text()")
prices = tree.xpath("//span[contains(@class, 'price')]/text()")

Rule of thumb: CSS selectors are simpler and more readable. XPath is more powerful for complex queries (like selecting by text content or navigating up the DOM tree).

Handling Common Challenges

Pagination

page_num = 1
all_data = []

while True:
    url = f"https://example.com/products?page={page_num}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    items = soup.select(".product-card")
    if not items:
        break
    
    for item in items:
        all_data.append({
            "name": item.select_one(".name").text.strip(),
            "price": item.select_one(".price").text.strip()
        })
    
    page_num += 1
    time.sleep(1)  # Be polite — don't hammer the server

Rate Limiting and Retries

import time
from random import uniform

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            if response.status_code == 200:
                return response
            if response.status_code == 429:  # Too many requests
                wait = 2 ** attempt + uniform(0, 1)
                print(f"Rate limited. Waiting {wait:.1f}s...")
                time.sleep(wait)
        except requests.RequestException:
            time.sleep(2 ** attempt)
    return None

Handling Login-Required Pages

session = requests.Session()

# Log in
session.post("https://example.com/login", data={
    "username": "your_user",
    "password": "your_pass"
})

# Now scrape authenticated pages
response = session.get("https://example.com/dashboard")

Saving Data

import json
import csv

# Save as JSON
with open("products.json", "w") as f:
    json.dump(all_data, f, indent=2)

# Save as CSV
with open("products.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price"])
    writer.writeheader()
    writer.writerows(all_data)

Choosing the Right Approach

| Scenario | Best Method |

|----------|-------------|

| Simple static pages | Requests + BeautifulSoup |

| JavaScript-rendered sites | Playwright or API |

| Anti-bot protected sites | API (WebPerception) |

| Scraping at scale (1000+ pages) | API (WebPerception) |

| Need structured data from messy HTML | API with AI extraction |

| One-off quick scrape | Requests + BeautifulSoup |

| Building an AI agent | API (WebPerception) |

The Modern Stack: API-First Scraping

In 2026, the smartest approach for most use cases is API-based scraping:

No infrastructure — No browser to install, no proxies to manage, no CAPTCHAs to solve

AI extraction — Describe what you want in English instead of writing fragile CSS selectors

Scale instantly — From 1 to 100,000 pages without changing your code

Anti-bot handled — The API handles Cloudflare, DataDome, and other protection

Cost-effective — Pay per request instead of running servers 24/7

WebPerception API gives you all of this with a simple REST API. Start with 100 free requests per month — no credit card required.

Quick Start: Your First Scrape in 60 Seconds

import requests

# One API call = full page data
response = requests.post(
    "https://api.mantisapi.com/v1/extract",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://news.ycombinator.com",
        "prompt": "Extract the top 10 stories with title, URL, points, and comment count"
    }
)

stories = response.json()["data"]
for s in stories:
    print(f"[{s.get('points', 0)} pts] {s['title']}")

That's it. No HTML parsing, no CSS selectors, no browser setup. Just data.

What's Next?

Now that you know how to scrape a website, here are some next steps:

---

Ready to start scraping? Get your free API key → — 100 requests/month, no credit card required.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →