Web Scraping with BeautifulSoup and Python in 2026: The Complete Guide

Published March 16, 2026 · 18 min read · Updated for BeautifulSoup 4.13+

BeautifulSoup is the most popular HTML parsing library in Python — and for good reason. It's simple, forgiving with broken HTML, and perfect for extracting data from web pages. This guide covers everything from installation to production-ready scrapers.

Whether you're scraping product prices, extracting article content, or building a dataset for machine learning, BeautifulSoup combined with Python's requests library is the classic starting point. We'll also show you when (and why) you might want to graduate to an API.

Table of Contents

  1. Installation & Setup
  2. Your First BeautifulSoup Scraper
  3. Choosing a Parser: lxml vs html.parser vs html5lib
  4. Finding Elements: find(), find_all(), and CSS Selectors
  5. Navigating the DOM Tree
  6. Extracting Text, Attributes, and Links
  7. Scraping Tables into DataFrames
  8. Handling Pagination
  9. Handling Forms and Login
  10. Production-Ready Scraper with Error Handling
  11. Dealing with JavaScript-Rendered Pages
  12. BeautifulSoup vs Scrapy vs Playwright vs API
  13. The API Shortcut: Skip the Parsing Entirely
  14. FAQ

1. Installation & Setup

Install BeautifulSoup 4, the requests HTTP library, and the lxml parser (recommended for speed):

pip install beautifulsoup4 requests lxml

Verify your installation:

import bs4
print(bs4.__version__)  # 4.13.x
💡 Tip: Always install lxml alongside BeautifulSoup. It's 10-100x faster than the built-in html.parser and handles malformed HTML better.

2. Your First BeautifulSoup Scraper

Let's scrape a web page in 10 lines of Python:

import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "https://example.com"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
response.raise_for_status()

# Parse the HTML
soup = BeautifulSoup(response.text, "lxml")

# Extract the title
print(soup.title.string)
# → "Example Domain"

# Extract all links
for link in soup.find_all("a"):
    print(link.get("href"))

That's the core pattern: fetch → parse → extract. Everything else is refinement.

3. Choosing a Parser

ParserSpeedInstallBroken HTMLBest For
lxml⚡ Fastestpip install lxmlGoodProduction scraping (recommended)
html.parserMediumBuilt-inDecentQuick scripts, no dependencies
html5lib🐢 Slowestpip install html5libPerfectExtremely broken HTML
lxml-xml⚡ Fastpip install lxmlN/AXML/RSS/Atom feeds
# lxml (recommended)
soup = BeautifulSoup(html, "lxml")

# Built-in parser (no extra install)
soup = BeautifulSoup(html, "html.parser")

# html5lib (browser-like parsing)
soup = BeautifulSoup(html, "html5lib")

# XML mode
soup = BeautifulSoup(xml_data, "lxml-xml")
💡 Rule of thumb: Use lxml unless you have a specific reason not to. It's faster and more forgiving than html.parser.

4. Finding Elements

BeautifulSoup gives you multiple ways to find elements. Here are the most useful:

find() — First Match

# Find first <h1> tag
h1 = soup.find("h1")
print(h1.text)

# Find by class
product = soup.find("div", class_="product-card")

# Find by id
sidebar = soup.find("div", id="sidebar")

# Find by attribute
price = soup.find("span", attrs={"data-price": True})

find_all() — All Matches

# All paragraphs
paragraphs = soup.find_all("p")

# All links with a specific class
nav_links = soup.find_all("a", class_="nav-link")

# Multiple tags
headers = soup.find_all(["h1", "h2", "h3"])

# Limit results
first_5 = soup.find_all("li", limit=5)

# Using a function
def has_data_id(tag):
    return tag.has_attr("data-id")

elements = soup.find_all(has_data_id)

CSS Selectors — select() and select_one()

# CSS selector (returns list)
products = soup.select("div.product-card")

# Single element
title = soup.select_one("h1.page-title")

# Nested selectors
prices = soup.select("div.product-card > span.price")

# Attribute selectors
links = soup.select('a[href^="https://"]')

# nth-child
third_item = soup.select_one("ul.menu li:nth-child(3)")

# Multiple classes
featured = soup.select("div.product.featured")
💡 CSS selectors are often cleaner for complex queries. Use select() when you'd naturally write CSS, use find_all() when you need programmatic filtering.
# Parent
card = soup.find("span", class_="price")
container = card.parent

# Children (direct)
for child in container.children:
    print(child.name)

# Descendants (all nested)
for desc in container.descendants:
    if desc.name:
        print(desc.name)

# Siblings
next_sibling = card.find_next_sibling("span")
prev_sibling = card.find_previous_sibling("div")

# All next siblings
for sibling in card.find_next_siblings():
    print(sibling)

6. Extracting Text, Attributes, and Links

# Text content
element = soup.find("div", class_="description")
text = element.get_text()           # All text, including nested
text = element.get_text(strip=True) # Strip whitespace
text = element.get_text(" | ")      # Custom separator

# Single string (only works if tag has one child string)
name = soup.find("h1").string

# Attributes
link = soup.find("a")
href = link.get("href")        # Safe (returns None if missing)
href = link["href"]            # Raises KeyError if missing
classes = link.get("class")    # Returns list for class attribute
all_attrs = link.attrs         # Dict of all attributes

# Extract ALL links from a page
links = []
for a in soup.find_all("a", href=True):
    links.append({
        "text": a.get_text(strip=True),
        "url": a["href"]
    })

# Extract ALL images
images = []
for img in soup.find_all("img"):
    images.append({
        "src": img.get("src"),
        "alt": img.get("alt", ""),
        "width": img.get("width")
    })

7. Scraping Tables into DataFrames

One of the most common scraping tasks — pulling HTML tables into structured data:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://example.com/data-table"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "lxml")

# Method 1: Manual extraction (more control)
table = soup.find("table", class_="data-table")
headers = [th.get_text(strip=True) for th in table.find_all("th")]
rows = []
for tr in table.find_all("tr")[1:]:  # Skip header row
    cells = [td.get_text(strip=True) for td in tr.find_all("td")]
    if cells:
        rows.append(cells)

df = pd.DataFrame(rows, columns=headers)

# Method 2: pandas shortcut (quick & dirty)
dfs = pd.read_html(response.text)
df = dfs[0]  # First table on the page

print(df.head())
💡 Pro tip: pd.read_html() uses BeautifulSoup under the hood. Use it for quick table extraction. Use manual parsing when you need to handle complex structures (merged cells, nested tables, data attributes).

8. Handling Pagination

import requests
from bs4 import BeautifulSoup
import time

base_url = "https://example.com/products"
all_products = []

page = 1
while True:
    print(f"Scraping page {page}...")
    response = requests.get(
        f"{base_url}?page={page}",
        headers={"User-Agent": "Mozilla/5.0"}
    )
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    # Extract products from this page
    products = soup.select("div.product-card")
    if not products:
        break  # No more products = last page

    for product in products:
        all_products.append({
            "name": product.select_one("h3").get_text(strip=True),
            "price": product.select_one(".price").get_text(strip=True),
            "url": product.select_one("a")["href"]
        })

    # Check for next page link
    next_link = soup.select_one("a.next-page")
    if not next_link:
        break

    page += 1
    time.sleep(2)  # Be polite — don't hammer the server

print(f"Scraped {len(all_products)} products from {page} pages")

Following "Next" Links

from urllib.parse import urljoin

url = "https://example.com/listings"

while url:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "lxml")

    # ... extract data ...

    # Follow next page link
    next_btn = soup.select_one('a[rel="next"]')
    url = urljoin(url, next_btn["href"]) if next_btn else None
    time.sleep(2)

9. Handling Forms and Login

import requests
from bs4 import BeautifulSoup

session = requests.Session()

# Step 1: Get the login page (grab CSRF token)
login_page = session.get("https://example.com/login")
soup = BeautifulSoup(login_page.text, "lxml")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]

# Step 2: Submit login form
login_data = {
    "username": "your_username",
    "password": "your_password",
    "csrf_token": csrf_token
}
response = session.post("https://example.com/login", data=login_data)

# Step 3: Scrape authenticated pages
protected_page = session.get("https://example.com/dashboard")
soup = BeautifulSoup(protected_page.text, "lxml")
# Now you can scrape pages behind the login
⚠️ Important: Use requests.Session() to persist cookies across requests. Without it, your login state won't carry over.

10. Production-Ready Scraper with Error Handling

import requests
from bs4 import BeautifulSoup
import time
import json
import logging
from urllib.parse import urljoin

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductScraper:
    """Production-ready BeautifulSoup scraper with retries and rate limiting."""

    def __init__(self, base_url, delay=2, max_retries=3):
        self.base_url = base_url
        self.delay = delay
        self.max_retries = max_retries
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
            "Accept-Language": "en-US,en;q=0.9",
        })

    def fetch(self, url):
        """Fetch URL with retries and exponential backoff."""
        for attempt in range(self.max_retries):
            try:
                response = self.session.get(url, timeout=30)
                response.raise_for_status()
                return response.text
            except requests.RequestException as e:
                wait = 2 ** attempt
                logger.warning(f"Attempt {attempt+1} failed for {url}: {e}. "
                               f"Retrying in {wait}s...")
                time.sleep(wait)
        logger.error(f"All {self.max_retries} attempts failed for {url}")
        return None

    def parse_product(self, card):
        """Extract product data from a card element."""
        try:
            return {
                "name": card.select_one("h3, .product-name")
                            .get_text(strip=True),
                "price": card.select_one(".price")
                             .get_text(strip=True),
                "url": urljoin(self.base_url,
                               card.select_one("a")["href"]),
                "rating": card.select_one(".rating")
                              .get_text(strip=True)
                          if card.select_one(".rating") else None,
            }
        except (AttributeError, KeyError) as e:
            logger.warning(f"Failed to parse product card: {e}")
            return None

    def scrape_all(self, max_pages=100):
        """Scrape all products with pagination."""
        all_products = []
        url = self.base_url

        for page in range(1, max_pages + 1):
            logger.info(f"Scraping page {page}: {url}")
            html = self.fetch(url)
            if not html:
                break

            soup = BeautifulSoup(html, "lxml")
            cards = soup.select("div.product-card, .product-item")

            if not cards:
                logger.info("No more products found. Done.")
                break

            for card in cards:
                product = self.parse_product(card)
                if product:
                    all_products.append(product)

            # Find next page
            next_link = soup.select_one(
                'a.next, a[rel="next"], .pagination a:last-child'
            )
            if not next_link or "disabled" in next_link.get("class", []):
                break

            url = urljoin(url, next_link["href"])
            time.sleep(self.delay)

        return all_products

    def save(self, products, filename="products.json"):
        """Save results to JSON."""
        with open(filename, "w") as f:
            json.dump(products, f, indent=2, ensure_ascii=False)
        logger.info(f"Saved {len(products)} products to {filename}")


# Usage
if __name__ == "__main__":
    scraper = ProductScraper("https://example.com/products")
    products = scraper.scrape_all()
    scraper.save(products)

11. Dealing with JavaScript-Rendered Pages

BeautifulSoup's biggest limitation: it can't execute JavaScript. If the data you need is loaded dynamically (React, Vue, Angular, AJAX), the HTML source won't contain it.

How to detect JS-rendered content

# If you see empty containers or loading spinners in the HTML:
soup = BeautifulSoup(response.text, "lxml")
products = soup.select(".product-card")
print(len(products))  # 0 — content is loaded by JavaScript!

Solution 1: Check for API endpoints

# Many SPAs load data from APIs. Check the Network tab in DevTools.
# Often you can call the API directly — much faster than scraping HTML.
api_url = "https://example.com/api/products?page=1&limit=50"
data = requests.get(api_url).json()

Solution 2: Use Playwright + BeautifulSoup

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/spa-page")
    page.wait_for_selector(".product-card")

    # Get rendered HTML, parse with BeautifulSoup
    html = page.content()
    soup = BeautifulSoup(html, "lxml")

    products = soup.select(".product-card")
    print(f"Found {len(products)} products")  # Now it works!

    browser.close()

Solution 3: Use a Web Scraping API

# Skip all the complexity — let the API handle JS rendering
import requests

response = requests.get(
    "https://api.mantisapi.com/scrape",
    params={
        "url": "https://example.com/spa-page",
        "render_js": True,
        "extract": "product_name,price,rating"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
data = response.json()
# Structured data returned — no parsing needed

Skip the Browser Automation

Mantis API handles JavaScript rendering, anti-bot bypass, and structured data extraction in one API call. No Playwright. No proxies. No CAPTCHA solving.

Start Free → 100 Requests/Month

12. BeautifulSoup vs Scrapy vs Playwright vs API

FeatureBeautifulSoupScrapyPlaywrightMantis API
TypeParser onlyFull frameworkBrowser automationAPI service
JS rendering❌ No❌ Needs plugin✅ Yes✅ Yes
SpeedFast (parsing)Very fastSlowFast (~200ms)
ConcurrencyManualBuilt-inManualBuilt-in
Anti-bot bypass❌ Manual❌ Manual⚠️ Partial✅ Automatic
Learning curveEasyMediumMediumEasy
Best forQuick scriptsLarge crawlsJS-heavy sitesProduction apps
Monthly costFree + proxies ($50-400)Free + infra ($100-500)Free + infra ($150-600)$29-299

When to use BeautifulSoup: Quick scripts, static HTML, learning web scraping, small projects, parsing HTML fragments.

When to upgrade: When you need JavaScript rendering, anti-bot bypass, concurrent scraping at scale, or structured data extraction without writing parsers. See our detailed comparison: Scrapy vs BeautifulSoup vs API.

13. The API Shortcut

For production workloads, writing BeautifulSoup parsers for every website gets tedious. Each site has different HTML structure, and sites change their markup without warning — breaking your scrapers.

A web scraping API gives you:

# BeautifulSoup approach (50+ lines for a robust scraper)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
# ... find elements, handle edge cases, parse text...

# Mantis API approach (5 lines)
response = requests.get(
    "https://api.mantisapi.com/scrape",
    params={"url": url, "extract": "title,price,description,reviews"},
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
data = response.json()  # Structured, ready to use

From BeautifulSoup to Production in Minutes

Start with BeautifulSoup for learning and prototyping. When you're ready for production — anti-bot bypass, JS rendering, AI extraction — Mantis API is one endpoint away.

Try Free → 100 Requests/Month

Frequently Asked Questions

What is BeautifulSoup used for in Python?

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates a parse tree that you can navigate, search, and extract data from. It's the go-to tool for web scraping tasks like extracting product prices, article content, table data, and links from websites.

Is BeautifulSoup better than Scrapy?

They serve different purposes. BeautifulSoup is a parser — it only handles HTML parsing. Scrapy is a complete framework with request scheduling, concurrency, middleware, and data pipelines. Use BeautifulSoup + Requests for simple projects. Use Scrapy for large-scale crawling. Use an API for production workloads.

Can BeautifulSoup handle JavaScript-rendered pages?

No. BeautifulSoup only parses static HTML. For JavaScript-rendered pages, you need Playwright or Selenium to render the page first, then pass the HTML to BeautifulSoup. Or use a web scraping API that handles rendering automatically.

Which parser should I use with BeautifulSoup?

Use lxml — it's the fastest and handles most HTML well. Use html.parser if you want zero dependencies. Use html5lib only for extremely broken HTML that other parsers can't handle.

How fast is BeautifulSoup?

With lxml, BeautifulSoup parses HTML in milliseconds. The bottleneck is never parsing — it's the HTTP requests, anti-bot detection, and JavaScript rendering. For high-throughput scraping, look at the anti-blocking guide or use an API.

Is web scraping with BeautifulSoup legal?

Scraping publicly available data is generally legal (see hiQ v. LinkedIn). However, respect robots.txt, avoid scraping personal data without consent (GDPR/CCPA), and don't overload servers. Always consult legal counsel for your specific use case. See our legal guide for details.

Next Steps

Now that you've mastered BeautifulSoup, explore these related guides: