Is BeautifulSoup better than Scrapy for web scraping?

BeautifulSoup is a parsing library — it only handles HTML parsing. Scrapy is a complete scraping framework with built-in request scheduling, concurrency, middleware, and pipelines. BeautifulSoup is better for simple, quick scraping tasks. Scrapy is better for large-scale crawling projects. Many developers use BeautifulSoup with Requests for small projects and switch to Scrapy or an API for production workloads.

How fast is BeautifulSoup compared to web scraping APIs?

BeautifulSoup with lxml parses HTML in milliseconds — parsing speed is not the bottleneck. The bottleneck is HTTP requests, anti-bot detection, and JavaScript rendering. A web scraping API like Mantis handles all three automatically at ~200ms per request, while building equivalent infrastructure yourself (proxies, headless browsers, CAPTCHA solving) costs $200-800/month and significant engineering time.

Web Scraping with BeautifulSoup and Python in 2026: The Complete Guide

Q: Which parser should I use with BeautifulSoup?

Use 'lxml' for speed — it's the fastest parser and handles most HTML well. Use 'html.parser' if you want zero dependencies (it's built into Python). Use 'html5lib' for maximum compatibility with broken HTML — it parses pages exactly as a browser would, but it's the slowest option.

Q: Is web scraping with BeautifulSoup legal?

Web scraping publicly available data is generally legal in the US (see hiQ v. LinkedIn, 2022). However, you should respect robots.txt, avoid scraping personal data without consent (GDPR/CCPA), don't violate Terms of Service that you've agreed to, and don't overload servers with excessive requests. Always consult legal counsel for your specific use case.

Published March 16, 2026 · 18 min read · Updated for BeautifulSoup 4.13+

BeautifulSoup is the most popular HTML parsing library in Python — and for good reason. It's simple, forgiving with broken HTML, and perfect for extracting data from web pages. This guide covers everything from installation to production-ready scrapers.

Whether you're scraping product prices, extracting article content, or building a dataset for machine learning, BeautifulSoup combined with Python's requests library is the classic starting point. We'll also show you when (and why) you might want to graduate to an API.

Installation & Setup
Your First BeautifulSoup Scraper
Choosing a Parser: lxml vs html.parser vs html5lib
Finding Elements: find(), find_all(), and CSS Selectors
Navigating the DOM Tree
Extracting Text, Attributes, and Links
Scraping Tables into DataFrames
Handling Pagination
Handling Forms and Login
Production-Ready Scraper with Error Handling
Dealing with JavaScript-Rendered Pages
BeautifulSoup vs Scrapy vs Playwright vs API
The API Shortcut: Skip the Parsing Entirely
FAQ

1. Installation & Setup

Install BeautifulSoup 4, the requests HTTP library, and the lxml parser (recommended for speed):

pip install beautifulsoup4 requests lxml

Verify your installation:

import bs4
print(bs4.__version__)  # 4.13.x

💡 Tip: Always install lxml alongside BeautifulSoup. It's 10-100x faster than the built-in html.parser and handles malformed HTML better.

2. Your First BeautifulSoup Scraper

Let's scrape a web page in 10 lines of Python:

import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "https://example.com"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
response.raise_for_status()

# Parse the HTML
soup = BeautifulSoup(response.text, "lxml")

# Extract the title
print(soup.title.string)
# → "Example Domain"

# Extract all links
for link in soup.find_all("a"):
    print(link.get("href"))

That's the core pattern: fetch → parse → extract. Everything else is refinement.

3. Choosing a Parser

Parser	Speed	Install	Broken HTML	Best For
`lxml`	⚡ Fastest	`pip install lxml`	Good	Production scraping (recommended)
`html.parser`	Medium	Built-in	Decent	Quick scripts, no dependencies
`html5lib`	🐢 Slowest	`pip install html5lib`	Perfect	Extremely broken HTML
`lxml-xml`	⚡ Fast	`pip install lxml`	N/A	XML/RSS/Atom feeds

# lxml (recommended)
soup = BeautifulSoup(html, "lxml")

# Built-in parser (no extra install)
soup = BeautifulSoup(html, "html.parser")

# html5lib (browser-like parsing)
soup = BeautifulSoup(html, "html5lib")

# XML mode
soup = BeautifulSoup(xml_data, "lxml-xml")

💡 Rule of thumb: Use lxml unless you have a specific reason not to. It's faster and more forgiving than html.parser.

4. Finding Elements

BeautifulSoup gives you multiple ways to find elements. Here are the most useful:

find() — First Match

# Find first <h1> tag
h1 = soup.find("h1")
print(h1.text)

# Find by class
product = soup.find("div", class_="product-card")

# Find by id
sidebar = soup.find("div", id="sidebar")

# Find by attribute
price = soup.find("span", attrs={"data-price": True})

find_all() — All Matches

# All paragraphs
paragraphs = soup.find_all("p")

# All links with a specific class
nav_links = soup.find_all("a", class_="nav-link")

# Multiple tags
headers = soup.find_all(["h1", "h2", "h3"])

# Limit results
first_5 = soup.find_all("li", limit=5)

# Using a function
def has_data_id(tag):
    return tag.has_attr("data-id")

elements = soup.find_all(has_data_id)

CSS Selectors — select() and select_one()

# CSS selector (returns list)
products = soup.select("div.product-card")

# Single element
title = soup.select_one("h1.page-title")

# Nested selectors
prices = soup.select("div.product-card > span.price")

# Attribute selectors
links = soup.select('a[href^="https://"]')

# nth-child
third_item = soup.select_one("ul.menu li:nth-child(3)")

# Multiple classes
featured = soup.select("div.product.featured")

💡 CSS selectors are often cleaner for complex queries. Use select() when you'd naturally write CSS, use find_all() when you need programmatic filtering.

5. Navigating the DOM Tree

# Parent
card = soup.find("span", class_="price")
container = card.parent

# Children (direct)
for child in container.children:
    print(child.name)

# Descendants (all nested)
for desc in container.descendants:
    if desc.name:
        print(desc.name)

# Siblings
next_sibling = card.find_next_sibling("span")
prev_sibling = card.find_previous_sibling("div")

# All next siblings
for sibling in card.find_next_siblings():
    print(sibling)

6. Extracting Text, Attributes, and Links

# Text content
element = soup.find("div", class_="description")
text = element.get_text()           # All text, including nested
text = element.get_text(strip=True) # Strip whitespace
text = element.get_text(" | ")      # Custom separator

# Single string (only works if tag has one child string)
name = soup.find("h1").string

# Attributes
link = soup.find("a")
href = link.get("href")        # Safe (returns None if missing)
href = link["href"]            # Raises KeyError if missing
classes = link.get("class")    # Returns list for class attribute
all_attrs = link.attrs         # Dict of all attributes

# Extract ALL links from a page
links = []
for a in soup.find_all("a", href=True):
    links.append({
        "text": a.get_text(strip=True),
        "url": a["href"]
    })

# Extract ALL images
images = []
for img in soup.find_all("img"):
    images.append({
        "src": img.get("src"),
        "alt": img.get("alt", ""),
        "width": img.get("width")
    })

7. Scraping Tables into DataFrames

One of the most common scraping tasks — pulling HTML tables into structured data:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://example.com/data-table"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "lxml")

# Method 1: Manual extraction (more control)
table = soup.find("table", class_="data-table")
headers = [th.get_text(strip=True) for th in table.find_all("th")]
rows = []
for tr in table.find_all("tr")[1:]:  # Skip header row
    cells = [td.get_text(strip=True) for td in tr.find_all("td")]
    if cells:
        rows.append(cells)

df = pd.DataFrame(rows, columns=headers)

# Method 2: pandas shortcut (quick & dirty)
dfs = pd.read_html(response.text)
df = dfs[0]  # First table on the page

print(df.head())

💡 Pro tip: pd.read_html() uses BeautifulSoup under the hood. Use it for quick table extraction. Use manual parsing when you need to handle complex structures (merged cells, nested tables, data attributes).

8. Handling Pagination

import requests
from bs4 import BeautifulSoup
import time

base_url = "https://example.com/products"
all_products = []

page = 1
while True:
    print(f"Scraping page {page}...")
    response = requests.get(
        f"{base_url}?page={page}",
        headers={"User-Agent": "Mozilla/5.0"}
    )
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    # Extract products from this page
    products = soup.select("div.product-card")
    if not products:
        break  # No more products = last page

    for product in products:
        all_products.append({
            "name": product.select_one("h3").get_text(strip=True),
            "price": product.select_one(".price").get_text(strip=True),
            "url": product.select_one("a")["href"]
        })

    # Check for next page link
    next_link = soup.select_one("a.next-page")
    if not next_link:
        break

    page += 1
    time.sleep(2)  # Be polite — don't hammer the server

print(f"Scraped {len(all_products)} products from {page} pages")

Following "Next" Links

from urllib.parse import urljoin

url = "https://example.com/listings"

while url:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "lxml")

    # ... extract data ...

    # Follow next page link
    next_btn = soup.select_one('a[rel="next"]')
    url = urljoin(url, next_btn["href"]) if next_btn else None
    time.sleep(2)

9. Handling Forms and Login

import requests
from bs4 import BeautifulSoup

session = requests.Session()

# Step 1: Get the login page (grab CSRF token)
login_page = session.get("https://example.com/login")
soup = BeautifulSoup(login_page.text, "lxml")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]

# Step 2: Submit login form
login_data = {
    "username": "your_username",
    "password": "your_password",
    "csrf_token": csrf_token
}
response = session.post("https://example.com/login", data=login_data)

# Step 3: Scrape authenticated pages
protected_page = session.get("https://example.com/dashboard")
soup = BeautifulSoup(protected_page.text, "lxml")
# Now you can scrape pages behind the login

⚠️ Important: Use requests.Session() to persist cookies across requests. Without it, your login state won't carry over.

10. Production-Ready Scraper with Error Handling

import requests
from bs4 import BeautifulSoup
import time
import json
import logging
from urllib.parse import urljoin

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductScraper:
    """Production-ready BeautifulSoup scraper with retries and rate limiting."""

    def __init__(self, base_url, delay=2, max_retries=3):
        self.base_url = base_url
        self.delay = delay
        self.max_retries = max_retries
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
            "Accept-Language": "en-US,en;q=0.9",
        })

    def fetch(self, url):
        """Fetch URL with retries and exponential backoff."""
        for attempt in range(self.max_retries):
            try:
                response = self.session.get(url, timeout=30)
                response.raise_for_status()
                return response.text
            except requests.RequestException as e:
                wait = 2 ** attempt
                logger.warning(f"Attempt {attempt+1} failed for {url}: {e}. "
                               f"Retrying in {wait}s...")
                time.sleep(wait)
        logger.error(f"All {self.max_retries} attempts failed for {url}")
        return None

    def parse_product(self, card):
        """Extract product data from a card element."""
        try:
            return {
                "name": card.select_one("h3, .product-name")
                            .get_text(strip=True),
                "price": card.select_one(".price")
                             .get_text(strip=True),
                "url": urljoin(self.base_url,
                               card.select_one("a")["href"]),
                "rating": card.select_one(".rating")
                              .get_text(strip=True)
                          if card.select_one(".rating") else None,
            }
        except (AttributeError, KeyError) as e:
            logger.warning(f"Failed to parse product card: {e}")
            return None

    def scrape_all(self, max_pages=100):
        """Scrape all products with pagination."""
        all_products = []
        url = self.base_url

        for page in range(1, max_pages + 1):
            logger.info(f"Scraping page {page}: {url}")
            html = self.fetch(url)
            if not html:
                break

            soup = BeautifulSoup(html, "lxml")
            cards = soup.select("div.product-card, .product-item")

            if not cards:
                logger.info("No more products found. Done.")
                break

            for card in cards:
                product = self.parse_product(card)
                if product:
                    all_products.append(product)

            # Find next page
            next_link = soup.select_one(
                'a.next, a[rel="next"], .pagination a:last-child'
            )
            if not next_link or "disabled" in next_link.get("class", []):
                break

            url = urljoin(url, next_link["href"])
            time.sleep(self.delay)

        return all_products

    def save(self, products, filename="products.json"):
        """Save results to JSON."""
        with open(filename, "w") as f:
            json.dump(products, f, indent=2, ensure_ascii=False)
        logger.info(f"Saved {len(products)} products to {filename}")


# Usage
if __name__ == "__main__":
    scraper = ProductScraper("https://example.com/products")
    products = scraper.scrape_all()
    scraper.save(products)

11. Dealing with JavaScript-Rendered Pages

BeautifulSoup's biggest limitation: it can't execute JavaScript. If the data you need is loaded dynamically (React, Vue, Angular, AJAX), the HTML source won't contain it.

How to detect JS-rendered content

# If you see empty containers or loading spinners in the HTML:
soup = BeautifulSoup(response.text, "lxml")
products = soup.select(".product-card")
print(len(products))  # 0 — content is loaded by JavaScript!

Solution 1: Check for API endpoints

# Many SPAs load data from APIs. Check the Network tab in DevTools.
# Often you can call the API directly — much faster than scraping HTML.
api_url = "https://example.com/api/products?page=1&limit=50"
data = requests.get(api_url).json()

Solution 2: Use Playwright + BeautifulSoup

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/spa-page")
    page.wait_for_selector(".product-card")

    # Get rendered HTML, parse with BeautifulSoup
    html = page.content()
    soup = BeautifulSoup(html, "lxml")

    products = soup.select(".product-card")
    print(f"Found {len(products)} products")  # Now it works!

    browser.close()

Solution 3: Use a Web Scraping API

# Skip all the complexity — let the API handle JS rendering
import requests

response = requests.get(
    "https://api.mantisapi.com/scrape",
    params={
        "url": "https://example.com/spa-page",
        "render_js": True,
        "extract": "product_name,price,rating"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
data = response.json()
# Structured data returned — no parsing needed

Skip the Browser Automation

Mantis API handles JavaScript rendering, anti-bot bypass, and structured data extraction in one API call. No Playwright. No proxies. No CAPTCHA solving.

Start Free → 100 Requests/Month

12. BeautifulSoup vs Scrapy vs Playwright vs API

Feature	BeautifulSoup	Scrapy	Playwright	Mantis API
Type	Parser only	Full framework	Browser automation	API service
JS rendering	❌ No	❌ Needs plugin	✅ Yes	✅ Yes
Speed	Fast (parsing)	Very fast	Slow	Fast (~200ms)
Concurrency	Manual	Built-in	Manual	Built-in
Anti-bot bypass	❌ Manual	❌ Manual	⚠️ Partial	✅ Automatic
Learning curve	Easy	Medium	Medium	Easy
Best for	Quick scripts	Large crawls	JS-heavy sites	Production apps
Monthly cost	Free + proxies ($50-400)	Free + infra ($100-500)	Free + infra ($150-600)	$29-299

When to use BeautifulSoup: Quick scripts, static HTML, learning web scraping, small projects, parsing HTML fragments.

When to upgrade: When you need JavaScript rendering, anti-bot bypass, concurrent scraping at scale, or structured data extraction without writing parsers. See our detailed comparison: Scrapy vs BeautifulSoup vs API.

13. The API Shortcut

For production workloads, writing BeautifulSoup parsers for every website gets tedious. Each site has different HTML structure, and sites change their markup without warning — breaking your scrapers.

A web scraping API gives you:

Structured data — JSON output, no parsing code needed
JavaScript rendering — SPAs, React, Angular all work
Anti-bot bypass — Cloudflare, Akamai, PerimeterX handled automatically
AI extraction — Describe what you want in natural language, get structured data back
No infrastructure — No proxies, no headless browsers, no CAPTCHA services

# BeautifulSoup approach (50+ lines for a robust scraper)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
# ... find elements, handle edge cases, parse text...

# Mantis API approach (5 lines)
response = requests.get(
    "https://api.mantisapi.com/scrape",
    params={"url": url, "extract": "title,price,description,reviews"},
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
data = response.json()  # Structured, ready to use

From BeautifulSoup to Production in Minutes

Start with BeautifulSoup for learning and prototyping. When you're ready for production — anti-bot bypass, JS rendering, AI extraction — Mantis API is one endpoint away.

Try Free → 100 Requests/Month

Frequently Asked Questions

What is BeautifulSoup used for in Python?

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates a parse tree that you can navigate, search, and extract data from. It's the go-to tool for web scraping tasks like extracting product prices, article content, table data, and links from websites.

Is BeautifulSoup better than Scrapy?

They serve different purposes. BeautifulSoup is a parser — it only handles HTML parsing. Scrapy is a complete framework with request scheduling, concurrency, middleware, and data pipelines. Use BeautifulSoup + Requests for simple projects. Use Scrapy for large-scale crawling. Use an API for production workloads.

Can BeautifulSoup handle JavaScript-rendered pages?

No. BeautifulSoup only parses static HTML. For JavaScript-rendered pages, you need Playwright or Selenium to render the page first, then pass the HTML to BeautifulSoup. Or use a web scraping API that handles rendering automatically.

Which parser should I use with BeautifulSoup?

Use lxml — it's the fastest and handles most HTML well. Use html.parser if you want zero dependencies. Use html5lib only for extremely broken HTML that other parsers can't handle.

How fast is BeautifulSoup?

With lxml, BeautifulSoup parses HTML in milliseconds. The bottleneck is never parsing — it's the HTTP requests, anti-bot detection, and JavaScript rendering. For high-throughput scraping, look at the anti-blocking guide or use an API.

Is web scraping with BeautifulSoup legal?

Scraping publicly available data is generally legal (see hiQ v. LinkedIn). However, respect robots.txt, avoid scraping personal data without consent (GDPR/CCPA), and don't overload servers. Always consult legal counsel for your specific use case. See our legal guide for details.

Next Steps

Now that you've mastered BeautifulSoup, explore these related guides:

Web Scraping with Python Requests — deep dive into the HTTP side
Web Scraping with Playwright — for JavaScript-rendered sites
Web Scraping with Selenium — the classic browser automation tool
How to Scrape Without Getting Blocked — anti-detection techniques
Best Web Scraping APIs Compared — when you're ready to scale
Scrapy vs BeautifulSoup vs API — detailed comparison

Web Scraping with BeautifulSoup and Python in 2026: The Complete Guide

Table of Contents

1. Installation & Setup

2. Your First BeautifulSoup Scraper

3. Choosing a Parser

4. Finding Elements

find() — First Match

find_all() — All Matches

CSS Selectors — select() and select_one()

5. Navigating the DOM Tree

6. Extracting Text, Attributes, and Links

7. Scraping Tables into DataFrames

8. Handling Pagination

Following "Next" Links

9. Handling Forms and Login

10. Production-Ready Scraper with Error Handling

11. Dealing with JavaScript-Rendered Pages

How to detect JS-rendered content

Solution 1: Check for API endpoints

Solution 2: Use Playwright + BeautifulSoup

Solution 3: Use a Web Scraping API

Skip the Browser Automation

12. BeautifulSoup vs Scrapy vs Playwright vs API

13. The API Shortcut

From BeautifulSoup to Production in Minutes

Frequently Asked Questions

What is BeautifulSoup used for in Python?

Is BeautifulSoup better than Scrapy?

Can BeautifulSoup handle JavaScript-rendered pages?

Which parser should I use with BeautifulSoup?

How fast is BeautifulSoup?

Is web scraping with BeautifulSoup legal?

Next Steps