Web Scraping with BeautifulSoup and Python in 2026: The Complete Guide
BeautifulSoup is the most popular HTML parsing library in Python — and for good reason. It's simple, forgiving with broken HTML, and perfect for extracting data from web pages. This guide covers everything from installation to production-ready scrapers.
Whether you're scraping product prices, extracting article content, or building a dataset for machine learning, BeautifulSoup combined with Python's requests library is the classic starting point. We'll also show you when (and why) you might want to graduate to an API.
Table of Contents
- Installation & Setup
- Your First BeautifulSoup Scraper
- Choosing a Parser: lxml vs html.parser vs html5lib
- Finding Elements: find(), find_all(), and CSS Selectors
- Navigating the DOM Tree
- Extracting Text, Attributes, and Links
- Scraping Tables into DataFrames
- Handling Pagination
- Handling Forms and Login
- Production-Ready Scraper with Error Handling
- Dealing with JavaScript-Rendered Pages
- BeautifulSoup vs Scrapy vs Playwright vs API
- The API Shortcut: Skip the Parsing Entirely
- FAQ
1. Installation & Setup
Install BeautifulSoup 4, the requests HTTP library, and the lxml parser (recommended for speed):
pip install beautifulsoup4 requests lxml
Verify your installation:
import bs4
print(bs4.__version__) # 4.13.x
lxml alongside BeautifulSoup. It's 10-100x faster than the built-in html.parser and handles malformed HTML better.
2. Your First BeautifulSoup Scraper
Let's scrape a web page in 10 lines of Python:
import requests
from bs4 import BeautifulSoup
# Fetch the page
url = "https://example.com"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
response.raise_for_status()
# Parse the HTML
soup = BeautifulSoup(response.text, "lxml")
# Extract the title
print(soup.title.string)
# → "Example Domain"
# Extract all links
for link in soup.find_all("a"):
print(link.get("href"))
That's the core pattern: fetch → parse → extract. Everything else is refinement.
3. Choosing a Parser
| Parser | Speed | Install | Broken HTML | Best For |
|---|---|---|---|---|
lxml | ⚡ Fastest | pip install lxml | Good | Production scraping (recommended) |
html.parser | Medium | Built-in | Decent | Quick scripts, no dependencies |
html5lib | 🐢 Slowest | pip install html5lib | Perfect | Extremely broken HTML |
lxml-xml | ⚡ Fast | pip install lxml | N/A | XML/RSS/Atom feeds |
# lxml (recommended)
soup = BeautifulSoup(html, "lxml")
# Built-in parser (no extra install)
soup = BeautifulSoup(html, "html.parser")
# html5lib (browser-like parsing)
soup = BeautifulSoup(html, "html5lib")
# XML mode
soup = BeautifulSoup(xml_data, "lxml-xml")
lxml unless you have a specific reason not to. It's faster and more forgiving than html.parser.
4. Finding Elements
BeautifulSoup gives you multiple ways to find elements. Here are the most useful:
find() — First Match
# Find first <h1> tag
h1 = soup.find("h1")
print(h1.text)
# Find by class
product = soup.find("div", class_="product-card")
# Find by id
sidebar = soup.find("div", id="sidebar")
# Find by attribute
price = soup.find("span", attrs={"data-price": True})
find_all() — All Matches
# All paragraphs
paragraphs = soup.find_all("p")
# All links with a specific class
nav_links = soup.find_all("a", class_="nav-link")
# Multiple tags
headers = soup.find_all(["h1", "h2", "h3"])
# Limit results
first_5 = soup.find_all("li", limit=5)
# Using a function
def has_data_id(tag):
return tag.has_attr("data-id")
elements = soup.find_all(has_data_id)
CSS Selectors — select() and select_one()
# CSS selector (returns list)
products = soup.select("div.product-card")
# Single element
title = soup.select_one("h1.page-title")
# Nested selectors
prices = soup.select("div.product-card > span.price")
# Attribute selectors
links = soup.select('a[href^="https://"]')
# nth-child
third_item = soup.select_one("ul.menu li:nth-child(3)")
# Multiple classes
featured = soup.select("div.product.featured")
select() when you'd naturally write CSS, use find_all() when you need programmatic filtering.
5. Navigating the DOM Tree
# Parent
card = soup.find("span", class_="price")
container = card.parent
# Children (direct)
for child in container.children:
print(child.name)
# Descendants (all nested)
for desc in container.descendants:
if desc.name:
print(desc.name)
# Siblings
next_sibling = card.find_next_sibling("span")
prev_sibling = card.find_previous_sibling("div")
# All next siblings
for sibling in card.find_next_siblings():
print(sibling)
6. Extracting Text, Attributes, and Links
# Text content
element = soup.find("div", class_="description")
text = element.get_text() # All text, including nested
text = element.get_text(strip=True) # Strip whitespace
text = element.get_text(" | ") # Custom separator
# Single string (only works if tag has one child string)
name = soup.find("h1").string
# Attributes
link = soup.find("a")
href = link.get("href") # Safe (returns None if missing)
href = link["href"] # Raises KeyError if missing
classes = link.get("class") # Returns list for class attribute
all_attrs = link.attrs # Dict of all attributes
# Extract ALL links from a page
links = []
for a in soup.find_all("a", href=True):
links.append({
"text": a.get_text(strip=True),
"url": a["href"]
})
# Extract ALL images
images = []
for img in soup.find_all("img"):
images.append({
"src": img.get("src"),
"alt": img.get("alt", ""),
"width": img.get("width")
})
7. Scraping Tables into DataFrames
One of the most common scraping tasks — pulling HTML tables into structured data:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = "https://example.com/data-table"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "lxml")
# Method 1: Manual extraction (more control)
table = soup.find("table", class_="data-table")
headers = [th.get_text(strip=True) for th in table.find_all("th")]
rows = []
for tr in table.find_all("tr")[1:]: # Skip header row
cells = [td.get_text(strip=True) for td in tr.find_all("td")]
if cells:
rows.append(cells)
df = pd.DataFrame(rows, columns=headers)
# Method 2: pandas shortcut (quick & dirty)
dfs = pd.read_html(response.text)
df = dfs[0] # First table on the page
print(df.head())
pd.read_html() uses BeautifulSoup under the hood. Use it for quick table extraction. Use manual parsing when you need to handle complex structures (merged cells, nested tables, data attributes).
8. Handling Pagination
import requests
from bs4 import BeautifulSoup
import time
base_url = "https://example.com/products"
all_products = []
page = 1
while True:
print(f"Scraping page {page}...")
response = requests.get(
f"{base_url}?page={page}",
headers={"User-Agent": "Mozilla/5.0"}
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
# Extract products from this page
products = soup.select("div.product-card")
if not products:
break # No more products = last page
for product in products:
all_products.append({
"name": product.select_one("h3").get_text(strip=True),
"price": product.select_one(".price").get_text(strip=True),
"url": product.select_one("a")["href"]
})
# Check for next page link
next_link = soup.select_one("a.next-page")
if not next_link:
break
page += 1
time.sleep(2) # Be polite — don't hammer the server
print(f"Scraped {len(all_products)} products from {page} pages")
Following "Next" Links
from urllib.parse import urljoin
url = "https://example.com/listings"
while url:
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "lxml")
# ... extract data ...
# Follow next page link
next_btn = soup.select_one('a[rel="next"]')
url = urljoin(url, next_btn["href"]) if next_btn else None
time.sleep(2)
9. Handling Forms and Login
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# Step 1: Get the login page (grab CSRF token)
login_page = session.get("https://example.com/login")
soup = BeautifulSoup(login_page.text, "lxml")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
# Step 2: Submit login form
login_data = {
"username": "your_username",
"password": "your_password",
"csrf_token": csrf_token
}
response = session.post("https://example.com/login", data=login_data)
# Step 3: Scrape authenticated pages
protected_page = session.get("https://example.com/dashboard")
soup = BeautifulSoup(protected_page.text, "lxml")
# Now you can scrape pages behind the login
requests.Session() to persist cookies across requests. Without it, your login state won't carry over.
10. Production-Ready Scraper with Error Handling
import requests
from bs4 import BeautifulSoup
import time
import json
import logging
from urllib.parse import urljoin
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ProductScraper:
"""Production-ready BeautifulSoup scraper with retries and rate limiting."""
def __init__(self, base_url, delay=2, max_retries=3):
self.base_url = base_url
self.delay = delay
self.max_retries = max_retries
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
})
def fetch(self, url):
"""Fetch URL with retries and exponential backoff."""
for attempt in range(self.max_retries):
try:
response = self.session.get(url, timeout=30)
response.raise_for_status()
return response.text
except requests.RequestException as e:
wait = 2 ** attempt
logger.warning(f"Attempt {attempt+1} failed for {url}: {e}. "
f"Retrying in {wait}s...")
time.sleep(wait)
logger.error(f"All {self.max_retries} attempts failed for {url}")
return None
def parse_product(self, card):
"""Extract product data from a card element."""
try:
return {
"name": card.select_one("h3, .product-name")
.get_text(strip=True),
"price": card.select_one(".price")
.get_text(strip=True),
"url": urljoin(self.base_url,
card.select_one("a")["href"]),
"rating": card.select_one(".rating")
.get_text(strip=True)
if card.select_one(".rating") else None,
}
except (AttributeError, KeyError) as e:
logger.warning(f"Failed to parse product card: {e}")
return None
def scrape_all(self, max_pages=100):
"""Scrape all products with pagination."""
all_products = []
url = self.base_url
for page in range(1, max_pages + 1):
logger.info(f"Scraping page {page}: {url}")
html = self.fetch(url)
if not html:
break
soup = BeautifulSoup(html, "lxml")
cards = soup.select("div.product-card, .product-item")
if not cards:
logger.info("No more products found. Done.")
break
for card in cards:
product = self.parse_product(card)
if product:
all_products.append(product)
# Find next page
next_link = soup.select_one(
'a.next, a[rel="next"], .pagination a:last-child'
)
if not next_link or "disabled" in next_link.get("class", []):
break
url = urljoin(url, next_link["href"])
time.sleep(self.delay)
return all_products
def save(self, products, filename="products.json"):
"""Save results to JSON."""
with open(filename, "w") as f:
json.dump(products, f, indent=2, ensure_ascii=False)
logger.info(f"Saved {len(products)} products to {filename}")
# Usage
if __name__ == "__main__":
scraper = ProductScraper("https://example.com/products")
products = scraper.scrape_all()
scraper.save(products)
11. Dealing with JavaScript-Rendered Pages
BeautifulSoup's biggest limitation: it can't execute JavaScript. If the data you need is loaded dynamically (React, Vue, Angular, AJAX), the HTML source won't contain it.
How to detect JS-rendered content
# If you see empty containers or loading spinners in the HTML:
soup = BeautifulSoup(response.text, "lxml")
products = soup.select(".product-card")
print(len(products)) # 0 — content is loaded by JavaScript!
Solution 1: Check for API endpoints
# Many SPAs load data from APIs. Check the Network tab in DevTools.
# Often you can call the API directly — much faster than scraping HTML.
api_url = "https://example.com/api/products?page=1&limit=50"
data = requests.get(api_url).json()
Solution 2: Use Playwright + BeautifulSoup
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/spa-page")
page.wait_for_selector(".product-card")
# Get rendered HTML, parse with BeautifulSoup
html = page.content()
soup = BeautifulSoup(html, "lxml")
products = soup.select(".product-card")
print(f"Found {len(products)} products") # Now it works!
browser.close()
Solution 3: Use a Web Scraping API
# Skip all the complexity — let the API handle JS rendering
import requests
response = requests.get(
"https://api.mantisapi.com/scrape",
params={
"url": "https://example.com/spa-page",
"render_js": True,
"extract": "product_name,price,rating"
},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
data = response.json()
# Structured data returned — no parsing needed
Skip the Browser Automation
Mantis API handles JavaScript rendering, anti-bot bypass, and structured data extraction in one API call. No Playwright. No proxies. No CAPTCHA solving.
Start Free → 100 Requests/Month12. BeautifulSoup vs Scrapy vs Playwright vs API
| Feature | BeautifulSoup | Scrapy | Playwright | Mantis API |
|---|---|---|---|---|
| Type | Parser only | Full framework | Browser automation | API service |
| JS rendering | ❌ No | ❌ Needs plugin | ✅ Yes | ✅ Yes |
| Speed | Fast (parsing) | Very fast | Slow | Fast (~200ms) |
| Concurrency | Manual | Built-in | Manual | Built-in |
| Anti-bot bypass | ❌ Manual | ❌ Manual | ⚠️ Partial | ✅ Automatic |
| Learning curve | Easy | Medium | Medium | Easy |
| Best for | Quick scripts | Large crawls | JS-heavy sites | Production apps |
| Monthly cost | Free + proxies ($50-400) | Free + infra ($100-500) | Free + infra ($150-600) | $29-299 |
When to use BeautifulSoup: Quick scripts, static HTML, learning web scraping, small projects, parsing HTML fragments.
When to upgrade: When you need JavaScript rendering, anti-bot bypass, concurrent scraping at scale, or structured data extraction without writing parsers. See our detailed comparison: Scrapy vs BeautifulSoup vs API.
13. The API Shortcut
For production workloads, writing BeautifulSoup parsers for every website gets tedious. Each site has different HTML structure, and sites change their markup without warning — breaking your scrapers.
A web scraping API gives you:
- Structured data — JSON output, no parsing code needed
- JavaScript rendering — SPAs, React, Angular all work
- Anti-bot bypass — Cloudflare, Akamai, PerimeterX handled automatically
- AI extraction — Describe what you want in natural language, get structured data back
- No infrastructure — No proxies, no headless browsers, no CAPTCHA services
# BeautifulSoup approach (50+ lines for a robust scraper)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
# ... find elements, handle edge cases, parse text...
# Mantis API approach (5 lines)
response = requests.get(
"https://api.mantisapi.com/scrape",
params={"url": url, "extract": "title,price,description,reviews"},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
data = response.json() # Structured, ready to use
From BeautifulSoup to Production in Minutes
Start with BeautifulSoup for learning and prototyping. When you're ready for production — anti-bot bypass, JS rendering, AI extraction — Mantis API is one endpoint away.
Try Free → 100 Requests/MonthFrequently Asked Questions
What is BeautifulSoup used for in Python?
BeautifulSoup is a Python library for parsing HTML and XML documents. It creates a parse tree that you can navigate, search, and extract data from. It's the go-to tool for web scraping tasks like extracting product prices, article content, table data, and links from websites.
Is BeautifulSoup better than Scrapy?
They serve different purposes. BeautifulSoup is a parser — it only handles HTML parsing. Scrapy is a complete framework with request scheduling, concurrency, middleware, and data pipelines. Use BeautifulSoup + Requests for simple projects. Use Scrapy for large-scale crawling. Use an API for production workloads.
Can BeautifulSoup handle JavaScript-rendered pages?
No. BeautifulSoup only parses static HTML. For JavaScript-rendered pages, you need Playwright or Selenium to render the page first, then pass the HTML to BeautifulSoup. Or use a web scraping API that handles rendering automatically.
Which parser should I use with BeautifulSoup?
Use lxml — it's the fastest and handles most HTML well. Use html.parser if you want zero dependencies. Use html5lib only for extremely broken HTML that other parsers can't handle.
How fast is BeautifulSoup?
With lxml, BeautifulSoup parses HTML in milliseconds. The bottleneck is never parsing — it's the HTTP requests, anti-bot detection, and JavaScript rendering. For high-throughput scraping, look at the anti-blocking guide or use an API.
Is web scraping with BeautifulSoup legal?
Scraping publicly available data is generally legal (see hiQ v. LinkedIn). However, respect robots.txt, avoid scraping personal data without consent (GDPR/CCPA), and don't overload servers. Always consult legal counsel for your specific use case. See our legal guide for details.
Next Steps
Now that you've mastered BeautifulSoup, explore these related guides:
- Web Scraping with Python Requests — deep dive into the HTTP side
- Web Scraping with Playwright — for JavaScript-rendered sites
- Web Scraping with Selenium — the classic browser automation tool
- How to Scrape Without Getting Blocked — anti-detection techniques
- Best Web Scraping APIs Compared — when you're ready to scale
- Scrapy vs BeautifulSoup vs API — detailed comparison