How to Extract Data from Any Website with Python in 2026

March 6, 2026 tutorial

How to Extract Data from Any Website with Python in 2026

Extracting data from websites is one of the most common tasks in software development. Whether you're building a price tracker, gathering research data, or feeding information to an AI agent, you need a reliable way to pull structured data from web pages.

But in 2026, websites are more complex than ever. JavaScript-rendered content, anti-bot protections, and constantly changing layouts make traditional scraping fragile and time-consuming.

This guide covers every approach to website data extraction — from basic HTML parsing to AI-powered extraction that understands page content like a human would.

The 4 Approaches to Extracting Website Data

| Approach | Best For | Handles JS? | Maintenance | |----------|----------|-------------|-------------| | HTML Parsing (BeautifulSoup) | Simple, static pages | No | High — breaks when HTML changes | | Browser Automation (Playwright) | JavaScript-heavy sites | Yes | High — selectors break often | | API-Based Scraping (WebPerception) | Any site, production use | Yes | Zero — AI adapts to changes | | Official APIs | Sites that offer them | N/A | Low — but limited availability |

Let's walk through each one.

Approach 1: HTML Parsing with BeautifulSoup

The classic approach. Fetch the HTML, parse it, extract what you need with CSS selectors.

`python import requests from bs4 import BeautifulSoup

response = requests.get("https://example.com/products") soup = BeautifulSoup(response.text, "html.parser")

products = [] for card in soup.select(".product-card"): products.append({ "name": card.select_one(".title").text.strip(), "price": card.select_one(".price").text.strip(), "url": card.select_one("a")["href"] })

print(products) `

Pros: - Simple and fast for static HTML - No browser overhead - Large ecosystem of tutorials

Cons: - Breaks when the site changes its HTML structure - Can't handle JavaScript-rendered content (SPAs, React, Vue) - You need to write and maintain CSS selectors for every site - Anti-bot systems often block raw HTTP requests

When it works: Simple pages with stable HTML structure that don't use JavaScript rendering.

Approach 2: Browser Automation with Playwright

When sites render content with JavaScript, you need a real browser.

`python from playwright.sync_api import sync_playwright

with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://example.com/products") page.wait_for_selector(".product-card")

products = page.evaluate(""" () => Array.from(document.querySelectorAll('.product-card')).map(card => ({ name: card.querySelector('.title')?.textContent?.trim(), price: card.querySelector('.price')?.textContent?.trim(), url: card.querySelector('a')?.href })) """)

browser.close() print(products) `

Pros: - Handles JavaScript-rendered content - Can interact with pages (click, scroll, fill forms) - Sees the page like a real user

Cons: - Slow — launches a full browser for every request - Memory-hungry — each browser instance uses 200-500MB RAM - Selectors still break when layouts change - Anti-bot systems detect headless browsers - Requires infrastructure to run browsers at scale

When it works: JavaScript-heavy sites where you need to see rendered content, and you're running at low volume.

Approach 3: AI-Powered Extraction with WebPerception API

The modern approach. Instead of writing brittle CSS selectors, describe what you want in plain English and let AI extract it.

`python import requests

Step 1: Scrape the page (handles JS rendering + anti-bot)

scrape = requests.post( "https://api.mantisapi.com/v1/scrape", headers={"x-api-key": "YOUR_API_KEY"}, json={"url": "https://example.com/products"} ) html_content = scrape.json()["content"]

Step 2: Extract structured data with AI

extract = requests.post( "https://api.mantisapi.com/v1/extract", headers={"x-api-key": "YOUR_API_KEY"}, json={ "content": html_content, "prompt": "Extract all products with name, price, and URL", "schema": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "string"}, "url": {"type": "string"} } } } } )

products = extract.json()["data"] print(products) `

Pros: - No CSS selectors to write or maintain - Handles JavaScript rendering automatically - Built-in anti-bot handling (proxies, browser fingerprinting) - AI adapts when layouts change — no broken scrapers - Returns clean, structured JSON - No browser infrastructure to manage

Cons: - Costs money (though free tier includes 100 requests/month) - Requires API key

When it works: Any website, any scale. Especially valuable when you're extracting from multiple sites or need production reliability.

Approach 4: Official APIs

Some websites offer official APIs. Always check first — it's the most reliable approach when available.

`python import requests

Example: GitHub API

response = requests.get( "https://api.github.com/repos/facebook/react", headers={"Authorization": "token YOUR_TOKEN"} ) repo = response.json() print(f"Stars: {repo['stargazers_count']}") `

When it works: When the site has an API that provides the data you need. Unfortunately, most sites don't offer comprehensive APIs.

Real-World Example: Building a Price Tracker

Let's build something practical — a price tracker that monitors product prices across multiple e-commerce sites.

The fragile way (BeautifulSoup):

`python

You need different selectors for every site

SELECTORS = { "amazon.com": {"price": "#priceblock_ourprice", "name": "#productTitle"}, "bestbuy.com": {"price": ".priceView-hero-price span", "name": ".sku-title h1"}, "walmart.com": {"price": "[itemprop='price']", "name": "[itemprop='name']"}, }

And they break constantly...

`

The reliable way (WebPerception API):

`python import requests

def get_price(url): """Extract product name and price from any e-commerce site.""" # Scrape scrape = requests.post( "https://api.mantisapi.com/v1/scrape", headers={"x-api-key": "YOUR_API_KEY"}, json={"url": url} )

# Extract with AI — works on ANY site extract = requests.post( "https://api.mantisapi.com/v1/extract", headers={"x-api-key": "YOUR_API_KEY"}, json={ "content": scrape.json()["content"], "prompt": "Extract the main product name and current price", "schema": { "type": "object", "properties": { "product_name": {"type": "string"}, "price": {"type": "string"}, "currency": {"type": "string"} } } } )

return extract.json()["data"]

Works on ANY e-commerce site — no per-site selectors needed

urls = [ "https://amazon.com/dp/B0EXAMPLE", "https://bestbuy.com/site/example", "https://walmart.com/ip/example" ]

for url in urls: result = get_price(url) print(f"{result['product_name']}: {result['price']} {result['currency']}") `

The difference is dramatic: one universal function vs. maintaining selectors for every site.

Extracting Specific Data Types

Tables

`python extract = requests.post( "https://api.mantisapi.com/v1/extract", headers={"x-api-key": "YOUR_API_KEY"}, json={ "content": html_content, "prompt": "Extract the comparison table with all columns and rows", "schema": { "type": "array", "items": { "type": "object", "properties": { "feature": {"type": "string"}, "plan_basic": {"type": "string"}, "plan_pro": {"type": "string"}, "plan_enterprise": {"type": "string"} } } } } ) `

Contact Information

`python extract = requests.post( "https://api.mantisapi.com/v1/extract", headers={"x-api-key": "YOUR_API_KEY"}, json={ "content": html_content, "prompt": "Extract all contact information: emails, phone numbers, addresses", "schema": { "type": "object", "properties": { "emails": {"type": "array", "items": {"type": "string"}}, "phones": {"type": "array", "items": {"type": "string"}}, "address": {"type": "string"} } } } ) `

News Articles

`python extract = requests.post( "https://api.mantisapi.com/v1/extract", headers={"x-api-key": "YOUR_API_KEY"}, json={ "content": html_content, "prompt": "Extract the article title, author, publication date, and full text", "schema": { "type": "object", "properties": { "title": {"type": "string"}, "author": {"type": "string"}, "date": {"type": "string"}, "text": {"type": "string"} } } } ) `

Handling Common Challenges

JavaScript-Rendered Content

BeautifulSoup and raw HTTP requests can't see JavaScript-rendered content. If you view source and the data isn't there, you need either a browser or an API that renders JavaScript for you.

WebPerception API renders JavaScript automatically — no configuration needed.

Pagination

Most websites spread data across multiple pages. Here's how to handle it:

`python import requests

all_products = [] page = 1

while True: scrape = requests.post( "https://api.mantisapi.com/v1/scrape", headers={"x-api-key": "YOUR_API_KEY"}, json={"url": f"https://example.com/products?page={page}"} )

extract = requests.post( "https://api.mantisapi.com/v1/extract", headers={"x-api-key": "YOUR_API_KEY"}, json={ "content": scrape.json()["content"], "prompt": "Extract all products. Return empty array if no products found.", "schema": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "string"} } } } } )

products = extract.json()["data"] if not products: break

all_products.extend(products) page += 1

print(f"Extracted {len(all_products)} products across {page - 1} pages") `

Rate Limiting

Always be respectful of the sites you're scraping:

`python import time

for url in urls: result = get_price(url) time.sleep(1) # Wait between requests `

WebPerception API handles rate limiting on the infrastructure side, but you should still pace your requests to avoid overwhelming target sites.

Which Approach Should You Use?

Choose BeautifulSoup if: - You're scraping one simple, static site - The HTML structure rarely changes - You're learning and want to understand the basics

Choose Playwright if: - You need to interact with pages (login, click, scroll) - You're building a one-off script for a specific site - You don't mind maintaining browser infrastructure

Choose WebPerception API if: - You're extracting data from multiple sites - You need production reliability (no broken selectors) - You want structured data without writing parsers - You're building an AI agent that needs web data - You don't want to manage browser infrastructure

Choose Official APIs if: - The site offers one with the data you need - You need the highest reliability and speed

Getting Started with WebPerception API

1. Sign up at [mantisapi.com](https://mantisapi.com) — free tier includes 100 requests/month 2. Get your API key from the dashboard 3. Start extracting — two API calls: scrape + extract

`python import requests

API_KEY = "YOUR_API_KEY" HEADERS = {"x-api-key": API_KEY}

Scrape any URL

scrape = requests.post( "https://api.mantisapi.com/v1/scrape", headers=HEADERS, json={"url": "https://news.ycombinator.com"} )

Extract what you need

extract = requests.post( "https://api.mantisapi.com/v1/extract", headers=HEADERS, json={ "content": scrape.json()["content"], "prompt": "Extract the top 10 stories with title, URL, points, and comment count", "schema": { "type": "array", "items": { "type": "object", "properties": { "title": {"type": "string"}, "url": {"type": "string"}, "points": {"type": "integer"}, "comments": {"type": "integer"} } } } } )

for story in extract.json()["data"]: print(f"{story['points']}pts | {story['title']}") `

No selectors. No browser setup. No maintenance. Just data.

---

Ready to extract data from any website? [Get your free API key](https://mantisapi.com) and start building in minutes.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →