AI Data Extraction: How to Extract Structured Data from Any Website
March 4, 2026 guide
# AI Data Extraction: How to Extract Structured Data from Any Website
Every developer has been there. You need data from a website — product prices, company info, review scores — and you end up writing fragile CSS selectors that break every time the site updates.
There's a better way. AI-powered data extraction lets you describe *what* you want in plain English, and get back structured JSON — no selectors, no parsing, no maintenance.
This guide shows you how with [WebPerception API](https://mantisapi.com).
## The Problem with Traditional Web Scraping
Traditional scraping looks like this:
```python
from bs4 import BeautifulSoup
import requests
html = requests.get("https://example.com/product").text
soup = BeautifulSoup(html, "html.parser")
# Fragile selectors that WILL break
price = soup.select_one(".price-box .special-price .price").text
title = soup.select_one("h1.product-title span").text
rating = soup.select_one(".rating-result span").text
```
Problems:
- **Breaks when HTML changes** — one class name update and your scraper dies
- **No JavaScript rendering** — SPAs and dynamic content are invisible
- **Manual for every site** — each website needs custom selectors
- **No semantic understanding** — the scraper doesn't "know" what a price is
## AI Data Extraction: The Modern Approach
With AI extraction, you describe what you want:
```python
import requests
response = requests.post(
"https://api.mantisapi.com/extract",
headers={"x-api-key": "your-api-key"},
json={
"url": "https://example.com/product",
"prompt": "Extract: product name, price, rating, number of reviews, availability"
}
)
data = response.json()
# {
# "product_name": "Sony WH-1000XM5 Wireless Headphones",
# "price": "$279.99",
# "rating": "4.7/5",
# "number_of_reviews": 12847,
# "availability": "In Stock"
# }
```
**No selectors. No parsing. No maintenance.** The AI understands the page like a human would.
## How It Works
WebPerception API's `/extract` endpoint:
1. **Renders the full page** — JavaScript, dynamic content, everything
2. **Understands the content** — AI reads the page like a human
3. **Extracts what you asked for** — Returns structured JSON matching your prompt
4. **Handles edge cases** — Missing fields return `null`, not crashes
The AI model sees the rendered page content and extracts exactly the fields you specified. It works across different sites without any site-specific configuration.
## Use Cases
### Product Data Extraction
```python
# Works on ANY e-commerce site
result = extract("https://amazon.com/dp/B0BX2L8PDS",
"Extract: product name, current price, original price, discount percentage, "
"rating, review count, bullet point features, availability, seller name")
```
### Job Listing Extraction
```python
result = extract("https://company.com/careers/senior-engineer",
"Extract: job title, location, salary range, required skills, "
"experience required, benefits, team name, remote policy")
```
### Real Estate Data
```python
result = extract("https://zillow.com/homedetails/123-main-st",
"Extract: address, price, bedrooms, bathrooms, square footage, "
"year built, lot size, property type, HOA fees, price history")
```
### Company Information
```python
result = extract("https://techstartup.com/about",
"Extract: company name, founding year, founders, headquarters, "
"employee count, funding raised, investors, mission statement")
```
### Review Aggregation
```python
result = extract("https://g2.com/products/some-tool/reviews",
"Extract the first 5 reviews with: reviewer name, rating, date, "
"pros, cons, and summary")
```
## Batch Extraction
Need data from multiple pages? Loop through them:
```python
import requests
from concurrent.futures import ThreadPoolExecutor
API_KEY = "your-api-key"
BASE_URL = "https://api.mantisapi.com"
def extract_product(url):
response = requests.post(
f"{BASE_URL}/extract",
headers={"x-api-key": API_KEY},
json={
"url": url,
"prompt": "Extract: product name, price, rating, review count"
}
)
return {"url": url, "data": response.json()}
urls = [
"https://amazon.com/dp/PRODUCT1",
"https://amazon.com/dp/PRODUCT2",
"https://amazon.com/dp/PRODUCT3",
]
# Extract in parallel
with ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(extract_product, urls))
for r in results:
print(f"{r['url']}: {r['data']}")
```
## AI Extraction vs. CSS Selectors
| Aspect | AI Extraction | CSS Selectors |
|--------|:-:|:-:|
| Setup time | Minutes | Hours per site |
| Maintenance | Zero | Constant |
| Cross-site | Works everywhere | Site-specific |
| Dynamic content | ✅ | Usually ❌ |
| Semantic understanding | ✅ | ❌ |
| Structured output | JSON | Raw strings |
| Edge cases | Handled gracefully | Crashes |
## Tips for Better Extraction
### 1. Be Specific in Your Prompts
```python
# ❌ Vague
"Extract product info"
# ✅ Specific
"Extract: product name, price in USD, star rating out of 5, number of reviews,
whether it's in stock (true/false), and the first 3 bullet point features"
```
### 2. Specify Data Types
```python
"Extract: price (number, no currency symbol), rating (decimal),
review_count (integer), in_stock (boolean)"
```
### 3. Handle Multiple Items
```python
"Extract ALL products on this page. For each product: name, price, rating, and URL"
```
### 4. Add Context
```python
"This is a SaaS pricing page. Extract each pricing tier with:
tier name, monthly price, annual price, features list, and any limits"
```
## Integration with AI Agents
AI data extraction is the perfect tool for autonomous agents. Here's a complete agent setup:
```python
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
@tool
def extract_web_data(url: str, fields: str) -> dict:
"""Extract structured data from any webpage.
Args:
url: The webpage URL
fields: Description of what data to extract
"""
response = requests.post(
"https://api.mantisapi.com/extract",
headers={"x-api-key": API_KEY},
json={"url": url, "prompt": f"Extract: {fields}"}
)
return response.json()
# Now your agent can extract data from any site
agent_executor.invoke({
"input": "Compare pricing between Notion, Coda, and Confluence. "
"Visit each pricing page and extract their tiers."
})
```
## Getting Started
WebPerception API makes AI data extraction simple:
1. **Sign up free** at [mantisapi.com](https://mantisapi.com) — 100 calls/month, no credit card
2. **Get your API key** from the dashboard
3. **Call `/extract`** with a URL and a description of what you want
4. **Get structured JSON** back — every time
Stop writing brittle scrapers. Start extracting data with AI. [Get your free API key →](https://mantisapi.com)
---
| Plan | Calls/Month | Price |
|------|-------------|-------|
| **Free** | 100 | $0 |
| **Starter** | 5,000 | $29/mo |
| **Pro** | 25,000 | $99/mo |
| **Scale** | 100,000 | $299/mo |
*Have a unique extraction use case? Email us at hello@mantisapi.com — we love hearing what developers build.*
Ready to try Mantis?
100 free API calls/month. No credit card required.
Get Your API Key →