Structured Data Extraction with AI: Turn Any Webpage into Clean JSON
You need product prices from an e-commerce site. Job listings from a career page. Contact info from a directory. The data is right there on the webpage โ but getting it into a clean, structured format? That's where things fall apart.
Traditional scraping with CSS selectors and regex is brittle. One layout change and your entire pipeline breaks. AI-powered extraction changes the game: describe what you want in plain English, and get back clean JSON every time.
Why Traditional Extraction Fails
Consider extracting product data from an e-commerce page. The traditional approach:
# Brittle CSS selector approach
price = soup.select_one('.price-box .special-price .price')
# What if the class changes to 'product-price'?
# What if there's a sale price AND regular price?
# What if the price is inside a shadow DOM component?
title = soup.select_one('h1.product-name')
# What if it's h2? What if there's no class?
Problems with this approach:
- Fragile selectors โ A single class name change breaks everything
- Site-specific code โ Every website needs custom extraction logic
- Dynamic content โ JavaScript-rendered content requires headless browsers
- Unstructured variations โ "$29.99", "29.99 USD", "From $29" all need different parsing
- Maintenance burden โ 50 scrapers = 50 things to fix when sites update
The AI Extraction Approach
AI extraction replaces brittle selectors with semantic understanding. Instead of telling the computer where the data is, you tell it what the data is:
# AI extraction approach
response = requests.post("https://api.mantisapi.com/v1/extract", json={
"url": "https://example.com/product/laptop-pro",
"prompt": "Extract the product name, current price, original price, rating, and number of reviews",
"schema": {
"product_name": "string",
"current_price": "number",
"original_price": "number or null",
"rating": "number",
"review_count": "integer"
}
}, headers={"Authorization": "Bearer sk_live_your_key"})
The result:
{
"product_name": "ProBook Laptop 16\" M4 Pro",
"current_price": 1999.00,
"original_price": 2499.00,
"rating": 4.7,
"review_count": 1283
}
Same code works on Amazon, Best Buy, Walmart โ any e-commerce site. No selectors to update. No regex to maintain.
Python Implementation
Basic Extraction
import requests
from pydantic import BaseModel
from typing import Optional
API_KEY = "sk_live_your_key"
API_BASE = "https://api.mantisapi.com/v1"
class Product(BaseModel):
name: str
price: float
currency: str
in_stock: bool
description: str
rating: Optional[float] = None
review_count: Optional[int] = None
def extract_product(url: str) -> Product:
"""Extract structured product data from any e-commerce URL."""
response = requests.post(f"{API_BASE}/extract", json={
"url": url,
"prompt": "Extract product details: name, price (as number), "
"currency code, stock availability, short description, "
"star rating, and number of reviews",
"schema": Product.model_json_schema()
}, headers={"Authorization": f"Bearer {API_KEY}"})
data = response.json()
return Product(**data["extracted"])
# Works on ANY e-commerce site
product = extract_product("https://store.example.com/widget-pro")
print(f"{product.name}: ${product.price} ({product.currency})")
print(f"Rating: {product.rating}/5 ({product.review_count} reviews)")
print(f"In stock: {product.in_stock}")
Batch Extraction
import asyncio
import aiohttp
async def extract_many(urls: list[str], prompt: str, schema: dict) -> list[dict]:
"""Extract structured data from multiple URLs concurrently."""
async with aiohttp.ClientSession() as session:
tasks = []
for url in urls:
task = extract_one(session, url, prompt, schema)
tasks.append(task)
return await asyncio.gather(*tasks)
async def extract_one(session, url, prompt, schema):
async with session.post(f"{API_BASE}/extract", json={
"url": url, "prompt": prompt, "schema": schema
}, headers={"Authorization": f"Bearer {API_KEY}"}) as resp:
data = await resp.json()
return {"url": url, "data": data.get("extracted", {})}
# Extract job listings from multiple career pages
urls = [
"https://company-a.com/careers",
"https://company-b.com/jobs",
"https://company-c.com/openings",
]
results = asyncio.run(extract_many(urls,
prompt="Extract all job listings with title, department, location, and salary range",
schema={
"jobs": [{
"title": "string",
"department": "string",
"location": "string",
"salary_range": "string or null"
}]
}
))
Node.js / TypeScript Implementation
import { z } from "zod";
const API_KEY = process.env.WEBPERCEPTION_API_KEY!;
// Define your schema with Zod
const ProductSchema = z.object({
name: z.string(),
price: z.number(),
currency: z.string(),
in_stock: z.boolean(),
description: z.string(),
rating: z.number().nullable(),
review_count: z.number().int().nullable(),
});
type Product = z.infer<typeof ProductSchema>;
async function extractProduct(url: string): Promise<Product> {
const response = await fetch("https://api.mantisapi.com/v1/extract", {
method: "POST",
headers: {
Authorization: `Bearer ${API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
url,
prompt: "Extract product details: name, price as number, "
+ "currency code, stock status, short description, "
+ "star rating, review count",
schema: {
name: "string",
price: "number",
currency: "string",
in_stock: "boolean",
description: "string",
rating: "number or null",
review_count: "integer or null",
},
}),
});
const data = await response.json();
return ProductSchema.parse(data.extracted);
}
// Extract from any site
const product = await extractProduct("https://store.example.com/item/123");
console.log(`${product.name}: $${product.price}`);
Real-World Use Cases
1. Product Catalog Aggregation
Build a product comparison engine by extracting structured data from multiple retailers. One extraction prompt works across all of them โ no per-site maintenance.
prompt = """Extract all products on this page:
- Product name
- Price (number only, no currency symbol)
- Image URL
- Product URL
- Availability (in_stock boolean)
- Brand name"""
2. Job Board Scraping
Monitor career pages for new openings. AI understands job listings regardless of how each company formats them.
prompt = """Extract all job listings:
- Job title
- Department
- Location (city, state/country)
- Employment type (full-time, part-time, contract)
- Posted date
- Salary range if listed"""
3. Real Estate Data
Extract property listings with prices, square footage, bedrooms, and amenities from any real estate site โ Zillow, Realtor.com, local MLS sites, international portals.
4. News Article Parsing
Extract headline, author, publication date, summary, and key entities from any news article. Perfect for building media monitoring pipelines.
Traditional vs AI Extraction
| Factor | CSS Selectors / Regex | AI Extraction |
|---|---|---|
| Setup time per site | Hours | Minutes |
| Maintenance | Constant | Near zero |
| Cross-site portability | None | Full |
| Handles layout changes | No | Yes |
| Unstructured text parsing | Regex hell | Native |
| Speed per page | ~100ms | ~2-3s |
| Cost per page | Compute only | ~$0.004 |
| Accuracy on messy HTML | Low | High |
When to use AI extraction: When you're scraping multiple sites, dealing with frequently changing layouts, or need to extract meaning (not just text) from pages.
When to use traditional: When you're scraping one stable site at massive scale (millions of pages) and need sub-second latency.
Schema Design Tips
- Be specific in your prompt โ "Extract the sale price as a number without currency symbols" beats "get the price"
- Use nullable fields โ Not every page will have every field. Use
nulldefaults for optional data. - Request arrays for lists โ When a page has multiple items, define the schema as an array of objects
- Include type hints โ "number", "boolean", "string", "integer" help the AI format output correctly
- Test with edge cases โ Try pages with missing data, different languages, and unusual layouts
Cost at Scale
| Use Case | Pages/Day | Monthly Cost (Pro) |
|---|---|---|
| Price monitoring (50 products) | 50 | $6 |
| Job board monitoring (20 sites) | 20 | $2.40 |
| News aggregation (100 articles) | 100 | $12 |
| Lead generation (200 companies) | 200 | $24 |
| Full catalog sync (500 products) | 500 | $60 |
All well within the Pro plan's 25,000 monthly calls at $99/month.
Start Extracting Structured Data
100 API calls/month free. No credit card required. Clean JSON from any webpage in seconds.
Get Your API Key โ