Structured Data Extraction with AI: Turn Any Webpage into Clean JSON

March 9, 2026 · 10 min read AI Extraction Web Scraping

You need product prices from an e-commerce site. Job listings from a career page. Contact info from a directory. The data is right there on the webpage — but getting it into a clean, structured format? That's where things fall apart.

Traditional scraping with CSS selectors and regex is brittle. One layout change and your entire pipeline breaks. AI-powered extraction changes the game: describe what you want in plain English, and get back clean JSON every time.

Why Traditional Extraction Fails

Consider extracting product data from an e-commerce page. The traditional approach:

# Brittle CSS selector approach
price = soup.select_one('.price-box .special-price .price')
# What if the class changes to 'product-price'?
# What if there's a sale price AND regular price?
# What if the price is inside a shadow DOM component?

title = soup.select_one('h1.product-name')
# What if it's h2? What if there's no class?

Problems with this approach:

Fragile selectors — A single class name change breaks everything
Site-specific code — Every website needs custom extraction logic
Dynamic content — JavaScript-rendered content requires headless browsers
Unstructured variations — "$29.99", "29.99 USD", "From $29" all need different parsing
Maintenance burden — 50 scrapers = 50 things to fix when sites update

The AI Extraction Approach

AI extraction replaces brittle selectors with semantic understanding. Instead of telling the computer where the data is, you tell it what the data is:

# AI extraction approach
response = requests.post("https://api.mantisapi.com/v1/extract", json={
    "url": "https://example.com/product/laptop-pro",
    "prompt": "Extract the product name, current price, original price, rating, and number of reviews",
    "schema": {
        "product_name": "string",
        "current_price": "number",
        "original_price": "number or null",
        "rating": "number",
        "review_count": "integer"
    }
}, headers={"Authorization": "Bearer sk_live_your_key"})

The result:

{
  "product_name": "ProBook Laptop 16\" M4 Pro",
  "current_price": 1999.00,
  "original_price": 2499.00,
  "rating": 4.7,
  "review_count": 1283
}

Same code works on Amazon, Best Buy, Walmart — any e-commerce site. No selectors to update. No regex to maintain.

Python Implementation

Basic Extraction

import requests
from pydantic import BaseModel
from typing import Optional

API_KEY = "sk_live_your_key"
API_BASE = "https://api.mantisapi.com/v1"

class Product(BaseModel):
    name: str
    price: float
    currency: str
    in_stock: bool
    description: str
    rating: Optional[float] = None
    review_count: Optional[int] = None

def extract_product(url: str) -> Product:
    """Extract structured product data from any e-commerce URL."""
    response = requests.post(f"{API_BASE}/extract", json={
        "url": url,
        "prompt": "Extract product details: name, price (as number), "
                  "currency code, stock availability, short description, "
                  "star rating, and number of reviews",
        "schema": Product.model_json_schema()
    }, headers={"Authorization": f"Bearer {API_KEY}"})

    data = response.json()
    return Product(**data["extracted"])

# Works on ANY e-commerce site
product = extract_product("https://store.example.com/widget-pro")
print(f"{product.name}: ${product.price} ({product.currency})")
print(f"Rating: {product.rating}/5 ({product.review_count} reviews)")
print(f"In stock: {product.in_stock}")

Batch Extraction

import asyncio
import aiohttp

async def extract_many(urls: list[str], prompt: str, schema: dict) -> list[dict]:
    """Extract structured data from multiple URLs concurrently."""
    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in urls:
            task = extract_one(session, url, prompt, schema)
            tasks.append(task)
        return await asyncio.gather(*tasks)

async def extract_one(session, url, prompt, schema):
    async with session.post(f"{API_BASE}/extract", json={
        "url": url, "prompt": prompt, "schema": schema
    }, headers={"Authorization": f"Bearer {API_KEY}"}) as resp:
        data = await resp.json()
        return {"url": url, "data": data.get("extracted", {})}

# Extract job listings from multiple career pages
urls = [
    "https://company-a.com/careers",
    "https://company-b.com/jobs",
    "https://company-c.com/openings",
]

results = asyncio.run(extract_many(urls,
    prompt="Extract all job listings with title, department, location, and salary range",
    schema={
        "jobs": [{
            "title": "string",
            "department": "string",
            "location": "string",
            "salary_range": "string or null"
        }]
    }
))

Node.js / TypeScript Implementation

import { z } from "zod";

const API_KEY = process.env.WEBPERCEPTION_API_KEY!;

// Define your schema with Zod
const ProductSchema = z.object({
  name: z.string(),
  price: z.number(),
  currency: z.string(),
  in_stock: z.boolean(),
  description: z.string(),
  rating: z.number().nullable(),
  review_count: z.number().int().nullable(),
});

type Product = z.infer<typeof ProductSchema>;

async function extractProduct(url: string): Promise<Product> {
  const response = await fetch("https://api.mantisapi.com/v1/extract", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      url,
      prompt: "Extract product details: name, price as number, "
            + "currency code, stock status, short description, "
            + "star rating, review count",
      schema: {
        name: "string",
        price: "number",
        currency: "string",
        in_stock: "boolean",
        description: "string",
        rating: "number or null",
        review_count: "integer or null",
      },
    }),
  });

  const data = await response.json();
  return ProductSchema.parse(data.extracted);
}

// Extract from any site
const product = await extractProduct("https://store.example.com/item/123");
console.log(`${product.name}: $${product.price}`);

Real-World Use Cases

1. Product Catalog Aggregation

Build a product comparison engine by extracting structured data from multiple retailers. One extraction prompt works across all of them — no per-site maintenance.

prompt = """Extract all products on this page:
- Product name
- Price (number only, no currency symbol)
- Image URL
- Product URL
- Availability (in_stock boolean)
- Brand name"""

2. Job Board Scraping

Monitor career pages for new openings. AI understands job listings regardless of how each company formats them.

prompt = """Extract all job listings:
- Job title
- Department
- Location (city, state/country)
- Employment type (full-time, part-time, contract)
- Posted date
- Salary range if listed"""

3. Real Estate Data

Extract property listings with prices, square footage, bedrooms, and amenities from any real estate site — Zillow, Realtor.com, local MLS sites, international portals.

4. News Article Parsing

Extract headline, author, publication date, summary, and key entities from any news article. Perfect for building media monitoring pipelines.

Traditional vs AI Extraction

Factor	CSS Selectors / Regex	AI Extraction
Setup time per site	Hours	Minutes
Maintenance	Constant	Near zero
Cross-site portability	None	Full
Handles layout changes	No	Yes
Unstructured text parsing	Regex hell	Native
Speed per page	~100ms	~2-3s
Cost per page	Compute only	~$0.004
Accuracy on messy HTML	Low	High

When to use AI extraction: When you're scraping multiple sites, dealing with frequently changing layouts, or need to extract meaning (not just text) from pages.

When to use traditional: When you're scraping one stable site at massive scale (millions of pages) and need sub-second latency.

Schema Design Tips

Be specific in your prompt — "Extract the sale price as a number without currency symbols" beats "get the price"
Use nullable fields — Not every page will have every field. Use null defaults for optional data.
Request arrays for lists — When a page has multiple items, define the schema as an array of objects
Include type hints — "number", "boolean", "string", "integer" help the AI format output correctly
Test with edge cases — Try pages with missing data, different languages, and unusual layouts

Cost at Scale

Use Case	Pages/Day	Monthly Cost (Pro)
Price monitoring (50 products)	50	$6
Job board monitoring (20 sites)	20	$2.40
News aggregation (100 articles)	100	$12
Lead generation (200 companies)	200	$24
Full catalog sync (500 products)	500	$60

All well within the Pro plan's 25,000 monthly calls at $99/month.

Start Extracting Structured Data

100 API calls/month free. No credit card required. Clean JSON from any webpage in seconds.

Get Your API Key →