AI Data Extraction: How to Extract Structured Data from Any Website

March 4, 2026 guide
# AI Data Extraction: How to Extract Structured Data from Any Website Every developer has been there. You need data from a website — product prices, company info, review scores — and you end up writing fragile CSS selectors that break every time the site updates. There's a better way. AI-powered data extraction lets you describe *what* you want in plain English, and get back structured JSON — no selectors, no parsing, no maintenance. This guide shows you how with [WebPerception API](https://mantisapi.com). ## The Problem with Traditional Web Scraping Traditional scraping looks like this: ```python from bs4 import BeautifulSoup import requests html = requests.get("https://example.com/product").text soup = BeautifulSoup(html, "html.parser") # Fragile selectors that WILL break price = soup.select_one(".price-box .special-price .price").text title = soup.select_one("h1.product-title span").text rating = soup.select_one(".rating-result span").text ``` Problems: - **Breaks when HTML changes** — one class name update and your scraper dies - **No JavaScript rendering** — SPAs and dynamic content are invisible - **Manual for every site** — each website needs custom selectors - **No semantic understanding** — the scraper doesn't "know" what a price is ## AI Data Extraction: The Modern Approach With AI extraction, you describe what you want: ```python import requests response = requests.post( "https://api.mantisapi.com/extract", headers={"x-api-key": "your-api-key"}, json={ "url": "https://example.com/product", "prompt": "Extract: product name, price, rating, number of reviews, availability" } ) data = response.json() # { # "product_name": "Sony WH-1000XM5 Wireless Headphones", # "price": "$279.99", # "rating": "4.7/5", # "number_of_reviews": 12847, # "availability": "In Stock" # } ``` **No selectors. No parsing. No maintenance.** The AI understands the page like a human would. ## How It Works WebPerception API's `/extract` endpoint: 1. **Renders the full page** — JavaScript, dynamic content, everything 2. **Understands the content** — AI reads the page like a human 3. **Extracts what you asked for** — Returns structured JSON matching your prompt 4. **Handles edge cases** — Missing fields return `null`, not crashes The AI model sees the rendered page content and extracts exactly the fields you specified. It works across different sites without any site-specific configuration. ## Use Cases ### Product Data Extraction ```python # Works on ANY e-commerce site result = extract("https://amazon.com/dp/B0BX2L8PDS", "Extract: product name, current price, original price, discount percentage, " "rating, review count, bullet point features, availability, seller name") ``` ### Job Listing Extraction ```python result = extract("https://company.com/careers/senior-engineer", "Extract: job title, location, salary range, required skills, " "experience required, benefits, team name, remote policy") ``` ### Real Estate Data ```python result = extract("https://zillow.com/homedetails/123-main-st", "Extract: address, price, bedrooms, bathrooms, square footage, " "year built, lot size, property type, HOA fees, price history") ``` ### Company Information ```python result = extract("https://techstartup.com/about", "Extract: company name, founding year, founders, headquarters, " "employee count, funding raised, investors, mission statement") ``` ### Review Aggregation ```python result = extract("https://g2.com/products/some-tool/reviews", "Extract the first 5 reviews with: reviewer name, rating, date, " "pros, cons, and summary") ``` ## Batch Extraction Need data from multiple pages? Loop through them: ```python import requests from concurrent.futures import ThreadPoolExecutor API_KEY = "your-api-key" BASE_URL = "https://api.mantisapi.com" def extract_product(url): response = requests.post( f"{BASE_URL}/extract", headers={"x-api-key": API_KEY}, json={ "url": url, "prompt": "Extract: product name, price, rating, review count" } ) return {"url": url, "data": response.json()} urls = [ "https://amazon.com/dp/PRODUCT1", "https://amazon.com/dp/PRODUCT2", "https://amazon.com/dp/PRODUCT3", ] # Extract in parallel with ThreadPoolExecutor(max_workers=3) as executor: results = list(executor.map(extract_product, urls)) for r in results: print(f"{r['url']}: {r['data']}") ``` ## AI Extraction vs. CSS Selectors | Aspect | AI Extraction | CSS Selectors | |--------|:-:|:-:| | Setup time | Minutes | Hours per site | | Maintenance | Zero | Constant | | Cross-site | Works everywhere | Site-specific | | Dynamic content | ✅ | Usually ❌ | | Semantic understanding | ✅ | ❌ | | Structured output | JSON | Raw strings | | Edge cases | Handled gracefully | Crashes | ## Tips for Better Extraction ### 1. Be Specific in Your Prompts ```python # ❌ Vague "Extract product info" # ✅ Specific "Extract: product name, price in USD, star rating out of 5, number of reviews, whether it's in stock (true/false), and the first 3 bullet point features" ``` ### 2. Specify Data Types ```python "Extract: price (number, no currency symbol), rating (decimal), review_count (integer), in_stock (boolean)" ``` ### 3. Handle Multiple Items ```python "Extract ALL products on this page. For each product: name, price, rating, and URL" ``` ### 4. Add Context ```python "This is a SaaS pricing page. Extract each pricing tier with: tier name, monthly price, annual price, features list, and any limits" ``` ## Integration with AI Agents AI data extraction is the perfect tool for autonomous agents. Here's a complete agent setup: ```python from langchain.tools import tool from langchain_openai import ChatOpenAI from langchain.agents import AgentExecutor, create_openai_tools_agent @tool def extract_web_data(url: str, fields: str) -> dict: """Extract structured data from any webpage. Args: url: The webpage URL fields: Description of what data to extract """ response = requests.post( "https://api.mantisapi.com/extract", headers={"x-api-key": API_KEY}, json={"url": url, "prompt": f"Extract: {fields}"} ) return response.json() # Now your agent can extract data from any site agent_executor.invoke({ "input": "Compare pricing between Notion, Coda, and Confluence. " "Visit each pricing page and extract their tiers." }) ``` ## Getting Started WebPerception API makes AI data extraction simple: 1. **Sign up free** at [mantisapi.com](https://mantisapi.com) — 100 calls/month, no credit card 2. **Get your API key** from the dashboard 3. **Call `/extract`** with a URL and a description of what you want 4. **Get structured JSON** back — every time Stop writing brittle scrapers. Start extracting data with AI. [Get your free API key →](https://mantisapi.com) --- | Plan | Calls/Month | Price | |------|-------------|-------| | **Free** | 100 | $0 | | **Starter** | 5,000 | $29/mo | | **Pro** | 25,000 | $99/mo | | **Scale** | 100,000 | $299/mo | *Have a unique extraction use case? Email us at hello@mantisapi.com — we love hearing what developers build.*

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →