Build an AI Research Agent That Scrapes the Web

March 9, 2026 AI Agents

Build an AI Research Agent That Scrapes the Web

Imagine telling an AI: "Research the top 5 competitors in the AI agent platform space and write a comparison report." Then walking away while it scrapes websites, extracts pricing, compares features, and delivers a polished report.

That's what we're building in this tutorial — a fully autonomous AI research agent that uses web scraping as its primary tool. No manual data gathering. No copy-pasting from browser tabs. Just a prompt in, report out.

What the Agent Does

Our research agent follows this workflow:

Receives a research question from the user

Plans which websites to scrape based on the topic

Scrapes each site to extract relevant content

Analyzes the data across multiple sources

Writes a structured report with citations

Think of it as a junior research analyst that works at the speed of API calls.

Prerequisites

pip install requests openai

You'll need:

WebPerception API key — Sign up free (100 calls/month)
OpenAI API key — For the LLM reasoning engine

Step 1: Define the Tools

Every agent needs tools. Our research agent gets three:

import requests
import json
from openai import OpenAI

MANTIS_API_KEY = "your_mantis_api_key"
MANTIS_BASE = "https://api.mantisapi.com"
client = OpenAI()

def scrape_url(url: str, format: str = "markdown") -> str:
    """Scrape a URL and return clean content."""
    try:
        response = requests.post(
            f"{MANTIS_BASE}/v1/scrape",
            headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
            json={"url": url, "format": format, "wait_for": "networkidle"},
            timeout=30
        )
        response.raise_for_status()
        data = response.json()
        content = data.get("content", "")
        title = data.get("metadata", {}).get("title", "Unknown")
        return f"[{title}]\n{content[:8000]}"  # Truncate for context window
    except Exception as e:
        return f"Error scraping {url}: {str(e)}"

def extract_data(url: str, prompt: str) -> str:
    """Use AI to extract specific data from a URL."""
    try:
        response = requests.post(
            f"{MANTIS_BASE}/v1/extract",
            headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
            json={"url": url, "prompt": prompt},
            timeout=30
        )
        response.raise_for_status()
        return json.dumps(response.json().get("data", {}), indent=2)
    except Exception as e:
        return f"Error extracting from {url}: {str(e)}"

def take_screenshot(url: str) -> str:
    """Take a screenshot of a URL (returns URL to image)."""
    try:
        response = requests.post(
            f"{MANTIS_BASE}/v1/screenshot",
            headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
            json={"url": url, "full_page": False},
            timeout=30
        )
        response.raise_for_status()
        return response.json().get("screenshot_url", "No screenshot URL")
    except Exception as e:
        return f"Error screenshotting {url}: {str(e)}"

Step 2: Build the Agent Loop

The agent uses OpenAI's function calling to decide which tools to use and when:

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "scrape_url",
            "description": "Scrape a web page and return its content as clean text/markdown. Use this to read articles, documentation, product pages, and any web content.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to scrape"},
                    "format": {"type": "string", "enum": ["markdown", "text"], "default": "markdown"}
                },
                "required": ["url"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "extract_data",
            "description": "Extract specific structured data from a web page using AI. Use this when you need specific facts like pricing, features, or contact info.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to extract from"},
                    "prompt": {"type": "string", "description": "What data to extract (e.g., 'Extract pricing tiers and features')"}
                },
                "required": ["url", "prompt"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "take_screenshot",
            "description": "Take a visual screenshot of a web page. Use this to capture visual layouts, designs, or proof of claims.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to screenshot"}
                },
                "required": ["url"]
            }
        }
    }
]

TOOL_MAP = {
    "scrape_url": scrape_url,
    "extract_data": extract_data,
    "take_screenshot": take_screenshot,
}

Step 3: The Agent Brain

Here's the core agent loop — it plans, executes tools, and synthesizes results:

SYSTEM_PROMPT = """You are an expert research agent. Your job is to thoroughly research a topic by scraping relevant websites and producing a comprehensive report.

Your research process:
1. Think about what information you need and which websites to check
2. Scrape relevant pages to gather data
3. Extract specific facts when needed (pricing, features, etc.)
4. Analyze all collected data
5. Write a structured report with findings and citations

Guidelines:
- Always cite your sources with URLs
- Scrape at least 3-5 sources for comprehensive coverage
- Use extract_data for structured info (pricing, specs, features)
- Use scrape_url for general content and context
- Be thorough but efficient — don't scrape unnecessary pages
- If a scrape fails, try an alternative source
- Present findings objectively with data to back claims"""

def research(question: str, max_iterations: int = 10) -> str:
    """Run the research agent on a question."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Research this topic and write a detailed report:\n\n{question}"}
    ]

    for i in range(max_iterations):
        print(f"\n🔄 Iteration {i+1}/{max_iterations}")

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        )

        message = response.choices[0].message
        messages.append(message)

        # If no tool calls, the agent is done — return the final answer
        if not message.tool_calls:
            print("✅ Research complete!")
            return message.content

        # Execute each tool call
        for tool_call in message.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)
            print(f"  🔧 {fn_name}({fn_args.get('url', fn_args.get('prompt', ''))[:60]}...)")

            result = TOOL_MAP[fn_name](**fn_args)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result[:10000]  # Limit context size
            })

    return "Research incomplete — reached maximum iterations."

Step 4: Run It

# Example 1: Competitive analysis
report = research(
    "Compare the top 5 web scraping APIs in 2026. "
    "Include pricing, features, AI capabilities, and which is best for AI agents."
)
print(report)

# Example 2: Market research
report = research(
    "What are the main challenges developers face when building AI agents? "
    "Research developer forums, blog posts, and documentation."
)
print(report)

# Example 3: Technical research
report = research(
    "How are companies using web scraping for real-time RAG pipelines? "
    "Find case studies and technical implementations."
)
print(report)

What the Agent Produces

Here's a real example output for a competitive analysis:

# Web Scraping API Comparison Report (2026)

## Executive Summary
After analyzing 5 major web scraping APIs, [findings]...

## Detailed Comparison

### 1. WebPerception API (mantisapi.com)
- **Pricing:** Free tier (100/mo), paid from $29/mo
- **AI Extraction:** Yes — structured data extraction with custom schemas
- **Best For:** AI agents needing real-time web access
- **Source:** https://mantisapi.com/pricing

### 2. ScrapingBee
- **Pricing:** From $49/mo for 1,000 credits
- **AI Extraction:** Limited
- **Source:** https://scrapingbee.com/pricing

[... more competitors ...]

## Recommendation
For AI agent developers, WebPerception API offers the best
combination of AI-native features and competitive pricing...

Adding Memory for Multi-Session Research

For longer research projects, add a simple memory layer:

import os
from datetime import datetime

RESEARCH_DIR = "./research_sessions"

def save_research(topic: str, report: str):
    """Save research to disk for future reference."""
    os.makedirs(RESEARCH_DIR, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    slug = topic[:50].replace(" ", "_").lower()
    filepath = f"{RESEARCH_DIR}/{timestamp}_{slug}.md"
    with open(filepath, "w") as f:
        f.write(f"# Research: {topic}\n")
        f.write(f"*Generated: {datetime.now().isoformat()}*\n\n")
        f.write(report)
    print(f"💾 Saved: {filepath}")

def load_previous_research(topic: str) -> str:
    """Load relevant previous research as context."""
    if not os.path.exists(RESEARCH_DIR):
        return ""
    relevant = []
    for f in os.listdir(RESEARCH_DIR):
        if any(word in f.lower() for word in topic.lower().split()):
            with open(f"{RESEARCH_DIR}/{f}") as fh:
                relevant.append(fh.read()[:3000])
    return "\n\n---\n\n".join(relevant) if relevant else ""

Advanced: Parallel Scraping

Speed up research by scraping multiple URLs concurrently:

from concurrent.futures import ThreadPoolExecutor, as_completed

def parallel_scrape(urls: list, max_workers: int = 5) -> dict:
    """Scrape multiple URLs in parallel."""
    results = {}
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_url = {
            executor.submit(scrape_url, url): url for url in urls
        }
        for future in as_completed(future_to_url):
            url = future_to_url[future]
            try:
                results[url] = future.result()
                print(f"✅ {url[:60]}")
            except Exception as e:
                results[url] = f"Error: {e}"
                print(f"❌ {url[:60]}")
    return results

Cost Estimate

For a typical research task (scraping 10 pages, 1 extraction):

| Component | Cost |

|-----------|------|

| Web scraping (10 pages) | ~$0.05 |

| AI extraction (1 page) | ~$0.01 |

| GPT-4o reasoning (~5K tokens) | ~$0.08 |

| Total per research task | ~$0.14 |

That's 70x cheaper than a human research assistant doing the same work — and it takes 2 minutes instead of 2 hours.

Use Cases

Competitive intelligence — Monitor competitor pricing, features, and messaging weekly
Market research — Analyze trends across industry blogs and news sites
Due diligence — Research companies before partnerships or investments
Content research — Gather sources and data for writing articles
Lead qualification — Research prospect companies before outreach
Academic research — Scrape papers and summarize findings across sources

Next Steps

Get your free API key — Start scraping with 100 free calls/month
API Documentation — Full endpoint reference
LangChain Integration — Build agents with LangChain + WebPerception
CrewAI Integration — Multi-agent research teams

Your AI agent shouldn't be limited to what's in its training data. Give it the web — and watch it research like a pro.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →