Build an AI Research Agent That Scrapes the Web
Build an AI Research Agent That Scrapes the Web
Imagine telling an AI: "Research the top 5 competitors in the AI agent platform space and write a comparison report." Then walking away while it scrapes websites, extracts pricing, compares features, and delivers a polished report.
That's what we're building in this tutorial ā a fully autonomous AI research agent that uses web scraping as its primary tool. No manual data gathering. No copy-pasting from browser tabs. Just a prompt in, report out.
What the Agent Does
Our research agent follows this workflow:
Receives a research question from the user
Plans which websites to scrape based on the topic
Scrapes each site to extract relevant content
Analyzes the data across multiple sources
Writes a structured report with citations
Think of it as a junior research analyst that works at the speed of API calls.
Prerequisites
pip install requests openai
You'll need:
- WebPerception API key ā Sign up free (100 calls/month)
- OpenAI API key ā For the LLM reasoning engine
Step 1: Define the Tools
Every agent needs tools. Our research agent gets three:
import requests
import json
from openai import OpenAI
MANTIS_API_KEY = "your_mantis_api_key"
MANTIS_BASE = "https://api.mantisapi.com"
client = OpenAI()
def scrape_url(url: str, format: str = "markdown") -> str:
"""Scrape a URL and return clean content."""
try:
response = requests.post(
f"{MANTIS_BASE}/v1/scrape",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={"url": url, "format": format, "wait_for": "networkidle"},
timeout=30
)
response.raise_for_status()
data = response.json()
content = data.get("content", "")
title = data.get("metadata", {}).get("title", "Unknown")
return f"[{title}]\n{content[:8000]}" # Truncate for context window
except Exception as e:
return f"Error scraping {url}: {str(e)}"
def extract_data(url: str, prompt: str) -> str:
"""Use AI to extract specific data from a URL."""
try:
response = requests.post(
f"{MANTIS_BASE}/v1/extract",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={"url": url, "prompt": prompt},
timeout=30
)
response.raise_for_status()
return json.dumps(response.json().get("data", {}), indent=2)
except Exception as e:
return f"Error extracting from {url}: {str(e)}"
def take_screenshot(url: str) -> str:
"""Take a screenshot of a URL (returns URL to image)."""
try:
response = requests.post(
f"{MANTIS_BASE}/v1/screenshot",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={"url": url, "full_page": False},
timeout=30
)
response.raise_for_status()
return response.json().get("screenshot_url", "No screenshot URL")
except Exception as e:
return f"Error screenshotting {url}: {str(e)}"
Step 2: Build the Agent Loop
The agent uses OpenAI's function calling to decide which tools to use and when:
TOOLS = [
{
"type": "function",
"function": {
"name": "scrape_url",
"description": "Scrape a web page and return its content as clean text/markdown. Use this to read articles, documentation, product pages, and any web content.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to scrape"},
"format": {"type": "string", "enum": ["markdown", "text"], "default": "markdown"}
},
"required": ["url"]
}
}
},
{
"type": "function",
"function": {
"name": "extract_data",
"description": "Extract specific structured data from a web page using AI. Use this when you need specific facts like pricing, features, or contact info.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to extract from"},
"prompt": {"type": "string", "description": "What data to extract (e.g., 'Extract pricing tiers and features')"}
},
"required": ["url", "prompt"]
}
}
},
{
"type": "function",
"function": {
"name": "take_screenshot",
"description": "Take a visual screenshot of a web page. Use this to capture visual layouts, designs, or proof of claims.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to screenshot"}
},
"required": ["url"]
}
}
}
]
TOOL_MAP = {
"scrape_url": scrape_url,
"extract_data": extract_data,
"take_screenshot": take_screenshot,
}
Step 3: The Agent Brain
Here's the core agent loop ā it plans, executes tools, and synthesizes results:
SYSTEM_PROMPT = """You are an expert research agent. Your job is to thoroughly research a topic by scraping relevant websites and producing a comprehensive report.
Your research process:
1. Think about what information you need and which websites to check
2. Scrape relevant pages to gather data
3. Extract specific facts when needed (pricing, features, etc.)
4. Analyze all collected data
5. Write a structured report with findings and citations
Guidelines:
- Always cite your sources with URLs
- Scrape at least 3-5 sources for comprehensive coverage
- Use extract_data for structured info (pricing, specs, features)
- Use scrape_url for general content and context
- Be thorough but efficient ā don't scrape unnecessary pages
- If a scrape fails, try an alternative source
- Present findings objectively with data to back claims"""
def research(question: str, max_iterations: int = 10) -> str:
"""Run the research agent on a question."""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Research this topic and write a detailed report:\n\n{question}"}
]
for i in range(max_iterations):
print(f"\nš Iteration {i+1}/{max_iterations}")
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=TOOLS,
tool_choice="auto"
)
message = response.choices[0].message
messages.append(message)
# If no tool calls, the agent is done ā return the final answer
if not message.tool_calls:
print("ā
Research complete!")
return message.content
# Execute each tool call
for tool_call in message.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
print(f" š§ {fn_name}({fn_args.get('url', fn_args.get('prompt', ''))[:60]}...)")
result = TOOL_MAP[fn_name](**fn_args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result[:10000] # Limit context size
})
return "Research incomplete ā reached maximum iterations."
Step 4: Run It
# Example 1: Competitive analysis
report = research(
"Compare the top 5 web scraping APIs in 2026. "
"Include pricing, features, AI capabilities, and which is best for AI agents."
)
print(report)
# Example 2: Market research
report = research(
"What are the main challenges developers face when building AI agents? "
"Research developer forums, blog posts, and documentation."
)
print(report)
# Example 3: Technical research
report = research(
"How are companies using web scraping for real-time RAG pipelines? "
"Find case studies and technical implementations."
)
print(report)
What the Agent Produces
Here's a real example output for a competitive analysis:
# Web Scraping API Comparison Report (2026)
## Executive Summary
After analyzing 5 major web scraping APIs, [findings]...
## Detailed Comparison
### 1. WebPerception API (mantisapi.com)
- **Pricing:** Free tier (100/mo), paid from $29/mo
- **AI Extraction:** Yes ā structured data extraction with custom schemas
- **Best For:** AI agents needing real-time web access
- **Source:** https://mantisapi.com/pricing
### 2. ScrapingBee
- **Pricing:** From $49/mo for 1,000 credits
- **AI Extraction:** Limited
- **Source:** https://scrapingbee.com/pricing
[... more competitors ...]
## Recommendation
For AI agent developers, WebPerception API offers the best
combination of AI-native features and competitive pricing...
Adding Memory for Multi-Session Research
For longer research projects, add a simple memory layer:
import os
from datetime import datetime
RESEARCH_DIR = "./research_sessions"
def save_research(topic: str, report: str):
"""Save research to disk for future reference."""
os.makedirs(RESEARCH_DIR, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
slug = topic[:50].replace(" ", "_").lower()
filepath = f"{RESEARCH_DIR}/{timestamp}_{slug}.md"
with open(filepath, "w") as f:
f.write(f"# Research: {topic}\n")
f.write(f"*Generated: {datetime.now().isoformat()}*\n\n")
f.write(report)
print(f"š¾ Saved: {filepath}")
def load_previous_research(topic: str) -> str:
"""Load relevant previous research as context."""
if not os.path.exists(RESEARCH_DIR):
return ""
relevant = []
for f in os.listdir(RESEARCH_DIR):
if any(word in f.lower() for word in topic.lower().split()):
with open(f"{RESEARCH_DIR}/{f}") as fh:
relevant.append(fh.read()[:3000])
return "\n\n---\n\n".join(relevant) if relevant else ""
Advanced: Parallel Scraping
Speed up research by scraping multiple URLs concurrently:
from concurrent.futures import ThreadPoolExecutor, as_completed
def parallel_scrape(urls: list, max_workers: int = 5) -> dict:
"""Scrape multiple URLs in parallel."""
results = {}
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_url = {
executor.submit(scrape_url, url): url for url in urls
}
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
results[url] = future.result()
print(f"ā
{url[:60]}")
except Exception as e:
results[url] = f"Error: {e}"
print(f"ā {url[:60]}")
return results
Cost Estimate
For a typical research task (scraping 10 pages, 1 extraction):
| Component | Cost |
|-----------|------|
| Web scraping (10 pages) | ~$0.05 |
| AI extraction (1 page) | ~$0.01 |
| GPT-4o reasoning (~5K tokens) | ~$0.08 |
| Total per research task | ~$0.14 |
That's 70x cheaper than a human research assistant doing the same work ā and it takes 2 minutes instead of 2 hours.
Use Cases
- Competitive intelligence ā Monitor competitor pricing, features, and messaging weekly
- Market research ā Analyze trends across industry blogs and news sites
- Due diligence ā Research companies before partnerships or investments
- Content research ā Gather sources and data for writing articles
- Lead qualification ā Research prospect companies before outreach
- Academic research ā Scrape papers and summarize findings across sources
Next Steps
- Get your free API key ā Start scraping with 100 free calls/month
- API Documentation ā Full endpoint reference
- LangChain Integration ā Build agents with LangChain + WebPerception
- CrewAI Integration ā Multi-agent research teams
Your AI agent shouldn't be limited to what's in its training data. Give it the web ā and watch it research like a pro.