CrewAI Web Scraping: How to Give Your AI Crew Real-Time Web Access in 2026
CrewAI Web Scraping: How to Give Your AI Crew Real-Time Web Access in 2026
CrewAI is one of the most popular frameworks for building multi-agent systems. But out of the box, your crew is blind to the web. Agents can reason, plan, and collaborate — but they can't see what's happening online right now.
That's a problem. Most real-world agent tasks require fresh web data: market research, competitive analysis, lead generation, content creation, price monitoring.
This guide shows you how to give your CrewAI agents powerful web scraping capabilities — from basic page reading to structured AI extraction.
The Problem: CrewAI Agents Can't See the Web
By default, CrewAI agents work with whatever context you give them. They can use tools, but the built-in web tools are limited:
- SerperDevTool — search results only, no page content
- ScrapeWebsiteTool — basic HTML fetch, breaks on JavaScript-heavy sites
- WebsiteSearchTool — RAG over a single page, no structured extraction
For production use cases, you need:
- JavaScript rendering (85%+ of modern sites need it)
- Anti-bot bypass (Cloudflare, PerimeterX, etc.)
- Structured data extraction (not raw HTML dumps)
- Reliable uptime at scale
Solution: WebPerception API as a CrewAI Tool
WebPerception API handles all the hard parts — rendering, anti-bot, and extraction — so your crew focuses on reasoning and decision-making.
Step 1: Create Custom Tools
from crewai.tools import tool
import requests
MANTIS_API_KEY = "your_api_key"
MANTIS_BASE = "https://api.mantisapi.com"
@tool("Scrape Web Page")
def scrape_webpage(url: str) -> str:
"""Scrape a web page and return its content as clean markdown.
Use this to read articles, documentation, blog posts, or any web page."""
response = requests.get(f"{MANTIS_BASE}/scrape", params={
'url': url,
'api_key': MANTIS_API_KEY
})
if response.ok:
return response.json().get('content', 'No content extracted')
return f"Error: {response.status_code}"
@tool("Extract Structured Data")
def extract_data(url: str, fields: str) -> str:
"""Extract specific structured data from a web page.
Args:
url: The web page URL
fields: Comma-separated list of fields to extract (e.g., "product_name,price,rating")
"""
schema = {field.strip(): "string" for field in fields.split(",")}
response = requests.post(f"{MANTIS_BASE}/extract", json={
'url': url,
'api_key': MANTIS_API_KEY,
'schema': schema
})
if response.ok:
return str(response.json())
return f"Error: {response.status_code}"
@tool("Take Screenshot")
def screenshot_page(url: str) -> str:
"""Capture a screenshot of a web page. Returns the screenshot URL."""
response = requests.get(f"{MANTIS_BASE}/screenshot", params={
'url': url,
'api_key': MANTIS_API_KEY
})
if response.ok:
return response.json().get('screenshot_url', 'No URL returned')
return f"Error: {response.status_code}"
Step 2: Build Your Research Crew
from crewai import Agent, Task, Crew, Process
# Market Research Agent
researcher = Agent(
role="Market Research Analyst",
goal="Gather comprehensive market intelligence from the web",
backstory="You're an expert at finding and analyzing market data online.",
tools=[scrape_webpage, extract_data],
verbose=True
)
# Competitive Analyst
analyst = Agent(
role="Competitive Intelligence Analyst",
goal="Analyze competitors and identify strategic opportunities",
backstory="You specialize in competitive analysis and strategic insights.",
tools=[scrape_webpage, extract_data, screenshot_page],
verbose=True
)
# Report Writer
writer = Agent(
role="Business Intelligence Writer",
goal="Synthesize research into clear, actionable reports",
backstory="You turn raw research into strategic recommendations.",
verbose=True
)
# Define tasks
research_task = Task(
description="""Research the top 5 competitors in the {industry} space.
For each competitor, scrape their website and extract:
- Company name and tagline
- Pricing information
- Key features
- Target audience""",
expected_output="Detailed competitor profiles with pricing and features",
agent=researcher
)
analysis_task = Task(
description="""Analyze the competitor data and identify:
- Market gaps and opportunities
- Pricing trends
- Feature comparison matrix
- Strategic recommendations""",
expected_output="Strategic analysis with actionable insights",
agent=analyst
)
report_task = Task(
description="""Write a comprehensive competitive intelligence report
based on the research and analysis. Include an executive summary,
detailed findings, and strategic recommendations.""",
expected_output="Professional competitive intelligence report",
agent=writer
)
# Assemble the crew
research_crew = Crew(
agents=[researcher, analyst, writer],
tasks=[research_task, analysis_task, report_task],
process=Process.sequential,
verbose=True
)
# Run it
result = research_crew.kickoff(inputs={'industry': 'web scraping APIs'})
print(result)
Real-World Use Cases
1. Lead Generation Crew
lead_finder = Agent(
role="Lead Generation Specialist",
goal="Find and qualify potential customers from the web",
tools=[scrape_webpage, extract_data],
backstory="Expert at finding companies that match our ICP."
)
lead_task = Task(
description="""Find 20 companies that might need a web scraping API.
Search for companies building AI agents, data pipelines, or price monitoring tools.
For each company, extract: name, website, what they do, potential use case for our API.""",
expected_output="List of 20 qualified leads with contact info and use cases",
agent=lead_finder
)
2. Content Research Crew
content_researcher = Agent(
role="Content Research Specialist",
goal="Research topics thoroughly by reading multiple web sources",
tools=[scrape_webpage],
backstory="You research topics by reading real articles, not generating from memory."
)
content_writer = Agent(
role="SEO Content Writer",
goal="Write high-quality, SEO-optimized blog posts based on real research",
backstory="You write accurate, well-researched content that ranks."
)
research = Task(
description="""Research "{topic}" by scraping the top 10 articles on this topic.
Summarize key points, statistics, and expert opinions.""",
expected_output="Comprehensive research summary with citations",
agent=content_researcher
)
writing = Task(
description="""Write a 2000-word blog post on "{topic}" using the research.
Include real statistics and insights from the sources. Optimize for SEO.""",
expected_output="Complete, SEO-optimized blog post",
agent=content_writer
)
3. Price Monitoring Crew
price_monitor = Agent(
role="Price Monitoring Agent",
goal="Track competitor prices across multiple websites",
tools=[extract_data],
backstory="You monitor prices and alert on significant changes."
)
price_task = Task(
description="""Check the current prices for {product_category} on these competitor sites:
{competitor_urls}
Extract: product name, current price, original price (if on sale), currency.
Flag any prices that have changed more than 10% from last check.""",
expected_output="Price comparison table with change alerts",
agent=price_monitor
)
Best Practices
1. Rate Limiting
Don't let your crew overwhelm the API. Add delays between scraping calls:
import time
@tool("Scrape Web Page (Rate Limited)")
def scrape_webpage_safe(url: str) -> str:
"""Scrape a web page with rate limiting."""
time.sleep(1) # 1 second between requests
response = requests.get(f"{MANTIS_BASE}/scrape", params={
'url': url,
'api_key': MANTIS_API_KEY
})
return response.json().get('content', 'Error') if response.ok else f"Error: {response.status_code}"
2. Error Handling
Web scraping can fail. Make your tools resilient:
@tool("Scrape Web Page (Resilient)")
def scrape_with_retry(url: str) -> str:
"""Scrape a web page with automatic retry on failure."""
for attempt in range(3):
try:
response = requests.get(f"{MANTIS_BASE}/scrape",
params={'url': url, 'api_key': MANTIS_API_KEY},
timeout=30
)
if response.ok:
return response.json().get('content', 'No content')
if response.status_code == 429:
time.sleep(5 * (attempt + 1))
continue
except requests.exceptions.Timeout:
continue
return "Failed to scrape after 3 attempts"
3. Schema Design for Extract
Be specific with your extraction schemas. Vague schemas get vague results:
# Bad — too vague
schema = {"info": "string"}
# Good — specific and typed
schema = {
"company_name": "string",
"founded_year": "integer",
"employee_count": "string",
"headquarters": "string",
"annual_revenue": "string",
"key_products": ["string"]
}
CrewAI Web Tools Comparison
| Tool | JS Rendering | Anti-Bot | Structured Extract | Production Ready |
|------|-------------|----------|-------------------|-----------------|
| ScrapeWebsiteTool | ❌ | ❌ | ❌ | ❌ |
| SerperDevTool | N/A | N/A | Search only | ✅ |
| WebPerception API | ✅ | ✅ | ✅ | ✅ |
| DIY (Playwright) | ✅ | ❌ | ❌ | Requires infra |
| DIY (LLM + HTML) | ❌ | ❌ | ✅ | Expensive |
Getting Started
Install CrewAI: pip install crewai
Get a WebPerception API key: Sign up at mantisapi.com — 100 free calls/month
Copy the tool definitions above into your project
Build your first crew with web access
Start small — scrape one page, extract one schema, then scale up
Your crew has been working with blindfolds on. Time to take them off.
---
Give your AI agents eyes on the web. WebPerception API provides scraping, screenshots, and AI data extraction — perfect for CrewAI tools. Start free today.