AutoGen Web Scraping: How to Build Web-Enabled Multi-Agent Systems in 2026
AutoGen Web Scraping: How to Build Web-Enabled Multi-Agent Systems in 2026
Microsoft's AutoGen framework makes it easy to build multi-agent systems where AI agents collaborate through conversation. But conversations alone don't solve real-world problems — agents need access to real-time web data.
Whether your agents need to research topics, monitor competitors, extract product data, or gather market intelligence, they need web scraping capabilities. This guide shows you exactly how to add production-grade web access to your AutoGen agents.
Why AutoGen Agents Need Web Access
AutoGen agents are powerful conversationalists and reasoners. But they're limited to their training data and whatever context you provide. For real-world tasks, agents need to:
- Research in real-time — not from training data that's months old
- Monitor websites — track prices, inventory, content changes
- Extract structured data — pull specific fields from web pages
- Verify information — fact-check against live sources
- Gather competitive intelligence — analyze competitor pages
Without web access, your agents are reasoning in a vacuum.
Setting Up Web Scraping Tools in AutoGen
Basic Setup
import autogen
import requests
# Configuration
MANTIS_API_KEY = "your_api_key"
MANTIS_BASE = "https://api.mantisapi.com"
config_list = [{"model": "gpt-4", "api_key": "your_openai_key"}]
llm_config = {
"config_list": config_list,
"timeout": 120,
}
Define Web Scraping Functions
def scrape_webpage(url: str) -> str:
"""Scrape a web page and return its content as clean markdown."""
try:
response = requests.get(f"{MANTIS_BASE}/scrape", params={
'url': url,
'api_key': MANTIS_API_KEY
}, timeout=30)
if response.ok:
data = response.json()
return data.get('content', 'No content extracted')
return f"Error scraping {url}: HTTP {response.status_code}"
except Exception as e:
return f"Error: {str(e)}"
def extract_structured_data(url: str, fields: list[str]) -> dict:
"""Extract specific structured data from a web page.
Args:
url: The web page to extract from
fields: List of field names to extract
"""
try:
schema = {field: "string" for field in fields}
response = requests.post(f"{MANTIS_BASE}/extract", json={
'url': url,
'api_key': MANTIS_API_KEY,
'schema': schema
}, timeout=30)
if response.ok:
return response.json()
return {"error": f"HTTP {response.status_code}"}
except Exception as e:
return {"error": str(e)}
def take_screenshot(url: str) -> str:
"""Capture a screenshot of a web page."""
try:
response = requests.get(f"{MANTIS_BASE}/screenshot", params={
'url': url,
'api_key': MANTIS_API_KEY
}, timeout=30)
if response.ok:
return response.json().get('screenshot_url', 'No URL')
return f"Error: HTTP {response.status_code}"
except Exception as e:
return f"Error: {str(e)}"
Register Tools with AutoGen Agents
# Create the assistant agent
assistant = autogen.AssistantAgent(
name="web_researcher",
system_message="""You are a web research specialist. You can scrape web pages,
extract structured data, and take screenshots. Use these tools to gather
information from the web when needed. Always cite your sources.""",
llm_config=llm_config,
)
# Create the user proxy agent
user_proxy = autogen.UserProxyAgent(
name="user",
human_input_mode="NEVER",
max_consecutive_auto_reply=10,
code_execution_config={"work_dir": "output"},
)
# Register the functions
@user_proxy.register_for_execution()
@assistant.register_for_llm(description="Scrape a web page and return clean markdown content")
def scrape(url: str) -> str:
return scrape_webpage(url)
@user_proxy.register_for_execution()
@assistant.register_for_llm(description="Extract specific fields from a web page as structured data")
def extract(url: str, fields: list[str]) -> dict:
return extract_structured_data(url, fields)
@user_proxy.register_for_execution()
@assistant.register_for_llm(description="Take a screenshot of a web page")
def screenshot(url: str) -> str:
return take_screenshot(url)
Run Your Web-Enabled Agent
# Research task
user_proxy.initiate_chat(
assistant,
message="""Research the top 5 web scraping APIs available in 2026.
For each one:
1. Visit their website
2. Extract their pricing, key features, and target audience
3. Create a comparison table
Start with ScrapingBee, Bright Data, ScraperAPI, Apify, and WebPerception API."""
)
Multi-Agent Patterns
Pattern 1: Researcher + Analyst
researcher = autogen.AssistantAgent(
name="researcher",
system_message="""You are a web researcher. Your job is to scrape websites
and gather raw information. Report findings factually without analysis.""",
llm_config=llm_config,
)
analyst = autogen.AssistantAgent(
name="analyst",
system_message="""You are a business analyst. You receive research data
and provide strategic insights, identify patterns, and make recommendations.
You don't scrape websites — you analyze what the researcher finds.""",
llm_config=llm_config,
)
user_proxy = autogen.UserProxyAgent(
name="user",
human_input_mode="NEVER",
max_consecutive_auto_reply=15,
)
# Register scraping tools only for the researcher
@user_proxy.register_for_execution()
@researcher.register_for_llm(description="Scrape a web page")
def scrape_for_research(url: str) -> str:
return scrape_webpage(url)
@user_proxy.register_for_execution()
@researcher.register_for_llm(description="Extract structured data from a page")
def extract_for_research(url: str, fields: list[str]) -> dict:
return extract_structured_data(url, fields)
# Group chat for collaboration
groupchat = autogen.GroupChat(
agents=[user_proxy, researcher, analyst],
messages=[],
max_round=20,
)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)
user_proxy.initiate_chat(
manager,
message="Research and analyze the competitive landscape for AI-powered web scraping tools."
)
Pattern 2: Monitor + Alert Agent
monitor = autogen.AssistantAgent(
name="price_monitor",
system_message="""You monitor product prices on specified websites.
Extract current prices and compare with previous checks.
Alert if any price changes more than 10%.""",
llm_config=llm_config,
)
# Register extraction tool
@user_proxy.register_for_execution()
@monitor.register_for_llm(description="Extract pricing data from a product page")
def check_price(url: str) -> dict:
return extract_structured_data(url, [
"product_name", "current_price", "original_price",
"currency", "availability"
])
user_proxy.initiate_chat(
monitor,
message="""Check prices on these product pages and report the current state:
1. https://example.com/product-a
2. https://example.com/product-b
3. https://example.com/product-c"""
)
Pattern 3: Content Creation Pipeline
researcher = autogen.AssistantAgent(
name="content_researcher",
system_message="""Research topics by scraping relevant web pages.
Gather facts, statistics, expert quotes, and examples.""",
llm_config=llm_config,
)
writer = autogen.AssistantAgent(
name="content_writer",
system_message="""Write engaging, well-structured blog posts based on
research provided. Use real data and cite sources.""",
llm_config=llm_config,
)
editor = autogen.AssistantAgent(
name="editor",
system_message="""Review blog posts for accuracy, clarity, SEO optimization,
and engagement. Suggest improvements.""",
llm_config=llm_config,
)
Best Practices for AutoGen Web Scraping
1. Limit Agent Autonomy
Set max_consecutive_auto_reply to prevent runaway scraping:
user_proxy = autogen.UserProxyAgent(
name="user",
max_consecutive_auto_reply=10, # Stop after 10 auto-replies
)
2. Add Caching
Don't scrape the same page twice in one session:
_cache = {}
def scrape_cached(url: str) -> str:
if url in _cache:
return _cache[url]
result = scrape_webpage(url)
_cache[url] = result
return result
3. Handle Failures Gracefully
Web scraping can fail. Always return useful error messages:
def scrape_safe(url: str) -> str:
result = scrape_webpage(url)
if result.startswith("Error"):
return f"Could not access {url}. The page may be down or blocking scrapers. Try a different URL or approach."
return result
4. Cost Control
Track API usage to avoid surprise bills:
call_count = 0
MAX_CALLS = 50
def scrape_with_budget(url: str) -> str:
global call_count
call_count += 1
if call_count > MAX_CALLS:
return "Budget exceeded — no more scraping calls allowed this session."
return scrape_webpage(url)
AutoGen vs CrewAI: Web Scraping Comparison
| Feature | AutoGen | CrewAI |
|---------|---------|--------|
| Tool registration | Decorators | @tool decorator |
| Multi-agent chat | Native group chat | Sequential/hierarchical |
| Function calling | Built-in | Built-in |
| Code execution | Built-in sandbox | Via tools |
| Best for | Conversational agents | Task-oriented crews |
Both work great with WebPerception API. Choose AutoGen for conversational research flows, CrewAI for structured task pipelines.
Getting Started
Install AutoGen: pip install pyautogen
Get WebPerception API key: mantisapi.com — 100 free calls/month
Copy the function definitions from this guide
Register them with your agents using the decorator pattern
Start with a simple task — research one topic, extract one page
Your AutoGen agents just got eyes on the entire web.
---
Power up your AutoGen agents with web access. WebPerception API handles JavaScript rendering, anti-bot bypass, and AI data extraction — so your agents focus on reasoning. Start free today.