AutoGen Web Scraping: How to Build Web-Enabled Multi-Agent Systems in 2026

March 6, 2026 AI Agents

AutoGen Web Scraping: How to Build Web-Enabled Multi-Agent Systems in 2026

Microsoft's AutoGen framework makes it easy to build multi-agent systems where AI agents collaborate through conversation. But conversations alone don't solve real-world problems — agents need access to real-time web data.

Whether your agents need to research topics, monitor competitors, extract product data, or gather market intelligence, they need web scraping capabilities. This guide shows you exactly how to add production-grade web access to your AutoGen agents.

Why AutoGen Agents Need Web Access

AutoGen agents are powerful conversationalists and reasoners. But they're limited to their training data and whatever context you provide. For real-world tasks, agents need to:

Research in real-time — not from training data that's months old
Monitor websites — track prices, inventory, content changes
Extract structured data — pull specific fields from web pages
Verify information — fact-check against live sources
Gather competitive intelligence — analyze competitor pages

Without web access, your agents are reasoning in a vacuum.

Setting Up Web Scraping Tools in AutoGen

Basic Setup

import autogen
import requests

# Configuration
MANTIS_API_KEY = "your_api_key"
MANTIS_BASE = "https://api.mantisapi.com"

config_list = [{"model": "gpt-4", "api_key": "your_openai_key"}]

llm_config = {
    "config_list": config_list,
    "timeout": 120,
}

Define Web Scraping Functions

def scrape_webpage(url: str) -> str:
    """Scrape a web page and return its content as clean markdown."""
    try:
        response = requests.get(f"{MANTIS_BASE}/scrape", params={
            'url': url,
            'api_key': MANTIS_API_KEY
        }, timeout=30)
        if response.ok:
            data = response.json()
            return data.get('content', 'No content extracted')
        return f"Error scraping {url}: HTTP {response.status_code}"
    except Exception as e:
        return f"Error: {str(e)}"

def extract_structured_data(url: str, fields: list[str]) -> dict:
    """Extract specific structured data from a web page.
    
    Args:
        url: The web page to extract from
        fields: List of field names to extract
    """
    try:
        schema = {field: "string" for field in fields}
        response = requests.post(f"{MANTIS_BASE}/extract", json={
            'url': url,
            'api_key': MANTIS_API_KEY,
            'schema': schema
        }, timeout=30)
        if response.ok:
            return response.json()
        return {"error": f"HTTP {response.status_code}"}
    except Exception as e:
        return {"error": str(e)}

def take_screenshot(url: str) -> str:
    """Capture a screenshot of a web page."""
    try:
        response = requests.get(f"{MANTIS_BASE}/screenshot", params={
            'url': url,
            'api_key': MANTIS_API_KEY
        }, timeout=30)
        if response.ok:
            return response.json().get('screenshot_url', 'No URL')
        return f"Error: HTTP {response.status_code}"
    except Exception as e:
        return f"Error: {str(e)}"

Register Tools with AutoGen Agents

# Create the assistant agent
assistant = autogen.AssistantAgent(
    name="web_researcher",
    system_message="""You are a web research specialist. You can scrape web pages, 
    extract structured data, and take screenshots. Use these tools to gather 
    information from the web when needed. Always cite your sources.""",
    llm_config=llm_config,
)

# Create the user proxy agent
user_proxy = autogen.UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "output"},
)

# Register the functions
@user_proxy.register_for_execution()
@assistant.register_for_llm(description="Scrape a web page and return clean markdown content")
def scrape(url: str) -> str:
    return scrape_webpage(url)

@user_proxy.register_for_execution()
@assistant.register_for_llm(description="Extract specific fields from a web page as structured data")
def extract(url: str, fields: list[str]) -> dict:
    return extract_structured_data(url, fields)

@user_proxy.register_for_execution()
@assistant.register_for_llm(description="Take a screenshot of a web page")
def screenshot(url: str) -> str:
    return take_screenshot(url)

Run Your Web-Enabled Agent

# Research task
user_proxy.initiate_chat(
    assistant,
    message="""Research the top 5 web scraping APIs available in 2026.
    For each one:
    1. Visit their website
    2. Extract their pricing, key features, and target audience
    3. Create a comparison table
    
    Start with ScrapingBee, Bright Data, ScraperAPI, Apify, and WebPerception API."""
)

Multi-Agent Patterns

Pattern 1: Researcher + Analyst

researcher = autogen.AssistantAgent(
    name="researcher",
    system_message="""You are a web researcher. Your job is to scrape websites 
    and gather raw information. Report findings factually without analysis.""",
    llm_config=llm_config,
)

analyst = autogen.AssistantAgent(
    name="analyst",
    system_message="""You are a business analyst. You receive research data 
    and provide strategic insights, identify patterns, and make recommendations.
    You don't scrape websites — you analyze what the researcher finds.""",
    llm_config=llm_config,
)

user_proxy = autogen.UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=15,
)

# Register scraping tools only for the researcher
@user_proxy.register_for_execution()
@researcher.register_for_llm(description="Scrape a web page")
def scrape_for_research(url: str) -> str:
    return scrape_webpage(url)

@user_proxy.register_for_execution()
@researcher.register_for_llm(description="Extract structured data from a page")
def extract_for_research(url: str, fields: list[str]) -> dict:
    return extract_structured_data(url, fields)

# Group chat for collaboration
groupchat = autogen.GroupChat(
    agents=[user_proxy, researcher, analyst],
    messages=[],
    max_round=20,
)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

user_proxy.initiate_chat(
    manager,
    message="Research and analyze the competitive landscape for AI-powered web scraping tools."
)

Pattern 2: Monitor + Alert Agent

monitor = autogen.AssistantAgent(
    name="price_monitor",
    system_message="""You monitor product prices on specified websites.
    Extract current prices and compare with previous checks.
    Alert if any price changes more than 10%.""",
    llm_config=llm_config,
)

# Register extraction tool
@user_proxy.register_for_execution()
@monitor.register_for_llm(description="Extract pricing data from a product page")
def check_price(url: str) -> dict:
    return extract_structured_data(url, [
        "product_name", "current_price", "original_price", 
        "currency", "availability"
    ])

user_proxy.initiate_chat(
    monitor,
    message="""Check prices on these product pages and report the current state:
    1. https://example.com/product-a
    2. https://example.com/product-b
    3. https://example.com/product-c"""
)

Pattern 3: Content Creation Pipeline

researcher = autogen.AssistantAgent(
    name="content_researcher",
    system_message="""Research topics by scraping relevant web pages. 
    Gather facts, statistics, expert quotes, and examples.""",
    llm_config=llm_config,
)

writer = autogen.AssistantAgent(
    name="content_writer",
    system_message="""Write engaging, well-structured blog posts based on 
    research provided. Use real data and cite sources.""",
    llm_config=llm_config,
)

editor = autogen.AssistantAgent(
    name="editor",
    system_message="""Review blog posts for accuracy, clarity, SEO optimization, 
    and engagement. Suggest improvements.""",
    llm_config=llm_config,
)

Best Practices for AutoGen Web Scraping

1. Limit Agent Autonomy

Set max_consecutive_auto_reply to prevent runaway scraping:

user_proxy = autogen.UserProxyAgent(
    name="user",
    max_consecutive_auto_reply=10,  # Stop after 10 auto-replies
)

2. Add Caching

Don't scrape the same page twice in one session:

_cache = {}

def scrape_cached(url: str) -> str:
    if url in _cache:
        return _cache[url]
    result = scrape_webpage(url)
    _cache[url] = result
    return result

3. Handle Failures Gracefully

Web scraping can fail. Always return useful error messages:

def scrape_safe(url: str) -> str:
    result = scrape_webpage(url)
    if result.startswith("Error"):
        return f"Could not access {url}. The page may be down or blocking scrapers. Try a different URL or approach."
    return result

4. Cost Control

Track API usage to avoid surprise bills:

call_count = 0
MAX_CALLS = 50

def scrape_with_budget(url: str) -> str:
    global call_count
    call_count += 1
    if call_count > MAX_CALLS:
        return "Budget exceeded — no more scraping calls allowed this session."
    return scrape_webpage(url)

AutoGen vs CrewAI: Web Scraping Comparison

| Feature | AutoGen | CrewAI |

|---------|---------|--------|

| Tool registration | Decorators | @tool decorator |

| Multi-agent chat | Native group chat | Sequential/hierarchical |

| Function calling | Built-in | Built-in |

| Code execution | Built-in sandbox | Via tools |

| Best for | Conversational agents | Task-oriented crews |

Both work great with WebPerception API. Choose AutoGen for conversational research flows, CrewAI for structured task pipelines.

Getting Started

Install AutoGen: pip install pyautogen

Get WebPerception API key: mantisapi.com — 100 free calls/month

Copy the function definitions from this guide

Register them with your agents using the decorator pattern

Start with a simple task — research one topic, extract one page

Your AutoGen agents just got eyes on the entire web.

---

Power up your AutoGen agents with web access. WebPerception API handles JavaScript rendering, anti-bot bypass, and AI data extraction — so your agents focus on reasoning. Start free today.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →