Build a Web Scraping Agent with HuggingFace Smolagents

March 9, 2026 · 11 min read · By Mantis Team

Smolagents is HuggingFace's lightweight agent framework — minimal abstraction, maximum control. Unlike heavier frameworks, Smolagents gives you two powerful agent types: CodeAgent (writes and executes Python) and ToolCallingAgent (uses structured tool calls). Both are perfect for web scraping.

In this tutorial, you'll build web scraping agents using Smolagents with the Mantis WebPerception API. You'll learn custom tool creation, both agent types, multi-agent orchestration, and production patterns.

Why Smolagents?

Tiny footprint — ~1,000 lines of code. No hidden magic.
CodeAgent — writes Python code to solve tasks (most powerful)
ToolCallingAgent — uses structured JSON tool calls (most predictable)
HuggingFace ecosystem — use any model from the Hub, local or API
Secure sandbox — CodeAgent runs in a sandboxed environment

Prerequisites

pip install smolagents requests

You'll need:

Python 3.10+
A Mantis API key (free tier: 100 calls/month)
A HuggingFace token or OpenAI API key

Step 1: Create Custom Scraping Tools

Smolagents tools are simple Python classes with a __call__ method:

from smolagents import Tool
import requests
import os

class WebScrapeTool(Tool):
    name = "web_scrape"
    description = """Scrapes a webpage and returns its content as markdown.
    Use this to read the text content of any URL."""
    inputs = {
        "url": {
            "type": "string",
            "description": "The URL to scrape"
        }
    }
    output_type = "string"

    def __init__(self):
        super().__init__()
        self.api_key = os.environ["MANTIS_API_KEY"]

    def forward(self, url: str) -> str:
        response = requests.post(
            "https://api.mantisapi.com/v1/scrape",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json={"url": url, "format": "markdown"}
        )
        response.raise_for_status()
        content = response.json()["content"]
        # Trim to avoid context overflow
        return content[:10000]


class WebExtractTool(Tool):
    name = "web_extract"
    description = """Extracts structured data from a webpage using AI.
    Provide a URL and a description of what to extract."""
    inputs = {
        "url": {
            "type": "string",
            "description": "The URL to extract data from"
        },
        "prompt": {
            "type": "string",
            "description": "What data to extract (e.g., 'product name, price, and rating')"
        }
    }
    output_type = "string"

    def __init__(self):
        super().__init__()
        self.api_key = os.environ["MANTIS_API_KEY"]

    def forward(self, url: str, prompt: str) -> str:
        response = requests.post(
            "https://api.mantisapi.com/v1/extract",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json={"url": url, "prompt": prompt}
        )
        response.raise_for_status()
        import json
        return json.dumps(response.json()["data"], indent=2)


class WebScreenshotTool(Tool):
    name = "web_screenshot"
    description = """Takes a screenshot of a webpage and returns the image URL.
    Useful for visual inspection or capturing page state."""
    inputs = {
        "url": {
            "type": "string",
            "description": "The URL to screenshot"
        }
    }
    output_type = "string"

    def __init__(self):
        super().__init__()
        self.api_key = os.environ["MANTIS_API_KEY"]

    def forward(self, url: str) -> str:
        response = requests.post(
            "https://api.mantisapi.com/v1/screenshot",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json={"url": url}
        )
        response.raise_for_status()
        return response.json()["screenshot_url"]

Step 2: Build a CodeAgent

The CodeAgent is Smolagents' most powerful agent type. It writes and executes Python code to solve tasks, calling your tools as functions:

from smolagents import CodeAgent, HfApiModel

# Use any HuggingFace model or OpenAI
model = HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct")
# Or: model = OpenAIServerModel("gpt-4o-mini")

# Create the agent with web scraping tools
agent = CodeAgent(
    tools=[WebScrapeTool(), WebExtractTool(), WebScreenshotTool()],
    model=model,
    max_steps=5
)

# Run a research task
result = agent.run(
    "Scrape the Python 3.13 what's new page and list the top 5 new features"
)
print(result)

🔥 How CodeAgent works: Instead of making tool calls via JSON, the CodeAgent writes actual Python code that calls your tools. It can use loops, conditionals, variables, and any Python logic — making it vastly more capable for complex scraping tasks.

What the CodeAgent Generates

For the task above, the CodeAgent might generate and execute code like this:

# Generated by CodeAgent:
content = web_scrape(url="https://docs.python.org/3/whatsnew/3.13.html")

# Parse the content to find features
lines = content.split("\n")
features = []
for line in lines:
    if line.startswith("## ") or line.startswith("### "):
        features.append(line.strip("# ").strip())

# Return top 5
top_features = features[:5]
final_answer(top_features)

Step 3: Build a ToolCallingAgent

For more predictable behavior, use the ToolCallingAgent — it uses structured tool calls instead of writing code:

from smolagents import ToolCallingAgent, OpenAIServerModel

model = OpenAIServerModel("gpt-4o-mini")

agent = ToolCallingAgent(
    tools=[WebScrapeTool(), WebExtractTool(), WebScreenshotTool()],
    model=model,
    max_steps=5
)

# Extract structured data
result = agent.run(
    "Go to https://github.com/huggingface/smolagents and extract "
    "the number of stars, the main programming language, and the description"
)
print(result)

Step 4: Multi-Agent Web Research Team

Smolagents supports multi-agent orchestration through ManagedAgent:

from smolagents import CodeAgent, ToolCallingAgent, ManagedAgent, HfApiModel

model = HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct")

# Agent 1: Web Scraper — gathers raw data
scraper_agent = ToolCallingAgent(
    tools=[WebScrapeTool(), WebScreenshotTool()],
    model=model,
    max_steps=3
)
managed_scraper = ManagedAgent(
    agent=scraper_agent,
    name="web_scraper",
    description="Scrapes web pages and returns their content. Give it a URL."
)

# Agent 2: Data Extractor — structures the data
extractor_agent = ToolCallingAgent(
    tools=[WebExtractTool()],
    model=model,
    max_steps=3
)
managed_extractor = ManagedAgent(
    agent=extractor_agent,
    name="data_extractor",
    description="Extracts structured data from web pages. Give it a URL and what to extract."
)

# Manager Agent: Orchestrates the team
manager = CodeAgent(
    tools=[],
    model=model,
    managed_agents=[managed_scraper, managed_extractor],
    max_steps=8
)

# Run a complex research task
result = manager.run("""
    Research the top 3 Python web scraping libraries.
    For each library:
    1. Scrape their GitHub page for stars and description
    2. Extract their key features from their documentation
    Compile a comparison report.
""")
print(result)

Step 5: Production Patterns

Error Handling and Retries

class RobustWebScrapeTool(Tool):
    name = "web_scrape"
    description = "Scrapes a webpage with automatic retries."
    inputs = {"url": {"type": "string", "description": "URL to scrape"}}
    output_type = "string"

    def __init__(self, max_retries=3):
        super().__init__()
        self.api_key = os.environ["MANTIS_API_KEY"]
        self.max_retries = max_retries

    def forward(self, url: str) -> str:
        import time

        for attempt in range(self.max_retries):
            try:
                response = requests.post(
                    "https://api.mantisapi.com/v1/scrape",
                    headers={
                        "Authorization": f"Bearer {self.api_key}",
                        "Content-Type": "application/json"
                    },
                    json={"url": url, "format": "markdown"},
                    timeout=30
                )
                response.raise_for_status()
                return response.json()["content"][:10000]
            except requests.exceptions.RequestException as e:
                if attempt == self.max_retries - 1:
                    return f"Error scraping {url}: {str(e)}"
                time.sleep(2 ** attempt)  # Exponential backoff

Caching Results

import hashlib
import json
from pathlib import Path

class CachedWebScrapeTool(Tool):
    name = "web_scrape"
    description = "Scrapes a webpage with local caching."
    inputs = {"url": {"type": "string", "description": "URL to scrape"}}
    output_type = "string"

    def __init__(self, cache_dir="./scrape_cache"):
        super().__init__()
        self.api_key = os.environ["MANTIS_API_KEY"]
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def forward(self, url: str) -> str:
        # Check cache first
        cache_key = hashlib.md5(url.encode()).hexdigest()
        cache_file = self.cache_dir / f"{cache_key}.json"

        if cache_file.exists():
            cached = json.loads(cache_file.read_text())
            return cached["content"]

        # Fetch from API
        response = requests.post(
            "https://api.mantisapi.com/v1/scrape",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json={"url": url, "format": "markdown"}
        )
        response.raise_for_status()
        content = response.json()["content"][:10000]

        # Cache the result
        cache_file.write_text(json.dumps({
            "url": url, "content": content
        }))

        return content

Real-World Use Cases

1. Automated Content Curation

agent = CodeAgent(
    tools=[WebScrapeTool(), WebExtractTool()],
    model=model,
    max_steps=8
)

newsletter = agent.run("""
    Create a weekly AI newsletter by:
    1. Scraping these news sources: HuggingFace blog, OpenAI blog, Anthropic blog
    2. Extract the latest 3 articles from each
    3. Summarize each article in 2 sentences
    4. Format as a newsletter with sections by source
""")

2. Price Monitoring Bot

agent = ToolCallingAgent(
    tools=[WebExtractTool()],
    model=model,
    max_steps=5
)

prices = agent.run("""
    Extract the current pricing for these SaaS products:
    1. https://vercel.com/pricing
    2. https://railway.app/pricing
    3. https://render.com/pricing

    For each: get the plan names, monthly prices, and key limits.
    Compare them and recommend the best value for a small startup.
""")

3. Job Board Aggregator

agent = CodeAgent(
    tools=[WebScrapeTool(), WebExtractTool()],
    model=model,
    max_steps=10
)

jobs = agent.run("""
    Search for remote Python developer jobs:
    1. Scrape the HuggingFace careers page
    2. Extract all open engineering positions
    3. For each: title, location, team, and requirements
    4. Filter for remote-friendly positions
    5. Sort by relevance to "ML engineer" role
""")

Smolagents vs Other Frameworks

Feature	Smolagents	LangChain	CrewAI
Lines of code	~1,000	~100K+	~15K+
Code generation	✅ CodeAgent	❌	❌
Tool calls	✅ ToolCallingAgent	✅	✅
Multi-agent	✅ ManagedAgent	✅ (via LangGraph)	✅
HF Hub models	✅ Native	⚠️ Via wrapper	⚠️ Via wrapper
Sandbox	✅ Built-in	❌	❌
Learning curve	Low	High	Medium

Cost Optimization

Operation	Mantis Credits	Typical Use
Scrape	1 credit	Page content for agent processing
Extract	1 credit	Structured data from any page
Screenshot	1 credit	Visual capture and verification

Tips:

Use max_steps=5 to limit agent loops and API calls
CodeAgent is more efficient — it can process data in Python instead of extra LLM calls
Cache aggressively with the CachedWebScrapeTool pattern above
Use free-tier models from HuggingFace Hub for development

Build Your Smolagents Scraping Bot

Get a free API key and start building lightweight, powerful web scraping agents in minutes.

Get Free API Key →

What's Next

Quickstart Guide — Set up the Mantis API in 5 minutes
7 Agent Tool Use Patterns — Architectural patterns for agent tools
All Framework Integrations — LangChain, CrewAI, OpenAI, and more
DSPy Web Scraping Modules — Optimizable scraping with DSPy