Smolagents is HuggingFace's lightweight agent framework โ minimal abstraction, maximum control. Unlike heavier frameworks, Smolagents gives you two powerful agent types: CodeAgent (writes and executes Python) and ToolCallingAgent (uses structured tool calls). Both are perfect for web scraping.
In this tutorial, you'll build web scraping agents using Smolagents with the Mantis WebPerception API. You'll learn custom tool creation, both agent types, multi-agent orchestration, and production patterns.
pip install smolagents requests
You'll need:
Smolagents tools are simple Python classes with a __call__ method:
from smolagents import Tool
import requests
import os
class WebScrapeTool(Tool):
name = "web_scrape"
description = """Scrapes a webpage and returns its content as markdown.
Use this to read the text content of any URL."""
inputs = {
"url": {
"type": "string",
"description": "The URL to scrape"
}
}
output_type = "string"
def __init__(self):
super().__init__()
self.api_key = os.environ["MANTIS_API_KEY"]
def forward(self, url: str) -> str:
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json={"url": url, "format": "markdown"}
)
response.raise_for_status()
content = response.json()["content"]
# Trim to avoid context overflow
return content[:10000]
class WebExtractTool(Tool):
name = "web_extract"
description = """Extracts structured data from a webpage using AI.
Provide a URL and a description of what to extract."""
inputs = {
"url": {
"type": "string",
"description": "The URL to extract data from"
},
"prompt": {
"type": "string",
"description": "What data to extract (e.g., 'product name, price, and rating')"
}
}
output_type = "string"
def __init__(self):
super().__init__()
self.api_key = os.environ["MANTIS_API_KEY"]
def forward(self, url: str, prompt: str) -> str:
response = requests.post(
"https://api.mantisapi.com/v1/extract",
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json={"url": url, "prompt": prompt}
)
response.raise_for_status()
import json
return json.dumps(response.json()["data"], indent=2)
class WebScreenshotTool(Tool):
name = "web_screenshot"
description = """Takes a screenshot of a webpage and returns the image URL.
Useful for visual inspection or capturing page state."""
inputs = {
"url": {
"type": "string",
"description": "The URL to screenshot"
}
}
output_type = "string"
def __init__(self):
super().__init__()
self.api_key = os.environ["MANTIS_API_KEY"]
def forward(self, url: str) -> str:
response = requests.post(
"https://api.mantisapi.com/v1/screenshot",
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json={"url": url}
)
response.raise_for_status()
return response.json()["screenshot_url"]
The CodeAgent is Smolagents' most powerful agent type. It writes and executes Python code to solve tasks, calling your tools as functions:
from smolagents import CodeAgent, HfApiModel
# Use any HuggingFace model or OpenAI
model = HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct")
# Or: model = OpenAIServerModel("gpt-4o-mini")
# Create the agent with web scraping tools
agent = CodeAgent(
tools=[WebScrapeTool(), WebExtractTool(), WebScreenshotTool()],
model=model,
max_steps=5
)
# Run a research task
result = agent.run(
"Scrape the Python 3.13 what's new page and list the top 5 new features"
)
print(result)
๐ฅ How CodeAgent works: Instead of making tool calls via JSON, the CodeAgent writes actual Python code that calls your tools. It can use loops, conditionals, variables, and any Python logic โ making it vastly more capable for complex scraping tasks.
For the task above, the CodeAgent might generate and execute code like this:
# Generated by CodeAgent:
content = web_scrape(url="https://docs.python.org/3/whatsnew/3.13.html")
# Parse the content to find features
lines = content.split("\n")
features = []
for line in lines:
if line.startswith("## ") or line.startswith("### "):
features.append(line.strip("# ").strip())
# Return top 5
top_features = features[:5]
final_answer(top_features)
For more predictable behavior, use the ToolCallingAgent โ it uses structured tool calls instead of writing code:
from smolagents import ToolCallingAgent, OpenAIServerModel
model = OpenAIServerModel("gpt-4o-mini")
agent = ToolCallingAgent(
tools=[WebScrapeTool(), WebExtractTool(), WebScreenshotTool()],
model=model,
max_steps=5
)
# Extract structured data
result = agent.run(
"Go to https://github.com/huggingface/smolagents and extract "
"the number of stars, the main programming language, and the description"
)
print(result)
Smolagents supports multi-agent orchestration through ManagedAgent:
from smolagents import CodeAgent, ToolCallingAgent, ManagedAgent, HfApiModel
model = HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct")
# Agent 1: Web Scraper โ gathers raw data
scraper_agent = ToolCallingAgent(
tools=[WebScrapeTool(), WebScreenshotTool()],
model=model,
max_steps=3
)
managed_scraper = ManagedAgent(
agent=scraper_agent,
name="web_scraper",
description="Scrapes web pages and returns their content. Give it a URL."
)
# Agent 2: Data Extractor โ structures the data
extractor_agent = ToolCallingAgent(
tools=[WebExtractTool()],
model=model,
max_steps=3
)
managed_extractor = ManagedAgent(
agent=extractor_agent,
name="data_extractor",
description="Extracts structured data from web pages. Give it a URL and what to extract."
)
# Manager Agent: Orchestrates the team
manager = CodeAgent(
tools=[],
model=model,
managed_agents=[managed_scraper, managed_extractor],
max_steps=8
)
# Run a complex research task
result = manager.run("""
Research the top 3 Python web scraping libraries.
For each library:
1. Scrape their GitHub page for stars and description
2. Extract their key features from their documentation
Compile a comparison report.
""")
print(result)
class RobustWebScrapeTool(Tool):
name = "web_scrape"
description = "Scrapes a webpage with automatic retries."
inputs = {"url": {"type": "string", "description": "URL to scrape"}}
output_type = "string"
def __init__(self, max_retries=3):
super().__init__()
self.api_key = os.environ["MANTIS_API_KEY"]
self.max_retries = max_retries
def forward(self, url: str) -> str:
import time
for attempt in range(self.max_retries):
try:
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json={"url": url, "format": "markdown"},
timeout=30
)
response.raise_for_status()
return response.json()["content"][:10000]
except requests.exceptions.RequestException as e:
if attempt == self.max_retries - 1:
return f"Error scraping {url}: {str(e)}"
time.sleep(2 ** attempt) # Exponential backoff
import hashlib
import json
from pathlib import Path
class CachedWebScrapeTool(Tool):
name = "web_scrape"
description = "Scrapes a webpage with local caching."
inputs = {"url": {"type": "string", "description": "URL to scrape"}}
output_type = "string"
def __init__(self, cache_dir="./scrape_cache"):
super().__init__()
self.api_key = os.environ["MANTIS_API_KEY"]
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def forward(self, url: str) -> str:
# Check cache first
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_file = self.cache_dir / f"{cache_key}.json"
if cache_file.exists():
cached = json.loads(cache_file.read_text())
return cached["content"]
# Fetch from API
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json={"url": url, "format": "markdown"}
)
response.raise_for_status()
content = response.json()["content"][:10000]
# Cache the result
cache_file.write_text(json.dumps({
"url": url, "content": content
}))
return content
agent = CodeAgent(
tools=[WebScrapeTool(), WebExtractTool()],
model=model,
max_steps=8
)
newsletter = agent.run("""
Create a weekly AI newsletter by:
1. Scraping these news sources: HuggingFace blog, OpenAI blog, Anthropic blog
2. Extract the latest 3 articles from each
3. Summarize each article in 2 sentences
4. Format as a newsletter with sections by source
""")
agent = ToolCallingAgent(
tools=[WebExtractTool()],
model=model,
max_steps=5
)
prices = agent.run("""
Extract the current pricing for these SaaS products:
1. https://vercel.com/pricing
2. https://railway.app/pricing
3. https://render.com/pricing
For each: get the plan names, monthly prices, and key limits.
Compare them and recommend the best value for a small startup.
""")
agent = CodeAgent(
tools=[WebScrapeTool(), WebExtractTool()],
model=model,
max_steps=10
)
jobs = agent.run("""
Search for remote Python developer jobs:
1. Scrape the HuggingFace careers page
2. Extract all open engineering positions
3. For each: title, location, team, and requirements
4. Filter for remote-friendly positions
5. Sort by relevance to "ML engineer" role
""")
| Feature | Smolagents | LangChain | CrewAI |
|---|---|---|---|
| Lines of code | ~1,000 | ~100K+ | ~15K+ |
| Code generation | โ CodeAgent | โ | โ |
| Tool calls | โ ToolCallingAgent | โ | โ |
| Multi-agent | โ ManagedAgent | โ (via LangGraph) | โ |
| HF Hub models | โ Native | โ ๏ธ Via wrapper | โ ๏ธ Via wrapper |
| Sandbox | โ Built-in | โ | โ |
| Learning curve | Low | High | Medium |
| Operation | Mantis Credits | Typical Use |
|---|---|---|
| Scrape | 1 credit | Page content for agent processing |
| Extract | 1 credit | Structured data from any page |
| Screenshot | 1 credit | Visual capture and verification |
Tips:
max_steps=5 to limit agent loops and API callsCachedWebScrapeTool pattern aboveGet a free API key and start building lightweight, powerful web scraping agents in minutes.
Get Free API Key โ