Build a Web Scraping Agent with Google ADK (Agent Development Kit)
Google's Agent Development Kit (ADK) is Google's open-source framework for building AI agents that can use tools, collaborate in teams, and run in production. If you're in the Google/Gemini ecosystem, ADK is the natural choice โ and giving those agents web scraping capabilities makes them dramatically more useful.
In this guide, you'll build an ADK agent that can:
- Scrape any URL and get clean, structured content
- Screenshot webpages for visual analysis
- Extract structured data with AI (prices, contacts, product specs)
- Orchestrate multi-agent teams for complex research tasks
All powered by the WebPerception API.
What is Google ADK?
Google ADK is an open-source, code-first Python framework for building AI agents. Released in 2025, it's designed to work seamlessly with Gemini models but also supports other LLMs.
Why ADK stands out:
- Multi-agent orchestration โ Built-in support for agent teams with delegation
- Flexible tool system โ Register Python functions as tools with automatic schema generation
- Streaming support โ Real-time streaming of agent responses
- Session management โ Built-in conversation state and memory
- Google Cloud integration โ Deploy to Vertex AI Agent Engine with one command
Prerequisites
- Python 3.10+
- WebPerception API key โ Get one free (100 calls/month)
- Google ADK installed
pip install google-adk requests
Step 1: Define Your Web Scraping Tools
ADK uses Python functions with type hints as tools. The framework automatically generates the tool schema from your function signatures and docstrings.
# tools.py
import requests
import json
from typing import Optional
MANTIS_API_KEY = "your_api_key_here" # Use env vars in production
BASE_URL = "https://api.mantisapi.com/v1"
def scrape_url(url: str, format: str = "markdown") -> str:
"""Scrape a webpage and return its content.
Args:
url: The URL to scrape
format: Output format - 'markdown', 'html', or 'text'
Returns:
The scraped content of the webpage
"""
response = requests.post(
f"{BASE_URL}/scrape",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={"url": url, "format": format}
)
data = response.json()
return data.get("content", f"Error: {data.get('error', 'Unknown error')}")
def screenshot_url(url: str, full_page: bool = False) -> str:
"""Take a screenshot of a webpage.
Args:
url: The URL to screenshot
full_page: Whether to capture the full page or just viewport
Returns:
URL of the screenshot image
"""
response = requests.post(
f"{BASE_URL}/screenshot",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={"url": url, "full_page": full_page}
)
data = response.json()
return data.get("screenshot_url", f"Error: {data.get('error', 'Unknown error')}")
def extract_data(url: str, schema: str) -> str:
"""Extract structured data from a webpage using AI.
Args:
url: The URL to extract data from
schema: JSON schema describing what data to extract.
Example: '{"name": "string", "price": "number", "rating": "number"}'
Returns:
JSON string of extracted data matching the schema
"""
response = requests.post(
f"{BASE_URL}/extract",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={"url": url, "schema": json.loads(schema)}
)
data = response.json()
return json.dumps(data.get("extracted", data), indent=2)
Step 2: Create Your ADK Agent
# agent.py
from google.adk.agents import Agent
from tools import scrape_url, screenshot_url, extract_data
# Create the web research agent
web_agent = Agent(
name="web_researcher",
model="gemini-2.0-flash",
description="An AI agent that can scrape websites, take screenshots, "
"and extract structured data.",
instruction="""You are a web research agent. You can:
1. Scrape any URL to read its content
2. Take screenshots of webpages
3. Extract structured data from pages using AI
When asked to research something:
- Start by scraping the most relevant pages
- Extract specific data points when needed
- Take screenshots when visual context helps
- Summarize findings clearly with sources
Always cite your sources with URLs.""",
tools=[scrape_url, screenshot_url, extract_data],
)
That's it. ADK handles tool schema generation, execution, and response parsing automatically.
Step 3: Run Your Agent
# run.py
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
from agent import web_agent
# Set up session and runner
session_service = InMemorySessionService()
runner = Runner(
agent=web_agent,
app_name="web_research_app",
session_service=session_service,
)
# Create a session
session = session_service.create_session(
app_name="web_research_app",
user_id="user_1",
)
# Run the agent
message = types.Content(
role="user",
parts=[types.Part(text="Research the top 3 AI agent frameworks "
"in 2026. Compare features and pricing.")]
)
for event in runner.run(
user_id="user_1",
session_id=session.id,
new_message=message,
):
if event.is_final_response():
print(event.content.parts[0].text)
Step 4: Multi-Agent Research Team
ADK's killer feature is multi-agent orchestration. Let's build a research team:
# team.py
from google.adk.agents import Agent
from tools import scrape_url, screenshot_url, extract_data
# Specialist: Web scraper
scraper_agent = Agent(
name="scraper",
model="gemini-2.0-flash",
description="Scrapes websites and extracts raw content.",
instruction="You scrape URLs and return clean content. "
"Focus on accuracy and completeness.",
tools=[scrape_url, screenshot_url],
)
# Specialist: Data extractor
extractor_agent = Agent(
name="extractor",
model="gemini-2.0-flash",
description="Extracts structured data from web pages.",
instruction="You extract specific data points from web pages "
"into structured formats. Always validate the data.",
tools=[extract_data],
)
# Orchestrator: Research lead
research_lead = Agent(
name="research_lead",
model="gemini-2.0-flash",
description="Coordinates web research by delegating to specialists.",
instruction="""You are a research team lead. You coordinate by:
1. Breaking down research questions into sub-tasks
2. Delegating scraping to the scraper agent
3. Delegating data extraction to the extractor agent
4. Synthesizing findings into a comprehensive report
Always provide a final summary with key findings.""",
sub_agents=[scraper_agent, extractor_agent],
)
The orchestrator agent automatically decides when to delegate to sub-agents based on the task.
Step 5: Real-World Use Cases
Competitor Price Monitoring
price_monitor = Agent(
name="price_monitor",
model="gemini-2.0-flash",
description="Monitors competitor pricing across the web.",
instruction="""You monitor competitor pricing. When given a product:
1. Search for the product on competitor websites
2. Extract current prices using the extract_data tool
3. Compare prices and highlight the best deals
4. Flag any price changes from previous checks""",
tools=[scrape_url, extract_data],
)
Lead Generation
lead_generator = Agent(
name="lead_gen",
model="gemini-2.0-flash",
description="Finds and qualifies business leads from the web.",
instruction="""You find and qualify leads. For each company:
1. Scrape their website for key information
2. Extract company details (size, industry, tech stack)
3. Check for hiring pages (indicates growth)
4. Rate lead quality based on ICP fit""",
tools=[scrape_url, extract_data],
)
Content Research Pipeline
content_researcher = Agent(
name="content_researcher",
model="gemini-2.0-flash",
description="Researches topics for content creation.",
instruction="""You research topics for blog posts. For each topic:
1. Scrape top-ranking articles
2. Identify common themes, gaps, and unique angles
3. Extract key statistics and data points
4. Create an outline that covers the topic better""",
tools=[scrape_url, extract_data],
)
ADK + WebPerception: Why It Works
| Feature | ADK Provides | WebPerception Provides |
|---|---|---|
| Agent orchestration | โ Multi-agent teams | โ |
| Tool system | โ Auto-schema from Python | โ |
| Web scraping | โ | โ Any URL, JS-rendered |
| Screenshots | โ | โ Full-page captures |
| AI extraction | โ | โ Structured data |
| Session management | โ Built-in state | โ |
| Deployment | โ Vertex AI | โ Cloud API, no infra |
Together: You get production-ready AI agents with real-time web access, deployed on Google Cloud, with zero browser infrastructure to manage.
Deploying to Vertex AI
ADK agents deploy to Google's Vertex AI Agent Engine:
# Install gcloud CLI tools
pip install google-cloud-aiplatform
# Deploy your agent
adk deploy cloud_run \
--project=your-gcp-project \
--region=us-central1 \
--app_name=web_research_app \
--agent_name=web_researcher
Your agent runs as a managed service with auto-scaling, monitoring, and API endpoints โ no servers to manage.
Error Handling Best Practices
import requests
from requests.exceptions import RequestException
def scrape_url_safe(url: str, format: str = "markdown") -> str:
"""Scrape a webpage with error handling."""
try:
response = requests.post(
f"{BASE_URL}/scrape",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={"url": url, "format": format},
timeout=30,
)
if response.status_code == 429:
return "Rate limited. Please wait before retrying."
response.raise_for_status()
data = response.json()
return data.get("content", "No content returned")
except RequestException as e:
return f"Failed to scrape {url}: {str(e)}"
Cost Optimization
WebPerception API uses simple per-call pricing:
| Plan | Calls/Month | Cost per Call |
|---|---|---|
| Free | 100 | $0.00 |
| Starter | 5,000 | $0.0058 |
| Pro | 25,000 | $0.0040 |
| Scale | 100,000 | $0.0030 |
Tips for keeping costs low:
- Cache results for URLs that don't change frequently
- Use
format="text"when you don't need HTML structure - Batch related scraping tasks in a single agent run
- Set reasonable timeouts to avoid wasted calls on unresponsive sites
Give Your ADK Agents Web Superpowers
Start scraping, screenshotting, and extracting data in minutes. Free tier included.
Get Your Free API Key โNext Steps
- WebPerception API Quickstart โ Get your API key in 30 seconds
- Google ADK Documentation โ Deep dive into ADK features
- AI Agent Tool Use Patterns โ 7 architectures for agent tool use
- MCP Web Scraping Server โ Alternative: build an MCP server instead