Build a Web Scraping Agent with LlamaIndex
LlamaIndex is the leading framework for building RAG (Retrieval-Augmented Generation) applications. But what happens when your knowledge base needs live web data? That's where web scraping tools come in.
This guide shows you how to build a LlamaIndex agent that can scrape websites, take screenshots, and extract structured data on demand using the WebPerception API.
Why LlamaIndex + Web Scraping?
LlamaIndex excels at connecting LLMs to your data. Adding web scraping tools means your agents can:
- Augment RAG with live data โ Don't just search your index; fetch fresh content from the web
- Build research agents โ Multi-step web research with automatic source citation
- Monitor and update โ Keep your knowledge base current by scraping sources on a schedule
- Extract structured data โ Turn any webpage into typed Pydantic models
Installation
pip install llama-index llama-index-llms-openai requests
Get your WebPerception API key at mantisapi.com.
Define Web Scraping Tools
import os
import requests
from llama_index.core.tools import FunctionTool
MANTIS_API_KEY = os.environ["MANTIS_API_KEY"]
MANTIS_BASE = "https://api.mantisapi.com/v1"
HEADERS = {
"Authorization": f"Bearer {MANTIS_API_KEY}",
"Content-Type": "application/json",
}
def scrape_url(url: str, format: str = "markdown") -> str:
"""Scrape a webpage and return its content as clean text or markdown.
Args:
url: The URL to scrape
format: Output format โ 'text' or 'markdown' (default: markdown)
"""
resp = requests.post(
f"{MANTIS_BASE}/scrape",
headers=HEADERS,
json={"url": url, "format": format},
)
resp.raise_for_status()
return resp.json()["content"]
def screenshot_url(url: str, full_page: bool = False) -> dict:
"""Take a screenshot of a webpage and return the image URL.
Args:
url: The URL to screenshot
full_page: Whether to capture the entire page (default: False)
"""
resp = requests.post(
f"{MANTIS_BASE}/screenshot",
headers=HEADERS,
json={"url": url, "full_page": full_page},
)
resp.raise_for_status()
data = resp.json()
return {"image_url": data["url"], "width": data["width"], "height": data["height"]}
def extract_data(url: str, prompt: str) -> dict:
"""Extract structured data from a webpage using AI.
Args:
url: The URL to extract data from
prompt: Description of what data to extract
"""
resp = requests.post(
f"{MANTIS_BASE}/extract",
headers=HEADERS,
json={"url": url, "prompt": prompt},
)
resp.raise_for_status()
return resp.json()
# Wrap as LlamaIndex FunctionTools
scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
screenshot_tool = FunctionTool.from_defaults(fn=screenshot_url)
extract_tool = FunctionTool.from_defaults(fn=extract_data)
Create a ReAct Agent
from llama_index.core.agent import ReActAgent
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o", temperature=0)
agent = ReActAgent.from_tools(
tools=[scrape_tool, screenshot_tool, extract_tool],
llm=llm,
verbose=True,
max_iterations=10,
system_prompt="""You are a web research agent. You can scrape websites,
take screenshots, and extract structured data. When researching a topic:
1. Scrape relevant pages for information
2. Extract structured data when needed
3. Synthesize findings into a clear summary
Always cite your sources with URLs.""",
)
# Run the agent
response = agent.chat(
"What are the latest features announced by OpenAI this week?"
)
print(response)
Combine Web Scraping with RAG
The real power comes from combining your existing knowledge index with live web tools:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Load your existing documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Create a query engine tool from your index
from llama_index.core.tools import QueryEngineTool
query_tool = QueryEngineTool.from_defaults(
query_engine=index.as_query_engine(),
name="knowledge_base",
description="Search the internal knowledge base for company docs and guides",
)
# Create an agent with BOTH internal search and web scraping
agent = ReActAgent.from_tools(
tools=[query_tool, scrape_tool, extract_tool],
llm=llm,
verbose=True,
system_prompt="""You are a research assistant with access to:
1. An internal knowledge base (use for company-specific questions)
2. Web scraping tools (use for live data, competitors, market research)
Always try the knowledge base first. Use web scraping for
current information or topics not in the knowledge base.""",
)
response = agent.chat(
"How does our pricing compare to ScrapingBee's current pricing?"
)
Structured Data Extraction with Pydantic
from pydantic import BaseModel
from typing import List, Optional
class CompanyProfile(BaseModel):
name: str
description: str
pricing_tiers: List[str]
key_features: List[str]
founded_year: Optional[int]
def extract_company_profile(url: str) -> CompanyProfile:
"""Extract a structured company profile from a website.
Args:
url: The company's website URL
"""
data = extract_data(
url=url,
prompt="Extract: company name, description, pricing tier names, "
"key features (list), and founding year if available"
)
return CompanyProfile(**data["extracted"])
company_tool = FunctionTool.from_defaults(fn=extract_company_profile)
Multi-Agent Web Research Team
from llama_index.core.agent import ReActAgent
# Researcher agent โ finds and scrapes sources
researcher = ReActAgent.from_tools(
tools=[scrape_tool, extract_tool],
llm=llm,
system_prompt="You are a web researcher. Scrape URLs and extract key data points.",
)
# Analyst agent โ synthesizes research into insights
from llama_index.core.tools import FunctionTool
def research(query: str) -> str:
"""Research a topic by scraping relevant web pages."""
return str(researcher.chat(query))
research_tool = FunctionTool.from_defaults(fn=research)
analyst = ReActAgent.from_tools(
tools=[research_tool],
llm=llm,
system_prompt="""You are a market analyst. Use the research tool to
gather data, then synthesize it into actionable insights with
specific numbers and recommendations.""",
)
response = analyst.chat(
"Analyze the web scraping API market: top 5 players, pricing, and trends"
)
Real-World Use Cases
1. Content Research Pipeline
Build an agent that researches a topic, scrapes top-ranking articles, and generates a comprehensive content brief with keyword opportunities.
2. Competitor Monitoring
Schedule your agent to scrape competitor pricing pages weekly. Extract structured pricing data and flag changes automatically.
3. Lead Enrichment
Given a company URL, extract contact info, tech stack, employee count, and recent news. Feed enriched data into your CRM.
4. Knowledge Base Updater
Periodically scrape documentation sites and add new content to your VectorStoreIndex, keeping your RAG pipeline current.
Best Practices
- Set
max_iterationsโ Prevent runaway agents that scrape endlessly. 5-10 is usually enough. - Use verbose mode during development โ See the agent's reasoning chain and tool calls.
- Cache with LlamaIndex storage โ Use
StorageContextto persist scraped data between runs. - Combine tools strategically โ Use
scrape_urlfor full content,extract_datawhen you need specific fields. - Handle errors gracefully โ Wrap tool functions in try/except and return descriptive error messages.
Cost Optimization
| Operation | WebPerception Cost | Typical Use |
|---|---|---|
| Scrape | 1 credit | Full content extraction |
| Screenshot | 1 credit | Visual verification |
| AI Extract | 2 credits | Structured data |
Start with the Free tier (100 calls/month) for development and testing.
Give Your LlamaIndex Agents Web Access
100 free API calls/month. Full documentation. No credit card required.
Get Your API Key โNext Steps
- WebPerception API Quickstart โ Get started in 5 minutes
- All Framework Integrations โ LangChain, CrewAI, OpenAI, and more
- AI Agent Tool Use Patterns โ 7 architectures that work
- Build an MCP Server โ Universal tool protocol