Build a Web Scraping Agent with LlamaIndex

March 9, 2026 ยท 11 min read LlamaIndex Python RAG

LlamaIndex is the leading framework for building RAG (Retrieval-Augmented Generation) applications. But what happens when your knowledge base needs live web data? That's where web scraping tools come in.

This guide shows you how to build a LlamaIndex agent that can scrape websites, take screenshots, and extract structured data on demand using the WebPerception API.

Why LlamaIndex + Web Scraping?

LlamaIndex excels at connecting LLMs to your data. Adding web scraping tools means your agents can:

Installation

pip install llama-index llama-index-llms-openai requests

Get your WebPerception API key at mantisapi.com.

Define Web Scraping Tools

import os
import requests
from llama_index.core.tools import FunctionTool

MANTIS_API_KEY = os.environ["MANTIS_API_KEY"]
MANTIS_BASE = "https://api.mantisapi.com/v1"
HEADERS = {
    "Authorization": f"Bearer {MANTIS_API_KEY}",
    "Content-Type": "application/json",
}

def scrape_url(url: str, format: str = "markdown") -> str:
    """Scrape a webpage and return its content as clean text or markdown.

    Args:
        url: The URL to scrape
        format: Output format โ€” 'text' or 'markdown' (default: markdown)
    """
    resp = requests.post(
        f"{MANTIS_BASE}/scrape",
        headers=HEADERS,
        json={"url": url, "format": format},
    )
    resp.raise_for_status()
    return resp.json()["content"]

def screenshot_url(url: str, full_page: bool = False) -> dict:
    """Take a screenshot of a webpage and return the image URL.

    Args:
        url: The URL to screenshot
        full_page: Whether to capture the entire page (default: False)
    """
    resp = requests.post(
        f"{MANTIS_BASE}/screenshot",
        headers=HEADERS,
        json={"url": url, "full_page": full_page},
    )
    resp.raise_for_status()
    data = resp.json()
    return {"image_url": data["url"], "width": data["width"], "height": data["height"]}

def extract_data(url: str, prompt: str) -> dict:
    """Extract structured data from a webpage using AI.

    Args:
        url: The URL to extract data from
        prompt: Description of what data to extract
    """
    resp = requests.post(
        f"{MANTIS_BASE}/extract",
        headers=HEADERS,
        json={"url": url, "prompt": prompt},
    )
    resp.raise_for_status()
    return resp.json()

# Wrap as LlamaIndex FunctionTools
scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
screenshot_tool = FunctionTool.from_defaults(fn=screenshot_url)
extract_tool = FunctionTool.from_defaults(fn=extract_data)

Create a ReAct Agent

from llama_index.core.agent import ReActAgent
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o", temperature=0)

agent = ReActAgent.from_tools(
    tools=[scrape_tool, screenshot_tool, extract_tool],
    llm=llm,
    verbose=True,
    max_iterations=10,
    system_prompt="""You are a web research agent. You can scrape websites,
    take screenshots, and extract structured data. When researching a topic:
    1. Scrape relevant pages for information
    2. Extract structured data when needed
    3. Synthesize findings into a clear summary
    Always cite your sources with URLs.""",
)

# Run the agent
response = agent.chat(
    "What are the latest features announced by OpenAI this week?"
)
print(response)

Combine Web Scraping with RAG

The real power comes from combining your existing knowledge index with live web tools:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load your existing documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create a query engine tool from your index
from llama_index.core.tools import QueryEngineTool

query_tool = QueryEngineTool.from_defaults(
    query_engine=index.as_query_engine(),
    name="knowledge_base",
    description="Search the internal knowledge base for company docs and guides",
)

# Create an agent with BOTH internal search and web scraping
agent = ReActAgent.from_tools(
    tools=[query_tool, scrape_tool, extract_tool],
    llm=llm,
    verbose=True,
    system_prompt="""You are a research assistant with access to:
    1. An internal knowledge base (use for company-specific questions)
    2. Web scraping tools (use for live data, competitors, market research)

    Always try the knowledge base first. Use web scraping for
    current information or topics not in the knowledge base.""",
)

response = agent.chat(
    "How does our pricing compare to ScrapingBee's current pricing?"
)

Structured Data Extraction with Pydantic

from pydantic import BaseModel
from typing import List, Optional

class CompanyProfile(BaseModel):
    name: str
    description: str
    pricing_tiers: List[str]
    key_features: List[str]
    founded_year: Optional[int]

def extract_company_profile(url: str) -> CompanyProfile:
    """Extract a structured company profile from a website.

    Args:
        url: The company's website URL
    """
    data = extract_data(
        url=url,
        prompt="Extract: company name, description, pricing tier names, "
               "key features (list), and founding year if available"
    )
    return CompanyProfile(**data["extracted"])

company_tool = FunctionTool.from_defaults(fn=extract_company_profile)

Multi-Agent Web Research Team

from llama_index.core.agent import ReActAgent

# Researcher agent โ€” finds and scrapes sources
researcher = ReActAgent.from_tools(
    tools=[scrape_tool, extract_tool],
    llm=llm,
    system_prompt="You are a web researcher. Scrape URLs and extract key data points.",
)

# Analyst agent โ€” synthesizes research into insights
from llama_index.core.tools import FunctionTool

def research(query: str) -> str:
    """Research a topic by scraping relevant web pages."""
    return str(researcher.chat(query))

research_tool = FunctionTool.from_defaults(fn=research)

analyst = ReActAgent.from_tools(
    tools=[research_tool],
    llm=llm,
    system_prompt="""You are a market analyst. Use the research tool to
    gather data, then synthesize it into actionable insights with
    specific numbers and recommendations.""",
)

response = analyst.chat(
    "Analyze the web scraping API market: top 5 players, pricing, and trends"
)

Real-World Use Cases

1. Content Research Pipeline

Build an agent that researches a topic, scrapes top-ranking articles, and generates a comprehensive content brief with keyword opportunities.

2. Competitor Monitoring

Schedule your agent to scrape competitor pricing pages weekly. Extract structured pricing data and flag changes automatically.

3. Lead Enrichment

Given a company URL, extract contact info, tech stack, employee count, and recent news. Feed enriched data into your CRM.

4. Knowledge Base Updater

Periodically scrape documentation sites and add new content to your VectorStoreIndex, keeping your RAG pipeline current.

Best Practices

  1. Set max_iterations โ€” Prevent runaway agents that scrape endlessly. 5-10 is usually enough.
  2. Use verbose mode during development โ€” See the agent's reasoning chain and tool calls.
  3. Cache with LlamaIndex storage โ€” Use StorageContext to persist scraped data between runs.
  4. Combine tools strategically โ€” Use scrape_url for full content, extract_data when you need specific fields.
  5. Handle errors gracefully โ€” Wrap tool functions in try/except and return descriptive error messages.

Cost Optimization

OperationWebPerception CostTypical Use
Scrape1 creditFull content extraction
Screenshot1 creditVisual verification
AI Extract2 creditsStructured data

Start with the Free tier (100 calls/month) for development and testing.

Give Your LlamaIndex Agents Web Access

100 free API calls/month. Full documentation. No credit card required.

Get Your API Key โ†’

Next Steps