LangChain Web Scraping: The Complete Guide for 2026
LangChain Web Scraping: The Complete Guide for 2026
If you're building LangChain agents, you've probably hit this wall: your agent needs real-time web data, but LangChain's built-in tools for web access are... limited.
Maybe you tried WebBaseLoader and got a mess of HTML. Or you used requests and spent more time parsing than building. Sound familiar?
This guide shows you how to give your LangChain agents reliable, structured web access using WebPerception API — purpose-built for AI agents.
Why LangChain Agents Need Better Web Access
LangChain agents are powerful orchestrators. They reason, plan, and execute multi-step tasks. But they're only as good as their tools.
The web is the world's largest data source, and most LangChain web tools fall short:
| Tool | Problem |
|------|---------|
| WebBaseLoader | Returns raw HTML, no JS rendering |
| requests + BeautifulSoup | Manual parsing, breaks on dynamic sites |
| SeleniumLoader | Heavy, slow, requires browser infrastructure |
| PlaywrightLoader | Same issues — needs headless browser management |
What agents actually need:
- Clean, structured data (not raw HTML)
- JavaScript rendering (most modern sites need it)
- AI-powered extraction (get exactly the fields you need)
- Screenshots (visual verification of what the page looks like)
- Reliability (no browser crashes, no infrastructure to manage)
That's exactly what WebPerception API provides — as a simple REST API your LangChain agent can call.
Setup: WebPerception as a LangChain Tool
Step 1: Get Your API Key
Sign up at mantisapi.com — the free tier gives you 100 API calls/month, no credit card required.
Step 2: Install Dependencies
pip install langchain langchain-openai requests
Step 3: Create the WebPerception Tool
import requests
from langchain.tools import tool
WEBPERCEPTION_API_KEY = "your-api-key"
BASE_URL = "https://api.mantisapi.com"
@tool
def scrape_webpage(url: str) -> str:
"""Scrape a webpage and return clean, readable content.
Use this when you need to read the content of any web page."""
response = requests.post(
f"{BASE_URL}/scrape",
headers={"x-api-key": WEBPERCEPTION_API_KEY},
json={"url": url}
)
data = response.json()
return data.get("content", "Failed to scrape page")
@tool
def extract_data(url: str, fields: str) -> str:
"""Extract specific structured data from a webpage.
Use this when you need specific information like prices, names, dates, etc.
Args:
url: The webpage URL to extract data from
fields: Comma-separated list of fields to extract (e.g., "price, title, rating")
"""
response = requests.post(
f"{BASE_URL}/extract",
headers={"x-api-key": WEBPERCEPTION_API_KEY},
json={
"url": url,
"prompt": f"Extract the following fields: {fields}"
}
)
return response.json()
@tool
def take_screenshot(url: str) -> str:
"""Take a screenshot of a webpage. Returns the screenshot URL.
Use this when you need to visually verify what a page looks like."""
response = requests.post(
f"{BASE_URL}/screenshot",
headers={"x-api-key": WEBPERCEPTION_API_KEY},
json={"url": url}
)
data = response.json()
return data.get("screenshotUrl", "Failed to take screenshot")
Step 4: Build Your Agent
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
# Define tools
tools = [scrape_webpage, extract_data, take_screenshot]
# Create the agent
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful research assistant with access to web scraping tools. "
"Use scrape_webpage to read pages, extract_data for structured info, "
"and take_screenshot for visual verification."),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
agent = create_openai_tools_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
Step 5: Run It
# Research a competitor
result = agent_executor.invoke({
"input": "Go to stripe.com/pricing and extract their pricing tiers with prices and features"
})
print(result["output"])
Your agent will:
Call extract_data with the URL and fields
WebPerception renders the JavaScript, loads the full page
AI extracts exactly the structured data your agent asked for
Returns clean, structured JSON — no parsing needed
Real-World Use Cases
1. Competitive Intelligence Agent
Monitor competitor pricing and features automatically:
@tool
def monitor_competitor(competitor_url: str) -> dict:
"""Monitor a competitor's pricing page for changes."""
response = requests.post(
f"{BASE_URL}/extract",
headers={"x-api-key": WEBPERCEPTION_API_KEY},
json={
"url": competitor_url,
"prompt": "Extract all pricing tiers with: tier name, monthly price, annual price, key features, and any limits or quotas"
}
)
return response.json()
# Use in a chain
competitors = [
"https://scrapingbee.com/pricing",
"https://apify.com/pricing",
"https://brightdata.com/pricing"
]
for url in competitors:
result = agent_executor.invoke({
"input": f"Extract pricing from {url} and summarize the tiers"
})
2. Content Research Agent
Research topics and summarize sources:
result = agent_executor.invoke({
"input": """Research the topic 'AI agents in enterprise'.
Scrape these 3 articles and create a summary with key insights:
- https://techcrunch.com/tag/ai-agents/
- https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights
- https://hbr.org/topic/subject/ai-and-machine-learning
"""
})
3. Lead Generation Agent
Extract contact info from business directories:
result = agent_executor.invoke({
"input": """Go to this YC company directory page and extract:
company name, description, website, and founding year
for all companies listed: https://www.ycombinator.com/companies"""
})
4. E-commerce Price Tracker
result = agent_executor.invoke({
"input": """Check the price of 'Sony WH-1000XM5' on these sites and tell me
which has the best deal:
- https://www.amazon.com/dp/B0BX2L8PDS
- https://www.bestbuy.com/site/sony-wh-1000xm5/6505727.p
"""
})
Advanced: LangChain + WebPerception in LCEL
For more control, use LangChain Expression Language (LCEL):
from langchain_core.runnables import RunnableLambda
def web_research_chain(urls: list[str], question: str):
"""Research multiple URLs and answer a question."""
# Scrape all URLs in parallel
scraped = []
for url in urls:
content = scrape_webpage.invoke(url)
scraped.append(f"Source: {url}\n{content}")
context = "\n\n---\n\n".join(scraped)
# Analyze with LLM
llm = ChatOpenAI(model="gpt-4o")
result = llm.invoke(
f"Based on the following sources, answer this question: {question}\n\n{context}"
)
return result.content
# Usage
answer = web_research_chain(
urls=["https://docs.langchain.com", "https://python.langchain.com"],
question="What are the latest LangChain features for agent development?"
)
WebPerception vs. Other LangChain Web Tools
| Feature | WebPerception API | WebBaseLoader | Selenium | Playwright |
|---------|:-:|:-:|:-:|:-:|
| JavaScript rendering | ✅ | ❌ | ✅ | ✅ |
| AI data extraction | ✅ | ❌ | ❌ | ❌ |
| Screenshots | ✅ | ❌ | ✅ | ✅ |
| No infrastructure | ✅ | ✅ | ❌ | ❌ |
| Speed | Fast | Fast | Slow | Medium |
| Reliability | 99.9% | Varies | Low | Medium |
| Setup time | 2 min | 1 min | 30+ min | 15+ min |
Pricing
WebPerception API has a generous free tier — perfect for development and testing:
| Plan | Calls/Month | Price |
|------|-------------|-------|
| Free | 100 | $0 |
| Starter | 5,000 | $29/mo |
| Pro | 25,000 | $99/mo |
| Scale | 100,000 | $299/mo |
Overage: $0.005 per additional call.
Most LangChain projects start on Free and upgrade when they go to production. Sign up here →
Common Patterns
Error Handling
@tool
def safe_scrape(url: str) -> str:
"""Scrape a webpage with error handling."""
try:
response = requests.post(
f"{BASE_URL}/scrape",
headers={"x-api-key": WEBPERCEPTION_API_KEY},
json={"url": url},
timeout=30
)
response.raise_for_status()
data = response.json()
return data.get("content", "No content extracted")
except requests.exceptions.Timeout:
return "Error: Request timed out. The page may be slow to load."
except requests.exceptions.HTTPError as e:
return f"Error: HTTP {e.response.status_code}. Check the URL and try again."
Caching Results
from functools import lru_cache
import hashlib
@lru_cache(maxsize=100)
def cached_scrape(url: str) -> str:
"""Cache scrape results to save API calls."""
return scrape_webpage.invoke(url)
Rate Limiting
import time
class RateLimitedScraper:
def __init__(self, calls_per_second=2):
self.min_interval = 1.0 / calls_per_second
self.last_call = 0
def scrape(self, url: str) -> str:
elapsed = time.time() - self.last_call
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_call = time.time()
return scrape_webpage.invoke(url)
Getting Started
Sign up at mantisapi.com (free, no credit card)
Copy your API key from the dashboard
Install pip install langchain langchain-openai requests
Copy the tool definitions from this guide
Build your agent — you'll have web-capable LangChain agents in under 5 minutes
Your LangChain agents are about to get a whole lot smarter. Get your free API key →
---
Building something cool with LangChain + WebPerception? We'd love to feature you. Reach out at hello@mantisapi.com.