DSPy (Declarative Self-improving Language Programs) from Stanford is redefining how developers build LLM-powered applications. Instead of writing fragile prompts, you write signatures and modules โ and DSPy's optimizers automatically tune them for your specific task.
In this tutorial, you'll learn how to build web scraping modules that combine DSPy's programmatic approach with the Mantis WebPerception API for real-time web data access. The result: optimizable, composable, production-ready scraping pipelines.
Traditional agent frameworks rely on hand-crafted prompts that break when you change models or data formats. DSPy takes a fundamentally different approach:
For web scraping, this means you can define what you want to extract, let DSPy figure out how to prompt the model, and then optimize the pipeline on your actual data.
pip install dspy-ai requests
You'll need:
First, create a Python wrapper for the Mantis API that DSPy modules can call:
import requests
import os
class WebPerception:
"""Client for the Mantis WebPerception API."""
BASE_URL = "https://api.mantisapi.com/v1"
def __init__(self):
self.api_key = os.environ["MANTIS_API_KEY"]
self.headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
def scrape(self, url: str, format: str = "markdown") -> str:
"""Scrape a URL and return its content."""
response = requests.post(
f"{self.BASE_URL}/scrape",
headers=self.headers,
json={"url": url, "format": format}
)
response.raise_for_status()
return response.json()["content"]
def extract(self, url: str, prompt: str, schema: dict = None) -> dict:
"""Extract structured data from a URL using AI."""
payload = {"url": url, "prompt": prompt}
if schema:
payload["schema"] = schema
response = requests.post(
f"{self.BASE_URL}/extract",
headers=self.headers,
json=payload
)
response.raise_for_status()
return response.json()["data"]
def screenshot(self, url: str) -> str:
"""Take a screenshot and return the image URL."""
response = requests.post(
f"{self.BASE_URL}/screenshot",
headers=self.headers,
json={"url": url}
)
response.raise_for_status()
return response.json()["screenshot_url"]
web = WebPerception()
DSPy signatures declare what your module does โ inputs in, outputs out. No prompt writing required:
import dspy
class ScrapeAndSummarize(dspy.Signature):
"""Scrape a webpage and produce a structured summary."""
url: str = dspy.InputField(desc="URL to scrape")
raw_content: str = dspy.InputField(desc="Raw scraped content from the page")
summary: str = dspy.OutputField(desc="Concise summary of the page content")
key_facts: list[str] = dspy.OutputField(desc="List of key facts extracted")
sentiment: str = dspy.OutputField(desc="Overall sentiment: positive, negative, or neutral")
class CompareProducts(dspy.Signature):
"""Compare two products based on scraped data."""
product_a_data: str = dspy.InputField(desc="Scraped data for product A")
product_b_data: str = dspy.InputField(desc="Scraped data for product B")
comparison: str = dspy.OutputField(desc="Detailed comparison of both products")
winner: str = dspy.OutputField(desc="Which product is better and why")
scores: dict = dspy.OutputField(desc="Scores for each product on key dimensions")
Modules are the core building blocks. They use signatures and can call external tools like the WebPerception API:
class WebResearcher(dspy.Module):
"""A module that scrapes URLs and produces research summaries."""
def __init__(self):
super().__init__()
self.web = WebPerception()
self.summarize = dspy.ChainOfThought(ScrapeAndSummarize)
def forward(self, url: str):
# Step 1: Scrape the page using WebPerception API
raw_content = self.web.scrape(url, format="markdown")
# Step 2: Let DSPy summarize and extract facts
result = self.summarize(
url=url,
raw_content=raw_content[:8000] # Trim to fit context
)
return dspy.Prediction(
summary=result.summary,
key_facts=result.key_facts,
sentiment=result.sentiment,
raw_content=raw_content
)
class ProductComparator(dspy.Module):
"""Compare two products by scraping their pages."""
def __init__(self):
super().__init__()
self.web = WebPerception()
self.compare = dspy.ChainOfThought(CompareProducts)
def forward(self, url_a: str, url_b: str):
# Scrape both product pages
data_a = self.web.scrape(url_a, format="markdown")
data_b = self.web.scrape(url_b, format="markdown")
# Compare using DSPy
result = self.compare(
product_a_data=data_a[:6000],
product_b_data=data_b[:6000]
)
return dspy.Prediction(
comparison=result.comparison,
winner=result.winner,
scores=result.scores
)
# Configure DSPy with your preferred LLM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# Use the WebResearcher module
researcher = WebResearcher()
result = researcher(url="https://docs.python.org/3/whatsnew/3.13.html")
print(f"Summary: {result.summary}")
print(f"Key Facts: {result.key_facts}")
print(f"Sentiment: {result.sentiment}")
# Use the ProductComparator
comparator = ProductComparator()
comparison = comparator(
url_a="https://github.com/features/copilot",
url_b="https://cursor.com"
)
print(f"Winner: {comparison.winner}")
This is where DSPy truly shines. You can automatically optimize your scraping pipeline using real examples:
# Define training examples
trainset = [
dspy.Example(
url="https://example.com/product-1",
raw_content="Product X: $29/mo, 10K API calls...",
summary="Product X is a mid-tier API service at $29/month...",
key_facts=["$29/month", "10K API calls", "99.9% uptime"],
sentiment="positive"
).with_inputs("url", "raw_content"),
# Add 5-10 more examples...
]
# Define a metric
def quality_metric(example, prediction, trace=None):
# Check that summary mentions key pricing info
has_pricing = any(
fact in prediction.summary
for fact in example.key_facts[:2]
)
# Check sentiment accuracy
correct_sentiment = prediction.sentiment == example.sentiment
return has_pricing and correct_sentiment
# Compile with BootstrapFewShot
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=quality_metric, max_bootstrapped_demos=4)
optimized_researcher = optimizer.compile(
WebResearcher(),
trainset=trainset
)
# Now use the optimized module โ it generates better prompts automatically
result = optimized_researcher(url="https://example.com/new-product")
๐ก Why this matters: Traditional scraping agents use static prompts that may work for one page but fail for another. DSPy's optimizer tests different prompt strategies against your actual data and finds the best one โ automatically. Think of it as "fine-tuning without fine-tuning."
Chain multiple modules together for complex research workflows:
class ResearchCollector(dspy.Signature):
"""Collect and identify relevant URLs for a research topic."""
topic: str = dspy.InputField(desc="Research topic")
urls: list[str] = dspy.InputField(desc="Candidate URLs from search")
selected_urls: list[str] = dspy.OutputField(desc="Top 3 most relevant URLs")
reasoning: str = dspy.OutputField(desc="Why these URLs were selected")
class ResearchSynthesizer(dspy.Signature):
"""Synthesize multiple research summaries into a report."""
topic: str = dspy.InputField(desc="Research topic")
summaries: list[str] = dspy.InputField(desc="Individual page summaries")
report: str = dspy.OutputField(desc="Comprehensive research report")
confidence: float = dspy.OutputField(desc="Confidence score 0-1")
class MultiSourceResearcher(dspy.Module):
"""Research a topic by scraping and synthesizing multiple sources."""
def __init__(self):
super().__init__()
self.web = WebPerception()
self.collector = dspy.ChainOfThought(ResearchCollector)
self.researcher = WebResearcher()
self.synthesizer = dspy.ChainOfThought(ResearchSynthesizer)
def forward(self, topic: str, candidate_urls: list[str]):
# Step 1: Select the best URLs
selection = self.collector(
topic=topic,
urls=candidate_urls
)
# Step 2: Scrape and summarize each URL
summaries = []
for url in selection.selected_urls[:3]:
try:
result = self.researcher(url=url)
summaries.append(result.summary)
except Exception as e:
summaries.append(f"Failed to scrape {url}: {e}")
# Step 3: Synthesize into a report
report = self.synthesizer(
topic=topic,
summaries=summaries
)
return dspy.Prediction(
report=report.report,
confidence=report.confidence,
sources=selection.selected_urls,
individual_summaries=summaries
)
class ValidatedResearcher(dspy.Module):
"""Web researcher with built-in quality assertions."""
def __init__(self):
super().__init__()
self.web = WebPerception()
self.summarize = dspy.ChainOfThought(ScrapeAndSummarize)
def forward(self, url: str):
raw_content = self.web.scrape(url, format="markdown")
result = self.summarize(
url=url,
raw_content=raw_content[:8000]
)
# DSPy assertions โ enforced at compile time
dspy.Assert(
len(result.summary) > 50,
"Summary must be at least 50 characters"
)
dspy.Assert(
len(result.key_facts) >= 2,
"Must extract at least 2 key facts"
)
dspy.Assert(
result.sentiment in ["positive", "negative", "neutral"],
"Sentiment must be one of: positive, negative, neutral"
)
return dspy.Prediction(
summary=result.summary,
key_facts=result.key_facts,
sentiment=result.sentiment
)
Monitor competitor pricing, features, and positioning. DSPy's optimizer ensures consistent extraction across different page layouts:
class CompetitorTracker(dspy.Signature):
"""Extract competitor intelligence from a product page."""
page_content: str = dspy.InputField()
company_name: str = dspy.OutputField()
pricing_tiers: list[dict] = dspy.OutputField()
key_features: list[str] = dspy.OutputField()
target_audience: str = dspy.OutputField()
positioning: str = dspy.OutputField()
tracker = dspy.ChainOfThought(CompetitorTracker)
content = web.scrape("https://competitor.com/pricing")
intel = tracker(page_content=content[:8000])
Scrape prospect websites and automatically qualify them:
class LeadQualifier(dspy.Signature):
"""Qualify a sales lead based on their company website."""
website_content: str = dspy.InputField()
company_size: str = dspy.OutputField(desc="small, medium, or enterprise")
industry: str = dspy.OutputField()
tech_stack: list[str] = dspy.OutputField()
icp_match: bool = dspy.OutputField(desc="Does this match our ideal customer?")
score: int = dspy.OutputField(desc="Lead score 1-100")
Track changes in documentation or news pages, with DSPy ensuring consistent change detection:
class ChangeDetector(dspy.Signature):
"""Detect meaningful changes between two versions of a page."""
old_content: str = dspy.InputField()
new_content: str = dspy.InputField()
has_meaningful_change: bool = dspy.OutputField()
changes: list[str] = dspy.OutputField(desc="List of meaningful changes")
severity: str = dspy.OutputField(desc="low, medium, or high")
| Operation | Mantis Credits | Typical Use |
|---|---|---|
| Scrape (markdown) | 1 credit | Page content for DSPy modules |
| Extract (AI) | 1 credit | Structured data extraction |
| Screenshot | 1 credit | Visual verification |
Tips for cost-efficient DSPy pipelines:
gpt-4o-mini during development, upgrade for productionraw_content to 8K chars โ enough for most extractionsGet your free API key and start building optimizable web scraping pipelines in minutes.
Get Free API Key โ