Build Web Scraping Modules with DSPy

March 9, 2026 · 12 min read · By Mantis Team

DSPy (Declarative Self-improving Language Programs) from Stanford is redefining how developers build LLM-powered applications. Instead of writing fragile prompts, you write signatures and modules — and DSPy's optimizers automatically tune them for your specific task.

In this tutorial, you'll learn how to build web scraping modules that combine DSPy's programmatic approach with the Mantis WebPerception API for real-time web data access. The result: optimizable, composable, production-ready scraping pipelines.

Why DSPy for Web Scraping?

Traditional agent frameworks rely on hand-crafted prompts that break when you change models or data formats. DSPy takes a fundamentally different approach:

Signatures define inputs/outputs declaratively — no prompt engineering
Modules compose like PyTorch layers — chain, branch, or loop
Optimizers automatically tune prompts based on your metric — like hyperparameter search for LLMs
Assertions enforce constraints at compile time, not runtime

For web scraping, this means you can define what you want to extract, let DSPy figure out how to prompt the model, and then optimize the pipeline on your actual data.

Prerequisites

pip install dspy-ai requests

You'll need:

Python 3.10+
A Mantis API key (free tier: 100 calls/month)
An OpenAI or Anthropic API key (for the LLM backend)

Step 1: Set Up the WebPerception Tool

First, create a Python wrapper for the Mantis API that DSPy modules can call:

import requests
import os

class WebPerception:
    """Client for the Mantis WebPerception API."""

    BASE_URL = "https://api.mantisapi.com/v1"

    def __init__(self):
        self.api_key = os.environ["MANTIS_API_KEY"]
        self.headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

    def scrape(self, url: str, format: str = "markdown") -> str:
        """Scrape a URL and return its content."""
        response = requests.post(
            f"{self.BASE_URL}/scrape",
            headers=self.headers,
            json={"url": url, "format": format}
        )
        response.raise_for_status()
        return response.json()["content"]

    def extract(self, url: str, prompt: str, schema: dict = None) -> dict:
        """Extract structured data from a URL using AI."""
        payload = {"url": url, "prompt": prompt}
        if schema:
            payload["schema"] = schema
        response = requests.post(
            f"{self.BASE_URL}/extract",
            headers=self.headers,
            json=payload
        )
        response.raise_for_status()
        return response.json()["data"]

    def screenshot(self, url: str) -> str:
        """Take a screenshot and return the image URL."""
        response = requests.post(
            f"{self.BASE_URL}/screenshot",
            headers=self.headers,
            json={"url": url}
        )
        response.raise_for_status()
        return response.json()["screenshot_url"]

web = WebPerception()

Step 2: Define DSPy Signatures

DSPy signatures declare what your module does — inputs in, outputs out. No prompt writing required:

import dspy

class ScrapeAndSummarize(dspy.Signature):
    """Scrape a webpage and produce a structured summary."""
    url: str = dspy.InputField(desc="URL to scrape")
    raw_content: str = dspy.InputField(desc="Raw scraped content from the page")
    summary: str = dspy.OutputField(desc="Concise summary of the page content")
    key_facts: list[str] = dspy.OutputField(desc="List of key facts extracted")
    sentiment: str = dspy.OutputField(desc="Overall sentiment: positive, negative, or neutral")

class CompareProducts(dspy.Signature):
    """Compare two products based on scraped data."""
    product_a_data: str = dspy.InputField(desc="Scraped data for product A")
    product_b_data: str = dspy.InputField(desc="Scraped data for product B")
    comparison: str = dspy.OutputField(desc="Detailed comparison of both products")
    winner: str = dspy.OutputField(desc="Which product is better and why")
    scores: dict = dspy.OutputField(desc="Scores for each product on key dimensions")

Step 3: Build DSPy Modules

Modules are the core building blocks. They use signatures and can call external tools like the WebPerception API:

class WebResearcher(dspy.Module):
    """A module that scrapes URLs and produces research summaries."""

    def __init__(self):
        super().__init__()
        self.web = WebPerception()
        self.summarize = dspy.ChainOfThought(ScrapeAndSummarize)

    def forward(self, url: str):
        # Step 1: Scrape the page using WebPerception API
        raw_content = self.web.scrape(url, format="markdown")

        # Step 2: Let DSPy summarize and extract facts
        result = self.summarize(
            url=url,
            raw_content=raw_content[:8000]  # Trim to fit context
        )

        return dspy.Prediction(
            summary=result.summary,
            key_facts=result.key_facts,
            sentiment=result.sentiment,
            raw_content=raw_content
        )

Product Comparison Module

class ProductComparator(dspy.Module):
    """Compare two products by scraping their pages."""

    def __init__(self):
        super().__init__()
        self.web = WebPerception()
        self.compare = dspy.ChainOfThought(CompareProducts)

    def forward(self, url_a: str, url_b: str):
        # Scrape both product pages
        data_a = self.web.scrape(url_a, format="markdown")
        data_b = self.web.scrape(url_b, format="markdown")

        # Compare using DSPy
        result = self.compare(
            product_a_data=data_a[:6000],
            product_b_data=data_b[:6000]
        )

        return dspy.Prediction(
            comparison=result.comparison,
            winner=result.winner,
            scores=result.scores
        )

Step 4: Configure DSPy and Run

# Configure DSPy with your preferred LLM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Use the WebResearcher module
researcher = WebResearcher()
result = researcher(url="https://docs.python.org/3/whatsnew/3.13.html")

print(f"Summary: {result.summary}")
print(f"Key Facts: {result.key_facts}")
print(f"Sentiment: {result.sentiment}")

# Use the ProductComparator
comparator = ProductComparator()
comparison = comparator(
    url_a="https://github.com/features/copilot",
    url_b="https://cursor.com"
)
print(f"Winner: {comparison.winner}")

Step 5: Optimize with DSPy Compilers

This is where DSPy truly shines. You can automatically optimize your scraping pipeline using real examples:

# Define training examples
trainset = [
    dspy.Example(
        url="https://example.com/product-1",
        raw_content="Product X: $29/mo, 10K API calls...",
        summary="Product X is a mid-tier API service at $29/month...",
        key_facts=["$29/month", "10K API calls", "99.9% uptime"],
        sentiment="positive"
    ).with_inputs("url", "raw_content"),
    # Add 5-10 more examples...
]

# Define a metric
def quality_metric(example, prediction, trace=None):
    # Check that summary mentions key pricing info
    has_pricing = any(
        fact in prediction.summary
        for fact in example.key_facts[:2]
    )
    # Check sentiment accuracy
    correct_sentiment = prediction.sentiment == example.sentiment
    return has_pricing and correct_sentiment

# Compile with BootstrapFewShot
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=quality_metric, max_bootstrapped_demos=4)
optimized_researcher = optimizer.compile(
    WebResearcher(),
    trainset=trainset
)

# Now use the optimized module — it generates better prompts automatically
result = optimized_researcher(url="https://example.com/new-product")

💡 Why this matters: Traditional scraping agents use static prompts that may work for one page but fail for another. DSPy's optimizer tests different prompt strategies against your actual data and finds the best one — automatically. Think of it as "fine-tuning without fine-tuning."

Step 6: Build a Multi-Step Research Pipeline

Chain multiple modules together for complex research workflows:

class ResearchCollector(dspy.Signature):
    """Collect and identify relevant URLs for a research topic."""
    topic: str = dspy.InputField(desc="Research topic")
    urls: list[str] = dspy.InputField(desc="Candidate URLs from search")
    selected_urls: list[str] = dspy.OutputField(desc="Top 3 most relevant URLs")
    reasoning: str = dspy.OutputField(desc="Why these URLs were selected")

class ResearchSynthesizer(dspy.Signature):
    """Synthesize multiple research summaries into a report."""
    topic: str = dspy.InputField(desc="Research topic")
    summaries: list[str] = dspy.InputField(desc="Individual page summaries")
    report: str = dspy.OutputField(desc="Comprehensive research report")
    confidence: float = dspy.OutputField(desc="Confidence score 0-1")

class MultiSourceResearcher(dspy.Module):
    """Research a topic by scraping and synthesizing multiple sources."""

    def __init__(self):
        super().__init__()
        self.web = WebPerception()
        self.collector = dspy.ChainOfThought(ResearchCollector)
        self.researcher = WebResearcher()
        self.synthesizer = dspy.ChainOfThought(ResearchSynthesizer)

    def forward(self, topic: str, candidate_urls: list[str]):
        # Step 1: Select the best URLs
        selection = self.collector(
            topic=topic,
            urls=candidate_urls
        )

        # Step 2: Scrape and summarize each URL
        summaries = []
        for url in selection.selected_urls[:3]:
            try:
                result = self.researcher(url=url)
                summaries.append(result.summary)
            except Exception as e:
                summaries.append(f"Failed to scrape {url}: {e}")

        # Step 3: Synthesize into a report
        report = self.synthesizer(
            topic=topic,
            summaries=summaries
        )

        return dspy.Prediction(
            report=report.report,
            confidence=report.confidence,
            sources=selection.selected_urls,
            individual_summaries=summaries
        )

Step 7: Add Assertions for Quality Control

class ValidatedResearcher(dspy.Module):
    """Web researcher with built-in quality assertions."""

    def __init__(self):
        super().__init__()
        self.web = WebPerception()
        self.summarize = dspy.ChainOfThought(ScrapeAndSummarize)

    def forward(self, url: str):
        raw_content = self.web.scrape(url, format="markdown")

        result = self.summarize(
            url=url,
            raw_content=raw_content[:8000]
        )

        # DSPy assertions — enforced at compile time
        dspy.Assert(
            len(result.summary) > 50,
            "Summary must be at least 50 characters"
        )
        dspy.Assert(
            len(result.key_facts) >= 2,
            "Must extract at least 2 key facts"
        )
        dspy.Assert(
            result.sentiment in ["positive", "negative", "neutral"],
            "Sentiment must be one of: positive, negative, neutral"
        )

        return dspy.Prediction(
            summary=result.summary,
            key_facts=result.key_facts,
            sentiment=result.sentiment
        )

Real-World Use Cases

1. Competitive Intelligence Pipeline

Monitor competitor pricing, features, and positioning. DSPy's optimizer ensures consistent extraction across different page layouts:

class CompetitorTracker(dspy.Signature):
    """Extract competitor intelligence from a product page."""
    page_content: str = dspy.InputField()
    company_name: str = dspy.OutputField()
    pricing_tiers: list[dict] = dspy.OutputField()
    key_features: list[str] = dspy.OutputField()
    target_audience: str = dspy.OutputField()
    positioning: str = dspy.OutputField()

tracker = dspy.ChainOfThought(CompetitorTracker)
content = web.scrape("https://competitor.com/pricing")
intel = tracker(page_content=content[:8000])

2. Lead Qualification

Scrape prospect websites and automatically qualify them:

class LeadQualifier(dspy.Signature):
    """Qualify a sales lead based on their company website."""
    website_content: str = dspy.InputField()
    company_size: str = dspy.OutputField(desc="small, medium, or enterprise")
    industry: str = dspy.OutputField()
    tech_stack: list[str] = dspy.OutputField()
    icp_match: bool = dspy.OutputField(desc="Does this match our ideal customer?")
    score: int = dspy.OutputField(desc="Lead score 1-100")

3. Content Monitoring

Track changes in documentation or news pages, with DSPy ensuring consistent change detection:

class ChangeDetector(dspy.Signature):
    """Detect meaningful changes between two versions of a page."""
    old_content: str = dspy.InputField()
    new_content: str = dspy.InputField()
    has_meaningful_change: bool = dspy.OutputField()
    changes: list[str] = dspy.OutputField(desc="List of meaningful changes")
    severity: str = dspy.OutputField(desc="low, medium, or high")

Cost Optimization

Operation	Mantis Credits	Typical Use
Scrape (markdown)	1 credit	Page content for DSPy modules
Extract (AI)	1 credit	Structured data extraction
Screenshot	1 credit	Visual verification

Tips for cost-efficient DSPy pipelines:

Cache scraped content — DSPy may retry modules during optimization
Use gpt-4o-mini during development, upgrade for production
Limit raw_content to 8K chars — enough for most extractions
Use the free tier (100 calls/mo) for building and testing

Ready to Build DSPy Scraping Modules?

Get your free API key and start building optimizable web scraping pipelines in minutes.

Get Free API Key →

What's Next

Quickstart Guide — Set up the Mantis API in 5 minutes
7 Agent Tool Use Patterns — Architectural patterns for agent tools
LangChain Integration — If you prefer the LangChain ecosystem
Extract API Reference — AI-powered structured data extraction