Web Scraping for Lead Generation: How AI Agents Find and Qualify Prospects in 2026

March 10, 2026 · 12 min read Lead Generation AI Agents

Your sales team spends 60% of their time researching prospects instead of selling. Meanwhile, your competitors are deploying AI agents that scrape company websites, LinkedIn profiles, job boards, and press releases — automatically building qualified lead lists while your reps are still Googling.

In this guide, you'll build an AI-powered lead generation agent that finds prospects, extracts company data, qualifies them against your ICP (Ideal Customer Profile), and outputs enriched, sales-ready lead lists. All with Python and the Mantis WebPerception API.

Why Traditional Lead Gen Tools Fall Short

Tools like ZoomInfo, Apollo, and Lusha give you databases. But databases are:

Stale — contact data decays at 30% per year
Generic — same leads your competitors buy
Expensive — $10K+/year for decent data
Rigid — you can't define custom qualification criteria

AI agents that scrape the live web solve all four problems. They find fresh data, discover prospects no database has indexed, cost a fraction of lead databases, and qualify leads using your exact ICP criteria.

The AI Lead Gen Architecture

Here's how a production lead generation agent works:

Source Discovery — Find companies matching your target profile from directories, job boards, forums, funding announcements
Data Extraction — Scrape each company's website for key signals: tech stack, team size, funding, pain points
Enrichment — Pull additional data from LinkedIn, Crunchbase, GitHub, G2 reviews
Qualification — Score each lead against your ICP using an LLM
Output — Structured JSON/CSV with qualified leads ready for your CRM

Step 1: Set Up the Scraping Foundation

import requests
import json
from openai import OpenAI

MANTIS_API_KEY = "your-mantis-api-key"
MANTIS_BASE = "https://api.mantisapi.com/v1"
openai = OpenAI()

def scrape_page(url):
    """Scrape a webpage and get clean text content."""
    resp = requests.post(f"{MANTIS_BASE}/scrape", json={
        "url": url,
        "render_js": True,
        "wait_for": "networkidle"
    }, headers={"Authorization": f"Bearer {MANTIS_API_KEY}"})
    return resp.json()

def extract_data(url, schema):
    """Use AI to extract structured data from a webpage."""
    resp = requests.post(f"{MANTIS_BASE}/extract", json={
        "url": url,
        "schema": schema,
        "render_js": True
    }, headers={"Authorization": f"Bearer {MANTIS_API_KEY}"})
    return resp.json()

Step 2: Build the Company Discovery Agent

The first job is finding companies to evaluate. Your agent can scrape multiple sources:

COMPANY_SCHEMA = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "website": {"type": "string"},
        "description": {"type": "string"},
        "industry": {"type": "string"},
        "location": {"type": "string"},
        "employee_count": {"type": "string"},
        "founded_year": {"type": "string"}
    }
}

def discover_companies_from_directory(directory_url):
    """Extract company listings from industry directories."""
    result = extract_data(directory_url, {
        "type": "array",
        "items": COMPANY_SCHEMA,
        "description": "All companies listed on this page with their details"
    })
    return result.get("data", [])

def discover_from_job_boards(search_url):
    """Find companies hiring for roles that signal they need your product."""
    result = extract_data(search_url, {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "company_name": {"type": "string"},
                "job_title": {"type": "string"},
                "description_snippet": {"type": "string"},
                "company_url": {"type": "string"}
            }
        },
        "description": "Companies hiring for these roles"
    })
    return result.get("data", [])

# Example: Find companies hiring AI/ML engineers (signal: they're building AI products)
sources = [
    "https://www.ycombinator.com/companies?batch=W26&industry=B2B",
    "https://wellfound.com/role/data-engineer",
]

all_companies = []
for source in sources:
    companies = discover_companies_from_directory(source)
    all_companies.extend(companies)
    print(f"Found {len(companies)} companies from {source}")

Step 3: Deep Company Enrichment

Once you have a list of companies, scrape their websites for qualification signals:

ENRICHMENT_SCHEMA = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "what_they_do": {"type": "string"},
        "target_customers": {"type": "string"},
        "tech_stack_signals": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Technologies mentioned (languages, frameworks, APIs, tools)"
        },
        "team_size_signals": {"type": "string"},
        "funding_stage": {"type": "string"},
        "pricing_model": {"type": "string"},
        "pain_points": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Problems they might have that our product solves"
        },
        "key_contacts": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"}
                }
            },
            "description": "Leadership team members visible on the site"
        }
    }
}

def enrich_company(website_url):
    """Deep-scrape a company website for qualification data."""
    # Scrape main page
    main_data = extract_data(website_url, ENRICHMENT_SCHEMA)

    # Also check /about, /team, /pricing pages
    enrichment_pages = ["/about", "/team", "/pricing", "/customers"]
    for page in enrichment_pages:
        try:
            page_data = extract_data(
                f"{website_url.rstrip('/')}{page}",
                ENRICHMENT_SCHEMA
            )
            # Merge additional signals
            if page_data.get("data"):
                for key, value in page_data["data"].items():
                    if value and not main_data.get("data", {}).get(key):
                        main_data.setdefault("data", {})[key] = value
        except Exception:
            continue

    return main_data.get("data", {})

# Enrich each discovered company
enriched_leads = []
for company in all_companies[:20]:  # Process top 20
    if company.get("website"):
        print(f"Enriching: {company['company_name']}...")
        enriched = enrich_company(company["website"])
        enriched["source_url"] = company["website"]
        enriched_leads.append(enriched)

Step 4: AI-Powered Lead Qualification

This is where the magic happens. Instead of simple filters, use an LLM to evaluate each lead against your ICP:

ICP_CRITERIA = """
Our Ideal Customer Profile:
- B2B SaaS companies building AI-powered products
- 10-200 employees (Series A to Series C)
- Engineering team uses Python or Node.js
- Currently scraping the web OR building agents that need web data
- Pain points: maintaining scrapers, handling anti-bot measures, scaling data extraction
- Budget: $100-$500/mo for developer tools
- Decision makers: CTO, VP Engineering, Lead Developer
"""

def qualify_lead(enriched_data):
    """Use GPT-4o to score and qualify a lead against ICP."""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""You are a sales qualification expert.
Score this lead against our ICP and provide actionable intelligence.

{ICP_CRITERIA}

Return JSON with:
- score: 1-100 (how well they match our ICP)
- tier: "hot" (80+), "warm" (50-79), "cold" (<50)
- reasons: list of why they match or don't
- talking_points: personalized outreach angles
- objections: likely objections and how to handle them
- recommended_plan: which pricing plan fits them
- urgency: "high", "medium", "low" — how urgently they need our product
"""},
            {"role": "user", "content": f"Qualify this lead:\n{json.dumps(enriched_data, indent=2)}"}
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Qualify all enriched leads
qualified_leads = []
for lead in enriched_leads:
    qualification = qualify_lead(lead)
    lead["qualification"] = qualification
    qualified_leads.append(lead)
    print(f"  {lead.get('company_name', 'Unknown')}: "
          f"Score {qualification['score']} ({qualification['tier']})")

# Sort by score
qualified_leads.sort(key=lambda x: x["qualification"]["score"], reverse=True)

Step 5: Export Sales-Ready Leads

import csv

def export_to_csv(leads, filename="qualified_leads.csv"):
    """Export qualified leads to a CSV ready for CRM import."""
    with open(filename, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow([
            "Company", "Website", "Score", "Tier", "Industry",
            "Team Size", "Funding", "Key Contact", "Contact Role",
            "Talking Points", "Recommended Plan", "Urgency"
        ])
        for lead in leads:
            q = lead.get("qualification", {})
            contacts = lead.get("key_contacts", [{}])
            primary_contact = contacts[0] if contacts else {}
            writer.writerow([
                lead.get("company_name", ""),
                lead.get("source_url", ""),
                q.get("score", ""),
                q.get("tier", ""),
                lead.get("industry", ""),
                lead.get("team_size_signals", ""),
                lead.get("funding_stage", ""),
                primary_contact.get("name", ""),
                primary_contact.get("role", ""),
                " | ".join(q.get("talking_points", [])),
                q.get("recommended_plan", ""),
                q.get("urgency", "")
            ])
    print(f"Exported {len(leads)} leads to {filename}")

# Export hot and warm leads
hot_warm = [l for l in qualified_leads if l["qualification"]["tier"] in ("hot", "warm")]
export_to_csv(hot_warm)

Step 6: Automated Pipeline with Scheduling

Run your lead gen agent on a schedule to continuously discover new prospects:

import schedule
import time

def daily_lead_gen():
    """Run the full lead generation pipeline."""
    print(f"Starting lead gen run: {time.strftime('%Y-%m-%d %H:%M')}")

    # 1. Discover from multiple sources
    companies = []
    for source in LEAD_SOURCES:
        companies.extend(discover_companies_from_directory(source))

    # 2. Deduplicate
    seen = set()
    unique = []
    for c in companies:
        key = c.get("website", "").lower().strip("/")
        if key and key not in seen:
            seen.add(key)
            unique.append(c)

    # 3. Enrich & qualify
    for company in unique:
        enriched = enrich_company(company["website"])
        qualification = qualify_lead(enriched)
        if qualification["tier"] in ("hot", "warm"):
            # Save to database or CRM
            save_lead(enriched, qualification)
            print(f"  HOT/WARM: {company['company_name']} (score: {qualification['score']})")

    print(f"Pipeline complete. Processed {len(unique)} companies.")

# Run every morning at 6 AM
schedule.every().day.at("06:00").do(daily_lead_gen)

while True:
    schedule.run_pending()
    time.sleep(60)

Lead Sources That Work

Source Type	Example	Signal
Startup directories	Y Combinator, Product Hunt	New companies with funding
Job boards	Indeed, LinkedIn Jobs	Hiring for roles that signal need
Funding announcements	Crunchbase, TechCrunch	Companies with budget
GitHub	Popular repos in your space	Engineering teams using relevant tech
Review sites	G2, Capterra	Companies evaluating competitor tools
Industry forums	Reddit, Hacker News	People discussing problems you solve
Conference attendee lists	Event websites	Active in your industry

Traditional Lead Gen vs AI Agent Lead Gen

Dimension	Traditional (ZoomInfo, Apollo)	AI Agent + Web Scraping
Data freshness	Weeks to months old	Real-time, scraped live
Cost	$10K-$50K/year	$99-$299/month (API costs)
Custom qualification	Basic filters only	LLM scores against your exact ICP
Unique leads	Same database as competitors	Discovers leads no one else has
Enrichment depth	Name, email, phone	Tech stack, pain points, talking points
Setup time	Minutes	Hours (one-time)
Maintenance	None	Low (API handles scraping complexity)

Best Practices

Respect Rate Limits and Ethics

Don't scrape personal data without a legitimate business purpose
Respect robots.txt and terms of service
Use reasonable delays between requests
Focus on publicly available business information
Comply with GDPR, CCPA, and other privacy regulations

Optimize for Quality Over Quantity

Better to have 20 qualified leads than 2,000 cold ones
Invest time in refining your ICP criteria — it's the most important variable
A/B test different qualification prompts to improve lead scoring accuracy
Track which lead sources produce the highest conversion rates

Keep Costs Low

Cache enrichment results — company data doesn't change daily
Use the Mantis Free tier (100 calls/month) for testing
Only enrich companies that pass initial filters
Batch API calls where possible

Start Finding Leads with AI

The Mantis WebPerception API gives your lead gen agent the power to scrape, screenshot, and extract structured data from any website. Start free — 100 API calls/month, no credit card required.

Get Your API Key →

What's Next

You've built the foundation of an AI-powered lead generation system. From here, you can:

Add email finding — Scrape company websites for contact patterns (first.last@company.com)
Build outreach sequences — Use the talking points from qualification to write personalized emails
Create a dashboard — Visualize your pipeline with Streamlit or Retool
Integrate with your CRM — Push qualified leads directly into Salesforce, HubSpot, or Pipedrive
Add monitoring — Re-scrape leads periodically to catch changes (new funding, new hires, new products)

The agent developers who win in 2026 aren't the ones with the biggest lead databases. They're the ones whose agents find fresh, qualified leads while everyone else is buying stale data.

Related reading: