Web Scraping for Lead Generation: How AI Agents Find and Qualify Prospects in 2026

March 10, 2026 ยท 12 min read Lead Generation AI Agents

Your sales team spends 60% of their time researching prospects instead of selling. Meanwhile, your competitors are deploying AI agents that scrape company websites, LinkedIn profiles, job boards, and press releases โ€” automatically building qualified lead lists while your reps are still Googling.

In this guide, you'll build an AI-powered lead generation agent that finds prospects, extracts company data, qualifies them against your ICP (Ideal Customer Profile), and outputs enriched, sales-ready lead lists. All with Python and the Mantis WebPerception API.

Why Traditional Lead Gen Tools Fall Short

Tools like ZoomInfo, Apollo, and Lusha give you databases. But databases are:

AI agents that scrape the live web solve all four problems. They find fresh data, discover prospects no database has indexed, cost a fraction of lead databases, and qualify leads using your exact ICP criteria.

The AI Lead Gen Architecture

Here's how a production lead generation agent works:

  1. Source Discovery โ€” Find companies matching your target profile from directories, job boards, forums, funding announcements
  2. Data Extraction โ€” Scrape each company's website for key signals: tech stack, team size, funding, pain points
  3. Enrichment โ€” Pull additional data from LinkedIn, Crunchbase, GitHub, G2 reviews
  4. Qualification โ€” Score each lead against your ICP using an LLM
  5. Output โ€” Structured JSON/CSV with qualified leads ready for your CRM

Step 1: Set Up the Scraping Foundation

import requests
import json
from openai import OpenAI

MANTIS_API_KEY = "your-mantis-api-key"
MANTIS_BASE = "https://api.mantisapi.com/v1"
openai = OpenAI()

def scrape_page(url):
    """Scrape a webpage and get clean text content."""
    resp = requests.post(f"{MANTIS_BASE}/scrape", json={
        "url": url,
        "render_js": True,
        "wait_for": "networkidle"
    }, headers={"Authorization": f"Bearer {MANTIS_API_KEY}"})
    return resp.json()

def extract_data(url, schema):
    """Use AI to extract structured data from a webpage."""
    resp = requests.post(f"{MANTIS_BASE}/extract", json={
        "url": url,
        "schema": schema,
        "render_js": True
    }, headers={"Authorization": f"Bearer {MANTIS_API_KEY}"})
    return resp.json()

Step 2: Build the Company Discovery Agent

The first job is finding companies to evaluate. Your agent can scrape multiple sources:

COMPANY_SCHEMA = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "website": {"type": "string"},
        "description": {"type": "string"},
        "industry": {"type": "string"},
        "location": {"type": "string"},
        "employee_count": {"type": "string"},
        "founded_year": {"type": "string"}
    }
}

def discover_companies_from_directory(directory_url):
    """Extract company listings from industry directories."""
    result = extract_data(directory_url, {
        "type": "array",
        "items": COMPANY_SCHEMA,
        "description": "All companies listed on this page with their details"
    })
    return result.get("data", [])

def discover_from_job_boards(search_url):
    """Find companies hiring for roles that signal they need your product."""
    result = extract_data(search_url, {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "company_name": {"type": "string"},
                "job_title": {"type": "string"},
                "description_snippet": {"type": "string"},
                "company_url": {"type": "string"}
            }
        },
        "description": "Companies hiring for these roles"
    })
    return result.get("data", [])

# Example: Find companies hiring AI/ML engineers (signal: they're building AI products)
sources = [
    "https://www.ycombinator.com/companies?batch=W26&industry=B2B",
    "https://wellfound.com/role/data-engineer",
]

all_companies = []
for source in sources:
    companies = discover_companies_from_directory(source)
    all_companies.extend(companies)
    print(f"Found {len(companies)} companies from {source}")

Step 3: Deep Company Enrichment

Once you have a list of companies, scrape their websites for qualification signals:

ENRICHMENT_SCHEMA = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "what_they_do": {"type": "string"},
        "target_customers": {"type": "string"},
        "tech_stack_signals": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Technologies mentioned (languages, frameworks, APIs, tools)"
        },
        "team_size_signals": {"type": "string"},
        "funding_stage": {"type": "string"},
        "pricing_model": {"type": "string"},
        "pain_points": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Problems they might have that our product solves"
        },
        "key_contacts": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"}
                }
            },
            "description": "Leadership team members visible on the site"
        }
    }
}

def enrich_company(website_url):
    """Deep-scrape a company website for qualification data."""
    # Scrape main page
    main_data = extract_data(website_url, ENRICHMENT_SCHEMA)

    # Also check /about, /team, /pricing pages
    enrichment_pages = ["/about", "/team", "/pricing", "/customers"]
    for page in enrichment_pages:
        try:
            page_data = extract_data(
                f"{website_url.rstrip('/')}{page}",
                ENRICHMENT_SCHEMA
            )
            # Merge additional signals
            if page_data.get("data"):
                for key, value in page_data["data"].items():
                    if value and not main_data.get("data", {}).get(key):
                        main_data.setdefault("data", {})[key] = value
        except Exception:
            continue

    return main_data.get("data", {})

# Enrich each discovered company
enriched_leads = []
for company in all_companies[:20]:  # Process top 20
    if company.get("website"):
        print(f"Enriching: {company['company_name']}...")
        enriched = enrich_company(company["website"])
        enriched["source_url"] = company["website"]
        enriched_leads.append(enriched)

Step 4: AI-Powered Lead Qualification

This is where the magic happens. Instead of simple filters, use an LLM to evaluate each lead against your ICP:

ICP_CRITERIA = """
Our Ideal Customer Profile:
- B2B SaaS companies building AI-powered products
- 10-200 employees (Series A to Series C)
- Engineering team uses Python or Node.js
- Currently scraping the web OR building agents that need web data
- Pain points: maintaining scrapers, handling anti-bot measures, scaling data extraction
- Budget: $100-$500/mo for developer tools
- Decision makers: CTO, VP Engineering, Lead Developer
"""

def qualify_lead(enriched_data):
    """Use GPT-4o to score and qualify a lead against ICP."""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""You are a sales qualification expert.
Score this lead against our ICP and provide actionable intelligence.

{ICP_CRITERIA}

Return JSON with:
- score: 1-100 (how well they match our ICP)
- tier: "hot" (80+), "warm" (50-79), "cold" (<50)
- reasons: list of why they match or don't
- talking_points: personalized outreach angles
- objections: likely objections and how to handle them
- recommended_plan: which pricing plan fits them
- urgency: "high", "medium", "low" โ€” how urgently they need our product
"""},
            {"role": "user", "content": f"Qualify this lead:\n{json.dumps(enriched_data, indent=2)}"}
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Qualify all enriched leads
qualified_leads = []
for lead in enriched_leads:
    qualification = qualify_lead(lead)
    lead["qualification"] = qualification
    qualified_leads.append(lead)
    print(f"  {lead.get('company_name', 'Unknown')}: "
          f"Score {qualification['score']} ({qualification['tier']})")

# Sort by score
qualified_leads.sort(key=lambda x: x["qualification"]["score"], reverse=True)

Step 5: Export Sales-Ready Leads

import csv

def export_to_csv(leads, filename="qualified_leads.csv"):
    """Export qualified leads to a CSV ready for CRM import."""
    with open(filename, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow([
            "Company", "Website", "Score", "Tier", "Industry",
            "Team Size", "Funding", "Key Contact", "Contact Role",
            "Talking Points", "Recommended Plan", "Urgency"
        ])
        for lead in leads:
            q = lead.get("qualification", {})
            contacts = lead.get("key_contacts", [{}])
            primary_contact = contacts[0] if contacts else {}
            writer.writerow([
                lead.get("company_name", ""),
                lead.get("source_url", ""),
                q.get("score", ""),
                q.get("tier", ""),
                lead.get("industry", ""),
                lead.get("team_size_signals", ""),
                lead.get("funding_stage", ""),
                primary_contact.get("name", ""),
                primary_contact.get("role", ""),
                " | ".join(q.get("talking_points", [])),
                q.get("recommended_plan", ""),
                q.get("urgency", "")
            ])
    print(f"Exported {len(leads)} leads to {filename}")

# Export hot and warm leads
hot_warm = [l for l in qualified_leads if l["qualification"]["tier"] in ("hot", "warm")]
export_to_csv(hot_warm)

Step 6: Automated Pipeline with Scheduling

Run your lead gen agent on a schedule to continuously discover new prospects:

import schedule
import time

def daily_lead_gen():
    """Run the full lead generation pipeline."""
    print(f"Starting lead gen run: {time.strftime('%Y-%m-%d %H:%M')}")

    # 1. Discover from multiple sources
    companies = []
    for source in LEAD_SOURCES:
        companies.extend(discover_companies_from_directory(source))

    # 2. Deduplicate
    seen = set()
    unique = []
    for c in companies:
        key = c.get("website", "").lower().strip("/")
        if key and key not in seen:
            seen.add(key)
            unique.append(c)

    # 3. Enrich & qualify
    for company in unique:
        enriched = enrich_company(company["website"])
        qualification = qualify_lead(enriched)
        if qualification["tier"] in ("hot", "warm"):
            # Save to database or CRM
            save_lead(enriched, qualification)
            print(f"  HOT/WARM: {company['company_name']} (score: {qualification['score']})")

    print(f"Pipeline complete. Processed {len(unique)} companies.")

# Run every morning at 6 AM
schedule.every().day.at("06:00").do(daily_lead_gen)

while True:
    schedule.run_pending()
    time.sleep(60)

Lead Sources That Work

Source TypeExampleSignal
Startup directoriesY Combinator, Product HuntNew companies with funding
Job boardsIndeed, LinkedIn JobsHiring for roles that signal need
Funding announcementsCrunchbase, TechCrunchCompanies with budget
GitHubPopular repos in your spaceEngineering teams using relevant tech
Review sitesG2, CapterraCompanies evaluating competitor tools
Industry forumsReddit, Hacker NewsPeople discussing problems you solve
Conference attendee listsEvent websitesActive in your industry

Traditional Lead Gen vs AI Agent Lead Gen

DimensionTraditional (ZoomInfo, Apollo)AI Agent + Web Scraping
Data freshnessWeeks to months oldReal-time, scraped live
Cost$10K-$50K/year$99-$299/month (API costs)
Custom qualificationBasic filters onlyLLM scores against your exact ICP
Unique leadsSame database as competitorsDiscovers leads no one else has
Enrichment depthName, email, phoneTech stack, pain points, talking points
Setup timeMinutesHours (one-time)
MaintenanceNoneLow (API handles scraping complexity)

Best Practices

Respect Rate Limits and Ethics

Optimize for Quality Over Quantity

Keep Costs Low

Start Finding Leads with AI

The Mantis WebPerception API gives your lead gen agent the power to scrape, screenshot, and extract structured data from any website. Start free โ€” 100 API calls/month, no credit card required.

Get Your API Key โ†’

What's Next

You've built the foundation of an AI-powered lead generation system. From here, you can:

The agent developers who win in 2026 aren't the ones with the biggest lead databases. They're the ones whose agents find fresh, qualified leads while everyone else is buying stale data.

Related reading: