Web Scraping for Lead Generation: How AI Agents Find and Qualify Prospects in 2026
Your sales team spends 60% of their time researching prospects instead of selling. Meanwhile, your competitors are deploying AI agents that scrape company websites, LinkedIn profiles, job boards, and press releases โ automatically building qualified lead lists while your reps are still Googling.
In this guide, you'll build an AI-powered lead generation agent that finds prospects, extracts company data, qualifies them against your ICP (Ideal Customer Profile), and outputs enriched, sales-ready lead lists. All with Python and the Mantis WebPerception API.
Why Traditional Lead Gen Tools Fall Short
Tools like ZoomInfo, Apollo, and Lusha give you databases. But databases are:
- Stale โ contact data decays at 30% per year
- Generic โ same leads your competitors buy
- Expensive โ $10K+/year for decent data
- Rigid โ you can't define custom qualification criteria
AI agents that scrape the live web solve all four problems. They find fresh data, discover prospects no database has indexed, cost a fraction of lead databases, and qualify leads using your exact ICP criteria.
The AI Lead Gen Architecture
Here's how a production lead generation agent works:
- Source Discovery โ Find companies matching your target profile from directories, job boards, forums, funding announcements
- Data Extraction โ Scrape each company's website for key signals: tech stack, team size, funding, pain points
- Enrichment โ Pull additional data from LinkedIn, Crunchbase, GitHub, G2 reviews
- Qualification โ Score each lead against your ICP using an LLM
- Output โ Structured JSON/CSV with qualified leads ready for your CRM
Step 1: Set Up the Scraping Foundation
import requests
import json
from openai import OpenAI
MANTIS_API_KEY = "your-mantis-api-key"
MANTIS_BASE = "https://api.mantisapi.com/v1"
openai = OpenAI()
def scrape_page(url):
"""Scrape a webpage and get clean text content."""
resp = requests.post(f"{MANTIS_BASE}/scrape", json={
"url": url,
"render_js": True,
"wait_for": "networkidle"
}, headers={"Authorization": f"Bearer {MANTIS_API_KEY}"})
return resp.json()
def extract_data(url, schema):
"""Use AI to extract structured data from a webpage."""
resp = requests.post(f"{MANTIS_BASE}/extract", json={
"url": url,
"schema": schema,
"render_js": True
}, headers={"Authorization": f"Bearer {MANTIS_API_KEY}"})
return resp.json()
Step 2: Build the Company Discovery Agent
The first job is finding companies to evaluate. Your agent can scrape multiple sources:
COMPANY_SCHEMA = {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"website": {"type": "string"},
"description": {"type": "string"},
"industry": {"type": "string"},
"location": {"type": "string"},
"employee_count": {"type": "string"},
"founded_year": {"type": "string"}
}
}
def discover_companies_from_directory(directory_url):
"""Extract company listings from industry directories."""
result = extract_data(directory_url, {
"type": "array",
"items": COMPANY_SCHEMA,
"description": "All companies listed on this page with their details"
})
return result.get("data", [])
def discover_from_job_boards(search_url):
"""Find companies hiring for roles that signal they need your product."""
result = extract_data(search_url, {
"type": "array",
"items": {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"job_title": {"type": "string"},
"description_snippet": {"type": "string"},
"company_url": {"type": "string"}
}
},
"description": "Companies hiring for these roles"
})
return result.get("data", [])
# Example: Find companies hiring AI/ML engineers (signal: they're building AI products)
sources = [
"https://www.ycombinator.com/companies?batch=W26&industry=B2B",
"https://wellfound.com/role/data-engineer",
]
all_companies = []
for source in sources:
companies = discover_companies_from_directory(source)
all_companies.extend(companies)
print(f"Found {len(companies)} companies from {source}")
Step 3: Deep Company Enrichment
Once you have a list of companies, scrape their websites for qualification signals:
ENRICHMENT_SCHEMA = {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"what_they_do": {"type": "string"},
"target_customers": {"type": "string"},
"tech_stack_signals": {
"type": "array",
"items": {"type": "string"},
"description": "Technologies mentioned (languages, frameworks, APIs, tools)"
},
"team_size_signals": {"type": "string"},
"funding_stage": {"type": "string"},
"pricing_model": {"type": "string"},
"pain_points": {
"type": "array",
"items": {"type": "string"},
"description": "Problems they might have that our product solves"
},
"key_contacts": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"role": {"type": "string"}
}
},
"description": "Leadership team members visible on the site"
}
}
}
def enrich_company(website_url):
"""Deep-scrape a company website for qualification data."""
# Scrape main page
main_data = extract_data(website_url, ENRICHMENT_SCHEMA)
# Also check /about, /team, /pricing pages
enrichment_pages = ["/about", "/team", "/pricing", "/customers"]
for page in enrichment_pages:
try:
page_data = extract_data(
f"{website_url.rstrip('/')}{page}",
ENRICHMENT_SCHEMA
)
# Merge additional signals
if page_data.get("data"):
for key, value in page_data["data"].items():
if value and not main_data.get("data", {}).get(key):
main_data.setdefault("data", {})[key] = value
except Exception:
continue
return main_data.get("data", {})
# Enrich each discovered company
enriched_leads = []
for company in all_companies[:20]: # Process top 20
if company.get("website"):
print(f"Enriching: {company['company_name']}...")
enriched = enrich_company(company["website"])
enriched["source_url"] = company["website"]
enriched_leads.append(enriched)
Step 4: AI-Powered Lead Qualification
This is where the magic happens. Instead of simple filters, use an LLM to evaluate each lead against your ICP:
ICP_CRITERIA = """
Our Ideal Customer Profile:
- B2B SaaS companies building AI-powered products
- 10-200 employees (Series A to Series C)
- Engineering team uses Python or Node.js
- Currently scraping the web OR building agents that need web data
- Pain points: maintaining scrapers, handling anti-bot measures, scaling data extraction
- Budget: $100-$500/mo for developer tools
- Decision makers: CTO, VP Engineering, Lead Developer
"""
def qualify_lead(enriched_data):
"""Use GPT-4o to score and qualify a lead against ICP."""
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"""You are a sales qualification expert.
Score this lead against our ICP and provide actionable intelligence.
{ICP_CRITERIA}
Return JSON with:
- score: 1-100 (how well they match our ICP)
- tier: "hot" (80+), "warm" (50-79), "cold" (<50)
- reasons: list of why they match or don't
- talking_points: personalized outreach angles
- objections: likely objections and how to handle them
- recommended_plan: which pricing plan fits them
- urgency: "high", "medium", "low" โ how urgently they need our product
"""},
{"role": "user", "content": f"Qualify this lead:\n{json.dumps(enriched_data, indent=2)}"}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
# Qualify all enriched leads
qualified_leads = []
for lead in enriched_leads:
qualification = qualify_lead(lead)
lead["qualification"] = qualification
qualified_leads.append(lead)
print(f" {lead.get('company_name', 'Unknown')}: "
f"Score {qualification['score']} ({qualification['tier']})")
# Sort by score
qualified_leads.sort(key=lambda x: x["qualification"]["score"], reverse=True)
Step 5: Export Sales-Ready Leads
import csv
def export_to_csv(leads, filename="qualified_leads.csv"):
"""Export qualified leads to a CSV ready for CRM import."""
with open(filename, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow([
"Company", "Website", "Score", "Tier", "Industry",
"Team Size", "Funding", "Key Contact", "Contact Role",
"Talking Points", "Recommended Plan", "Urgency"
])
for lead in leads:
q = lead.get("qualification", {})
contacts = lead.get("key_contacts", [{}])
primary_contact = contacts[0] if contacts else {}
writer.writerow([
lead.get("company_name", ""),
lead.get("source_url", ""),
q.get("score", ""),
q.get("tier", ""),
lead.get("industry", ""),
lead.get("team_size_signals", ""),
lead.get("funding_stage", ""),
primary_contact.get("name", ""),
primary_contact.get("role", ""),
" | ".join(q.get("talking_points", [])),
q.get("recommended_plan", ""),
q.get("urgency", "")
])
print(f"Exported {len(leads)} leads to {filename}")
# Export hot and warm leads
hot_warm = [l for l in qualified_leads if l["qualification"]["tier"] in ("hot", "warm")]
export_to_csv(hot_warm)
Step 6: Automated Pipeline with Scheduling
Run your lead gen agent on a schedule to continuously discover new prospects:
import schedule
import time
def daily_lead_gen():
"""Run the full lead generation pipeline."""
print(f"Starting lead gen run: {time.strftime('%Y-%m-%d %H:%M')}")
# 1. Discover from multiple sources
companies = []
for source in LEAD_SOURCES:
companies.extend(discover_companies_from_directory(source))
# 2. Deduplicate
seen = set()
unique = []
for c in companies:
key = c.get("website", "").lower().strip("/")
if key and key not in seen:
seen.add(key)
unique.append(c)
# 3. Enrich & qualify
for company in unique:
enriched = enrich_company(company["website"])
qualification = qualify_lead(enriched)
if qualification["tier"] in ("hot", "warm"):
# Save to database or CRM
save_lead(enriched, qualification)
print(f" HOT/WARM: {company['company_name']} (score: {qualification['score']})")
print(f"Pipeline complete. Processed {len(unique)} companies.")
# Run every morning at 6 AM
schedule.every().day.at("06:00").do(daily_lead_gen)
while True:
schedule.run_pending()
time.sleep(60)
Lead Sources That Work
| Source Type | Example | Signal |
|---|---|---|
| Startup directories | Y Combinator, Product Hunt | New companies with funding |
| Job boards | Indeed, LinkedIn Jobs | Hiring for roles that signal need |
| Funding announcements | Crunchbase, TechCrunch | Companies with budget |
| GitHub | Popular repos in your space | Engineering teams using relevant tech |
| Review sites | G2, Capterra | Companies evaluating competitor tools |
| Industry forums | Reddit, Hacker News | People discussing problems you solve |
| Conference attendee lists | Event websites | Active in your industry |
Traditional Lead Gen vs AI Agent Lead Gen
| Dimension | Traditional (ZoomInfo, Apollo) | AI Agent + Web Scraping |
|---|---|---|
| Data freshness | Weeks to months old | Real-time, scraped live |
| Cost | $10K-$50K/year | $99-$299/month (API costs) |
| Custom qualification | Basic filters only | LLM scores against your exact ICP |
| Unique leads | Same database as competitors | Discovers leads no one else has |
| Enrichment depth | Name, email, phone | Tech stack, pain points, talking points |
| Setup time | Minutes | Hours (one-time) |
| Maintenance | None | Low (API handles scraping complexity) |
Best Practices
Respect Rate Limits and Ethics
- Don't scrape personal data without a legitimate business purpose
- Respect robots.txt and terms of service
- Use reasonable delays between requests
- Focus on publicly available business information
- Comply with GDPR, CCPA, and other privacy regulations
Optimize for Quality Over Quantity
- Better to have 20 qualified leads than 2,000 cold ones
- Invest time in refining your ICP criteria โ it's the most important variable
- A/B test different qualification prompts to improve lead scoring accuracy
- Track which lead sources produce the highest conversion rates
Keep Costs Low
- Cache enrichment results โ company data doesn't change daily
- Use the Mantis Free tier (100 calls/month) for testing
- Only enrich companies that pass initial filters
- Batch API calls where possible
Start Finding Leads with AI
The Mantis WebPerception API gives your lead gen agent the power to scrape, screenshot, and extract structured data from any website. Start free โ 100 API calls/month, no credit card required.
Get Your API Key โWhat's Next
You've built the foundation of an AI-powered lead generation system. From here, you can:
- Add email finding โ Scrape company websites for contact patterns (first.last@company.com)
- Build outreach sequences โ Use the talking points from qualification to write personalized emails
- Create a dashboard โ Visualize your pipeline with Streamlit or Retool
- Integrate with your CRM โ Push qualified leads directly into Salesforce, HubSpot, or Pipedrive
- Add monitoring โ Re-scrape leads periodically to catch changes (new funding, new hires, new products)
The agent developers who win in 2026 aren't the ones with the biggest lead databases. They're the ones whose agents find fresh, qualified leads while everyone else is buying stale data.
Related reading: