Web Scraping for Recruitment: How AI Agents Find, Screen & Engage Top Talent in 2026
Recruiting teams spend $3,000โ$10,000 per hire on sourcing tools, job board subscriptions, and agency fees. Most of that money goes to platforms that give you the same candidates everyone else sees.
What if an AI agent could scrape job boards, company pages, GitHub profiles, and conference speaker lists โ then automatically screen candidates against your requirements and draft personalized outreach? That's not hypothetical. You can build it today.
In this guide, you'll build a complete AI-powered recruitment pipeline that sources candidates from the open web, scores them with LLM-powered analysis, and surfaces the best matches โ all for a fraction of what traditional sourcing tools cost.
Why Traditional Recruiting Tools Fall Short
LinkedIn Recruiter costs $8,000โ$12,000/year per seat. Job board subscriptions run $300โ$500/month each. And you're still limited to people who opted into those platforms.
The best candidates are often passive โ they're not on job boards. They're writing blog posts, contributing to open source, speaking at conferences, and publishing research. Web scraping lets you find them where they actually are.
| Capability | Traditional Tools | AI Agent Approach |
|---|---|---|
| Source coverage | 1-2 platforms (LinkedIn, Indeed) | Entire open web |
| Passive candidates | Limited to platform users | GitHub, blogs, conferences, papers |
| Screening | Keyword matching | LLM-powered contextual analysis |
| Personalization | Template mail merge | AI-crafted based on candidate's work |
| Cost per seat | $500โ$1,000/mo | ~$29โ$99/mo |
| Customization | Fixed features | Fully programmable |
Architecture: The AI Recruitment Pipeline
Our system follows six steps:
- Source Discovery โ Scrape job boards, GitHub, blogs, conference sites for candidate profiles
- Profile Extraction โ AI extracts structured candidate data from any page format
- Enrichment โ Cross-reference multiple sources per candidate
- AI Screening โ LLM scores candidates against your job requirements
- Outreach Drafting โ Generate personalized messages referencing their actual work
- Pipeline Management โ Store everything in SQLite, track status, alert recruiters
Step 1: Define Your Ideal Candidate Profile
from pydantic import BaseModel
from typing import Optional
from enum import Enum
class SeniorityLevel(str, Enum):
JUNIOR = "junior"
MID = "mid"
SENIOR = "senior"
STAFF = "staff"
PRINCIPAL = "principal"
class JobRequirements(BaseModel):
title: str
required_skills: list[str]
preferred_skills: list[str]
min_experience_years: int
target_seniority: list[SeniorityLevel]
location_preferences: list[str] # "remote", "US", "Europe", etc.
industry_preferences: list[str]
class CandidateProfile(BaseModel):
name: str
current_role: Optional[str] = None
current_company: Optional[str] = None
skills: list[str]
experience_years: Optional[int] = None
location: Optional[str] = None
github_url: Optional[str] = None
linkedin_url: Optional[str] = None
blog_url: Optional[str] = None
notable_projects: list[str] = []
publications: list[str] = []
source_url: str
source_type: str # "github", "blog", "conference", "job_board"
# Example: hiring a senior Python backend engineer
job = JobRequirements(
title="Senior Backend Engineer",
required_skills=["Python", "FastAPI", "PostgreSQL", "AWS"],
preferred_skills=["Kubernetes", "Redis", "GraphQL", "ML"],
min_experience_years=5,
target_seniority=[SeniorityLevel.SENIOR, SeniorityLevel.STAFF],
location_preferences=["remote", "US"],
industry_preferences=["SaaS", "fintech", "developer tools"]
)
Step 2: Source Candidates from the Open Web
import requests
MANTIS_API_KEY = "your-api-key"
BASE_URL = "https://api.mantisapi.com/v1"
def scrape_page(url: str) -> dict:
"""Scrape any page with Mantis WebPerception API."""
response = requests.post(
f"{BASE_URL}/scrape",
headers={"x-api-key": MANTIS_API_KEY},
json={"url": url, "render_js": True}
)
return response.json()
def extract_candidates(url: str, source_type: str) -> list[dict]:
"""Extract candidate profiles from any page using AI."""
response = requests.post(
f"{BASE_URL}/extract",
headers={"x-api-key": MANTIS_API_KEY},
json={
"url": url,
"schema": {
"candidates": [{
"name": "string",
"role": "string",
"company": "string",
"skills": ["string"],
"location": "string",
"profile_url": "string",
"bio": "string"
}]
},
"prompt": f"Extract all people/candidates visible on this page. "
f"Source type: {source_type}. Include all available details."
}
)
return response.json().get("data", {}).get("candidates", [])
# Source from multiple channels
sources = [
# GitHub trending developers
{"url": "https://github.com/trending/python?since=monthly", "type": "github"},
# Conference speaker pages
{"url": "https://pycon.org/speakers/", "type": "conference"},
# Tech blog author pages
{"url": "https://dev.to/top/week?tag=python", "type": "blog"},
]
all_candidates = []
for source in sources:
candidates = extract_candidates(source["url"], source["type"])
for c in candidates:
c["source_url"] = source["url"]
c["source_type"] = source["type"]
all_candidates.extend(candidates)
print(f"Found {len(candidates)} candidates from {source['type']}")
Step 3: Enrich Candidate Profiles
def enrich_candidate(candidate: dict) -> dict:
"""Cross-reference candidate across multiple sources."""
enriched = {**candidate}
# If we have a GitHub URL, scrape their profile
if candidate.get("profile_url") and "github.com" in candidate["profile_url"]:
github_data = requests.post(
f"{BASE_URL}/extract",
headers={"x-api-key": MANTIS_API_KEY},
json={
"url": candidate["profile_url"],
"schema": {
"name": "string",
"bio": "string",
"location": "string",
"company": "string",
"repos_count": "number",
"followers": "number",
"top_languages": ["string"],
"pinned_repos": [{"name": "string", "description": "string", "stars": "number"}],
"contribution_streak": "string"
},
"prompt": "Extract this GitHub user's complete profile information."
}
).json().get("data", {})
enriched["github_profile"] = github_data
enriched["skills"] = list(set(
enriched.get("skills", []) +
github_data.get("top_languages", [])
))
return enriched
# Enrich all candidates
enriched_candidates = [enrich_candidate(c) for c in all_candidates]
print(f"Enriched {len(enriched_candidates)} candidate profiles")
Step 4: AI-Powered Candidate Screening
from openai import OpenAI
import json
client = OpenAI()
def screen_candidate(candidate: dict, job: JobRequirements) -> dict:
"""Use LLM to score and evaluate a candidate against job requirements."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """You are an expert technical recruiter. Evaluate the candidate
against the job requirements. Be thorough but fair. Consider:
- Skill match (required vs preferred)
- Experience level and seniority signals
- Quality of work (projects, contributions, publications)
- Cultural/industry fit
- Red flags or concerns
Return JSON with:
- score (1-100)
- rating: "STRONG_MATCH" | "GOOD_MATCH" | "PARTIAL_MATCH" | "WEAK_MATCH"
- skill_match: {matched: [], missing: [], bonus: []}
- strengths: [str]
- concerns: [str]
- outreach_angle: str (what to reference in personalized outreach)
- summary: str (2-3 sentence assessment)"""
}, {
"role": "user",
"content": f"""JOB REQUIREMENTS:
{json.dumps(job.model_dump(), indent=2)}
CANDIDATE PROFILE:
{json.dumps(candidate, indent=2)}
Evaluate this candidate."""
}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
# Screen all candidates
screened = []
for candidate in enriched_candidates:
evaluation = screen_candidate(candidate, job)
screened.append({
"candidate": candidate,
"evaluation": evaluation
})
# Sort by score
screened.sort(key=lambda x: x["evaluation"].get("score", 0), reverse=True)
# Show top candidates
for s in screened[:10]:
e = s["evaluation"]
c = s["candidate"]
print(f"{e.get('score', 0)}/100 | {e.get('rating')} | {c.get('name')} | {c.get('role')} @ {c.get('company')}")
print(f" Strengths: {', '.join(e.get('strengths', []))}")
print(f" Outreach angle: {e.get('outreach_angle')}")
print()
Step 5: Generate Personalized Outreach
def draft_outreach(candidate: dict, evaluation: dict, job: JobRequirements) -> str:
"""Generate personalized outreach message based on candidate's actual work."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """You are a recruiter writing a personalized outreach message.
Rules:
- Reference specific work the candidate has done (repos, posts, talks)
- Keep it under 150 words
- Be genuine, not salesy
- Mention 1-2 specific things that impressed you
- Don't use generic phrases like "I came across your profile"
- End with a low-pressure ask (15-min chat, not "apply now")"""
}, {
"role": "user",
"content": f"""CANDIDATE: {json.dumps(candidate, indent=2)}
EVALUATION: {json.dumps(evaluation, indent=2)}
JOB: {job.title} โ {', '.join(job.required_skills)}
OUTREACH ANGLE: {evaluation.get('outreach_angle', '')}
Draft a personalized outreach message."""
}]
)
return response.choices[0].message.content
# Generate outreach for top candidates
for s in screened[:5]:
if s["evaluation"].get("rating") in ["STRONG_MATCH", "GOOD_MATCH"]:
message = draft_outreach(s["candidate"], s["evaluation"], job)
s["outreach_message"] = message
print(f"--- Outreach for {s['candidate'].get('name')} ---")
print(message)
print()
Step 6: Pipeline Management with SQLite
import sqlite3
from datetime import datetime
def init_db():
conn = sqlite3.connect("recruitment.db")
conn.execute("""CREATE TABLE IF NOT EXISTS candidates (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
current_role TEXT,
current_company TEXT,
skills TEXT,
location TEXT,
source_url TEXT,
source_type TEXT,
profile_data TEXT,
score INTEGER,
rating TEXT,
evaluation TEXT,
outreach_message TEXT,
status TEXT DEFAULT 'sourced',
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
UNIQUE(name, source_url)
)""")
conn.execute("""CREATE TABLE IF NOT EXISTS pipeline_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
candidate_id INTEGER,
event_type TEXT,
details TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (candidate_id) REFERENCES candidates(id)
)""")
conn.commit()
return conn
def save_candidate(conn, candidate: dict, evaluation: dict, outreach: str = None):
"""Save or update a candidate in the pipeline."""
conn.execute("""INSERT OR REPLACE INTO candidates
(name, current_role, current_company, skills, location,
source_url, source_type, profile_data, score, rating,
evaluation, outreach_message, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(candidate.get("name"), candidate.get("role"),
candidate.get("company"), json.dumps(candidate.get("skills", [])),
candidate.get("location"), candidate.get("source_url"),
candidate.get("source_type"), json.dumps(candidate),
evaluation.get("score"), evaluation.get("rating"),
json.dumps(evaluation), outreach,
datetime.now().isoformat()))
conn.commit()
def get_pipeline_stats(conn) -> dict:
"""Get recruitment pipeline statistics."""
stats = {}
for status in ["sourced", "screened", "contacted", "responded", "interviewing", "offered", "hired"]:
count = conn.execute(
"SELECT COUNT(*) FROM candidates WHERE status = ?", (status,)
).fetchone()[0]
stats[status] = count
stats["avg_score"] = conn.execute(
"SELECT AVG(score) FROM candidates WHERE score IS NOT NULL"
).fetchone()[0] or 0
stats["strong_matches"] = conn.execute(
"SELECT COUNT(*) FROM candidates WHERE rating = 'STRONG_MATCH'"
).fetchone()[0]
return stats
# Save all screened candidates
conn = init_db()
for s in screened:
save_candidate(conn, s["candidate"], s["evaluation"],
s.get("outreach_message"))
stats = get_pipeline_stats(conn)
print(f"Pipeline: {stats}")
print(f"Strong matches: {stats['strong_matches']}")
print(f"Avg score: {stats['avg_score']:.0f}/100")
Automating the Full Pipeline
import asyncio
import aiohttp
async def run_recruitment_cycle(job: JobRequirements, sources: list[dict]):
"""Run a complete recruitment sourcing cycle."""
print(f"Starting recruitment cycle for: {job.title}")
print(f"Scanning {len(sources)} sources...")
conn = init_db()
new_candidates = 0
strong_matches = 0
for source in sources:
try:
candidates = extract_candidates(source["url"], source["type"])
for candidate in candidates:
candidate["source_url"] = source["url"]
candidate["source_type"] = source["type"]
# Enrich
enriched = enrich_candidate(candidate)
# Screen
evaluation = screen_candidate(enriched, job)
# Only draft outreach for good+ matches
outreach = None
if evaluation.get("rating") in ["STRONG_MATCH", "GOOD_MATCH"]:
outreach = draft_outreach(enriched, evaluation, job)
strong_matches += 1
# Save to pipeline
save_candidate(conn, enriched, evaluation, outreach)
new_candidates += 1
except Exception as e:
print(f"Error processing {source['url']}: {e}")
continue
# Summary
stats = get_pipeline_stats(conn)
summary = f"""Recruitment Cycle Complete:
โข Sources scanned: {len(sources)}
โข New candidates: {new_candidates}
โข Strong/Good matches: {strong_matches}
โข Pipeline total: {sum(stats.values()) - stats.get('avg_score', 0) - stats.get('strong_matches', 0)}
โข Top score: {max((s['evaluation'].get('score', 0) for s in screened), default=0)}/100"""
print(summary)
return summary
# Run daily
# asyncio.run(run_recruitment_cycle(job, sources))
Cost Comparison
| Tool/Approach | Monthly Cost | Candidates/mo | Cost/Candidate |
|---|---|---|---|
| LinkedIn Recruiter | $800โ$1,000/seat | 50โ100 | $8โ$20 |
| Job board bundle (Indeed + ZipRecruiter) | $500โ$800 | 30โ80 | $6โ$27 |
| Recruiting agency (retained) | $5,000โ$15,000 | 5โ15 | $333โ$3,000 |
| AI Agent + Mantis API | $29โ$99 | 100โ500+ | $0.06โ$0.99 |
The AI agent approach doesn't just save money โ it finds candidates that traditional platforms miss entirely. Passive candidates who aren't on job boards but are actively building impressive things in the open.
Use Cases by Company Type
1. Startups (1โ50 employees)
Startups can't afford $10K/year LinkedIn Recruiter seats or agency fees. An AI sourcing agent levels the playing field โ scan GitHub, Hacker News, and conference sites to find engineers who'd never see your Indeed posting.
2. Recruiting Agencies
Agencies filling multiple roles can run parallel sourcing agents for each position. AI screening reduces time-to-shortlist from days to hours. Personalized outreach at scale increases response rates 3โ5x over template emails.
3. Enterprise Talent Acquisition
Large companies hiring 100+ engineers/year can automate the top-of-funnel completely. The AI agent handles sourcing and initial screening; human recruiters focus on relationship building and closing.
4. Technical Recruiting Firms
Specialized firms can build domain-specific sourcing โ scraping research papers for ML engineers, open-source contributions for infrastructure roles, or regulatory publications for compliance hires.
Best Practices
- Respect privacy and terms of service. Only scrape publicly available information. Never access private data or circumvent access controls.
- Be transparent in outreach. Tell candidates how you found them ("I noticed your FastAPI middleware library on GitHub").
- Deduplicate aggressively. The same person appears on GitHub, blogs, and conferences. Match by name + company to avoid duplicate outreach.
- Review AI screening decisions. Use LLM scores as a sorting tool, not a final filter. Human judgment still matters for culture fit and potential.
- Comply with local regulations. GDPR, CCPA, and local employment laws apply to candidate data. Store only what you need, delete what you don't.
- Rate limit your scraping. Don't hammer any single source. Spread sourcing across days and sources.
Build Your AI Recruitment Pipeline
Mantis WebPerception API handles the scraping so you can focus on building the intelligence layer. Extract structured candidate data from any website with one API call.
Start Free โ 100 calls/monthWhat's Next
You now have a complete AI-powered recruitment pipeline that sources, screens, and drafts outreach for candidates automatically. To take it further:
- The Complete Guide to Web Scraping with AI โ master the fundamentals
- Structured Data Extraction with AI โ improve your candidate profile extraction
- Web Scraping for Lead Generation โ similar pipeline pattern for sales leads
- Automate Website Monitoring โ track new job postings and candidate activity
- API vs DIY Scraping โ why building your own scraper is harder than it looks