Web Scraping for Recruitment: How AI Agents Find, Screen & Engage Top Talent in 2026

Published March 10, 2026 ยท 12 min read ยท By Mantis Team

Recruiting teams spend $3,000โ€“$10,000 per hire on sourcing tools, job board subscriptions, and agency fees. Most of that money goes to platforms that give you the same candidates everyone else sees.

What if an AI agent could scrape job boards, company pages, GitHub profiles, and conference speaker lists โ€” then automatically screen candidates against your requirements and draft personalized outreach? That's not hypothetical. You can build it today.

In this guide, you'll build a complete AI-powered recruitment pipeline that sources candidates from the open web, scores them with LLM-powered analysis, and surfaces the best matches โ€” all for a fraction of what traditional sourcing tools cost.

Why Traditional Recruiting Tools Fall Short

LinkedIn Recruiter costs $8,000โ€“$12,000/year per seat. Job board subscriptions run $300โ€“$500/month each. And you're still limited to people who opted into those platforms.

The best candidates are often passive โ€” they're not on job boards. They're writing blog posts, contributing to open source, speaking at conferences, and publishing research. Web scraping lets you find them where they actually are.

CapabilityTraditional ToolsAI Agent Approach
Source coverage1-2 platforms (LinkedIn, Indeed)Entire open web
Passive candidatesLimited to platform usersGitHub, blogs, conferences, papers
ScreeningKeyword matchingLLM-powered contextual analysis
PersonalizationTemplate mail mergeAI-crafted based on candidate's work
Cost per seat$500โ€“$1,000/mo~$29โ€“$99/mo
CustomizationFixed featuresFully programmable

Architecture: The AI Recruitment Pipeline

Our system follows six steps:

  1. Source Discovery โ€” Scrape job boards, GitHub, blogs, conference sites for candidate profiles
  2. Profile Extraction โ€” AI extracts structured candidate data from any page format
  3. Enrichment โ€” Cross-reference multiple sources per candidate
  4. AI Screening โ€” LLM scores candidates against your job requirements
  5. Outreach Drafting โ€” Generate personalized messages referencing their actual work
  6. Pipeline Management โ€” Store everything in SQLite, track status, alert recruiters

Step 1: Define Your Ideal Candidate Profile

from pydantic import BaseModel
from typing import Optional
from enum import Enum

class SeniorityLevel(str, Enum):
    JUNIOR = "junior"
    MID = "mid"
    SENIOR = "senior"
    STAFF = "staff"
    PRINCIPAL = "principal"

class JobRequirements(BaseModel):
    title: str
    required_skills: list[str]
    preferred_skills: list[str]
    min_experience_years: int
    target_seniority: list[SeniorityLevel]
    location_preferences: list[str]  # "remote", "US", "Europe", etc.
    industry_preferences: list[str]

class CandidateProfile(BaseModel):
    name: str
    current_role: Optional[str] = None
    current_company: Optional[str] = None
    skills: list[str]
    experience_years: Optional[int] = None
    location: Optional[str] = None
    github_url: Optional[str] = None
    linkedin_url: Optional[str] = None
    blog_url: Optional[str] = None
    notable_projects: list[str] = []
    publications: list[str] = []
    source_url: str
    source_type: str  # "github", "blog", "conference", "job_board"

# Example: hiring a senior Python backend engineer
job = JobRequirements(
    title="Senior Backend Engineer",
    required_skills=["Python", "FastAPI", "PostgreSQL", "AWS"],
    preferred_skills=["Kubernetes", "Redis", "GraphQL", "ML"],
    min_experience_years=5,
    target_seniority=[SeniorityLevel.SENIOR, SeniorityLevel.STAFF],
    location_preferences=["remote", "US"],
    industry_preferences=["SaaS", "fintech", "developer tools"]
)

Step 2: Source Candidates from the Open Web

import requests

MANTIS_API_KEY = "your-api-key"
BASE_URL = "https://api.mantisapi.com/v1"

def scrape_page(url: str) -> dict:
    """Scrape any page with Mantis WebPerception API."""
    response = requests.post(
        f"{BASE_URL}/scrape",
        headers={"x-api-key": MANTIS_API_KEY},
        json={"url": url, "render_js": True}
    )
    return response.json()

def extract_candidates(url: str, source_type: str) -> list[dict]:
    """Extract candidate profiles from any page using AI."""
    response = requests.post(
        f"{BASE_URL}/extract",
        headers={"x-api-key": MANTIS_API_KEY},
        json={
            "url": url,
            "schema": {
                "candidates": [{
                    "name": "string",
                    "role": "string",
                    "company": "string",
                    "skills": ["string"],
                    "location": "string",
                    "profile_url": "string",
                    "bio": "string"
                }]
            },
            "prompt": f"Extract all people/candidates visible on this page. "
                      f"Source type: {source_type}. Include all available details."
        }
    )
    return response.json().get("data", {}).get("candidates", [])

# Source from multiple channels
sources = [
    # GitHub trending developers
    {"url": "https://github.com/trending/python?since=monthly", "type": "github"},
    # Conference speaker pages
    {"url": "https://pycon.org/speakers/", "type": "conference"},
    # Tech blog author pages
    {"url": "https://dev.to/top/week?tag=python", "type": "blog"},
]

all_candidates = []
for source in sources:
    candidates = extract_candidates(source["url"], source["type"])
    for c in candidates:
        c["source_url"] = source["url"]
        c["source_type"] = source["type"]
    all_candidates.extend(candidates)
    print(f"Found {len(candidates)} candidates from {source['type']}")

Step 3: Enrich Candidate Profiles

def enrich_candidate(candidate: dict) -> dict:
    """Cross-reference candidate across multiple sources."""
    enriched = {**candidate}

    # If we have a GitHub URL, scrape their profile
    if candidate.get("profile_url") and "github.com" in candidate["profile_url"]:
        github_data = requests.post(
            f"{BASE_URL}/extract",
            headers={"x-api-key": MANTIS_API_KEY},
            json={
                "url": candidate["profile_url"],
                "schema": {
                    "name": "string",
                    "bio": "string",
                    "location": "string",
                    "company": "string",
                    "repos_count": "number",
                    "followers": "number",
                    "top_languages": ["string"],
                    "pinned_repos": [{"name": "string", "description": "string", "stars": "number"}],
                    "contribution_streak": "string"
                },
                "prompt": "Extract this GitHub user's complete profile information."
            }
        ).json().get("data", {})

        enriched["github_profile"] = github_data
        enriched["skills"] = list(set(
            enriched.get("skills", []) +
            github_data.get("top_languages", [])
        ))

    return enriched

# Enrich all candidates
enriched_candidates = [enrich_candidate(c) for c in all_candidates]
print(f"Enriched {len(enriched_candidates)} candidate profiles")

Step 4: AI-Powered Candidate Screening

from openai import OpenAI
import json

client = OpenAI()

def screen_candidate(candidate: dict, job: JobRequirements) -> dict:
    """Use LLM to score and evaluate a candidate against job requirements."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """You are an expert technical recruiter. Evaluate the candidate
against the job requirements. Be thorough but fair. Consider:
- Skill match (required vs preferred)
- Experience level and seniority signals
- Quality of work (projects, contributions, publications)
- Cultural/industry fit
- Red flags or concerns

Return JSON with:
- score (1-100)
- rating: "STRONG_MATCH" | "GOOD_MATCH" | "PARTIAL_MATCH" | "WEAK_MATCH"
- skill_match: {matched: [], missing: [], bonus: []}
- strengths: [str]
- concerns: [str]
- outreach_angle: str (what to reference in personalized outreach)
- summary: str (2-3 sentence assessment)"""
        }, {
            "role": "user",
            "content": f"""JOB REQUIREMENTS:
{json.dumps(job.model_dump(), indent=2)}

CANDIDATE PROFILE:
{json.dumps(candidate, indent=2)}

Evaluate this candidate."""
        }],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

# Screen all candidates
screened = []
for candidate in enriched_candidates:
    evaluation = screen_candidate(candidate, job)
    screened.append({
        "candidate": candidate,
        "evaluation": evaluation
    })

# Sort by score
screened.sort(key=lambda x: x["evaluation"].get("score", 0), reverse=True)

# Show top candidates
for s in screened[:10]:
    e = s["evaluation"]
    c = s["candidate"]
    print(f"{e.get('score', 0)}/100 | {e.get('rating')} | {c.get('name')} | {c.get('role')} @ {c.get('company')}")
    print(f"  Strengths: {', '.join(e.get('strengths', []))}")
    print(f"  Outreach angle: {e.get('outreach_angle')}")
    print()

Step 5: Generate Personalized Outreach

def draft_outreach(candidate: dict, evaluation: dict, job: JobRequirements) -> str:
    """Generate personalized outreach message based on candidate's actual work."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """You are a recruiter writing a personalized outreach message.
Rules:
- Reference specific work the candidate has done (repos, posts, talks)
- Keep it under 150 words
- Be genuine, not salesy
- Mention 1-2 specific things that impressed you
- Don't use generic phrases like "I came across your profile"
- End with a low-pressure ask (15-min chat, not "apply now")"""
        }, {
            "role": "user",
            "content": f"""CANDIDATE: {json.dumps(candidate, indent=2)}
EVALUATION: {json.dumps(evaluation, indent=2)}
JOB: {job.title} โ€” {', '.join(job.required_skills)}
OUTREACH ANGLE: {evaluation.get('outreach_angle', '')}

Draft a personalized outreach message."""
        }]
    )

    return response.choices[0].message.content

# Generate outreach for top candidates
for s in screened[:5]:
    if s["evaluation"].get("rating") in ["STRONG_MATCH", "GOOD_MATCH"]:
        message = draft_outreach(s["candidate"], s["evaluation"], job)
        s["outreach_message"] = message
        print(f"--- Outreach for {s['candidate'].get('name')} ---")
        print(message)
        print()

Step 6: Pipeline Management with SQLite

import sqlite3
from datetime import datetime

def init_db():
    conn = sqlite3.connect("recruitment.db")
    conn.execute("""CREATE TABLE IF NOT EXISTS candidates (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT NOT NULL,
        current_role TEXT,
        current_company TEXT,
        skills TEXT,
        location TEXT,
        source_url TEXT,
        source_type TEXT,
        profile_data TEXT,
        score INTEGER,
        rating TEXT,
        evaluation TEXT,
        outreach_message TEXT,
        status TEXT DEFAULT 'sourced',
        created_at TEXT DEFAULT CURRENT_TIMESTAMP,
        updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
        UNIQUE(name, source_url)
    )""")

    conn.execute("""CREATE TABLE IF NOT EXISTS pipeline_events (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        candidate_id INTEGER,
        event_type TEXT,
        details TEXT,
        created_at TEXT DEFAULT CURRENT_TIMESTAMP,
        FOREIGN KEY (candidate_id) REFERENCES candidates(id)
    )""")

    conn.commit()
    return conn

def save_candidate(conn, candidate: dict, evaluation: dict, outreach: str = None):
    """Save or update a candidate in the pipeline."""
    conn.execute("""INSERT OR REPLACE INTO candidates
        (name, current_role, current_company, skills, location,
         source_url, source_type, profile_data, score, rating,
         evaluation, outreach_message, updated_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (candidate.get("name"), candidate.get("role"),
         candidate.get("company"), json.dumps(candidate.get("skills", [])),
         candidate.get("location"), candidate.get("source_url"),
         candidate.get("source_type"), json.dumps(candidate),
         evaluation.get("score"), evaluation.get("rating"),
         json.dumps(evaluation), outreach,
         datetime.now().isoformat()))
    conn.commit()

def get_pipeline_stats(conn) -> dict:
    """Get recruitment pipeline statistics."""
    stats = {}
    for status in ["sourced", "screened", "contacted", "responded", "interviewing", "offered", "hired"]:
        count = conn.execute(
            "SELECT COUNT(*) FROM candidates WHERE status = ?", (status,)
        ).fetchone()[0]
        stats[status] = count

    stats["avg_score"] = conn.execute(
        "SELECT AVG(score) FROM candidates WHERE score IS NOT NULL"
    ).fetchone()[0] or 0

    stats["strong_matches"] = conn.execute(
        "SELECT COUNT(*) FROM candidates WHERE rating = 'STRONG_MATCH'"
    ).fetchone()[0]

    return stats

# Save all screened candidates
conn = init_db()
for s in screened:
    save_candidate(conn, s["candidate"], s["evaluation"],
                   s.get("outreach_message"))

stats = get_pipeline_stats(conn)
print(f"Pipeline: {stats}")
print(f"Strong matches: {stats['strong_matches']}")
print(f"Avg score: {stats['avg_score']:.0f}/100")

Automating the Full Pipeline

import asyncio
import aiohttp

async def run_recruitment_cycle(job: JobRequirements, sources: list[dict]):
    """Run a complete recruitment sourcing cycle."""
    print(f"Starting recruitment cycle for: {job.title}")
    print(f"Scanning {len(sources)} sources...")

    conn = init_db()
    new_candidates = 0
    strong_matches = 0

    for source in sources:
        try:
            candidates = extract_candidates(source["url"], source["type"])
            for candidate in candidates:
                candidate["source_url"] = source["url"]
                candidate["source_type"] = source["type"]

                # Enrich
                enriched = enrich_candidate(candidate)

                # Screen
                evaluation = screen_candidate(enriched, job)

                # Only draft outreach for good+ matches
                outreach = None
                if evaluation.get("rating") in ["STRONG_MATCH", "GOOD_MATCH"]:
                    outreach = draft_outreach(enriched, evaluation, job)
                    strong_matches += 1

                # Save to pipeline
                save_candidate(conn, enriched, evaluation, outreach)
                new_candidates += 1

        except Exception as e:
            print(f"Error processing {source['url']}: {e}")
            continue

    # Summary
    stats = get_pipeline_stats(conn)
    summary = f"""Recruitment Cycle Complete:
โ€ข Sources scanned: {len(sources)}
โ€ข New candidates: {new_candidates}
โ€ข Strong/Good matches: {strong_matches}
โ€ข Pipeline total: {sum(stats.values()) - stats.get('avg_score', 0) - stats.get('strong_matches', 0)}
โ€ข Top score: {max((s['evaluation'].get('score', 0) for s in screened), default=0)}/100"""

    print(summary)
    return summary

# Run daily
# asyncio.run(run_recruitment_cycle(job, sources))
๐Ÿ’ก Pro tip: Schedule your sourcing agent to run weekly on different source pools. Monday: GitHub trending. Wednesday: conference speaker pages. Friday: new blog posts and tech publications. This keeps your pipeline fresh without hitting any single source too hard.

Cost Comparison

Tool/ApproachMonthly CostCandidates/moCost/Candidate
LinkedIn Recruiter$800โ€“$1,000/seat50โ€“100$8โ€“$20
Job board bundle (Indeed + ZipRecruiter)$500โ€“$80030โ€“80$6โ€“$27
Recruiting agency (retained)$5,000โ€“$15,0005โ€“15$333โ€“$3,000
AI Agent + Mantis API$29โ€“$99100โ€“500+$0.06โ€“$0.99

The AI agent approach doesn't just save money โ€” it finds candidates that traditional platforms miss entirely. Passive candidates who aren't on job boards but are actively building impressive things in the open.

Use Cases by Company Type

1. Startups (1โ€“50 employees)

Startups can't afford $10K/year LinkedIn Recruiter seats or agency fees. An AI sourcing agent levels the playing field โ€” scan GitHub, Hacker News, and conference sites to find engineers who'd never see your Indeed posting.

2. Recruiting Agencies

Agencies filling multiple roles can run parallel sourcing agents for each position. AI screening reduces time-to-shortlist from days to hours. Personalized outreach at scale increases response rates 3โ€“5x over template emails.

3. Enterprise Talent Acquisition

Large companies hiring 100+ engineers/year can automate the top-of-funnel completely. The AI agent handles sourcing and initial screening; human recruiters focus on relationship building and closing.

4. Technical Recruiting Firms

Specialized firms can build domain-specific sourcing โ€” scraping research papers for ML engineers, open-source contributions for infrastructure roles, or regulatory publications for compliance hires.

Best Practices

Build Your AI Recruitment Pipeline

Mantis WebPerception API handles the scraping so you can focus on building the intelligence layer. Extract structured candidate data from any website with one API call.

Start Free โ€” 100 calls/month

What's Next

You now have a complete AI-powered recruitment pipeline that sources, screens, and drafts outreach for candidates automatically. To take it further: