Web Scraping for Venture Capital & Startup Intelligence: How AI Agents Track Deals, Valuations & Market Signals in 2026
Global venture capital investment exceeded $300 billion in 2025, funding over 30,000 startups across every sector from AI to climate tech. Yet the information asymmetry in venture capital remains staggering โ the best deals are found by investors who see signals earliest: a hiring surge on LinkedIn, a spike in GitHub stars, a Form D filing before the press release, a founder's second company quietly incorporating in Delaware.
Traditional VC intelligence platforms like PitchBook ($20K-100K/yr) and CB Insights ($50K-100K/yr) provide curated databases, but they're backward-looking by design โ they report deals after they close, not before. The real competitive edge comes from real-time signal detection: scraping job postings, monitoring product launches, tracking web traffic patterns, and aggregating the digital exhaust that every startup produces.
In this guide, we'll build an AI-powered venture capital intelligence system that monitors startup ecosystems in real-time, detects investable signals before they hit databases, maps competitive landscapes automatically, and generates investment memos โ all using Python, web scraping, and AI agents powered by the Mantis WebPerception API.
Why Venture Capital Needs Real-Time Web Intelligence
The venture capital industry runs on information advantages. The VC who discovers a breakout startup 3 months before their Series A gets the best terms. The fund that spots a market trend before consensus captures outsized returns. Yet most VCs still rely on:
- Warm introductions โ limits deal flow to existing networks
- Conference circuits โ expensive and time-consuming
- Database subscriptions โ PitchBook/Crunchbase report deals after they happen
- Manual research โ associates spending 20+ hours per deal on due diligence
An AI agent with web scraping capabilities can monitor thousands of signals simultaneously: new SEC Form D filings (startups raising money), Y Combinator batch announcements, ProductHunt launches gaining traction, GitHub repositories exploding in popularity, LinkedIn job postings signaling growth, and web traffic patterns indicating product-market fit.
Step 1: Startup Discovery & Deal Flow Pipeline
The first challenge is finding startups before they appear in databases. We'll scrape multiple discovery channels to build a continuous deal flow pipeline.
SEC EDGAR Form D Filings
Every U.S. startup raising money under Regulation D must file a Form D with the SEC โ often weeks before any press announcement. This is one of the most underutilized signals in venture capital.
import mantis
from pydantic import BaseModel
from typing import Optional
from datetime import datetime
client = mantis.Client(api_key="your-mantis-api-key")
class StartupProfile(BaseModel):
name: str
domain: Optional[str]
description: Optional[str]
founded_date: Optional[str]
headquarters: Optional[str]
employee_count: Optional[int]
total_funding: Optional[float]
last_funding_round: Optional[str]
last_funding_amount: Optional[float]
investors: list[str] = []
sector: Optional[str]
tech_stack: list[str] = []
growth_signals: list[str] = []
discovery_source: str
discovered_at: datetime
class FormDFiling(BaseModel):
company_name: str
cik: str
filing_date: str
offering_amount: Optional[float]
amount_sold: Optional[float]
investor_count: Optional[int]
is_first_sale: bool
industry_group: Optional[str]
state: Optional[str]
executives: list[str] = []
# Scrape recent Form D filings from SEC EDGAR
def discover_from_sec_filings():
"""Monitor SEC EDGAR for new Form D filings โ early funding signals."""
result = client.extract(
url="https://efts.sec.gov/LATEST/search-index?q=%22Form+D%22&dateRange=custom&startdt=2026-03-01&enddt=2026-03-14&forms=D",
schema=list[FormDFiling],
prompt="""Extract all Form D filings. For each filing, get:
- Company name and CIK number
- Filing date
- Total offering amount and amount already sold
- Number of investors
- Whether this is the first sale (new raise vs amendment)
- Industry group classification
- State of incorporation
- Names of executives/directors listed"""
)
return result
# Discover startups from ProductHunt
def discover_from_producthunt():
"""Track ProductHunt launches gaining significant traction."""
result = client.extract(
url="https://www.producthunt.com/",
schema=list[StartupProfile],
prompt="""Extract top-launched products from today. For each:
- Product/company name and website
- Description of what they do
- Upvote count and comment count
- Maker information
Focus on B2B/developer tools and AI products."""
)
return [p for p in result if any(
kw in (p.description or "").lower()
for kw in ["api", "ai", "developer", "saas", "b2b", "platform"]
)]
Y Combinator & Accelerator Tracking
Accelerator batches are goldmines for deal flow. Tracking YC, Techstars, and other top accelerators gives you visibility into the highest-potential startups 3-6 months before they raise their seed rounds.
# Track Y Combinator batch companies
def discover_from_yc():
"""Monitor YC's directory for new batch companies."""
result = client.extract(
url="https://www.ycombinator.com/companies?batch=W2026",
schema=list[StartupProfile],
prompt="""Extract all companies from this YC batch. For each:
- Company name, website, one-line description
- Batch (e.g., W2026)
- Sector/vertical
- Team size
- Location
Focus on companies with clear B2B or platform plays."""
)
return result
# Track GitHub trending repositories for developer tools
def discover_from_github_trending():
"""Find open-source projects gaining traction โ future VC-backed companies."""
result = client.extract(
url="https://github.com/trending?since=weekly&spoken_language_code=en",
schema=list[dict],
prompt="""Extract trending repositories. For each:
- Repository name, owner, description
- Stars gained this week, total stars, forks
- Primary programming language
- Whether it appears to be a startup/company project vs personal
Focus on developer tools, AI/ML frameworks, infrastructure."""
)
return result
Step 2: Funding Round Tracking & Valuation Intelligence
Once you've identified interesting startups, track their fundraising activity in real time โ not from database updates, but from primary sources.
class FundingRound(BaseModel):
company_name: str
round_type: str # pre-seed, seed, series_a, series_b, etc.
amount_raised: Optional[float]
valuation: Optional[float]
lead_investor: Optional[str]
participating_investors: list[str] = []
date_announced: Optional[str]
date_filed: Optional[str] # SEC filing date (often earlier)
source: str # sec_edgar, press_release, crunchbase, linkedin
use_of_funds: Optional[str]
pre_money_valuation: Optional[float]
dilution_estimate: Optional[float]
class InvestorProfile(BaseModel):
name: str
type: str # vc_fund, angel, corporate, accelerator
aum: Optional[float]
focus_sectors: list[str] = []
focus_stages: list[str] = []
recent_investments: list[str] = []
portfolio_size: Optional[int]
notable_exits: list[str] = []
co_investment_frequency: dict = {} # investor_name -> count
def track_funding_rounds(company_name: str, domain: str):
"""Multi-source funding round tracking for a specific company."""
# Source 1: SEC EDGAR Form D amendments
sec_data = client.extract(
url=f"https://efts.sec.gov/LATEST/search-index?q=%22{company_name}%22&forms=D,D/A",
schema=list[FundingRound],
prompt=f"""Find all Form D filings for {company_name}.
Extract offering amounts, amendment history (shows multiple rounds),
investor counts, and executive changes between filings."""
)
# Source 2: Press releases and news
news_data = client.extract(
url=f"https://www.google.com/search?q=%22{company_name}%22+%22raises%22+OR+%22funding%22+OR+%22series%22&tbs=qdr:m",
schema=list[FundingRound],
prompt=f"""Find funding announcements for {company_name}.
Extract round type, amount, valuation, lead investor,
participating investors, and stated use of funds."""
)
# Source 3: LinkedIn hiring signals (proxy for recent raise)
hiring_data = client.extract(
url=f"https://www.linkedin.com/company/{domain.replace('.com','')}/jobs/",
schema=dict,
prompt=f"""Count open positions for {company_name}. Categorize by:
- Engineering roles (indicates product investment)
- Sales/marketing roles (indicates go-to-market push)
- Executive hires (indicates scaling)
A surge in hiring often follows a funding round by 2-4 weeks."""
)
return {
"sec_filings": sec_data,
"news": news_data,
"hiring_signals": hiring_data
}
Step 3: Growth Signal Detection Engine
The most valuable VC intelligence isn't about deals that already happened โ it's about detecting growth signals before they become obvious. We'll build a multi-signal monitoring engine.
class MarketSignal(BaseModel):
company_name: str
signal_type: str # hiring_surge, traffic_spike, github_growth,
# patent_filing, product_launch, exec_hire,
# partnership, award, media_mention
signal_strength: float # 0-1 normalized
description: str
detected_at: datetime
data_points: dict = {} # raw metrics
historical_context: Optional[str] # how this compares to baseline
class GrowthScorecard(BaseModel):
company_name: str
overall_score: float # 0-100
signal_breakdown: dict # signal_type -> score
trajectory: str # accelerating, steady, decelerating, stalled
comparable_companies: list[str] # similar-stage companies
investment_thesis: Optional[str]
def detect_growth_signals(company: StartupProfile) -> list[MarketSignal]:
"""Multi-channel growth signal detection for a startup."""
signals = []
# Signal 1: Web traffic trends (SimilarWeb / alternative)
traffic = client.extract(
url=f"https://www.similarweb.com/website/{company.domain}/",
schema=dict,
prompt=f"""Extract web traffic data for {company.domain}:
- Monthly visits (last 3 months)
- Month-over-month growth rate
- Traffic sources breakdown (direct, search, referral, social)
- Average visit duration, pages per visit, bounce rate
- Top referring sites and keywords
- Geographic distribution of traffic"""
)
if traffic.get("mom_growth", 0) > 0.2: # 20%+ MoM growth
signals.append(MarketSignal(
company_name=company.name,
signal_type="traffic_spike",
signal_strength=min(traffic["mom_growth"], 1.0),
description=f"Web traffic growing {traffic['mom_growth']*100:.0f}% MoM",
detected_at=datetime.now(),
data_points=traffic
))
# Signal 2: GitHub repository activity
if company.domain:
github_org = company.domain.replace(".com", "").replace(".io", "")
github_data = client.extract(
url=f"https://github.com/{github_org}",
schema=dict,
prompt=f"""Extract GitHub organization metrics:
- Total public repositories
- Total stars across all repos
- Stars gained in last 30 days
- Total contributors
- Commit frequency (daily/weekly)
- Most active repositories
- Recent releases or major version bumps"""
)
star_velocity = github_data.get("stars_last_30d", 0)
if star_velocity > 500:
signals.append(MarketSignal(
company_name=company.name,
signal_type="github_growth",
signal_strength=min(star_velocity / 5000, 1.0),
description=f"GitHub stars +{star_velocity} in 30 days",
detected_at=datetime.now(),
data_points=github_data
))
# Signal 3: Job posting velocity
jobs = client.extract(
url=f"https://www.linkedin.com/company/{company.domain.split('.')[0]}/jobs/",
schema=dict,
prompt=f"""Count and categorize all open positions:
- Total open roles
- Engineering/technical roles
- Sales/BD/marketing roles
- Senior/executive hires (VP+, C-suite)
- New roles posted in last 2 weeks vs older
- Locations (remote, specific offices, new markets)"""
)
total_roles = jobs.get("total_open_roles", 0)
new_roles = jobs.get("roles_last_2_weeks", 0)
if new_roles > 10 or (total_roles > 20 and new_roles / max(total_roles, 1) > 0.3):
signals.append(MarketSignal(
company_name=company.name,
signal_type="hiring_surge",
signal_strength=min(new_roles / 30, 1.0),
description=f"{new_roles} new roles posted in 2 weeks ({total_roles} total open)",
detected_at=datetime.now(),
data_points=jobs
))
# Signal 4: Patent filings (USPTO)
patents = client.extract(
url=f"https://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&p=1&f=S&l=50&Query=AN%2F%22{company.name}%22&d=PTXT",
schema=list[dict],
prompt=f"""Find recent patent applications/grants for {company.name}:
- Patent title and number
- Filing date and grant date
- Technology classification
- Abstract summary
Patent activity signals R&D investment and potential moats."""
)
if len(patents) > 0:
signals.append(MarketSignal(
company_name=company.name,
signal_type="patent_filing",
signal_strength=min(len(patents) / 10, 1.0),
description=f"{len(patents)} patent applications detected",
detected_at=datetime.now(),
data_points={"patents": patents}
))
return signals
Step 4: Competitive Landscape Mapping
Every investment thesis requires understanding the competitive landscape. We'll automate competitive analysis that would take an associate days to compile manually.
class CompetitiveMap(BaseModel):
target_company: str
sector: str
total_funding_in_sector: Optional[float]
competitors: list[dict] # name, funding, stage, differentiator
market_concentration: str # fragmented, consolidating, dominated
moat_analysis: dict # network_effects, switching_costs, data_moat, etc.
white_space: list[str] # underserved segments
recent_exits: list[dict] # acquisitions, IPOs in sector
market_size_estimate: Optional[str]
class PortfolioCompany(BaseModel):
name: str
domain: str
sector: str
stage: str
last_valuation: Optional[float]
current_arr: Optional[float] # estimated from hiring/traffic
burn_rate_estimate: Optional[str] # based on team size
runway_estimate: Optional[str]
health_score: float # 0-100
risk_flags: list[str] = []
growth_trajectory: str
next_milestone: Optional[str]
last_checked: datetime
def map_competitive_landscape(company: StartupProfile):
"""Build comprehensive competitive landscape for a startup's sector."""
# Step 1: Identify competitors via search
competitors_raw = client.extract(
url=f"https://www.google.com/search?q={company.name}+competitors+alternatives+2026",
schema=list[str],
prompt=f"""Identify direct and indirect competitors to {company.name}
({company.description}). List company names only โ include:
- Direct competitors (same product, same market)
- Adjacent competitors (different product, same customer)
- Emerging competitors (startups in stealth or early stage)"""
)
# Step 2: Enrich each competitor
competitor_profiles = []
for comp_name in competitors_raw[:10]: # Top 10 competitors
profile = client.extract(
url=f"https://www.google.com/search?q=%22{comp_name}%22+funding+employees+revenue",
schema=dict,
prompt=f"""Extract competitive intelligence for {comp_name}:
- Total funding raised and last round
- Estimated employee count
- Key product features and differentiators
- Target customer segment
- Pricing model (if available)
- Notable customers or partnerships
- Founded date and headquarters"""
)
competitor_profiles.append({"name": comp_name, **profile})
# Step 3: Analyze sector M&A and exits
exits = client.extract(
url=f"https://www.google.com/search?q={company.sector}+startup+acquisition+OR+IPO+2025+2026",
schema=list[dict],
prompt="""Find recent exits (acquisitions and IPOs) in this sector:
- Company name and acquirer (or IPO exchange)
- Deal value and multiple (revenue or ARR multiple)
- Date of transaction
- Strategic rationale"""
)
# Step 4: AI-powered moat analysis
moat = analyze_competitive_moats(company, competitor_profiles)
return CompetitiveMap(
target_company=company.name,
sector=company.sector or "Unknown",
competitors=competitor_profiles,
recent_exits=exits,
moat_analysis=moat,
white_space=identify_white_space(competitor_profiles),
market_concentration=assess_concentration(competitor_profiles)
)
def analyze_competitive_moats(company, competitors):
"""Score competitive moats across key dimensions."""
return {
"network_effects": "Score based on multi-sided platform dynamics",
"switching_costs": "Score based on integration depth, data lock-in",
"data_moat": "Score based on proprietary data accumulation",
"brand": "Score based on NPS, organic traffic, community size",
"economies_of_scale": "Score based on marginal cost structure",
"regulatory": "Score based on licensing, compliance barriers"
}
Step 5: Portfolio Monitoring & Risk Detection
For VCs with existing portfolios, continuous monitoring catches problems early โ before a founder calls to say they're running out of runway.
class PortfolioAlert(BaseModel):
company_name: str
alert_type: str # runway_risk, key_person_departure, competitor_threat,
# negative_sentiment, traffic_decline, hiring_freeze
severity: str # critical, warning, info
description: str
recommended_action: str
detected_at: datetime
def monitor_portfolio(portfolio: list[PortfolioCompany]) -> list[PortfolioAlert]:
"""Continuous portfolio monitoring across multiple risk dimensions."""
alerts = []
for company in portfolio:
# Risk 1: Key person departure (LinkedIn monitoring)
exec_changes = client.extract(
url=f"https://www.linkedin.com/company/{company.domain.split('.')[0]}/people/",
schema=list[dict],
prompt=f"""Check leadership team at {company.name}. Look for:
- Any C-suite or VP departures in last 90 days
- New executive hires (positive signal)
- Founder role changes (CEO to Chairman = yellow flag)
- Significant engineer departures (3+ senior in 30 days)"""
)
for change in exec_changes:
if change.get("type") == "departure" and change.get("level") in ["c_suite", "vp"]:
alerts.append(PortfolioAlert(
company_name=company.name,
alert_type="key_person_departure",
severity="critical" if change["level"] == "c_suite" else "warning",
description=f"{change.get('name', 'Executive')} ({change.get('title', 'Unknown')}) departed",
recommended_action="Schedule call with CEO within 48 hours. Assess impact on roadmap and team morale.",
detected_at=datetime.now()
))
# Risk 2: Traffic decline (product-market fit weakening)
traffic = client.extract(
url=f"https://www.similarweb.com/website/{company.domain}/",
schema=dict,
prompt=f"""Get traffic trend for {company.domain}:
- Monthly visits last 3 months
- Month-over-month change
- Bounce rate trend
Flag if traffic is declining more than 15% MoM."""
)
if traffic.get("mom_change", 0) < -0.15:
alerts.append(PortfolioAlert(
company_name=company.name,
alert_type="traffic_decline",
severity="warning",
description=f"Traffic down {abs(traffic['mom_change'])*100:.0f}% MoM",
recommended_action="Review product metrics in next board meeting. Check if seasonal or structural.",
detected_at=datetime.now()
))
# Risk 3: Competitive threat (new well-funded competitor)
new_competitors = client.extract(
url=f"https://www.google.com/search?q={company.sector}+startup+raises+OR+funding+OR+series&tbs=qdr:m",
schema=list[dict],
prompt=f"""Find startups in {company.sector} that raised significant
funding in the last 30 days. Extract company name, amount raised,
and what they do. Flag any that are direct competitors to {company.name}."""
)
for comp in new_competitors:
if comp.get("is_direct_competitor") and comp.get("amount_raised", 0) > 10_000_000:
alerts.append(PortfolioAlert(
company_name=company.name,
alert_type="competitor_threat",
severity="warning",
description=f"Competitor {comp['name']} raised ${comp['amount_raised']/1e6:.0f}M",
recommended_action=f"Evaluate competitive positioning. Consider accelerating {company.name}'s next round.",
detected_at=datetime.now()
))
return alerts
Step 6: AI-Powered Investment Memos
The culmination of all this data: automated investment memo generation that synthesizes deal flow, growth signals, competitive analysis, and market context into actionable intelligence.
class InvestmentBrief(BaseModel):
company_name: str
generated_at: datetime
executive_summary: str
deal_score: float # 0-100
deal_score_breakdown: dict
market_opportunity: str
competitive_position: str
growth_metrics: dict
risk_factors: list[str]
comparable_deals: list[dict] # similar companies, their valuations
recommendation: str # strong_pass, pass, investigate, interested, strong_interest
key_questions: list[str] # questions for founder meeting
next_steps: list[str]
def generate_investment_brief(
company: StartupProfile,
signals: list[MarketSignal],
landscape: CompetitiveMap,
funding: list[FundingRound]
) -> InvestmentBrief:
"""Generate AI-powered investment memo from aggregated intelligence."""
# Deal scoring engine
scores = {
"market_size": score_market(landscape),
"team_quality": score_team(company),
"traction": score_traction(signals),
"competitive_moat": score_moat(landscape.moat_analysis),
"timing": score_timing(signals, landscape),
"capital_efficiency": score_efficiency(funding, signals)
}
# Weights reflect empirical VC return drivers
weights = {
"market_size": 0.25,
"team_quality": 0.25,
"traction": 0.20,
"competitive_moat": 0.15,
"timing": 0.10,
"capital_efficiency": 0.05
}
overall_score = sum(scores[k] * weights[k] for k in scores)
# Determine recommendation
if overall_score >= 80:
recommendation = "strong_interest"
elif overall_score >= 65:
recommendation = "interested"
elif overall_score >= 50:
recommendation = "investigate"
elif overall_score >= 35:
recommendation = "pass"
else:
recommendation = "strong_pass"
# Generate key questions for founder meeting
questions = generate_diligence_questions(company, signals, landscape)
# Find comparable deals for valuation benchmarking
comps = find_comparable_deals(company, landscape)
return InvestmentBrief(
company_name=company.name,
generated_at=datetime.now(),
executive_summary=f"{company.name} is a {company.sector} startup "
f"with {len(signals)} active growth signals. "
f"Deal score: {overall_score:.0f}/100 ({recommendation}).",
deal_score=overall_score,
deal_score_breakdown=scores,
market_opportunity=landscape.market_size_estimate or "TBD",
competitive_position=landscape.market_concentration,
growth_metrics={s.signal_type: s.signal_strength for s in signals},
risk_factors=identify_risks(company, signals, landscape),
comparable_deals=comps,
recommendation=recommendation,
key_questions=questions,
next_steps=determine_next_steps(recommendation)
)
def score_traction(signals: list[MarketSignal]) -> float:
"""Score traction based on growth signals."""
if not signals:
return 20.0
signal_weights = {
"traffic_spike": 25,
"github_growth": 20,
"hiring_surge": 20,
"product_launch": 15,
"patent_filing": 10,
"media_mention": 5,
"partnership": 15,
"exec_hire": 10
}
total = sum(
signal_weights.get(s.signal_type, 5) * s.signal_strength
for s in signals
)
return min(total, 100)
Build Your VC Intelligence Pipeline with Mantis
The Mantis WebPerception API handles JavaScript rendering, anti-bot bypasses, and structured data extraction โ so your AI agent can focus on finding the next unicorn.
Start Free โ 100 Calls/MonthCost Comparison: Traditional VC Intelligence vs AI Agent
| Platform | Annual Cost | Key Limitations |
|---|---|---|
| PitchBook | $20,000 โ $100,000/yr | Backward-looking database; reports deals after close |
| CB Insights | $50,000 โ $100,000/yr | Strong analytics but expensive; limited real-time signals |
| Crunchbase Pro | $588/yr ($49/mo) | Good for basic data; limited API, no custom signals |
| Diffbot | $10,000 โ $50,000/yr | Knowledge graph API; requires significant integration work |
| Harmonic.ai | $15,000 โ $60,000/yr | VC-focused but limited to their signal taxonomy |
| AI Agent + Mantis | $348 โ $3,588/yr | Real-time signals, fully customizable, but requires setup |
Real-World Use Cases
1. Early-Stage VCs (Pre-Seed to Series A)
Challenge: Seeing enough deals early. Most startups at this stage don't appear in databases until after they've already raised.
AI agent solution: Monitor SEC Form D filings daily, track YC/Techstars batch announcements, scan ProductHunt and Hacker News for breakout products, detect GitHub repositories gaining 100+ stars/week in your focus sectors. Alert when a company in your thesis area shows 3+ simultaneous growth signals.
Impact: See deals 4-8 weeks before they hit PitchBook. First-mover advantage on term sheets.
2. Growth Equity & Private Equity
Challenge: Identifying companies ready to scale from $10M to $100M ARR. Need accurate growth metrics without relying on self-reported data.
AI agent solution: Track web traffic (SimilarWeb), hiring velocity (LinkedIn), G2/Capterra review volume and sentiment, App Store rankings, job posting analysis for sales-to-engineering ratio (indicator of GTM maturity). Build ARR estimation models from proxy signals.
Impact: Independent verification of growth claims. Catch companies inflating metrics before due diligence.
3. Corporate Venture Capital (CVC)
Challenge: Finding startups that are strategic fits โ potential acquisition targets, technology partners, or ecosystem plays for the parent company.
AI agent solution: Monitor patent filings in adjacent technology areas, track startups hiring engineers with expertise in your tech stack, detect companies integrating with your APIs or platform, map the competitive landscape around your product roadmap gaps.
Impact: Build-vs-buy decisions backed by real-time market intelligence. Identify acquisition targets before bankers bring them to market.
4. Angel Investors & Scout Networks
Challenge: Limited time and resources for deal sourcing. Need high-signal, low-noise deal flow without expensive subscriptions.
AI agent solution: Daily digest of Form D filings in your sectors, ProductHunt launches above 500 upvotes, GitHub repos that crossed 1K stars, Twitter/X threads about new startups going viral. Simple scoring: founder pedigree ร market size ร traction signals.
Impact: Institutional-quality deal flow on an angel budget. $29/month vs $50K+/year for database access.
Building the Complete Pipeline
async def run_daily_vc_intelligence():
"""Complete daily VC intelligence pipeline."""
# 1. Discover new startups from multiple channels
print("๐ Scanning deal flow sources...")
sec_deals = discover_from_sec_filings()
ph_deals = discover_from_producthunt()
yc_deals = discover_from_yc()
github_deals = discover_from_github_trending()
all_discoveries = sec_deals + ph_deals + yc_deals + github_deals
print(f" Found {len(all_discoveries)} new startups")
# 2. Score and filter for relevance
scored = []
for startup in all_discoveries:
signals = detect_growth_signals(startup)
if len(signals) >= 2: # Multi-signal filter
scored.append((startup, signals))
scored.sort(key=lambda x: sum(s.signal_strength for s in x[1]), reverse=True)
top_deals = scored[:10]
# 3. Deep-dive on top prospects
briefs = []
for startup, signals in top_deals:
landscape = map_competitive_landscape(startup)
funding = track_funding_rounds(startup.name, startup.domain or "")
brief = generate_investment_brief(startup, signals, landscape, funding.get("sec_filings", []))
briefs.append(brief)
# 4. Portfolio monitoring
portfolio_alerts = monitor_portfolio(get_active_portfolio())
# 5. Generate daily digest
print(f"\n๐ Daily VC Intelligence Brief โ {datetime.now().strftime('%B %d, %Y')}")
print(f"{'='*60}")
print(f"\n๐ New Discoveries: {len(all_discoveries)}")
print(f"๐ฏ Multi-Signal Matches: {len(scored)}")
print(f"๐ Investment Briefs Generated: {len(briefs)}")
print(f"๐จ Portfolio Alerts: {len(portfolio_alerts)}")
for brief in briefs:
emoji = {"strong_interest": "๐ข", "interested": "๐ก",
"investigate": "๐ต", "pass": "โช", "strong_pass": "๐ด"}
print(f"\n{emoji.get(brief.recommendation, 'โช')} {brief.company_name}")
print(f" Score: {brief.deal_score:.0f}/100 โ {brief.recommendation}")
print(f" {brief.executive_summary}")
for alert in portfolio_alerts:
severity_icon = {"critical": "๐ด", "warning": "๐ก", "info": "๐ต"}
print(f"\n{severity_icon.get(alert.severity, 'โช')} [{alert.company_name}] {alert.alert_type}")
print(f" {alert.description}")
print(f" โ {alert.recommended_action}")
if __name__ == "__main__":
import asyncio
asyncio.run(run_daily_vc_intelligence())
Advanced: Deal Scoring Calibration
The real power of an AI-powered deal scoring system is that it can be calibrated against outcomes. Track which signals predicted successful investments and which were false positives, then adjust weights over time.
def calibrate_scoring_model(historical_deals: list[dict]):
"""Calibrate deal scoring weights against actual investment outcomes."""
# historical_deals format:
# {signals: [...], invested: bool, outcome: "unicorn"|"good"|"mediocre"|"loss"|"write_off"}
from sklearn.linear_model import LogisticRegression
import numpy as np
feature_names = ["traffic_spike", "github_growth", "hiring_surge",
"patent_filing", "product_launch", "exec_hire",
"partnership", "media_mention"]
X = []
y = []
for deal in historical_deals:
features = [
max((s["signal_strength"] for s in deal["signals"]
if s["signal_type"] == feat), default=0)
for feat in feature_names
]
X.append(features)
y.append(1 if deal["outcome"] in ["unicorn", "good"] else 0)
model = LogisticRegression()
model.fit(np.array(X), np.array(y))
# Extract calibrated weights
calibrated_weights = dict(zip(feature_names, model.coef_[0]))
print("Calibrated signal weights (higher = more predictive of success):")
for name, weight in sorted(calibrated_weights.items(), key=lambda x: -x[1]):
print(f" {name}: {weight:.3f}")
return calibrated_weights
Key Data Sources for VC Intelligence
- SEC EDGAR โ Form D filings (funding rounds), 13F filings (institutional holdings), S-1/S-1A (pre-IPO)
- LinkedIn โ Hiring velocity, team composition, executive changes, employee growth
- GitHub โ Star velocity, contributor growth, commit frequency, release cadence
- ProductHunt โ Launch traction, upvotes, maker profiles, community reception
- SimilarWeb / web traffic โ Monthly visits, growth trajectory, traffic source mix
- App Store / Google Play โ Download rankings, review velocity, rating trends
- Patent databases โ USPTO, Google Patents โ R&D investment signals
- News & press releases โ Funding announcements, partnerships, product launches
- Y Combinator / accelerators โ Batch listings, demo day presentations
- G2 / Capterra โ Review volume, sentiment, competitive positioning
Getting Started
Building a VC intelligence agent with Mantis takes about a day of setup:
- Start with Form D monitoring โ highest-signal, most underutilized data source. Filter by your focus sectors and geographies.
- Add growth signal detection โ LinkedIn hiring + web traffic + GitHub stars gives you a multi-signal view of startup momentum.
- Build competitive landscape automation โ map the competitive landscape for every company that passes your signal threshold.
- Layer in portfolio monitoring โ track your existing investments for risk signals and growth metrics.
- Generate daily briefs โ synthesize everything into actionable intelligence your team reviews each morning.
Ready to Build Your VC Intelligence Agent?
Start with 100 free API calls per month. Scrape SEC filings, monitor startups, detect growth signals โ all through one API.
Get Your Free API Key โ