Web Scraping for Healthcare & Pharma: How AI Agents Track Drug Prices, Clinical Trials & Medical Data in 2026
The healthcare and pharmaceutical industries generate massive amounts of publicly available data โ drug pricing on GoodRx and pharmacy sites, clinical trial registrations on ClinicalTrials.gov, FDA approval filings, medical research on PubMed, and competitive intelligence from pharma company press releases. Yet most healthcare organizations still track this data manually or pay $10,000โ$100,000+ per year for specialized data providers.
AI agents powered by web scraping APIs can automate healthcare data collection, extract structured information from unstructured medical pages, and deliver real-time intelligence at a fraction of the cost. In this guide, you'll build a complete healthcare data intelligence system using Python, the Mantis WebPerception API, and GPT-4o.
Why Healthcare & Pharma Teams Need Web Scraping
Healthcare is uniquely data-rich but tool-poor when it comes to automated intelligence:
- Drug pricing changes constantly โ retail prices, insurance formularies, GoodRx coupons, and international pricing all shift daily
- Clinical trials are public but hard to track โ ClinicalTrials.gov has 500,000+ studies but no built-in alerting
- FDA filings signal market moves โ new drug approvals, label changes, and safety alerts impact billions in revenue
- Competitor intelligence is scattered โ pipeline updates, partnership announcements, and earnings calls across dozens of sites
- Medical literature grows exponentially โ PubMed adds 1M+ articles per year; no human can keep up
What You'll Build
By the end of this guide, you'll have an AI-powered healthcare intelligence agent that:
- Scrapes drug pricing from pharmacy comparison sites
- Monitors clinical trials for new registrations and status changes
- Tracks FDA filings for approvals, warnings, and label changes
- Extracts structured data using AI (drug names, dosages, trial phases, endpoints)
- Analyzes trends with GPT-4o for strategic insights
- Alerts your team via Slack when significant changes occur
Step 1: Set Up Your Healthcare Data Pipeline
First, install the required packages and define your data models:
pip install requests pydantic openai sqlite-utils python-dotenv
import os
import json
import requests
import sqlite3
from datetime import datetime, timedelta
from pydantic import BaseModel, Field
from typing import Optional, List
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
MANTIS_API_KEY = os.getenv("MANTIS_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MANTIS_BASE = "https://api.mantisapi.com/v1"
# --- Data Models ---
class DrugPrice(BaseModel):
"""Structured drug pricing data."""
drug_name: str = Field(description="Generic or brand name")
brand_name: Optional[str] = Field(default=None, description="Brand name if generic provided")
dosage: str = Field(description="e.g., '10mg', '500mg/5ml'")
form: str = Field(description="tablet, capsule, injection, etc.")
quantity: int = Field(description="Number of units")
pharmacy: str = Field(description="Pharmacy name")
retail_price: float = Field(description="Retail price in USD")
coupon_price: Optional[float] = Field(default=None, description="Discounted/coupon price")
source_url: str
scraped_at: str
class ClinicalTrial(BaseModel):
"""Structured clinical trial data."""
nct_id: str = Field(description="ClinicalTrials.gov identifier (e.g., NCT05123456)")
title: str
sponsor: str
phase: str = Field(description="Phase 1, 2, 3, 4, or N/A")
status: str = Field(description="Recruiting, Active, Completed, etc.")
condition: str = Field(description="Disease or condition being studied")
intervention: str = Field(description="Drug or treatment being tested")
enrollment: Optional[int] = Field(default=None, description="Target enrollment")
start_date: Optional[str] = None
completion_date: Optional[str] = None
primary_endpoint: Optional[str] = None
source_url: str
scraped_at: str
class FDAFiling(BaseModel):
"""Structured FDA filing/approval data."""
filing_type: str = Field(description="NDA, BLA, sNDA, ANDA, safety alert, etc.")
drug_name: str
company: str
indication: str = Field(description="Approved indication or therapeutic area")
decision: str = Field(description="Approved, Complete Response, Tentative Approval, etc.")
decision_date: str
summary: str = Field(description="Brief summary of the filing/decision")
source_url: str
scraped_at: str
Step 2: Scrape Drug Pricing Data
Drug prices vary wildly by pharmacy, location, and available coupons. Here's how to scrape and compare prices across sources:
def scrape_drug_prices(drug_name: str, dosage: str, quantity: int = 30) -> List[DrugPrice]:
"""Scrape drug prices from pharmacy comparison sites."""
# Use Mantis to scrape pricing pages
search_url = f"https://www.goodrx.com/{drug_name.lower().replace(' ', '-')}"
response = requests.post(
f"{MANTIS_BASE}/scrape",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={
"url": search_url,
"render_js": True,
"wait_for": 3000,
"extract": {
"schema": DrugPrice.model_json_schema(),
"prompt": f"""Extract all pharmacy prices for {drug_name} {dosage},
quantity {quantity}. Include retail and coupon prices for each pharmacy.
Return as a list of DrugPrice objects.""",
"multiple": True
}
}
)
data = response.json()
prices = []
if data.get("extracted"):
for item in data["extracted"]:
item["scraped_at"] = datetime.now().isoformat()
item["source_url"] = search_url
try:
prices.append(DrugPrice(**item))
except Exception:
continue
return prices
def track_price_changes(drug_name: str, prices: List[DrugPrice], db_path: str = "healthcare.db"):
"""Store prices and detect changes over time."""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS drug_prices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
drug_name TEXT,
dosage TEXT,
form TEXT,
quantity INTEGER,
pharmacy TEXT,
retail_price REAL,
coupon_price REAL,
source_url TEXT,
scraped_at TEXT
)
""")
changes = []
for price in prices:
# Check previous price
cursor.execute("""
SELECT retail_price, coupon_price FROM drug_prices
WHERE drug_name = ? AND pharmacy = ? AND dosage = ?
ORDER BY scraped_at DESC LIMIT 1
""", (price.drug_name, price.pharmacy, price.dosage))
prev = cursor.fetchone()
if prev:
prev_retail, prev_coupon = prev
if prev_retail and abs(price.retail_price - prev_retail) / prev_retail > 0.05:
pct = ((price.retail_price - prev_retail) / prev_retail) * 100
changes.append({
"drug": price.drug_name,
"pharmacy": price.pharmacy,
"old_price": prev_retail,
"new_price": price.retail_price,
"change_pct": round(pct, 1),
"direction": "increase" if pct > 0 else "decrease"
})
# Insert new price
cursor.execute("""
INSERT INTO drug_prices
(drug_name, dosage, form, quantity, pharmacy, retail_price, coupon_price, source_url, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (price.drug_name, price.dosage, price.form, price.quantity,
price.pharmacy, price.retail_price, price.coupon_price,
price.source_url, price.scraped_at))
conn.commit()
conn.close()
return changes
Step 3: Monitor Clinical Trials
ClinicalTrials.gov is the world's largest registry of clinical studies. Tracking new trials and status changes gives pharma teams early intelligence on competitor pipelines:
def scrape_clinical_trials(
condition: str,
phase: str = None,
status: str = "recruiting"
) -> List[ClinicalTrial]:
"""Scrape clinical trial data for a condition."""
search_url = (
f"https://clinicaltrials.gov/search?"
f"cond={condition.replace(' ', '+')}"
f"&aggFilters=status:{status}"
)
if phase:
search_url += f",phase:{phase}"
response = requests.post(
f"{MANTIS_BASE}/scrape",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={
"url": search_url,
"render_js": True,
"wait_for": 5000,
"extract": {
"schema": ClinicalTrial.model_json_schema(),
"prompt": f"""Extract all clinical trials listed on this page.
For each trial, capture the NCT ID, title, sponsor, phase, status,
condition, intervention, enrollment target, dates, and primary endpoint.
Return as a list.""",
"multiple": True
}
}
)
data = response.json()
trials = []
if data.get("extracted"):
for item in data["extracted"]:
item["scraped_at"] = datetime.now().isoformat()
item["source_url"] = search_url
try:
trials.append(ClinicalTrial(**item))
except Exception:
continue
return trials
def detect_trial_changes(trials: List[ClinicalTrial], db_path: str = "healthcare.db"):
"""Track trial status changes and new registrations."""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS clinical_trials (
id INTEGER PRIMARY KEY AUTOINCREMENT,
nct_id TEXT,
title TEXT,
sponsor TEXT,
phase TEXT,
status TEXT,
condition TEXT,
intervention TEXT,
enrollment INTEGER,
start_date TEXT,
completion_date TEXT,
primary_endpoint TEXT,
source_url TEXT,
scraped_at TEXT
)
""")
alerts = []
for trial in trials:
# Check if this trial exists
cursor.execute("""
SELECT status, enrollment FROM clinical_trials
WHERE nct_id = ? ORDER BY scraped_at DESC LIMIT 1
""", (trial.nct_id,))
prev = cursor.fetchone()
if not prev:
alerts.append({
"type": "NEW_TRIAL",
"nct_id": trial.nct_id,
"title": trial.title,
"sponsor": trial.sponsor,
"phase": trial.phase,
"condition": trial.condition,
"intervention": trial.intervention
})
else:
prev_status, prev_enrollment = prev
if prev_status != trial.status:
alerts.append({
"type": "STATUS_CHANGE",
"nct_id": trial.nct_id,
"title": trial.title,
"old_status": prev_status,
"new_status": trial.status,
"sponsor": trial.sponsor
})
# Insert record
cursor.execute("""
INSERT INTO clinical_trials
(nct_id, title, sponsor, phase, status, condition, intervention,
enrollment, start_date, completion_date, primary_endpoint, source_url, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (trial.nct_id, trial.title, trial.sponsor, trial.phase,
trial.status, trial.condition, trial.intervention,
trial.enrollment, trial.start_date, trial.completion_date,
trial.primary_endpoint, trial.source_url, trial.scraped_at))
conn.commit()
conn.close()
return alerts
Step 4: Track FDA Filings and Approvals
FDA decisions move markets. A new drug approval can add billions to a company's market cap overnight. Here's how to monitor FDA activity automatically:
def scrape_fda_approvals(days_back: int = 7) -> List[FDAFiling]:
"""Scrape recent FDA drug approvals and filings."""
fda_url = "https://www.fda.gov/drugs/drug-approvals-and-databases/drug-trials-snapshots"
response = requests.post(
f"{MANTIS_BASE}/scrape",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={
"url": fda_url,
"render_js": True,
"wait_for": 3000,
"extract": {
"schema": FDAFiling.model_json_schema(),
"prompt": f"""Extract all FDA drug approvals and filings from this page.
For each, capture: filing type (NDA, BLA, sNDA, etc.), drug name,
company/sponsor, approved indication, decision (approved/rejected/CRL),
decision date, and a brief summary. Focus on entries from the last {days_back} days.
Return as a list.""",
"multiple": True
}
}
)
data = response.json()
filings = []
if data.get("extracted"):
for item in data["extracted"]:
item["scraped_at"] = datetime.now().isoformat()
item["source_url"] = fda_url
try:
filings.append(FDAFiling(**item))
except Exception:
continue
return filings
def analyze_fda_impact(filing: FDAFiling) -> dict:
"""Use GPT-4o to analyze the market impact of an FDA decision."""
client = OpenAI(api_key=OPENAI_API_KEY)
prompt = f"""Analyze this FDA decision for market and competitive impact:
Filing Type: {filing.filing_type}
Drug: {filing.drug_name}
Company: {filing.company}
Indication: {filing.indication}
Decision: {filing.decision}
Date: {filing.decision_date}
Summary: {filing.summary}
Provide:
1. IMPACT_LEVEL: HIGH / MEDIUM / LOW
2. MARKET_SIZE: Estimated market size for this indication
3. COMPETITORS: Key competing drugs/companies
4. IMPLICATIONS: What this means for the company and competitors
5. NEXT_STEPS: What to watch for next (Phase 3 data, label expansion, generics, etc.)
Return as JSON."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a pharmaceutical industry analyst. Provide concise, data-driven analysis."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
Step 5: AI-Powered Healthcare Intelligence Analysis
The real power comes from connecting all data sources and having GPT-4o analyze patterns across drug pricing, clinical trials, and FDA activity:
def generate_healthcare_intelligence_report(
drug_prices: List[dict],
trial_alerts: List[dict],
fda_filings: List[FDAFiling],
therapeutic_area: str
) -> dict:
"""Generate a comprehensive healthcare intelligence report."""
client = OpenAI(api_key=OPENAI_API_KEY)
prompt = f"""Generate a healthcare intelligence report for: {therapeutic_area}
DRUG PRICING CHANGES:
{json.dumps(drug_prices, indent=2)}
CLINICAL TRIAL ALERTS:
{json.dumps(trial_alerts, indent=2)}
RECENT FDA ACTIVITY:
{json.dumps([f.model_dump() for f in fda_filings], indent=2)}
Provide a strategic intelligence report with:
1. EXECUTIVE_SUMMARY: 2-3 sentence overview of key developments
2. PRICING_TRENDS: What's happening with drug prices in this space
3. PIPELINE_ACTIVITY: Notable clinical trial developments
4. REGULATORY_SIGNALS: What FDA activity signals for the market
5. COMPETITIVE_LANDSCAPE: Winners and losers this period
6. RISKS: Potential risks or disruptions to monitor
7. OPPORTUNITIES: Actionable opportunities identified
8. WATCH_LIST: Top 5 things to monitor next week
Return as JSON."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """You are a senior healthcare strategy analyst.
Provide actionable intelligence, not just summaries. Connect dots between
pricing, trials, and regulatory activity. Flag non-obvious implications."""},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
Step 6: Automated Alerts and Reporting
Set up automated monitoring with Slack alerts for critical healthcare intelligence:
def send_healthcare_alert(alert_type: str, data: dict, webhook_url: str):
"""Send healthcare intelligence alerts to Slack."""
emoji_map = {
"NEW_TRIAL": "๐งช",
"STATUS_CHANGE": "๐",
"PRICE_INCREASE": "๐",
"PRICE_DECREASE": "๐",
"FDA_APPROVAL": "โ
",
"FDA_REJECTION": "โ",
"SAFETY_ALERT": "โ ๏ธ"
}
emoji = emoji_map.get(alert_type, "๐")
blocks = [
{
"type": "header",
"text": {"type": "plain_text", "text": f"{emoji} Healthcare Alert: {alert_type}"}
},
{
"type": "section",
"text": {"type": "mrkdwn", "text": format_alert_message(alert_type, data)}
}
]
requests.post(webhook_url, json={"blocks": blocks})
def run_healthcare_monitor(
drugs: List[str],
conditions: List[str],
webhook_url: str
):
"""Main monitoring loop for healthcare intelligence."""
print(f"[{datetime.now()}] Starting healthcare intelligence scan...")
all_price_changes = []
all_trial_alerts = []
# 1. Check drug prices
for drug in drugs:
prices = scrape_drug_prices(drug, dosage="all")
changes = track_price_changes(drug, prices)
all_price_changes.extend(changes)
for change in changes:
alert_type = "PRICE_INCREASE" if change["direction"] == "increase" else "PRICE_DECREASE"
if abs(change["change_pct"]) > 10: # Only alert on >10% changes
send_healthcare_alert(alert_type, change, webhook_url)
# 2. Monitor clinical trials
for condition in conditions:
trials = scrape_clinical_trials(condition)
alerts = detect_trial_changes(trials)
all_trial_alerts.extend(alerts)
for alert in alerts:
send_healthcare_alert(alert["type"], alert, webhook_url)
# 3. Check FDA filings
fda_filings = scrape_fda_approvals(days_back=1)
for filing in fda_filings:
impact = analyze_fda_impact(filing)
if impact.get("IMPACT_LEVEL") in ("HIGH", "MEDIUM"):
send_healthcare_alert("FDA_APPROVAL", {
"filing": filing.model_dump(),
"analysis": impact
}, webhook_url)
# 4. Generate weekly report (if Monday)
if datetime.now().weekday() == 0:
report = generate_healthcare_intelligence_report(
all_price_changes, all_trial_alerts,
fda_filings, therapeutic_area="oncology"
)
send_healthcare_alert("WEEKLY_REPORT", report, webhook_url)
print(f"[{datetime.now()}] Scan complete. "
f"Prices: {len(all_price_changes)} changes, "
f"Trials: {len(all_trial_alerts)} alerts, "
f"FDA: {len(fda_filings)} filings")
# --- Run the monitor ---
if __name__ == "__main__":
run_healthcare_monitor(
drugs=["ozempic", "keytruda", "humira", "eliquis", "jardiance"],
conditions=["type 2 diabetes", "non-small cell lung cancer", "rheumatoid arthritis"],
webhook_url=os.getenv("SLACK_WEBHOOK_URL")
)
Cost Comparison: Traditional vs. AI Agent Approach
| Solution | Monthly Cost | Coverage | Customization |
|---|---|---|---|
| IQVIA / Evaluate Pharma | $5,000โ$50,000 | Comprehensive | Limited |
| Clarivate / Cortellis | $3,000โ$25,000 | Clinical trials focus | Some customization |
| GoodRx Pro / Definitive HC | $500โ$5,000 | Pricing data | Limited |
| Manual research team | $8,000โ$15,000 | Varies | Fully custom |
| AI Agent + Mantis API | $29โ$299 | Fully customizable | Complete control |
Enterprise pharma data platforms like IQVIA charge tens of thousands per month for access. An AI agent approach lets you build exactly what you need โ whether that's drug pricing intelligence, clinical trial monitoring, or competitive analysis โ at a tiny fraction of the cost.
Use Cases by Healthcare Segment
1. Pharmaceutical Companies
Track competitor pipeline activity, monitor drug pricing across pharmacies, and get early alerts on FDA decisions that affect your therapeutic area. An AI agent can monitor 50+ competitor drugs simultaneously and alert you within minutes of a filing or price change.
2. Health Tech / Digital Health Startups
Build real-time drug pricing APIs, clinical trial matching platforms, or treatment cost comparison tools. Mantis provides the data extraction layer; you build the product on top.
3. Payers & PBMs (Pharmacy Benefit Managers)
Monitor drug price changes across retail pharmacies, track generic entry dates, and analyze formulary positioning of competing drugs. Automated intelligence helps negotiate better pricing with manufacturers.
4. Healthcare Consulting & Advisory
Deliver real-time market intelligence to pharma clients. Build automated competitive landscape reports that update weekly, tracking pipeline changes, pricing shifts, and regulatory developments across an entire therapeutic area.
Compliance and Ethical Considerations
Important: Healthcare data scraping requires extra care. Always respect robots.txt, terms of service, and never scrape protected health information (PHI). The techniques in this guide focus on publicly available data: published drug prices, registered clinical trials, and public FDA filings. Never use web scraping to access patient data or circumvent paywalls on medical literature.
Best practices for healthcare data collection:
- Only scrape public sources โ FDA.gov, ClinicalTrials.gov, published pharmacy pricing
- Respect rate limits โ government sites have strict crawling policies
- No PHI ever โ never collect, store, or process patient health information
- Verify data accuracy โ always cross-reference critical data points before making decisions
- Document your sources โ maintain audit trails for regulatory compliance
Start Building Healthcare Intelligence Today
Mantis WebPerception API handles JavaScript rendering, anti-bot protection, and AI-powered data extraction โ so you can focus on building healthcare insights, not fighting with web scraping infrastructure.
Get Your API Key โ Free Tier AvailableNext Steps
You now have a complete healthcare intelligence system. To expand it:
- Add more data sources โ patent databases (Google Patents), medical conferences (ASCO, AHA), pharmacy benefit formularies
- Build specialty trackers โ focus on oncology, rare diseases, or biosimilars for deeper intelligence
- Create dashboards โ use Streamlit or Retool to visualize pricing trends, pipeline maps, and competitive landscapes
- Schedule with cron โ run daily pricing checks, weekly trial scans, and real-time FDA monitoring
- Alert escalation โ route high-impact alerts to executives, routine data to analysts