Web Scraping for Healthcare & Pharma: How AI Agents Track Drug Prices, Clinical Trials & Medical Data in 2026

Published: March 11, 2026 ยท 15 min read ยท By the Mantis Team

The healthcare and pharmaceutical industries generate massive amounts of publicly available data โ€” drug pricing on GoodRx and pharmacy sites, clinical trial registrations on ClinicalTrials.gov, FDA approval filings, medical research on PubMed, and competitive intelligence from pharma company press releases. Yet most healthcare organizations still track this data manually or pay $10,000โ€“$100,000+ per year for specialized data providers.

AI agents powered by web scraping APIs can automate healthcare data collection, extract structured information from unstructured medical pages, and deliver real-time intelligence at a fraction of the cost. In this guide, you'll build a complete healthcare data intelligence system using Python, the Mantis WebPerception API, and GPT-4o.

Why Healthcare & Pharma Teams Need Web Scraping

Healthcare is uniquely data-rich but tool-poor when it comes to automated intelligence:

What You'll Build

By the end of this guide, you'll have an AI-powered healthcare intelligence agent that:

  1. Scrapes drug pricing from pharmacy comparison sites
  2. Monitors clinical trials for new registrations and status changes
  3. Tracks FDA filings for approvals, warnings, and label changes
  4. Extracts structured data using AI (drug names, dosages, trial phases, endpoints)
  5. Analyzes trends with GPT-4o for strategic insights
  6. Alerts your team via Slack when significant changes occur

Step 1: Set Up Your Healthcare Data Pipeline

First, install the required packages and define your data models:

pip install requests pydantic openai sqlite-utils python-dotenv
import os
import json
import requests
import sqlite3
from datetime import datetime, timedelta
from pydantic import BaseModel, Field
from typing import Optional, List
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

MANTIS_API_KEY = os.getenv("MANTIS_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MANTIS_BASE = "https://api.mantisapi.com/v1"


# --- Data Models ---

class DrugPrice(BaseModel):
    """Structured drug pricing data."""
    drug_name: str = Field(description="Generic or brand name")
    brand_name: Optional[str] = Field(default=None, description="Brand name if generic provided")
    dosage: str = Field(description="e.g., '10mg', '500mg/5ml'")
    form: str = Field(description="tablet, capsule, injection, etc.")
    quantity: int = Field(description="Number of units")
    pharmacy: str = Field(description="Pharmacy name")
    retail_price: float = Field(description="Retail price in USD")
    coupon_price: Optional[float] = Field(default=None, description="Discounted/coupon price")
    source_url: str
    scraped_at: str


class ClinicalTrial(BaseModel):
    """Structured clinical trial data."""
    nct_id: str = Field(description="ClinicalTrials.gov identifier (e.g., NCT05123456)")
    title: str
    sponsor: str
    phase: str = Field(description="Phase 1, 2, 3, 4, or N/A")
    status: str = Field(description="Recruiting, Active, Completed, etc.")
    condition: str = Field(description="Disease or condition being studied")
    intervention: str = Field(description="Drug or treatment being tested")
    enrollment: Optional[int] = Field(default=None, description="Target enrollment")
    start_date: Optional[str] = None
    completion_date: Optional[str] = None
    primary_endpoint: Optional[str] = None
    source_url: str
    scraped_at: str


class FDAFiling(BaseModel):
    """Structured FDA filing/approval data."""
    filing_type: str = Field(description="NDA, BLA, sNDA, ANDA, safety alert, etc.")
    drug_name: str
    company: str
    indication: str = Field(description="Approved indication or therapeutic area")
    decision: str = Field(description="Approved, Complete Response, Tentative Approval, etc.")
    decision_date: str
    summary: str = Field(description="Brief summary of the filing/decision")
    source_url: str
    scraped_at: str

Step 2: Scrape Drug Pricing Data

Drug prices vary wildly by pharmacy, location, and available coupons. Here's how to scrape and compare prices across sources:

def scrape_drug_prices(drug_name: str, dosage: str, quantity: int = 30) -> List[DrugPrice]:
    """Scrape drug prices from pharmacy comparison sites."""

    # Use Mantis to scrape pricing pages
    search_url = f"https://www.goodrx.com/{drug_name.lower().replace(' ', '-')}"

    response = requests.post(
        f"{MANTIS_BASE}/scrape",
        headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
        json={
            "url": search_url,
            "render_js": True,
            "wait_for": 3000,
            "extract": {
                "schema": DrugPrice.model_json_schema(),
                "prompt": f"""Extract all pharmacy prices for {drug_name} {dosage}, 
                quantity {quantity}. Include retail and coupon prices for each pharmacy.
                Return as a list of DrugPrice objects.""",
                "multiple": True
            }
        }
    )

    data = response.json()
    prices = []

    if data.get("extracted"):
        for item in data["extracted"]:
            item["scraped_at"] = datetime.now().isoformat()
            item["source_url"] = search_url
            try:
                prices.append(DrugPrice(**item))
            except Exception:
                continue

    return prices


def track_price_changes(drug_name: str, prices: List[DrugPrice], db_path: str = "healthcare.db"):
    """Store prices and detect changes over time."""

    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute("""
        CREATE TABLE IF NOT EXISTS drug_prices (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            drug_name TEXT,
            dosage TEXT,
            form TEXT,
            quantity INTEGER,
            pharmacy TEXT,
            retail_price REAL,
            coupon_price REAL,
            source_url TEXT,
            scraped_at TEXT
        )
    """)

    changes = []

    for price in prices:
        # Check previous price
        cursor.execute("""
            SELECT retail_price, coupon_price FROM drug_prices
            WHERE drug_name = ? AND pharmacy = ? AND dosage = ?
            ORDER BY scraped_at DESC LIMIT 1
        """, (price.drug_name, price.pharmacy, price.dosage))

        prev = cursor.fetchone()

        if prev:
            prev_retail, prev_coupon = prev
            if prev_retail and abs(price.retail_price - prev_retail) / prev_retail > 0.05:
                pct = ((price.retail_price - prev_retail) / prev_retail) * 100
                changes.append({
                    "drug": price.drug_name,
                    "pharmacy": price.pharmacy,
                    "old_price": prev_retail,
                    "new_price": price.retail_price,
                    "change_pct": round(pct, 1),
                    "direction": "increase" if pct > 0 else "decrease"
                })

        # Insert new price
        cursor.execute("""
            INSERT INTO drug_prices
            (drug_name, dosage, form, quantity, pharmacy, retail_price, coupon_price, source_url, scraped_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (price.drug_name, price.dosage, price.form, price.quantity,
              price.pharmacy, price.retail_price, price.coupon_price,
              price.source_url, price.scraped_at))

    conn.commit()
    conn.close()

    return changes

Step 3: Monitor Clinical Trials

ClinicalTrials.gov is the world's largest registry of clinical studies. Tracking new trials and status changes gives pharma teams early intelligence on competitor pipelines:

def scrape_clinical_trials(
    condition: str,
    phase: str = None,
    status: str = "recruiting"
) -> List[ClinicalTrial]:
    """Scrape clinical trial data for a condition."""

    search_url = (
        f"https://clinicaltrials.gov/search?"
        f"cond={condition.replace(' ', '+')}"
        f"&aggFilters=status:{status}"
    )

    if phase:
        search_url += f",phase:{phase}"

    response = requests.post(
        f"{MANTIS_BASE}/scrape",
        headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
        json={
            "url": search_url,
            "render_js": True,
            "wait_for": 5000,
            "extract": {
                "schema": ClinicalTrial.model_json_schema(),
                "prompt": f"""Extract all clinical trials listed on this page.
                For each trial, capture the NCT ID, title, sponsor, phase, status,
                condition, intervention, enrollment target, dates, and primary endpoint.
                Return as a list.""",
                "multiple": True
            }
        }
    )

    data = response.json()
    trials = []

    if data.get("extracted"):
        for item in data["extracted"]:
            item["scraped_at"] = datetime.now().isoformat()
            item["source_url"] = search_url
            try:
                trials.append(ClinicalTrial(**item))
            except Exception:
                continue

    return trials


def detect_trial_changes(trials: List[ClinicalTrial], db_path: str = "healthcare.db"):
    """Track trial status changes and new registrations."""

    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute("""
        CREATE TABLE IF NOT EXISTS clinical_trials (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            nct_id TEXT,
            title TEXT,
            sponsor TEXT,
            phase TEXT,
            status TEXT,
            condition TEXT,
            intervention TEXT,
            enrollment INTEGER,
            start_date TEXT,
            completion_date TEXT,
            primary_endpoint TEXT,
            source_url TEXT,
            scraped_at TEXT
        )
    """)

    alerts = []

    for trial in trials:
        # Check if this trial exists
        cursor.execute("""
            SELECT status, enrollment FROM clinical_trials
            WHERE nct_id = ? ORDER BY scraped_at DESC LIMIT 1
        """, (trial.nct_id,))

        prev = cursor.fetchone()

        if not prev:
            alerts.append({
                "type": "NEW_TRIAL",
                "nct_id": trial.nct_id,
                "title": trial.title,
                "sponsor": trial.sponsor,
                "phase": trial.phase,
                "condition": trial.condition,
                "intervention": trial.intervention
            })
        else:
            prev_status, prev_enrollment = prev
            if prev_status != trial.status:
                alerts.append({
                    "type": "STATUS_CHANGE",
                    "nct_id": trial.nct_id,
                    "title": trial.title,
                    "old_status": prev_status,
                    "new_status": trial.status,
                    "sponsor": trial.sponsor
                })

        # Insert record
        cursor.execute("""
            INSERT INTO clinical_trials
            (nct_id, title, sponsor, phase, status, condition, intervention,
             enrollment, start_date, completion_date, primary_endpoint, source_url, scraped_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (trial.nct_id, trial.title, trial.sponsor, trial.phase,
              trial.status, trial.condition, trial.intervention,
              trial.enrollment, trial.start_date, trial.completion_date,
              trial.primary_endpoint, trial.source_url, trial.scraped_at))

    conn.commit()
    conn.close()

    return alerts

Step 4: Track FDA Filings and Approvals

FDA decisions move markets. A new drug approval can add billions to a company's market cap overnight. Here's how to monitor FDA activity automatically:

def scrape_fda_approvals(days_back: int = 7) -> List[FDAFiling]:
    """Scrape recent FDA drug approvals and filings."""

    fda_url = "https://www.fda.gov/drugs/drug-approvals-and-databases/drug-trials-snapshots"

    response = requests.post(
        f"{MANTIS_BASE}/scrape",
        headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
        json={
            "url": fda_url,
            "render_js": True,
            "wait_for": 3000,
            "extract": {
                "schema": FDAFiling.model_json_schema(),
                "prompt": f"""Extract all FDA drug approvals and filings from this page.
                For each, capture: filing type (NDA, BLA, sNDA, etc.), drug name,
                company/sponsor, approved indication, decision (approved/rejected/CRL),
                decision date, and a brief summary. Focus on entries from the last {days_back} days.
                Return as a list.""",
                "multiple": True
            }
        }
    )

    data = response.json()
    filings = []

    if data.get("extracted"):
        for item in data["extracted"]:
            item["scraped_at"] = datetime.now().isoformat()
            item["source_url"] = fda_url
            try:
                filings.append(FDAFiling(**item))
            except Exception:
                continue

    return filings


def analyze_fda_impact(filing: FDAFiling) -> dict:
    """Use GPT-4o to analyze the market impact of an FDA decision."""

    client = OpenAI(api_key=OPENAI_API_KEY)

    prompt = f"""Analyze this FDA decision for market and competitive impact:

Filing Type: {filing.filing_type}
Drug: {filing.drug_name}
Company: {filing.company}
Indication: {filing.indication}
Decision: {filing.decision}
Date: {filing.decision_date}
Summary: {filing.summary}

Provide:
1. IMPACT_LEVEL: HIGH / MEDIUM / LOW
2. MARKET_SIZE: Estimated market size for this indication
3. COMPETITORS: Key competing drugs/companies
4. IMPLICATIONS: What this means for the company and competitors
5. NEXT_STEPS: What to watch for next (Phase 3 data, label expansion, generics, etc.)

Return as JSON."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a pharmaceutical industry analyst. Provide concise, data-driven analysis."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

Step 5: AI-Powered Healthcare Intelligence Analysis

The real power comes from connecting all data sources and having GPT-4o analyze patterns across drug pricing, clinical trials, and FDA activity:

def generate_healthcare_intelligence_report(
    drug_prices: List[dict],
    trial_alerts: List[dict],
    fda_filings: List[FDAFiling],
    therapeutic_area: str
) -> dict:
    """Generate a comprehensive healthcare intelligence report."""

    client = OpenAI(api_key=OPENAI_API_KEY)

    prompt = f"""Generate a healthcare intelligence report for: {therapeutic_area}

DRUG PRICING CHANGES:
{json.dumps(drug_prices, indent=2)}

CLINICAL TRIAL ALERTS:
{json.dumps(trial_alerts, indent=2)}

RECENT FDA ACTIVITY:
{json.dumps([f.model_dump() for f in fda_filings], indent=2)}

Provide a strategic intelligence report with:

1. EXECUTIVE_SUMMARY: 2-3 sentence overview of key developments
2. PRICING_TRENDS: What's happening with drug prices in this space
3. PIPELINE_ACTIVITY: Notable clinical trial developments
4. REGULATORY_SIGNALS: What FDA activity signals for the market
5. COMPETITIVE_LANDSCAPE: Winners and losers this period
6. RISKS: Potential risks or disruptions to monitor
7. OPPORTUNITIES: Actionable opportunities identified
8. WATCH_LIST: Top 5 things to monitor next week

Return as JSON."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are a senior healthcare strategy analyst. 
            Provide actionable intelligence, not just summaries. Connect dots between 
            pricing, trials, and regulatory activity. Flag non-obvious implications."""},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

Step 6: Automated Alerts and Reporting

Set up automated monitoring with Slack alerts for critical healthcare intelligence:

def send_healthcare_alert(alert_type: str, data: dict, webhook_url: str):
    """Send healthcare intelligence alerts to Slack."""

    emoji_map = {
        "NEW_TRIAL": "๐Ÿงช",
        "STATUS_CHANGE": "๐Ÿ”„",
        "PRICE_INCREASE": "๐Ÿ“ˆ",
        "PRICE_DECREASE": "๐Ÿ“‰",
        "FDA_APPROVAL": "โœ…",
        "FDA_REJECTION": "โŒ",
        "SAFETY_ALERT": "โš ๏ธ"
    }

    emoji = emoji_map.get(alert_type, "๐Ÿ’Š")

    blocks = [
        {
            "type": "header",
            "text": {"type": "plain_text", "text": f"{emoji} Healthcare Alert: {alert_type}"}
        },
        {
            "type": "section",
            "text": {"type": "mrkdwn", "text": format_alert_message(alert_type, data)}
        }
    ]

    requests.post(webhook_url, json={"blocks": blocks})


def run_healthcare_monitor(
    drugs: List[str],
    conditions: List[str],
    webhook_url: str
):
    """Main monitoring loop for healthcare intelligence."""

    print(f"[{datetime.now()}] Starting healthcare intelligence scan...")

    all_price_changes = []
    all_trial_alerts = []

    # 1. Check drug prices
    for drug in drugs:
        prices = scrape_drug_prices(drug, dosage="all")
        changes = track_price_changes(drug, prices)
        all_price_changes.extend(changes)

        for change in changes:
            alert_type = "PRICE_INCREASE" if change["direction"] == "increase" else "PRICE_DECREASE"
            if abs(change["change_pct"]) > 10:  # Only alert on >10% changes
                send_healthcare_alert(alert_type, change, webhook_url)

    # 2. Monitor clinical trials
    for condition in conditions:
        trials = scrape_clinical_trials(condition)
        alerts = detect_trial_changes(trials)
        all_trial_alerts.extend(alerts)

        for alert in alerts:
            send_healthcare_alert(alert["type"], alert, webhook_url)

    # 3. Check FDA filings
    fda_filings = scrape_fda_approvals(days_back=1)
    for filing in fda_filings:
        impact = analyze_fda_impact(filing)
        if impact.get("IMPACT_LEVEL") in ("HIGH", "MEDIUM"):
            send_healthcare_alert("FDA_APPROVAL", {
                "filing": filing.model_dump(),
                "analysis": impact
            }, webhook_url)

    # 4. Generate weekly report (if Monday)
    if datetime.now().weekday() == 0:
        report = generate_healthcare_intelligence_report(
            all_price_changes, all_trial_alerts,
            fda_filings, therapeutic_area="oncology"
        )
        send_healthcare_alert("WEEKLY_REPORT", report, webhook_url)

    print(f"[{datetime.now()}] Scan complete. "
          f"Prices: {len(all_price_changes)} changes, "
          f"Trials: {len(all_trial_alerts)} alerts, "
          f"FDA: {len(fda_filings)} filings")


# --- Run the monitor ---
if __name__ == "__main__":
    run_healthcare_monitor(
        drugs=["ozempic", "keytruda", "humira", "eliquis", "jardiance"],
        conditions=["type 2 diabetes", "non-small cell lung cancer", "rheumatoid arthritis"],
        webhook_url=os.getenv("SLACK_WEBHOOK_URL")
    )

Cost Comparison: Traditional vs. AI Agent Approach

Solution Monthly Cost Coverage Customization
IQVIA / Evaluate Pharma $5,000โ€“$50,000 Comprehensive Limited
Clarivate / Cortellis $3,000โ€“$25,000 Clinical trials focus Some customization
GoodRx Pro / Definitive HC $500โ€“$5,000 Pricing data Limited
Manual research team $8,000โ€“$15,000 Varies Fully custom
AI Agent + Mantis API $29โ€“$299 Fully customizable Complete control

Enterprise pharma data platforms like IQVIA charge tens of thousands per month for access. An AI agent approach lets you build exactly what you need โ€” whether that's drug pricing intelligence, clinical trial monitoring, or competitive analysis โ€” at a tiny fraction of the cost.

Use Cases by Healthcare Segment

1. Pharmaceutical Companies

Track competitor pipeline activity, monitor drug pricing across pharmacies, and get early alerts on FDA decisions that affect your therapeutic area. An AI agent can monitor 50+ competitor drugs simultaneously and alert you within minutes of a filing or price change.

2. Health Tech / Digital Health Startups

Build real-time drug pricing APIs, clinical trial matching platforms, or treatment cost comparison tools. Mantis provides the data extraction layer; you build the product on top.

3. Payers & PBMs (Pharmacy Benefit Managers)

Monitor drug price changes across retail pharmacies, track generic entry dates, and analyze formulary positioning of competing drugs. Automated intelligence helps negotiate better pricing with manufacturers.

4. Healthcare Consulting & Advisory

Deliver real-time market intelligence to pharma clients. Build automated competitive landscape reports that update weekly, tracking pipeline changes, pricing shifts, and regulatory developments across an entire therapeutic area.

Compliance and Ethical Considerations

Important: Healthcare data scraping requires extra care. Always respect robots.txt, terms of service, and never scrape protected health information (PHI). The techniques in this guide focus on publicly available data: published drug prices, registered clinical trials, and public FDA filings. Never use web scraping to access patient data or circumvent paywalls on medical literature.

Best practices for healthcare data collection:

Start Building Healthcare Intelligence Today

Mantis WebPerception API handles JavaScript rendering, anti-bot protection, and AI-powered data extraction โ€” so you can focus on building healthcare insights, not fighting with web scraping infrastructure.

Get Your API Key โ€” Free Tier Available

Next Steps

You now have a complete healthcare intelligence system. To expand it: