๐ Table of Contents
- Why Scrape Instagram Data?
- What Data Can You Extract?
- Method 1: Python + Requests (Public API Endpoints)
- Method 2: Playwright (Headless Browser)
- Method 3: Node.js + Puppeteer
- Method 4: Web Scraping API (Easiest)
- Beating Instagram's Anti-Bot Detection
- Instagram Graph API vs Scraping
- Method Comparison
- Real-World Use Cases
- Legal Considerations
- FAQ
Why Scrape Instagram Data?
Instagram has over 2 billion monthly active users, making it one of the richest sources of social data on the internet. Businesses, researchers, and developers scrape Instagram for:
- Influencer analytics โ Evaluate engagement rates, follower growth, and content performance before sponsorship deals
- Brand monitoring โ Track mentions, hashtags, and competitor activity across Instagram in real time
- Market research โ Discover trending products, aesthetics, and consumer preferences through visual content analysis
- Competitive intelligence โ Monitor competitor posting strategies, engagement metrics, and audience growth
- Lead generation โ Find business accounts in specific niches with their contact info (email, website from bio)
- Content curation โ Aggregate UGC (user-generated content) for marketing campaigns with proper attribution
- AI agent social intelligence โ Give AI assistants the ability to research brands, people, and trends on Instagram
- Academic research โ Study social media behavior, visual trends, and platform dynamics at scale
Instagram's official API is extremely limited โ it only works with accounts you own or that authorize your app. For public data at scale, scraping is often the only viable option.
What Data Can You Extract?
Instagram profiles and posts contain rich data, though access depends on privacy settings:
| Data Point | Public Profiles | Private Profiles |
|---|---|---|
| Username & Full Name | โ | โ |
| Bio & External URL | โ | โ |
| Profile Picture (HD) | โ | โ |
| Follower / Following Count | โ | โ |
| Post Count | โ | โ |
| Verified Badge | โ | โ |
| Business Category | โ | โ |
| Recent Posts (12-50) | โ | โ |
| Post Captions | โ | โ |
| Like / Comment Counts | โ | โ |
| Post Images / Videos | โ | โ |
| Reels | โ | โ |
| Tagged Locations | โ | โ |
| Hashtags Used | โ | โ |
| Comment Text | โ (limited) | โ |
Method 1: Python + Requests (Public API Endpoints)
Instagram serves profile data through internal GraphQL endpoints that return JSON. These endpoints are undocumented and change frequently, but they're faster than full browser rendering.
Install Dependencies
pip install requests
Public Profile Scraper
# instagram_scraper.py import requests import json import time import random HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/125.0.0.0 Safari/537.36", "Accept": "*/*", "Accept-Language": "en-US,en;q=0.9", "X-IG-App-ID": "936619743392459", "X-Requested-With": "XMLHttpRequest", "Referer": "https://www.instagram.com/", } def scrape_instagram_profile(username: str) -> dict: """Scrape public Instagram profile data via web API.""" url = f"https://www.instagram.com/api/v1/users/web_profile_info/" params = {"username": username} resp = requests.get( url, headers=HEADERS, params=params, timeout=15 ) if resp.status_code == 404: return {"error": f"User '{username}' not found"} if resp.status_code != 200: return {"error": f"Request failed: {resp.status_code}"} data = resp.json() user = data.get("data", {}).get("user", {}) if not user: return {"error": "Could not parse user data"} # Extract recent posts posts = [] edges = ( user.get("edge_owner_to_timeline_media", {}) .get("edges", []) ) for edge in edges[:12]: node = edge.get("node", {}) caption_edges = ( node.get("edge_media_to_caption", {}) .get("edges", []) ) posts.append({ "id": node.get("shortcode"), "url": ( f"https://www.instagram.com/p/" f"{node.get('shortcode')}/" ), "image": node.get("display_url"), "caption": ( caption_edges[0]["node"]["text"] if caption_edges else None ), "likes": ( node.get("edge_liked_by", {}).get("count", 0) ), "comments": ( node.get("edge_media_to_comment", {}) .get("count", 0) ), "timestamp": node.get("taken_at_timestamp"), "is_video": node.get("is_video", False), "video_views": node.get("video_view_count"), }) return { "username": user.get("username"), "full_name": user.get("full_name"), "bio": user.get("biography"), "external_url": user.get("external_url"), "followers": ( user.get("edge_followed_by", {}).get("count", 0) ), "following": ( user.get("edge_follow", {}).get("count", 0) ), "posts_count": ( user.get("edge_owner_to_timeline_media", {}) .get("count", 0) ), "is_verified": user.get("is_verified", False), "is_business": user.get("is_business_account", False), "business_category": user.get("category_name"), "profile_pic_hd": user.get("profile_pic_url_hd"), "is_private": user.get("is_private", False), "recent_posts": posts, } # Example usage profile = scrape_instagram_profile("natgeo") print(json.dumps(profile, indent=2)) print(f"\nFollowers: {profile.get('followers', 0):,}") print(f"Posts: {profile.get('posts_count', 0):,}")
Instagram's internal API endpoints change frequently without notice. The web_profile_info endpoint works as of early 2026, but Meta may modify or remove it at any time. Always have a fallback strategy โ browser-based scraping or a managed API.
Hashtag Explorer
# hashtag_scraper.py
def scrape_hashtag(tag: str) -> dict:
"""Scrape top posts from an Instagram hashtag page."""
url = f"https://www.instagram.com/explore/tags/{tag}/"
params = {"__a": "1", "__d": "dis"}
resp = requests.get(
url, headers=HEADERS, params=params, timeout=15
)
if resp.status_code != 200:
return {"error": f"Failed to fetch #{tag}"}
try:
data = resp.json()
except json.JSONDecodeError:
return {"error": "Instagram returned HTML โ likely blocked"}
hashtag_data = (
data.get("graphql", {}).get("hashtag", {})
or data.get("data", {}).get("hashtag", {})
)
if not hashtag_data:
return {"error": "Could not parse hashtag data"}
top_posts = []
edges = (
hashtag_data.get("edge_hashtag_to_top_posts", {})
.get("edges", [])
)
for edge in edges[:9]:
node = edge["node"]
caption_edges = (
node.get("edge_media_to_caption", {})
.get("edges", [])
)
top_posts.append({
"shortcode": node.get("shortcode"),
"likes": (
node.get("edge_liked_by", {}).get("count", 0)
),
"comments": (
node.get("edge_media_to_comment", {})
.get("count", 0)
),
"caption": (
caption_edges[0]["node"]["text"][:200]
if caption_edges else None
),
"is_video": node.get("is_video", False),
})
return {
"hashtag": tag,
"post_count": (
hashtag_data.get("edge_hashtag_to_media", {})
.get("count", 0)
),
"top_posts": top_posts,
}
result = scrape_hashtag("webdevelopment")
print(f"#{result['hashtag']}: {result.get('post_count', 0):,} posts")
Method 2: Playwright (Headless Browser)
Instagram is a JavaScript-heavy single-page app. Playwright renders the full page, handles login walls, infinite scroll, and gives you access to all visible content โ including stories, reels, and dynamically loaded posts.
Install
pip install playwright playwright-stealth playwright install chromium
Full-Render Instagram Scraper
# playwright_instagram.py import asyncio from playwright.async_api import async_playwright from playwright_stealth import stealth_async import json async def scrape_instagram_profile(username: str) -> dict: async with async_playwright() as p: browser = await p.chromium.launch(headless=True) context = await browser.new_context( user_agent=( "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/125.0.0.0 Safari/537.36" ), viewport={"width": 1920, "height": 1080}, locale="en-US", ) page = await context.new_page() await stealth_async(page) # Block media for speed await page.route( "**/*.{mp4,webm,ogg}", lambda route: route.abort() ) url = f"https://www.instagram.com/{username}/" await page.goto(url, wait_until="networkidle") # Dismiss login popup if it appears try: close_btn = page.locator( '[aria-label="Close"],' 'button:has-text("Not Now")' ) await close_btn.click(timeout=3000) except Exception: pass await page.wait_for_timeout(2000) # Extract profile data from the page profile = await page.evaluate("""() => { const getMeta = (prop) => { const el = document.querySelector( `meta[property="${prop}"]` ); return el ? el.content : null; }; // Parse follower counts from header const stats = document.querySelectorAll( 'header section ul li' ); const parseCount = (el) => { if (!el) return 0; const text = el.textContent.replace(/,/g, ''); const match = text.match( /([\d.]+)\s*(K|M|B)?/i ); if (!match) return 0; let num = parseFloat(match[1]); const suffix = (match[2] || '').toUpperCase(); if (suffix === 'K') num *= 1000; if (suffix === 'M') num *= 1000000; if (suffix === 'B') num *= 1000000000; return Math.round(num); }; // Get post thumbnails const posts = []; const articles = document.querySelectorAll( 'article img[srcset], main img[srcset]' ); articles.forEach((img, i) => { if (i < 12) { posts.push({ image: img.src, alt: img.alt || '', }); } }); return { title: getMeta('og:title'), description: getMeta('og:description'), image: getMeta('og:image'), posts_count: stats[0] ? parseCount(stats[0]) : 0, followers: stats[1] ? parseCount(stats[1]) : 0, following: stats[2] ? parseCount(stats[2]) : 0, posts: posts, }; }""") # Parse bio from meta description desc = profile.get("description", "") or "" bio_parts = desc.split("on Instagram: ") bio = bio_parts[1] if len(bio_parts) > 1 else "" if bio.startswith('"') and bio.endswith('"'): bio = bio[1:-1] profile["username"] = username profile["bio"] = bio profile["url"] = url await browser.close() return profile # Run it data = asyncio.run(scrape_instagram_profile("natgeo")) print(json.dumps(data, indent=2))
Scroll & Collect Posts
# scroll_posts.py async def scrape_all_posts( username: str, max_posts: int = 50 ) -> list: """Scroll through a profile and collect post data.""" async with async_playwright() as p: browser = await p.chromium.launch(headless=True) context = await browser.new_context( user_agent=( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 Chrome/125.0.0.0" ), viewport={"width": 1920, "height": 1080}, ) page = await context.new_page() await stealth_async(page) await page.goto( f"https://www.instagram.com/{username}/", wait_until="networkidle", ) # Dismiss popups try: await page.click( 'button:has-text("Not Now")', timeout=3000 ) except Exception: pass posts = set() prev_count = 0 scroll_attempts = 0 while len(posts) < max_posts and scroll_attempts < 20: # Collect post links links = await page.evaluate("""() => { return [...document.querySelectorAll( 'a[href*="/p/"]' )].map(a => a.href); }""") posts.update(links) if len(posts) == prev_count: scroll_attempts += 1 else: scroll_attempts = 0 prev_count = len(posts) # Scroll down await page.evaluate( "window.scrollBy(0, window.innerHeight * 2)" ) await page.wait_for_timeout( random.randint(1500, 3000) ) await browser.close() return list(posts)[:max_posts]
Instagram shows more data to logged-in users. You can export cookies from a real browser session (use a burner account) and load them into Playwright with context.add_cookies(). This unlocks additional post data and higher rate limits โ but increases the risk of account suspension.
Method 3: Node.js + Puppeteer
Puppeteer provides headless Chrome control from Node.js โ ideal for building Instagram scraping into backend services, serverless functions, or data pipelines.
Install
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Profile Scraper with Stealth
// instagram-scraper.mjs import puppeteer from "puppeteer-extra"; import StealthPlugin from "puppeteer-extra-plugin-stealth"; puppeteer.use(StealthPlugin()); async function scrapeProfile(username) { const browser = await puppeteer.launch({ headless: "new", args: ["--no-sandbox", "--disable-setuid-sandbox"], }); const page = await browser.newPage(); await page.setViewport({ width: 1920, height: 1080 }); // Block heavy resources await page.setRequestInterception(true); page.on("request", (req) => { const type = req.resourceType(); if (["video", "font"].includes(type)) { req.abort(); } else { req.continue(); } }); // Intercept the GraphQL response let profileData = null; page.on("response", async (resp) => { const url = resp.url(); if ( url.includes("web_profile_info") || url.includes("graphql/query") ) { try { const json = await resp.json(); if (json?.data?.user) { profileData = json.data.user; } } catch (e) { // Not JSON } } }); await page.goto( `https://www.instagram.com/${username}/`, { waitUntil: "networkidle0", timeout: 30000 } ); // Dismiss login modal try { await page.click('button:has-text("Not Now")', { timeout: 3000, }); } catch (e) {} await page.waitForTimeout(3000); if (!profileData) { // Fallback: parse from page profileData = await page.evaluate(() => { const desc = document .querySelector('meta[property="og:description"]') ?.content || ""; const match = desc.match( /([\d,.KMB]+) Followers, ([\d,.KMB]+) Following, ([\d,.KMB]+) Posts/i ); return { username: document .querySelector('meta[property="og:title"]') ?.content?.split("(")[0] .trim() || "", description: desc, followers_text: match?.[1] || "0", following_text: match?.[2] || "0", posts_text: match?.[3] || "0", }; }); } const result = { username: profileData.username || username, full_name: profileData.full_name || null, bio: profileData.biography || null, external_url: profileData.external_url || null, followers: profileData.edge_followed_by?.count || profileData.follower_count || 0, following: profileData.edge_follow?.count || profileData.following_count || 0, posts_count: profileData.edge_owner_to_timeline_media?.count || profileData.media_count || 0, is_verified: profileData.is_verified || false, is_private: profileData.is_private || false, is_business: profileData.is_business_account || false, category: profileData.category_name || null, profile_pic: profileData.profile_pic_url_hd || null, }; await browser.close(); return result; } // Batch scrape with rate limiting async function scrapeMultiple(usernames, delayMs = 8000) { const results = []; for (const username of usernames) { try { const profile = await scrapeProfile(username); results.push(profile); console.log( `โ @${profile.username} โ ` + `${profile.followers.toLocaleString()} followers` ); } catch (err) { console.error(`โ @${username}: ${err.message}`); results.push({ username, error: err.message }); } await new Promise((r) => setTimeout(r, delayMs)); } return results; } // Usage const profiles = await scrapeMultiple([ "natgeo", "nasa", "nike" ]); console.log(JSON.stringify(profiles, null, 2));
Method 4: Web Scraping API (Easiest)
The most reliable approach for production Instagram scraping. A web scraping API handles proxy rotation, login walls, browser rendering, and anti-bot detection โ you send a URL, get structured data back.
Using the Mantis API
# One API call โ structured Instagram data
import requests
resp = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
json={
"url": "https://www.instagram.com/natgeo/",
"extract": {
"username": "Instagram username",
"full_name": "display name",
"bio": "profile biography text",
"followers": "follower count as integer",
"following": "following count as integer",
"posts_count": "total number of posts",
"is_verified": "whether account is verified",
"is_business": "whether it's a business account",
"category": "business category if applicable",
"external_url": "website URL from bio",
"recent_posts": (
"array of recent posts with: "
"caption, likes, comments, image_url, "
"is_video, timestamp"
),
},
"render_js": True,
},
)
profile = resp.json()
print(f"@{profile.get('username')} โ "
f"{profile.get('followers', 0):,} followers")
Skip the Login Walls & Blocks
Mantis handles Instagram's anti-bot detection, login popups, proxy rotation, and JavaScript rendering โ so you don't have to.
View Pricing Get Started FreeNode.js with Mantis
// mantis-instagram.mjs
const resp = await fetch("https://api.mantisapi.com/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://www.instagram.com/nike/",
extract: {
username: "Instagram handle",
followers: "follower count as number",
bio: "profile bio text",
recent_posts:
"array of last 12 posts with caption, " +
"likes, comments, is_video",
},
render_js: true,
}),
});
const profile = await resp.json();
console.log(profile);
Beating Instagram's Anti-Bot Detection
Instagram (Meta) has some of the most sophisticated anti-scraping defenses of any social platform. Here's what you're up against:
Instagram's Defense Layers
| Defense | What It Does | Countermeasure |
|---|---|---|
| Login Wall | Redirects to login after a few page views | Cookie rotation, session management |
| Rate Limiting | 429 errors after rapid requests | Rotating residential proxies, long delays |
| Checkpoint Challenges | Phone/email verification prompts | Avoid logged-in scraping, use APIs |
| Device ID Tracking | Fingerprints devices across sessions | Fresh browser contexts, randomized fingerprints |
| IP Reputation | Blocks datacenter IPs and known proxy ranges | Residential or mobile proxies only |
| Browser Fingerprinting | Detects automation via WebDriver, plugins | Stealth plugins (playwright-stealth, puppeteer-extra) |
| API Endpoint Changes | Moves/renames internal API endpoints | Monitor changes, maintain multiple fallbacks |
| GraphQL Query Hash Rotation | Changes query hashes for GraphQL endpoints | Extract hashes from page source dynamically |
Essential Anti-Detection Techniques
# instagram_stealth.py import random import time PROXY_POOL = [ # Use RESIDENTIAL proxies only # Datacenter IPs are instantly blocked "http://user:pass@res-proxy1.example.com:8080", "http://user:pass@res-proxy2.example.com:8080", ] def instagram_delay(): """Instagram requires longer delays than most sites.""" base = random.uniform(5, 15) # Occasionally take a longer break if random.random() < 0.1: base += random.uniform(30, 60) time.sleep(base) def is_rate_limited(response) -> bool: """Detect Instagram rate limiting.""" if response.status_code == 429: return True if response.status_code == 401: return True if "checkpoint_required" in response.text: return True if "login" in response.url and "instagram.com" in response.url: return True return False def get_fresh_session(): """Create a session with residential proxy.""" session = requests.Session() proxy = random.choice(PROXY_POOL) session.proxies = {"http": proxy, "https": proxy} session.headers.update({ "User-Agent": random.choice([ "Mozilla/5.0 (iPhone; CPU iPhone OS 17_5 like Mac OS X) " "AppleWebKit/605.1.15 (KHTML, like Gecko) " "Version/17.5 Mobile/15E148 Safari/604.1", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36", ]), "Accept-Language": "en-US,en;q=0.9", "X-IG-App-ID": "936619743392459", }) return session
Instagram blocks virtually all datacenter IP ranges. You must use residential or mobile proxies for any meaningful scraping. This is the #1 reason DIY Instagram scrapers fail. A managed API like Mantis handles proxy infrastructure for you.
Instagram Graph API vs Scraping
Meta offers the Instagram Graph API and formerly the Basic Display API (deprecated December 2024). Here's how they compare:
| Feature | Graph API | Basic Display (Deprecated) | Web Scraping | Mantis API |
|---|---|---|---|---|
| Access | Business/Creator accounts only | Shut down Dec 2024 | Any public profile | Any public profile |
| Requires App Review | Yes (Meta approval) | N/A | No | No |
| Rate Limits | 200 calls/user/hour | N/A | Depends on proxies | Based on plan |
| Discover/Search Profiles | โ Only owned accounts | N/A | โ Any public profile | โ Any public profile |
| Follower Count | โ Own account only | N/A | โ Any public profile | โ Any public profile |
| Post Content | โ Own posts only | N/A | โ Any public posts | โ Any public posts |
| Competitor Data | โ | N/A | โ | โ |
| Hashtag Search | Limited (30 unique/7 days) | N/A | โ Unlimited | โ Unlimited |
| Reliability | High (official) | N/A | Medium (endpoints change) | High (maintained) |
| Cost | Free (but limited) | N/A | $100-500+/mo (proxies) | $0-299/mo |
Instagram's official API is designed for managing your own account โ not for discovering or analyzing other accounts. If you need competitor data, influencer analytics, or market research across multiple profiles, scraping or a web scraping API is your only option.
Method Comparison
| Criteria | Python + Requests | Playwright | Node.js + Puppeteer | Mantis API |
|---|---|---|---|---|
| Setup Time | 5 min | 10 min | 10 min | 2 min |
| JS Rendering | โ (API endpoints) | โ | โ | โ |
| Anti-Detection | Low (easily blocked) | Good (with stealth) | Good (with stealth) | Built-in |
| Speed | Fast | Slow | Slow | Medium |
| Maintenance | Very High (endpoints change) | High | High | None |
| Scale | Low | Low-Medium | Low-Medium | High |
| Cost (5K profiles/mo) | $100-300 (proxies) | $200-500 (proxies + compute) | $200-500 (proxies + compute) | $99 (Pro plan) |
| Best For | Quick prototypes | Rich data extraction | Backend services | Production |
Real-World Use Cases
1. Influencer Analytics Dashboard
Build a tool that evaluates influencer accounts โ engagement rate, posting frequency, audience quality โ before sponsorship deals.
# influencer_analytics.py import requests from datetime import datetime, timezone MANTIS_KEY = "YOUR_API_KEY" def analyze_influencer(username: str) -> dict: """Calculate engagement metrics for an influencer.""" resp = requests.post( "https://api.mantisapi.com/v1/scrape", headers={ "Authorization": f"Bearer {MANTIS_KEY}", "Content-Type": "application/json", }, json={ "url": f"https://www.instagram.com/{username}/", "extract": { "username": "Instagram username", "followers": "follower count as integer", "following": "following count as integer", "posts_count": "total posts as integer", "is_verified": "boolean", "bio": "biography text", "recent_posts": ( "array of last 12 posts with: " "likes (integer), comments (integer), " "caption (string), is_video (boolean)" ), }, "render_js": True, }, ) data = resp.json() posts = data.get("recent_posts", []) followers = data.get("followers", 1) if posts and followers > 0: total_engagement = sum( p.get("likes", 0) + p.get("comments", 0) for p in posts ) avg_engagement = total_engagement / len(posts) engagement_rate = (avg_engagement / followers) * 100 avg_likes = sum( p.get("likes", 0) for p in posts ) / len(posts) avg_comments = sum( p.get("comments", 0) for p in posts ) / len(posts) video_ratio = sum( 1 for p in posts if p.get("is_video") ) / len(posts) else: engagement_rate = 0 avg_likes = 0 avg_comments = 0 video_ratio = 0 # Engagement rate benchmarks if engagement_rate > 6: tier = "Excellent" elif engagement_rate > 3: tier = "Good" elif engagement_rate > 1: tier = "Average" else: tier = "Below Average" # Follower/following ratio (health signal) ff_ratio = followers / max(data.get("following", 1), 1) return { "username": username, "followers": followers, "engagement_rate": round(engagement_rate, 2), "engagement_tier": tier, "avg_likes": round(avg_likes), "avg_comments": round(avg_comments), "video_ratio": round(video_ratio * 100, 1), "ff_ratio": round(ff_ratio, 1), "verified": data.get("is_verified", False), "estimated_post_value": ( f"${round(followers * engagement_rate / 100 * 0.05, 2)}" ), } # Evaluate multiple influencers influencers = ["natgeo", "nike", "airbnb"] for username in influencers: result = analyze_influencer(username) print(f"\n@{result['username']}:") print(f" Followers: {result['followers']:,}") print(f" Engagement: {result['engagement_rate']}% " f"({result['engagement_tier']})") print(f" Avg Likes: {result['avg_likes']:,}") print(f" Video Mix: {result['video_ratio']}%") print(f" Est. Post Value: {result['estimated_post_value']}")
2. Brand Mention Monitor
Track when your brand is mentioned in Instagram posts and hashtags โ essential for reputation management and identifying UGC opportunities.
# brand_monitor.py import requests import json from datetime import datetime MANTIS_KEY = "YOUR_API_KEY" BRAND_HASHTAGS = [ "mantisapi", "mantis_api", "webscrapingapi" ] COMPETITOR_ACCOUNTS = [ "scrapingbee", "brightdata", "apify_official" ] def monitor_hashtag(tag: str) -> dict: """Check a hashtag for recent brand mentions.""" resp = requests.post( "https://api.mantisapi.com/v1/scrape", headers={ "Authorization": f"Bearer {MANTIS_KEY}", "Content-Type": "application/json", }, json={ "url": ( f"https://www.instagram.com/explore/tags/{tag}/" ), "extract": { "post_count": "total posts with this hashtag", "top_posts": ( "array of top 9 posts with: " "author, caption, likes, comments, " "post_url" ), }, "render_js": True, }, ) return {"hashtag": tag, **resp.json()} def monitor_competitor(username: str) -> dict: """Check competitor's recent posts.""" resp = requests.post( "https://api.mantisapi.com/v1/scrape", headers={ "Authorization": f"Bearer {MANTIS_KEY}", "Content-Type": "application/json", }, json={ "url": f"https://www.instagram.com/{username}/", "extract": { "followers": "follower count", "recent_posts": ( "last 6 posts: caption, likes, comments" ), }, "render_js": True, }, ) return {"competitor": username, **resp.json()} # Daily monitoring run report = { "timestamp": datetime.utcnow().isoformat(), "hashtags": [ monitor_hashtag(tag) for tag in BRAND_HASHTAGS ], "competitors": [ monitor_competitor(acc) for acc in COMPETITOR_ACCOUNTS ], } # Save daily report date = datetime.utcnow().strftime("%Y-%m-%d") with open(f"instagram_report_{date}.json", "w") as f: json.dump(report, f, indent=2) print(f"Report saved: instagram_report_{date}.json")
3. AI Agent Social Intelligence
Give an AI agent the ability to research brands, influencers, and trends on Instagram โ a key capability for marketing AI assistants.
# agent_instagram.py โ LangChain tool from langchain.tools import tool import requests MANTIS_KEY = "YOUR_API_KEY" @tool def research_instagram_account(username: str) -> str: """Research an Instagram account. Returns profile info, engagement metrics, and recent post activity.""" resp = requests.post( "https://api.mantisapi.com/v1/scrape", headers={ "Authorization": f"Bearer {MANTIS_KEY}", "Content-Type": "application/json", }, json={ "url": f"https://www.instagram.com/{username}/", "extract": { "full_name": "display name", "bio": "biography", "followers": "follower count", "following": "following count", "posts_count": "total posts", "is_verified": "verified status", "category": "business category", "website": "external URL", "recent_posts": ( "last 6 posts: caption (first 100 chars), " "likes, comments, is_video" ), }, "render_js": True, }, ) data = resp.json() posts = data.get("recent_posts", []) followers = data.get("followers", 0) if posts and followers: avg_eng = sum( p.get("likes", 0) + p.get("comments", 0) for p in posts ) / len(posts) eng_rate = round((avg_eng / followers) * 100, 2) else: eng_rate = "N/A" result = f"Instagram Profile: @{username}\n" result += f"Name: {data.get('full_name', 'N/A')}\n" result += f"Bio: {data.get('bio', 'N/A')}\n" result += ( f"Followers: {data.get('followers', 0):,} | " f"Following: {data.get('following', 0):,}\n" ) result += f"Posts: {data.get('posts_count', 0):,}\n" result += f"Verified: {data.get('is_verified', False)}\n" result += f"Category: {data.get('category', 'N/A')}\n" result += f"Website: {data.get('website', 'N/A')}\n" result += f"Engagement Rate: {eng_rate}%\n\n" if posts: result += "Recent Posts:\n" for i, p in enumerate(posts[:3], 1): caption = ( p.get("caption", "")[:80] + "..." if p.get("caption") else "[no caption]" ) result += ( f" {i}. {caption}\n" f" โค๏ธ {p.get('likes', 0):,} | " f"๐ฌ {p.get('comments', 0):,}\n" ) return result # Use in a LangChain agent # agent = create_agent( # tools=[research_instagram_account], ... # )
Legal Considerations
Instagram scraping carries more legal risk than most platforms due to Meta's aggressive enforcement. Key considerations:
- Meta's Terms of Service โ Explicitly prohibit automated data collection. Meta has filed lawsuits against multiple scraping companies (including a $500M+ judgment against Voyager Labs in 2023)
- hiQ Labs v. LinkedIn (2022) โ Ninth Circuit ruled scraping public data doesn't violate CFAA, but this case involved LinkedIn, not Meta. Meta's ToS are more aggressive
- Van Buren v. United States (2021) โ Supreme Court narrowed CFAA "exceeds authorized access" โ supports scraping public data, but doesn't address contract/ToS claims
- Meta v. Voyager Labs (2023) โ Meta won a massive judgment against a company that created fake accounts to scrape Instagram. Creating fake accounts crosses clear legal lines
- GDPR (EU) โ Instagram profiles contain personal data (names, photos, locations). Scraping EU user data for commercial purposes without consent likely violates GDPR
- CCPA (California) โ Similar obligations for California residents' personal data
- Photos & Videos โ Visual content is copyrighted by the creator. Scraping metadata is different from republishing content
Only scrape publicly visible profile data (don't create fake accounts). Don't store or republish photos/videos. Respect private accounts. Don't scrape personal data of EU residents for commercial use without a legal basis. Use rate limiting to avoid disrupting the service. Consult a lawyer for commercial use cases. Consider a web scraping API that handles compliance considerations.
Production-Ready Instagram Scraping
Stop fighting login walls, proxy blocks, and endpoint changes. Mantis extracts structured Instagram data with a single API call.
View Pricing Get Started FreeFrequently Asked Questions
Is it legal to scrape Instagram data?
Scraping publicly available Instagram data is in a legal gray area. While hiQ v. LinkedIn supports scraping public data, Meta aggressively enforces its ToS and has sued scrapers. Avoid creating fake accounts, respect private profiles, and consult legal counsel for commercial use.
How do I scrape Instagram without getting blocked?
Use rotating residential proxies (datacenter IPs are instantly blocked), headless browsers with stealth plugins, delays of 5-20 seconds between requests, and fresh browser contexts. Or use a web scraping API like Mantis that handles anti-blocking automatically.
Can I use Instagram's official API instead of scraping?
The Instagram Graph API only works with accounts you own or that authorize your app, requires Meta app review, and doesn't support discovering or analyzing other public profiles. The Basic Display API was deprecated in December 2024.
What data can I extract from Instagram?
From public profiles: username, full name, bio, follower/following counts, post count, profile picture, recent posts with captions/likes/comments, reels, tagged locations, and hashtags. Private accounts only show basic profile info.
What Python library is best for scraping Instagram?
Playwright with stealth plugins is most reliable since Instagram requires JS rendering. For quick prototypes, the web_profile_info API endpoint works but changes frequently. For production, use a managed API like Mantis.
How many Instagram profiles can I scrape per day?
Without proxies: 20-50 before blocking. With residential proxies: 500-2,000. With Mantis API: up to 100,000/month on the Scale plan without managing any infrastructure.
Related Guides
- How to Scrape Google Search Results in 2026
- How to Scrape Amazon Product Data in 2026
- How to Scrape LinkedIn Profiles & Jobs in 2026
- How to Scrape Twitter (X) Data in 2026
- Web Scraping with Python: Complete Guide
- Web Scraping with Node.js: Complete Guide
- Anti-Blocking Web Scraping Guide
- Best Web Scraping APIs in 2026