Facebook remains the world's largest social network with over 3 billion monthly active users. Despite declining engagement among younger demographics, it's still the dominant platform for:
But since the Cambridge Analytica scandal in 2018, Meta has locked down data access more aggressively than any other major platform. The Graph API is nearly useless for third-party data extraction, making web scraping the only viable approach for competitive intelligence and market research.
After Cambridge Analytica, Meta gutted the Graph API. Here's what you can't do with the official API:
| What You Want | Graph API Support | Reality |
|---|---|---|
| Search public pages | ❌ Removed | Page search API deprecated in 2019 |
| Read other pages' posts | ❌ Removed | Only your own pages via Page Access Token |
| Read group posts | ❌ Removed | Only groups your app is installed in |
| User profiles | ❌ Removed | Only the authenticated user's own data |
| Comments on others' posts | ❌ Removed | Only comments on your own page's posts |
| Your own page insights | ✅ Available | Analytics for pages you manage |
| Post to your page | ✅ Available | Publishing to pages you manage |
| Ad Library | ✅ Available | Public ad transparency data |
Here are four approaches to extract data from Facebook, from lightweight HTML parsing to production-ready API solutions:
Facebook's mobile basic site (mbasic.facebook.com) serves simple HTML without JavaScript — perfect for lightweight scraping. It's designed for low-bandwidth connections and older devices, which means minimal anti-bot protections compared to the main site.
pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
import time
import re
import json
from datetime import datetime
class FacebookScraper:
"""Scrape Facebook using mbasic.facebook.com (simplified HTML)."""
BASE_URL = "https://mbasic.facebook.com"
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/16.6 Mobile/15E148 Safari/604.1",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
})
self.last_request = 0
def _rate_limit(self, delay=8):
"""Respect rate limits — Facebook is aggressive about blocking."""
elapsed = time.time() - self.last_request
if elapsed < delay:
time.sleep(delay - elapsed)
self.last_request = time.time()
def get_page_info(self, page_name):
"""Fetch basic info about a public Facebook page."""
self._rate_limit()
url = f"{self.BASE_URL}/{page_name}"
resp = self.session.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# Extract page name from title
title = soup.find("title")
page_title = title.text.strip() if title else page_name
# Extract profile info
profile_info = {}
info_section = soup.find("div", {"id": "pages_mbasic_header_top"})
if info_section:
profile_info["name"] = info_section.get_text(strip=True)
# Look for likes/followers count
likes_elem = soup.find(string=re.compile(r"[\d,.]+ (likes?|followers?)", re.I))
if likes_elem:
profile_info["engagement_text"] = likes_elem.strip()
return {
"page_name": page_name,
"title": page_title,
"url": f"https://facebook.com/{page_name}",
"info": profile_info,
}
def get_page_posts(self, page_name, max_posts=20):
"""Scrape posts from a public Facebook page."""
self._rate_limit()
url = f"{self.BASE_URL}/{page_name}"
resp = self.session.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
posts = []
# mbasic posts are in article tags or specific div structures
post_elements = soup.find_all("div", {"role": "article"})
if not post_elements:
# Fallback: look for story containers
post_elements = soup.find_all("div", class_=re.compile("story"))
for elem in post_elements[:max_posts]:
post = self._parse_post(elem)
if post and post.get("text"):
posts.append(post)
return posts
def _parse_post(self, elem):
"""Parse a single post element from mbasic HTML."""
post = {}
# Extract post text
text_parts = []
for p in elem.find_all(["p", "span"]):
text = p.get_text(strip=True)
if text and len(text) > 10:
text_parts.append(text)
post["text"] = " ".join(text_parts)[:1000] if text_parts else ""
# Extract links
links = []
for a in elem.find_all("a", href=True):
href = a["href"]
if "facebook.com" not in href and href.startswith("http"):
links.append(href)
post["links"] = links
# Extract timestamp (mbasic uses abbr tags for timestamps)
time_elem = elem.find("abbr")
if time_elem:
post["timestamp"] = time_elem.get_text(strip=True)
# Extract engagement (likes, comments, shares text)
footer = elem.find("footer") or elem.find(
"div", class_=re.compile("footer|action")
)
if footer:
footer_text = footer.get_text(" ", strip=True)
# Extract like count
likes_match = re.search(r"([\d,.]+)\s*(likes?|reactions?)", footer_text, re.I)
if likes_match:
post["likes"] = likes_match.group(1)
comments_match = re.search(r"([\d,.]+)\s*comments?", footer_text, re.I)
if comments_match:
post["comments_count"] = comments_match.group(1)
shares_match = re.search(r"([\d,.]+)\s*shares?", footer_text, re.I)
if shares_match:
post["shares"] = shares_match.group(1)
# Extract image if present
img = elem.find("img", src=re.compile("scontent"))
if img:
post["image_url"] = img.get("src", "")
return post
def get_page_reviews(self, page_name, max_reviews=20):
"""Scrape reviews from a Facebook business page."""
self._rate_limit()
url = f"{self.BASE_URL}/{page_name}/reviews"
resp = self.session.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
reviews = []
review_elements = soup.find_all("div", {"role": "article"})
for elem in review_elements[:max_reviews]:
review = {}
# Extract reviewer name
author = elem.find("strong") or elem.find("h3")
if author:
review["author"] = author.get_text(strip=True)
# Extract review text
text = elem.get_text(" ", strip=True)
review["text"] = text[:500]
# Look for star ratings
stars = elem.find_all("img", alt=re.compile("star", re.I))
if stars:
review["rating"] = len(stars)
if review.get("text"):
reviews.append(review)
return reviews
def search_pages(self, query, max_results=10):
"""Search for Facebook pages (limited on mbasic)."""
self._rate_limit()
url = f"{self.BASE_URL}/search/pages/"
params = {"q": query}
resp = self.session.get(url, params=params)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
results = []
for link in soup.find_all("a", href=re.compile("/pages/")):
name = link.get_text(strip=True)
href = link["href"]
if name and len(name) > 2:
results.append({
"name": name,
"url": f"https://facebook.com{href}",
})
return results[:max_results]
# Usage
scraper = FacebookScraper()
# Get page info
page_info = scraper.get_page_info("TechCrunch")
print(f"Page: {page_info['title']}")
print(f"URL: {page_info['url']}")
# Get recent posts
posts = scraper.get_page_posts("TechCrunch", max_posts=5)
for i, post in enumerate(posts, 1):
print(f"\n--- Post {i} ---")
print(f"Text: {post['text'][:150]}...")
if post.get("likes"):
print(f"Likes: {post['likes']}")
if post.get("timestamp"):
print(f"Time: {post['timestamp']}")
def scrape_all_posts(scraper, page_name, max_posts=50):
"""Scrape multiple pages of Facebook posts using mbasic pagination."""
all_posts = []
url = f"{scraper.BASE_URL}/{page_name}"
while len(all_posts) < max_posts:
scraper._rate_limit(delay=10) # Extra cautious with Facebook
resp = scraper.session.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# Parse posts on this page
post_elements = soup.find_all("div", {"role": "article"})
if not post_elements:
post_elements = soup.find_all("div", class_=re.compile("story"))
new_posts = 0
for elem in post_elements:
post = scraper._parse_post(elem)
if post and post.get("text"):
all_posts.append(post)
new_posts += 1
if new_posts == 0:
break
print(f"Fetched {len(all_posts)} posts so far...")
# Find "See more posts" link for pagination
more_link = soup.find("a", string=re.compile("See More|Show More|Older", re.I))
if more_link and more_link.get("href"):
next_url = more_link["href"]
if not next_url.startswith("http"):
url = f"{scraper.BASE_URL}{next_url}"
else:
url = next_url
else:
break
return all_posts[:max_posts]
# Scrape 50 posts from a page
all_posts = scrape_all_posts(scraper, "TechCrunch", max_posts=50)
print(f"\nTotal posts scraped: {len(all_posts)}")
For the full Facebook experience — JavaScript rendering, infinite scroll, and dynamic content loading — Playwright with stealth configuration is your best option.
pip install playwright playwright-stealth
playwright install chromium
import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import json
import re
class FacebookPlaywrightScraper:
"""Scrape Facebook using Playwright for full JS rendering."""
async def scrape_page_posts(self, page_name, max_posts=20, scroll_count=5):
"""Scrape posts from a public Facebook page with full rendering."""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
]
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/121.0.0.0 Safari/537.36",
locale="en-US",
)
page = await context.new_page()
await stealth_async(page)
# Navigate to the Facebook page
url = f"https://www.facebook.com/{page_name}"
await page.goto(url, wait_until="networkidle", timeout=30000)
# Handle cookie consent / login popup
try:
close_btn = page.locator('[aria-label="Close"]').first
if await close_btn.is_visible(timeout=3000):
await close_btn.click()
except Exception:
pass
# Dismiss login modal if it appears
try:
not_now = page.locator('text="Not Now"').first
if await not_now.is_visible(timeout=3000):
await not_now.click()
except Exception:
pass
# Scroll to load more posts
for i in range(scroll_count):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2 + (i * 0.5)) # Progressive delay
# Extract posts from the page
posts = await page.evaluate("""() => {
const posts = [];
const postElements = document.querySelectorAll(
'[role="article"], [data-ad-preview="message"]'
);
postElements.forEach(elem => {
const textElem = elem.querySelector(
'[data-ad-preview="message"], [data-ad-comet-preview="message"]'
);
const text = textElem ? textElem.innerText : '';
// Extract engagement metrics
const likeSpan = elem.querySelector(
'[aria-label*="reaction"], [aria-label*="like"]'
);
const likes = likeSpan ? likeSpan.getAttribute('aria-label') : '';
const commentLink = elem.querySelector('a[href*="comment"]');
const comments = commentLink ? commentLink.innerText : '';
const shareSpan = elem.querySelector(
'span[class*="share"], a[href*="shares"]'
);
const shares = shareSpan ? shareSpan.innerText : '';
// Extract image
const img = elem.querySelector(
'img[src*="scontent"], img[data-visualcompletion]'
);
const imageUrl = img ? img.src : '';
// Extract timestamp
const timeLink = elem.querySelector('a[href*="/posts/"] span');
const timestamp = timeLink ? timeLink.innerText : '';
if (text && text.length > 10) {
posts.push({
text: text.slice(0, 1000),
likes,
comments,
shares,
imageUrl,
timestamp,
});
}
});
return posts;
}""")
await browser.close()
return posts[:max_posts]
async def scrape_ad_library(self, page_name, country="US"):
"""Scrape Facebook Ad Library for a page's active ads."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 Chrome/121.0.0.0 Safari/537.36",
)
page = await context.new_page()
# Facebook Ad Library is publicly accessible
url = (
f"https://www.facebook.com/ads/library/"
f"?active_status=active&ad_type=all"
f"&country={country}&q={page_name}"
)
await page.goto(url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(3)
# Scroll to load ads
for _ in range(3):
await page.evaluate(
"window.scrollTo(0, document.body.scrollHeight)"
)
await asyncio.sleep(2)
# Extract ad data
ads = await page.evaluate("""() => {
const ads = [];
const adCards = document.querySelectorAll(
'[class*="ad-card"], div[role="article"]'
);
adCards.forEach(card => {
const text = card.innerText;
const img = card.querySelector('img[src*="scontent"]');
const link = card.querySelector('a[href*="facebook.com"]');
ads.push({
text: text.slice(0, 500),
imageUrl: img ? img.src : '',
link: link ? link.href : '',
});
});
return ads;
}""")
await browser.close()
return ads
# Usage
async def main():
scraper = FacebookPlaywrightScraper()
# Scrape page posts
print("--- Scraping Facebook Page Posts ---")
posts = await scraper.scrape_page_posts("TechCrunch", max_posts=10)
for i, post in enumerate(posts, 1):
print(f"\nPost {i}: {post['text'][:120]}...")
if post.get("likes"):
print(f" Reactions: {post['likes']}")
# Scrape Ad Library
print("\n--- Scraping Ad Library ---")
ads = await scraper.scrape_ad_library("TechCrunch")
print(f"Found {len(ads)} active ads")
asyncio.run(main())
Node.js with Puppeteer and the stealth plugin is excellent for intercepting Facebook's internal GraphQL API calls — giving you structured JSON data directly from Facebook's backend.
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
class FacebookPuppeteerScraper {
async scrapePagePosts(pageName, maxPosts = 20, scrollCount = 5) {
const browser = await puppeteer.launch({
headless: 'new',
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-blink-features=AutomationControlled',
],
});
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
// Intercept GraphQL responses for structured data
const graphqlResponses = [];
page.on('response', async (response) => {
const url = response.url();
if (url.includes('graphql') || url.includes('api/graphql')) {
try {
const json = await response.json();
graphqlResponses.push(json);
} catch (e) {
// Not JSON, skip
}
}
});
// Navigate to page
const url = `https://www.facebook.com/${pageName}`;
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
// Handle popups
try {
const closeBtn = await page.$('[aria-label="Close"]');
if (closeBtn) await closeBtn.click();
} catch (e) {}
try {
const notNow = await page.$x('//div[text()="Not Now" or text()="Not now"]');
if (notNow.length) await notNow[0].click();
} catch (e) {}
// Scroll to load posts
for (let i = 0; i < scrollCount; i++) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await new Promise(r => setTimeout(r, 2000 + i * 500));
}
// Extract posts from DOM
const posts = await page.evaluate((max) => {
const results = [];
const articles = document.querySelectorAll('[role="article"]');
articles.forEach(article => {
if (results.length >= max) return;
const textEl = article.querySelector(
'[data-ad-preview="message"], [data-ad-comet-preview="message"]'
);
const text = textEl ? textEl.innerText : '';
const reactionLabel = article.querySelector(
'[aria-label*="reaction"], [aria-label*="like"]'
);
const reactions = reactionLabel
? reactionLabel.getAttribute('aria-label')
: '';
const commentEl = article.querySelector('a[href*="comment"]');
const comments = commentEl ? commentEl.innerText : '';
const timeEl = article.querySelector('a[href*="/posts/"] span');
const timestamp = timeEl ? timeEl.innerText : '';
const imgEl = article.querySelector('img[src*="scontent"]');
const imageUrl = imgEl ? imgEl.src : '';
if (text && text.length > 10) {
results.push({
text: text.slice(0, 1000),
reactions,
comments,
timestamp,
imageUrl,
});
}
});
return results;
}, maxPosts);
// Extract structured data from intercepted GraphQL responses
const structuredPosts = this.parseGraphQLPosts(graphqlResponses);
await browser.close();
// Merge DOM posts with GraphQL data
return {
domPosts: posts,
graphqlPosts: structuredPosts,
totalGraphQLResponses: graphqlResponses.length,
};
}
parseGraphQLPosts(responses) {
const posts = [];
for (const resp of responses) {
try {
// Facebook's GraphQL responses have varying structures
const data = resp?.data;
if (!data) continue;
// Look for timeline feed data
const edges = this.findNestedKey(data, 'edges') || [];
for (const edge of edges) {
const node = edge?.node;
if (!node) continue;
const story = node?.comet_sections?.content?.story;
if (!story) continue;
const message = story?.message?.text || '';
const createdTime = node?.created_time;
const feedbackSummary = node?.comet_sections?.feedback?.story?.feedback_context;
if (message) {
posts.push({
text: message,
createdTime,
reactions: feedbackSummary?.reaction_count?.count || 0,
comments: feedbackSummary?.comment_count?.total_count || 0,
shares: feedbackSummary?.share_count?.count || 0,
});
}
}
} catch (e) {
// GraphQL structure varies, skip unparseable responses
}
}
return posts;
}
findNestedKey(obj, key) {
if (!obj || typeof obj !== 'object') return null;
if (obj[key]) return obj[key];
for (const k of Object.keys(obj)) {
const result = this.findNestedKey(obj[k], key);
if (result) return result;
}
return null;
}
}
// Usage
(async () => {
const scraper = new FacebookPuppeteerScraper();
console.log('--- Scraping Facebook Page ---');
const result = await scraper.scrapePagePosts('TechCrunch', 10);
console.log(`\nDOM Posts: ${result.domPosts.length}`);
result.domPosts.slice(0, 3).forEach((post, i) => {
console.log(`\n Post ${i + 1}: ${post.text.slice(0, 100)}...`);
if (post.reactions) console.log(` Reactions: ${post.reactions}`);
});
console.log(`\nGraphQL Posts: ${result.graphqlPosts.length}`);
result.graphqlPosts.slice(0, 3).forEach((post, i) => {
console.log(`\n Post ${i + 1}: ${post.text.slice(0, 100)}...`);
console.log(` Reactions: ${post.reactions} | Comments: ${post.comments} | Shares: ${post.shares}`);
});
console.log(`\nTotal GraphQL responses intercepted: ${result.totalGraphQLResponses}`);
})();
For production applications, Mantis provides the most reliable way to extract Facebook data. One API call handles rendering, login wall bypassing, proxy rotation, and structured data extraction — without maintaining browser infrastructure.
import requests
# Scrape a Facebook page
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={"x-api-key": "YOUR_API_KEY"},
json={
"url": "https://www.facebook.com/TechCrunch",
"render_js": True,
"wait_for": "[role='article']",
"scroll_count": 3,
"extract": {
"posts": {
"_selector": "[role='article']",
"_type": "list",
"text": "[data-ad-preview='message']",
"reactions": "[aria-label*='reaction']::attr(aria-label)",
"comments": "a[href*='comment']",
"timestamp": "a[href*='/posts/'] span",
}
}
}
)
data = response.json()
for post in data["extracted"]["posts"][:10]:
print(f"Post: {post['text'][:100]}...")
print(f" Reactions: {post['reactions']}")
print(f" Comments: {post['comments']}")
print()
# Scrape competitor ads from Facebook Ad Library
response = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={"x-api-key": "YOUR_API_KEY"},
json={
"url": "https://www.facebook.com/ads/library/?active_status=active"
"&ad_type=all&country=US&q=competitor_name",
"render_js": True,
"scroll_count": 5,
"extract": {
"ads": {
"_selector": "[role='article'], [class*='ad-card']",
"_type": "list",
"creative_text": "div[class*='_7jyr']",
"started": "span:has-text('Started running')",
"platform": "span:has-text('Facebook'), span:has-text('Instagram')",
}
}
}
)
data = response.json()
print(f"Found {len(data['extracted']['ads'])} active ads")
for ad in data["extracted"]["ads"][:5]:
print(f" Ad: {ad['creative_text'][:100]}...")
print(f" Started: {ad['started']} | Platforms: {ad['platform']}")
print()
Extract page posts, ad library data, and engagement metrics with a single API call. No browser infrastructure, no proxy management, no login walls.
View Pricing Get Started FreeFacebook has the most sophisticated anti-scraping defenses of any major platform. Understanding them is critical:
Facebook aggressively pushes login modals on every page. Even public pages show a login overlay after a few seconds of browsing. The mbasic.facebook.com version is less aggressive, but the main site requires constant popup dismissal. This is Meta's primary anti-scraping strategy — force authentication to track and control access.
Facebook collects extensive browser fingerprints: canvas rendering, WebGL hashes, audio context, installed fonts, screen resolution, timezone, language, and dozens of other signals. Headless browsers have detectable fingerprint anomalies — use stealth plugins (playwright-stealth, puppeteer-extra-plugin-stealth) to mitigate this.
Facebook maintains an extensive IP reputation database. Datacenter IPs (AWS, GCP, Azure, DigitalOcean) are typically pre-blocked or severely throttled. Residential proxies work better but must be rotated carefully — Facebook tracks behavioral patterns per IP and flags unusual access patterns.
Suspicious sessions trigger "checkpoint" challenges — CAPTCHA, phone verification, photo identification, or "Is this you?" prompts. These are nearly impossible to automate and effectively block any scraping session that triggers them.
Facebook's internal API uses GraphQL with hashed query identifiers (doc_id). These hashes change frequently during deployments, breaking any scraper that relies on intercepting specific GraphQL queries. You need to dynamically discover the current hashes or rely on DOM-based extraction.
Facebook obfuscates CSS class names with randomly generated strings (e.g., x1lliihq x6ikm8r x10wlt62). These classes change between deployments, making CSS selector-based scraping fragile. Use semantic selectors like [role="article"] and [aria-label] instead.
| Data Type | Fields | Auth Required? |
|---|---|---|
| Public Pages | Name, category, about, followers, likes, posts, contact info, hours, location | No (with mbasic) |
| Page Posts | Text, images, videos, reactions, comments count, shares, timestamp, links | Partial |
| Comments | Text, author name, reactions, replies, timestamp | Partial |
| Public Groups | Name, description, member count, post previews (limited without login) | Yes (most content) |
| Events | Title, date, location, description, attendance count, organizer | Partial |
| Marketplace | Listing title, price, location, images, seller info, description | Yes |
| Ad Library | Ad creative, start date, platforms, estimated spend, page name, status | No (public) |
| Reviews | Rating, text, author, date | Partial |
Track mentions, sentiment, and engagement across competitor and brand Facebook pages. Facebook reviews and page comments are brutally honest — making them valuable for understanding real customer sentiment.
import requests
from bs4 import BeautifulSoup
import time
import re
from collections import Counter
class BrandReputationMonitor:
"""Monitor brand reputation across Facebook pages."""
def __init__(self):
self.scraper = FacebookScraper() # From Method 1
def monitor_brand(self, brand_pages):
"""Monitor multiple brand pages for reputation signals."""
report = {}
for page_name, brand_name in brand_pages.items():
print(f"Monitoring: {brand_name} ({page_name})...")
# Get page info
info = self.scraper.get_page_info(page_name)
# Get recent posts
posts = self.scraper.get_page_posts(page_name, max_posts=10)
# Get reviews if available
reviews = self.scraper.get_page_reviews(page_name, max_reviews=10)
# Analyze sentiment
all_text = " ".join([p.get("text", "") for p in posts])
all_text += " ".join([r.get("text", "") for r in reviews])
sentiment = self._analyze_sentiment(all_text)
# Calculate engagement metrics
total_likes = 0
total_comments = 0
for post in posts:
try:
likes_str = post.get("likes", "0").replace(",", "")
total_likes += int(re.search(r"\d+", likes_str).group()) if likes_str else 0
except (ValueError, AttributeError):
pass
try:
comments_str = post.get("comments_count", "0").replace(",", "")
total_comments += int(re.search(r"\d+", comments_str).group()) if comments_str else 0
except (ValueError, AttributeError):
pass
report[brand_name] = {
"page_name": page_name,
"title": info.get("title", ""),
"total_posts_analyzed": len(posts),
"total_reviews_analyzed": len(reviews),
"avg_likes_per_post": total_likes // max(len(posts), 1),
"avg_comments_per_post": total_comments // max(len(posts), 1),
"sentiment": sentiment,
"avg_review_rating": self._avg_rating(reviews),
"top_post": posts[0]["text"][:200] if posts else "N/A",
}
return report
def _analyze_sentiment(self, text):
"""Simple keyword-based sentiment analysis."""
text_lower = text.lower()
positive = {"love", "great", "amazing", "best", "awesome", "excellent",
"fantastic", "recommend", "perfect", "helpful", "wonderful"}
negative = {"hate", "terrible", "worst", "awful", "scam", "avoid",
"disappointed", "horrible", "broken", "waste", "fraud"}
pos = sum(1 for w in positive if w in text_lower)
neg = sum(1 for w in negative if w in text_lower)
if pos > neg:
return "positive"
elif neg > pos:
return "negative"
return "neutral"
def _avg_rating(self, reviews):
"""Calculate average review rating."""
ratings = [r["rating"] for r in reviews if r.get("rating")]
return round(sum(ratings) / len(ratings), 1) if ratings else "N/A"
# Monitor competitor brands
monitor = BrandReputationMonitor()
report = monitor.monitor_brand({
"TechCrunch": "TechCrunch",
"TheVerge": "The Verge",
"Wired": "WIRED",
})
for brand, data in report.items():
print(f"\n📊 {brand}:")
print(f" Posts analyzed: {data['total_posts_analyzed']}")
print(f" Avg likes/post: {data['avg_likes_per_post']}")
print(f" Avg comments/post: {data['avg_comments_per_post']}")
print(f" Sentiment: {data['sentiment']}")
print(f" Avg review rating: {data['avg_review_rating']}")
Compare engagement metrics, posting frequency, and content strategy across competitor Facebook pages.
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
class CompetitorPageAnalyzer {
async analyzeCompetitors(pageNames) {
const results = {};
for (const pageName of pageNames) {
console.log(`Analyzing: ${pageName}...`);
await new Promise(r => setTimeout(r, 10000)); // Rate limit
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox'],
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`https://www.facebook.com/${pageName}`, {
waitUntil: 'networkidle2',
timeout: 30000,
});
// Close popups
try {
await page.click('[aria-label="Close"]', { timeout: 3000 });
} catch (e) {}
// Scroll for posts
for (let i = 0; i < 3; i++) {
await page.evaluate(() =>
window.scrollTo(0, document.body.scrollHeight)
);
await new Promise(r => setTimeout(r, 2000));
}
// Extract page metrics
const metrics = await page.evaluate(() => {
const articles = document.querySelectorAll('[role="article"]');
const postCount = articles.length;
let totalReactions = 0;
let totalComments = 0;
const postTexts = [];
articles.forEach(article => {
// Count reactions
const reactionEl = article.querySelector(
'[aria-label*="reaction"], [aria-label*="like"]'
);
if (reactionEl) {
const label = reactionEl.getAttribute('aria-label') || '';
const match = label.match(/([\d,.]+)/);
if (match) totalReactions += parseInt(match[1].replace(/,/g, '')) || 0;
}
// Count comments
const commentEl = article.querySelector('a[href*="comment"]');
if (commentEl) {
const match = commentEl.innerText.match(/(\d+)/);
if (match) totalComments += parseInt(match[1]) || 0;
}
// Get post text
const textEl = article.querySelector(
'[data-ad-preview="message"]'
);
if (textEl) postTexts.push(textEl.innerText.slice(0, 200));
});
// Get follower count
const followerEl = document.querySelector(
'a[href*="followers"] span, [class*="follower"]'
);
const followers = followerEl ? followerEl.innerText : 'N/A';
return {
postCount,
totalReactions,
totalComments,
avgReactionsPerPost: postCount > 0
? Math.round(totalReactions / postCount)
: 0,
avgCommentsPerPost: postCount > 0
? Math.round(totalComments / postCount)
: 0,
followers,
samplePosts: postTexts.slice(0, 3),
};
});
results[pageName] = metrics;
} catch (error) {
results[pageName] = { error: error.message };
} finally {
await browser.close();
}
}
return results;
}
}
// Analyze competitors
(async () => {
const analyzer = new CompetitorPageAnalyzer();
const results = await analyzer.analyzeCompetitors([
'ScrapingBee',
'Apify',
'BrightData',
]);
for (const [page, data] of Object.entries(results)) {
if (data.error) {
console.log(`\n❌ ${page}: ${data.error}`);
continue;
}
console.log(`\n📊 ${page}:`);
console.log(` Followers: ${data.followers}`);
console.log(` Posts visible: ${data.postCount}`);
console.log(` Avg reactions/post: ${data.avgReactionsPerPost}`);
console.log(` Avg comments/post: ${data.avgCommentsPerPost}`);
console.log(` Sample: ${data.samplePosts[0]?.slice(0, 80) || 'N/A'}...`);
}
})();
Build an AI agent tool that extracts and analyzes Facebook page data for automated market intelligence — perfect for AI agents that need to understand brand presence and social engagement.
import requests
import json
from datetime import datetime
class FacebookIntelligenceTool:
"""AI agent tool for Facebook social intelligence."""
def __init__(self, mantis_api_key):
self.api_key = mantis_api_key
self.base_url = "https://api.mantisapi.com/v1/scrape"
def get_page_intelligence(self, page_url):
"""Extract comprehensive intelligence from a Facebook page.
Designed as an AI agent tool — returns structured data
that agents can reason about.
"""
# Scrape the page
response = requests.post(
self.base_url,
headers={"x-api-key": self.api_key},
json={
"url": page_url,
"render_js": True,
"wait_for": "[role='article']",
"scroll_count": 3,
"extract": {
"page_name": "h1, [role='heading']",
"category": "a[href*='/pages/category/']",
"posts": {
"_selector": "[role='article']",
"_type": "list",
"text": "[data-ad-preview='message']",
"reactions": "[aria-label*='reaction']::attr(aria-label)",
"comments": "a[href*='comment']",
"timestamp": "a[href*='/posts/'] span",
}
}
}
)
data = response.json()
extracted = data.get("extracted", {})
# Analyze content strategy
posts = extracted.get("posts", [])
analysis = {
"page_name": extracted.get("page_name", "Unknown"),
"category": extracted.get("category", "Unknown"),
"posts_analyzed": len(posts),
"content_types": self._classify_content(posts),
"posting_patterns": self._analyze_patterns(posts),
"engagement_summary": self._summarize_engagement(posts),
"top_performing": self._top_posts(posts, 3),
"extracted_at": datetime.utcnow().isoformat(),
}
return analysis
def get_ad_intelligence(self, brand_name, country="US"):
"""Extract competitor ad intelligence from Facebook Ad Library."""
response = requests.post(
self.base_url,
headers={"x-api-key": self.api_key},
json={
"url": f"https://www.facebook.com/ads/library/"
f"?active_status=active&ad_type=all"
f"&country={country}&q={brand_name}",
"render_js": True,
"scroll_count": 5,
"extract": {
"ads": {
"_selector": "[role='article']",
"_type": "list",
"text": "div",
"started": "span:has-text('Started')",
"platforms": "span:has-text('Facebook'), span:has-text('Instagram')",
}
}
}
)
data = response.json()
ads = data.get("extracted", {}).get("ads", [])
return {
"brand": brand_name,
"country": country,
"active_ads": len(ads),
"ad_samples": [
{"text": ad.get("text", "")[:200], "started": ad.get("started", "")}
for ad in ads[:5]
],
"extracted_at": datetime.utcnow().isoformat(),
}
def _classify_content(self, posts):
"""Classify posts by content type."""
types = {"text_only": 0, "with_link": 0, "with_media": 0}
for post in posts:
text = post.get("text", "")
if "http" in text or "www." in text:
types["with_link"] += 1
elif post.get("imageUrl"):
types["with_media"] += 1
else:
types["text_only"] += 1
return types
def _analyze_patterns(self, posts):
"""Analyze posting patterns."""
return {
"total_posts_visible": len(posts),
"has_regular_cadence": len(posts) >= 5,
}
def _summarize_engagement(self, posts):
"""Summarize engagement across posts."""
import re
total_reactions = 0
for post in posts:
label = post.get("reactions", "")
match = re.search(r"([\d,.]+)", str(label))
if match:
total_reactions += int(match.group(1).replace(",", ""))
return {
"total_reactions": total_reactions,
"avg_reactions_per_post": total_reactions // max(len(posts), 1),
}
def _top_posts(self, posts, n=3):
"""Return top performing posts by engagement."""
# Sort by reaction count (approximate from labels)
import re
scored = []
for post in posts:
label = post.get("reactions", "")
match = re.search(r"([\d,.]+)", str(label))
score = int(match.group(1).replace(",", "")) if match else 0
scored.append({"text": post.get("text", "")[:200], "reactions": score})
scored.sort(key=lambda x: x["reactions"], reverse=True)
return scored[:n]
# Usage as an AI agent tool
tool = FacebookIntelligenceTool("YOUR_API_KEY")
# Get page intelligence
intel = tool.get_page_intelligence("https://www.facebook.com/TechCrunch")
print(json.dumps(intel, indent=2))
# Get ad intelligence
ad_intel = tool.get_ad_intelligence("ScrapingBee")
print(f"\n{ad_intel['brand']}: {ad_intel['active_ads']} active ads")
for ad in ad_intel["ad_samples"]:
print(f" - {ad['text'][:80]}...")
| Feature | Facebook Graph API | DIY Scraping | Mantis API |
|---|---|---|---|
| Cost | Free (own pages only) | Server + residential proxy costs | $29/mo (5K requests) |
| Access scope | Only pages you own/manage | Any public page | Any public page |
| Competitor data | ❌ Not available | ✅ Public pages | ✅ Public pages |
| Ad Library | ✅ Ad Library API | ✅ Web scraping | ✅ Included |
| Setup time | Hours (app review required) | Days (stealth config, proxies) | Minutes |
| Login wall handling | N/A (API) | You manage | Included |
| Anti-bot handling | N/A (API) | You manage | Included |
| Proxy management | N/A | Residential required ($$$) | Included |
| JS rendering | N/A (API) | Browser required | Included |
| Maintenance | API version updates | High (DOM changes, fingerprinting) | Zero |
| Reliability | High (limited scope) | Low-Medium | High |
Mantis handles login walls, residential proxies, fingerprinting, and JavaScript rendering. Get structured Facebook data with a single API call.
View Pricing Get Started FreeFacebook scraping carries more legal risk than most platforms due to Meta's aggressive enforcement posture. Here's what you need to know:
Meta is uniquely aggressive among tech companies in enforcing against scraping:
Disclaimer: This article is for educational purposes only. Web scraping may violate Facebook's Terms of Service. Meta has actively pursued legal action against scrapers. Always ensure your scraping activities comply with applicable laws and regulations in your jurisdiction.
See the structured FAQ data above for common questions about scraping Facebook. Key points:
mbasic.facebook.com is the easiest scraping target — simple HTML, no JS requiredNow that you know how to scrape Facebook, explore more scraping guides: