How to Scrape Facebook Data in 2026: Pages, Posts & Groups

Published March 31, 2026 · 19 min read

Table of Contents

Why Scrape Facebook?

Facebook remains the world's largest social network with over 3 billion monthly active users. Despite declining engagement among younger demographics, it's still the dominant platform for:

But since the Cambridge Analytica scandal in 2018, Meta has locked down data access more aggressively than any other major platform. The Graph API is nearly useless for third-party data extraction, making web scraping the only viable approach for competitive intelligence and market research.

The Facebook Graph API Problem

After Cambridge Analytica, Meta gutted the Graph API. Here's what you can't do with the official API:

What You Want Graph API Support Reality
Search public pages ❌ Removed Page search API deprecated in 2019
Read other pages' posts ❌ Removed Only your own pages via Page Access Token
Read group posts ❌ Removed Only groups your app is installed in
User profiles ❌ Removed Only the authenticated user's own data
Comments on others' posts ❌ Removed Only comments on your own page's posts
Your own page insights ✅ Available Analytics for pages you manage
Post to your page ✅ Available Publishing to pages you manage
Ad Library ✅ Available Public ad transparency data
Bottom line: The Graph API only lets you manage your own pages. For competitive analysis, market research, or monitoring any page you don't own, web scraping is the only option.

4 Methods to Scrape Facebook

Here are four approaches to extract data from Facebook, from lightweight HTML parsing to production-ready API solutions:

  1. Python + Requests (mbasic.facebook.com) — Parse the simplified mobile HTML, no JS rendering needed
  2. Playwright headless browser — Full browser automation for the main Facebook site
  3. Node.js + Puppeteer — Stealth browser scraping with GraphQL API interception
  4. Mantis Web Scraping API — One-call solution, handles anti-bot, production-ready

Method 1: Python + Requests (mbasic.facebook.com)

Facebook's mobile basic site (mbasic.facebook.com) serves simple HTML without JavaScript — perfect for lightweight scraping. It's designed for low-bandwidth connections and older devices, which means minimal anti-bot protections compared to the main site.

Setup

pip install requests beautifulsoup4

Scrape a Public Facebook Page

import requests
from bs4 import BeautifulSoup
import time
import re
import json
from datetime import datetime

class FacebookScraper:
    """Scrape Facebook using mbasic.facebook.com (simplified HTML)."""

    BASE_URL = "https://mbasic.facebook.com"

    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6 like Mac OS X) "
                          "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                          "Version/16.6 Mobile/15E148 Safari/604.1",
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
        })
        self.last_request = 0

    def _rate_limit(self, delay=8):
        """Respect rate limits — Facebook is aggressive about blocking."""
        elapsed = time.time() - self.last_request
        if elapsed < delay:
            time.sleep(delay - elapsed)
        self.last_request = time.time()

    def get_page_info(self, page_name):
        """Fetch basic info about a public Facebook page."""
        self._rate_limit()

        url = f"{self.BASE_URL}/{page_name}"
        resp = self.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        # Extract page name from title
        title = soup.find("title")
        page_title = title.text.strip() if title else page_name

        # Extract profile info
        profile_info = {}
        info_section = soup.find("div", {"id": "pages_mbasic_header_top"})
        if info_section:
            profile_info["name"] = info_section.get_text(strip=True)

        # Look for likes/followers count
        likes_elem = soup.find(string=re.compile(r"[\d,.]+ (likes?|followers?)", re.I))
        if likes_elem:
            profile_info["engagement_text"] = likes_elem.strip()

        return {
            "page_name": page_name,
            "title": page_title,
            "url": f"https://facebook.com/{page_name}",
            "info": profile_info,
        }

    def get_page_posts(self, page_name, max_posts=20):
        """Scrape posts from a public Facebook page."""
        self._rate_limit()

        url = f"{self.BASE_URL}/{page_name}"
        resp = self.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        posts = []
        # mbasic posts are in article tags or specific div structures
        post_elements = soup.find_all("div", {"role": "article"})

        if not post_elements:
            # Fallback: look for story containers
            post_elements = soup.find_all("div", class_=re.compile("story"))

        for elem in post_elements[:max_posts]:
            post = self._parse_post(elem)
            if post and post.get("text"):
                posts.append(post)

        return posts

    def _parse_post(self, elem):
        """Parse a single post element from mbasic HTML."""
        post = {}

        # Extract post text
        text_parts = []
        for p in elem.find_all(["p", "span"]):
            text = p.get_text(strip=True)
            if text and len(text) > 10:
                text_parts.append(text)

        post["text"] = " ".join(text_parts)[:1000] if text_parts else ""

        # Extract links
        links = []
        for a in elem.find_all("a", href=True):
            href = a["href"]
            if "facebook.com" not in href and href.startswith("http"):
                links.append(href)
        post["links"] = links

        # Extract timestamp (mbasic uses abbr tags for timestamps)
        time_elem = elem.find("abbr")
        if time_elem:
            post["timestamp"] = time_elem.get_text(strip=True)

        # Extract engagement (likes, comments, shares text)
        footer = elem.find("footer") or elem.find(
            "div", class_=re.compile("footer|action")
        )
        if footer:
            footer_text = footer.get_text(" ", strip=True)
            # Extract like count
            likes_match = re.search(r"([\d,.]+)\s*(likes?|reactions?)", footer_text, re.I)
            if likes_match:
                post["likes"] = likes_match.group(1)

            comments_match = re.search(r"([\d,.]+)\s*comments?", footer_text, re.I)
            if comments_match:
                post["comments_count"] = comments_match.group(1)

            shares_match = re.search(r"([\d,.]+)\s*shares?", footer_text, re.I)
            if shares_match:
                post["shares"] = shares_match.group(1)

        # Extract image if present
        img = elem.find("img", src=re.compile("scontent"))
        if img:
            post["image_url"] = img.get("src", "")

        return post

    def get_page_reviews(self, page_name, max_reviews=20):
        """Scrape reviews from a Facebook business page."""
        self._rate_limit()

        url = f"{self.BASE_URL}/{page_name}/reviews"
        resp = self.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        reviews = []
        review_elements = soup.find_all("div", {"role": "article"})

        for elem in review_elements[:max_reviews]:
            review = {}
            # Extract reviewer name
            author = elem.find("strong") or elem.find("h3")
            if author:
                review["author"] = author.get_text(strip=True)

            # Extract review text
            text = elem.get_text(" ", strip=True)
            review["text"] = text[:500]

            # Look for star ratings
            stars = elem.find_all("img", alt=re.compile("star", re.I))
            if stars:
                review["rating"] = len(stars)

            if review.get("text"):
                reviews.append(review)

        return reviews

    def search_pages(self, query, max_results=10):
        """Search for Facebook pages (limited on mbasic)."""
        self._rate_limit()

        url = f"{self.BASE_URL}/search/pages/"
        params = {"q": query}
        resp = self.session.get(url, params=params)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        results = []
        for link in soup.find_all("a", href=re.compile("/pages/")):
            name = link.get_text(strip=True)
            href = link["href"]
            if name and len(name) > 2:
                results.append({
                    "name": name,
                    "url": f"https://facebook.com{href}",
                })

        return results[:max_results]


# Usage
scraper = FacebookScraper()

# Get page info
page_info = scraper.get_page_info("TechCrunch")
print(f"Page: {page_info['title']}")
print(f"URL: {page_info['url']}")

# Get recent posts
posts = scraper.get_page_posts("TechCrunch", max_posts=5)
for i, post in enumerate(posts, 1):
    print(f"\n--- Post {i} ---")
    print(f"Text: {post['text'][:150]}...")
    if post.get("likes"):
        print(f"Likes: {post['likes']}")
    if post.get("timestamp"):
        print(f"Time: {post['timestamp']}")

Pagination: Scrape Multiple Pages of Posts

def scrape_all_posts(scraper, page_name, max_posts=50):
    """Scrape multiple pages of Facebook posts using mbasic pagination."""
    all_posts = []
    url = f"{scraper.BASE_URL}/{page_name}"

    while len(all_posts) < max_posts:
        scraper._rate_limit(delay=10)  # Extra cautious with Facebook

        resp = scraper.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        # Parse posts on this page
        post_elements = soup.find_all("div", {"role": "article"})
        if not post_elements:
            post_elements = soup.find_all("div", class_=re.compile("story"))

        new_posts = 0
        for elem in post_elements:
            post = scraper._parse_post(elem)
            if post and post.get("text"):
                all_posts.append(post)
                new_posts += 1

        if new_posts == 0:
            break

        print(f"Fetched {len(all_posts)} posts so far...")

        # Find "See more posts" link for pagination
        more_link = soup.find("a", string=re.compile("See More|Show More|Older", re.I))
        if more_link and more_link.get("href"):
            next_url = more_link["href"]
            if not next_url.startswith("http"):
                url = f"{scraper.BASE_URL}{next_url}"
            else:
                url = next_url
        else:
            break

    return all_posts[:max_posts]


# Scrape 50 posts from a page
all_posts = scrape_all_posts(scraper, "TechCrunch", max_posts=50)
print(f"\nTotal posts scraped: {len(all_posts)}")

Method 2: Playwright Headless Browser

For the full Facebook experience — JavaScript rendering, infinite scroll, and dynamic content loading — Playwright with stealth configuration is your best option.

Setup

pip install playwright playwright-stealth
playwright install chromium

Scrape Facebook Page with Full Rendering

import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import json
import re

class FacebookPlaywrightScraper:
    """Scrape Facebook using Playwright for full JS rendering."""

    async def scrape_page_posts(self, page_name, max_posts=20, scroll_count=5):
        """Scrape posts from a public Facebook page with full rendering."""
        async with async_playwright() as p:
            browser = await p.chromium.launch(
                headless=True,
                args=[
                    "--disable-blink-features=AutomationControlled",
                    "--no-sandbox",
                ]
            )

            context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                           "AppleWebKit/537.36 (KHTML, like Gecko) "
                           "Chrome/121.0.0.0 Safari/537.36",
                locale="en-US",
            )

            page = await context.new_page()
            await stealth_async(page)

            # Navigate to the Facebook page
            url = f"https://www.facebook.com/{page_name}"
            await page.goto(url, wait_until="networkidle", timeout=30000)

            # Handle cookie consent / login popup
            try:
                close_btn = page.locator('[aria-label="Close"]').first
                if await close_btn.is_visible(timeout=3000):
                    await close_btn.click()
            except Exception:
                pass

            # Dismiss login modal if it appears
            try:
                not_now = page.locator('text="Not Now"').first
                if await not_now.is_visible(timeout=3000):
                    await not_now.click()
            except Exception:
                pass

            # Scroll to load more posts
            for i in range(scroll_count):
                await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                await asyncio.sleep(2 + (i * 0.5))  # Progressive delay

            # Extract posts from the page
            posts = await page.evaluate("""() => {
                const posts = [];
                const postElements = document.querySelectorAll(
                    '[role="article"], [data-ad-preview="message"]'
                );

                postElements.forEach(elem => {
                    const textElem = elem.querySelector(
                        '[data-ad-preview="message"], [data-ad-comet-preview="message"]'
                    );
                    const text = textElem ? textElem.innerText : '';

                    // Extract engagement metrics
                    const likeSpan = elem.querySelector(
                        '[aria-label*="reaction"], [aria-label*="like"]'
                    );
                    const likes = likeSpan ? likeSpan.getAttribute('aria-label') : '';

                    const commentLink = elem.querySelector('a[href*="comment"]');
                    const comments = commentLink ? commentLink.innerText : '';

                    const shareSpan = elem.querySelector(
                        'span[class*="share"], a[href*="shares"]'
                    );
                    const shares = shareSpan ? shareSpan.innerText : '';

                    // Extract image
                    const img = elem.querySelector(
                        'img[src*="scontent"], img[data-visualcompletion]'
                    );
                    const imageUrl = img ? img.src : '';

                    // Extract timestamp
                    const timeLink = elem.querySelector('a[href*="/posts/"] span');
                    const timestamp = timeLink ? timeLink.innerText : '';

                    if (text && text.length > 10) {
                        posts.push({
                            text: text.slice(0, 1000),
                            likes,
                            comments,
                            shares,
                            imageUrl,
                            timestamp,
                        });
                    }
                });

                return posts;
            }""")

            await browser.close()
            return posts[:max_posts]

    async def scrape_ad_library(self, page_name, country="US"):
        """Scrape Facebook Ad Library for a page's active ads."""
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                           "AppleWebKit/537.36 Chrome/121.0.0.0 Safari/537.36",
            )

            page = await context.new_page()

            # Facebook Ad Library is publicly accessible
            url = (
                f"https://www.facebook.com/ads/library/"
                f"?active_status=active&ad_type=all"
                f"&country={country}&q={page_name}"
            )
            await page.goto(url, wait_until="networkidle", timeout=30000)
            await asyncio.sleep(3)

            # Scroll to load ads
            for _ in range(3):
                await page.evaluate(
                    "window.scrollTo(0, document.body.scrollHeight)"
                )
                await asyncio.sleep(2)

            # Extract ad data
            ads = await page.evaluate("""() => {
                const ads = [];
                const adCards = document.querySelectorAll(
                    '[class*="ad-card"], div[role="article"]'
                );

                adCards.forEach(card => {
                    const text = card.innerText;
                    const img = card.querySelector('img[src*="scontent"]');
                    const link = card.querySelector('a[href*="facebook.com"]');

                    ads.push({
                        text: text.slice(0, 500),
                        imageUrl: img ? img.src : '',
                        link: link ? link.href : '',
                    });
                });

                return ads;
            }""")

            await browser.close()
            return ads


# Usage
async def main():
    scraper = FacebookPlaywrightScraper()

    # Scrape page posts
    print("--- Scraping Facebook Page Posts ---")
    posts = await scraper.scrape_page_posts("TechCrunch", max_posts=10)
    for i, post in enumerate(posts, 1):
        print(f"\nPost {i}: {post['text'][:120]}...")
        if post.get("likes"):
            print(f"  Reactions: {post['likes']}")

    # Scrape Ad Library
    print("\n--- Scraping Ad Library ---")
    ads = await scraper.scrape_ad_library("TechCrunch")
    print(f"Found {len(ads)} active ads")


asyncio.run(main())

Method 3: Node.js + Puppeteer

Node.js with Puppeteer and the stealth plugin is excellent for intercepting Facebook's internal GraphQL API calls — giving you structured JSON data directly from Facebook's backend.

Setup

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

Scrape with GraphQL API Interception

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

class FacebookPuppeteerScraper {
  async scrapePagePosts(pageName, maxPosts = 20, scrollCount = 5) {
    const browser = await puppeteer.launch({
      headless: 'new',
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-blink-features=AutomationControlled',
      ],
    });

    const page = await browser.newPage();
    await page.setViewport({ width: 1920, height: 1080 });

    // Intercept GraphQL responses for structured data
    const graphqlResponses = [];
    page.on('response', async (response) => {
      const url = response.url();
      if (url.includes('graphql') || url.includes('api/graphql')) {
        try {
          const json = await response.json();
          graphqlResponses.push(json);
        } catch (e) {
          // Not JSON, skip
        }
      }
    });

    // Navigate to page
    const url = `https://www.facebook.com/${pageName}`;
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });

    // Handle popups
    try {
      const closeBtn = await page.$('[aria-label="Close"]');
      if (closeBtn) await closeBtn.click();
    } catch (e) {}

    try {
      const notNow = await page.$x('//div[text()="Not Now" or text()="Not now"]');
      if (notNow.length) await notNow[0].click();
    } catch (e) {}

    // Scroll to load posts
    for (let i = 0; i < scrollCount; i++) {
      await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await new Promise(r => setTimeout(r, 2000 + i * 500));
    }

    // Extract posts from DOM
    const posts = await page.evaluate((max) => {
      const results = [];
      const articles = document.querySelectorAll('[role="article"]');

      articles.forEach(article => {
        if (results.length >= max) return;

        const textEl = article.querySelector(
          '[data-ad-preview="message"], [data-ad-comet-preview="message"]'
        );
        const text = textEl ? textEl.innerText : '';

        const reactionLabel = article.querySelector(
          '[aria-label*="reaction"], [aria-label*="like"]'
        );
        const reactions = reactionLabel
          ? reactionLabel.getAttribute('aria-label')
          : '';

        const commentEl = article.querySelector('a[href*="comment"]');
        const comments = commentEl ? commentEl.innerText : '';

        const timeEl = article.querySelector('a[href*="/posts/"] span');
        const timestamp = timeEl ? timeEl.innerText : '';

        const imgEl = article.querySelector('img[src*="scontent"]');
        const imageUrl = imgEl ? imgEl.src : '';

        if (text && text.length > 10) {
          results.push({
            text: text.slice(0, 1000),
            reactions,
            comments,
            timestamp,
            imageUrl,
          });
        }
      });

      return results;
    }, maxPosts);

    // Extract structured data from intercepted GraphQL responses
    const structuredPosts = this.parseGraphQLPosts(graphqlResponses);

    await browser.close();

    // Merge DOM posts with GraphQL data
    return {
      domPosts: posts,
      graphqlPosts: structuredPosts,
      totalGraphQLResponses: graphqlResponses.length,
    };
  }

  parseGraphQLPosts(responses) {
    const posts = [];

    for (const resp of responses) {
      try {
        // Facebook's GraphQL responses have varying structures
        const data = resp?.data;
        if (!data) continue;

        // Look for timeline feed data
        const edges = this.findNestedKey(data, 'edges') || [];
        for (const edge of edges) {
          const node = edge?.node;
          if (!node) continue;

          const story = node?.comet_sections?.content?.story;
          if (!story) continue;

          const message = story?.message?.text || '';
          const createdTime = node?.created_time;
          const feedbackSummary = node?.comet_sections?.feedback?.story?.feedback_context;

          if (message) {
            posts.push({
              text: message,
              createdTime,
              reactions: feedbackSummary?.reaction_count?.count || 0,
              comments: feedbackSummary?.comment_count?.total_count || 0,
              shares: feedbackSummary?.share_count?.count || 0,
            });
          }
        }
      } catch (e) {
        // GraphQL structure varies, skip unparseable responses
      }
    }

    return posts;
  }

  findNestedKey(obj, key) {
    if (!obj || typeof obj !== 'object') return null;
    if (obj[key]) return obj[key];
    for (const k of Object.keys(obj)) {
      const result = this.findNestedKey(obj[k], key);
      if (result) return result;
    }
    return null;
  }
}

// Usage
(async () => {
  const scraper = new FacebookPuppeteerScraper();

  console.log('--- Scraping Facebook Page ---');
  const result = await scraper.scrapePagePosts('TechCrunch', 10);

  console.log(`\nDOM Posts: ${result.domPosts.length}`);
  result.domPosts.slice(0, 3).forEach((post, i) => {
    console.log(`\n  Post ${i + 1}: ${post.text.slice(0, 100)}...`);
    if (post.reactions) console.log(`  Reactions: ${post.reactions}`);
  });

  console.log(`\nGraphQL Posts: ${result.graphqlPosts.length}`);
  result.graphqlPosts.slice(0, 3).forEach((post, i) => {
    console.log(`\n  Post ${i + 1}: ${post.text.slice(0, 100)}...`);
    console.log(`  Reactions: ${post.reactions} | Comments: ${post.comments} | Shares: ${post.shares}`);
  });

  console.log(`\nTotal GraphQL responses intercepted: ${result.totalGraphQLResponses}`);
})();

Method 4: Mantis Web Scraping API

For production applications, Mantis provides the most reliable way to extract Facebook data. One API call handles rendering, login wall bypassing, proxy rotation, and structured data extraction — without maintaining browser infrastructure.

import requests

# Scrape a Facebook page
response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://www.facebook.com/TechCrunch",
        "render_js": True,
        "wait_for": "[role='article']",
        "scroll_count": 3,
        "extract": {
            "posts": {
                "_selector": "[role='article']",
                "_type": "list",
                "text": "[data-ad-preview='message']",
                "reactions": "[aria-label*='reaction']::attr(aria-label)",
                "comments": "a[href*='comment']",
                "timestamp": "a[href*='/posts/'] span",
            }
        }
    }
)

data = response.json()
for post in data["extracted"]["posts"][:10]:
    print(f"Post: {post['text'][:100]}...")
    print(f"  Reactions: {post['reactions']}")
    print(f"  Comments: {post['comments']}")
    print()

Scrape Facebook Ad Library with Mantis

# Scrape competitor ads from Facebook Ad Library
response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://www.facebook.com/ads/library/?active_status=active"
              "&ad_type=all&country=US&q=competitor_name",
        "render_js": True,
        "scroll_count": 5,
        "extract": {
            "ads": {
                "_selector": "[role='article'], [class*='ad-card']",
                "_type": "list",
                "creative_text": "div[class*='_7jyr']",
                "started": "span:has-text('Started running')",
                "platform": "span:has-text('Facebook'), span:has-text('Instagram')",
            }
        }
    }
)

data = response.json()
print(f"Found {len(data['extracted']['ads'])} active ads")
for ad in data["extracted"]["ads"][:5]:
    print(f"  Ad: {ad['creative_text'][:100]}...")
    print(f"  Started: {ad['started']} | Platforms: {ad['platform']}")
    print()

Why Mantis for Facebook?

Scrape Facebook Without Getting Blocked

Extract page posts, ad library data, and engagement metrics with a single API call. No browser infrastructure, no proxy management, no login walls.

View Pricing Get Started Free

Facebook Anti-Bot Defenses

Facebook has the most sophisticated anti-scraping defenses of any major platform. Understanding them is critical:

1. Login Walls

Facebook aggressively pushes login modals on every page. Even public pages show a login overlay after a few seconds of browsing. The mbasic.facebook.com version is less aggressive, but the main site requires constant popup dismissal. This is Meta's primary anti-scraping strategy — force authentication to track and control access.

2. Device & Browser Fingerprinting

Facebook collects extensive browser fingerprints: canvas rendering, WebGL hashes, audio context, installed fonts, screen resolution, timezone, language, and dozens of other signals. Headless browsers have detectable fingerprint anomalies — use stealth plugins (playwright-stealth, puppeteer-extra-plugin-stealth) to mitigate this.

3. IP Reputation & Rate Limiting

Facebook maintains an extensive IP reputation database. Datacenter IPs (AWS, GCP, Azure, DigitalOcean) are typically pre-blocked or severely throttled. Residential proxies work better but must be rotated carefully — Facebook tracks behavioral patterns per IP and flags unusual access patterns.

4. Checkpoint Challenges

Suspicious sessions trigger "checkpoint" challenges — CAPTCHA, phone verification, photo identification, or "Is this you?" prompts. These are nearly impossible to automate and effectively block any scraping session that triggers them.

5. GraphQL Hash Rotation

Facebook's internal API uses GraphQL with hashed query identifiers (doc_id). These hashes change frequently during deployments, breaking any scraper that relies on intercepting specific GraphQL queries. You need to dynamically discover the current hashes or rely on DOM-based extraction.

6. Dynamic CSS Classes

Facebook obfuscates CSS class names with randomly generated strings (e.g., x1lliihq x6ikm8r x10wlt62). These classes change between deployments, making CSS selector-based scraping fragile. Use semantic selectors like [role="article"] and [aria-label] instead.

What Data Can You Extract?

Data Type Fields Auth Required?
Public Pages Name, category, about, followers, likes, posts, contact info, hours, location No (with mbasic)
Page Posts Text, images, videos, reactions, comments count, shares, timestamp, links Partial
Comments Text, author name, reactions, replies, timestamp Partial
Public Groups Name, description, member count, post previews (limited without login) Yes (most content)
Events Title, date, location, description, attendance count, organizer Partial
Marketplace Listing title, price, location, images, seller info, description Yes
Ad Library Ad creative, start date, platforms, estimated spend, page name, status No (public)
Reviews Rating, text, author, date Partial

3 Real-World Use Cases

Use Case 1: Brand Reputation Monitor

Track mentions, sentiment, and engagement across competitor and brand Facebook pages. Facebook reviews and page comments are brutally honest — making them valuable for understanding real customer sentiment.

import requests
from bs4 import BeautifulSoup
import time
import re
from collections import Counter

class BrandReputationMonitor:
    """Monitor brand reputation across Facebook pages."""

    def __init__(self):
        self.scraper = FacebookScraper()  # From Method 1

    def monitor_brand(self, brand_pages):
        """Monitor multiple brand pages for reputation signals."""
        report = {}

        for page_name, brand_name in brand_pages.items():
            print(f"Monitoring: {brand_name} ({page_name})...")

            # Get page info
            info = self.scraper.get_page_info(page_name)

            # Get recent posts
            posts = self.scraper.get_page_posts(page_name, max_posts=10)

            # Get reviews if available
            reviews = self.scraper.get_page_reviews(page_name, max_reviews=10)

            # Analyze sentiment
            all_text = " ".join([p.get("text", "") for p in posts])
            all_text += " ".join([r.get("text", "") for r in reviews])
            sentiment = self._analyze_sentiment(all_text)

            # Calculate engagement metrics
            total_likes = 0
            total_comments = 0
            for post in posts:
                try:
                    likes_str = post.get("likes", "0").replace(",", "")
                    total_likes += int(re.search(r"\d+", likes_str).group()) if likes_str else 0
                except (ValueError, AttributeError):
                    pass
                try:
                    comments_str = post.get("comments_count", "0").replace(",", "")
                    total_comments += int(re.search(r"\d+", comments_str).group()) if comments_str else 0
                except (ValueError, AttributeError):
                    pass

            report[brand_name] = {
                "page_name": page_name,
                "title": info.get("title", ""),
                "total_posts_analyzed": len(posts),
                "total_reviews_analyzed": len(reviews),
                "avg_likes_per_post": total_likes // max(len(posts), 1),
                "avg_comments_per_post": total_comments // max(len(posts), 1),
                "sentiment": sentiment,
                "avg_review_rating": self._avg_rating(reviews),
                "top_post": posts[0]["text"][:200] if posts else "N/A",
            }

        return report

    def _analyze_sentiment(self, text):
        """Simple keyword-based sentiment analysis."""
        text_lower = text.lower()
        positive = {"love", "great", "amazing", "best", "awesome", "excellent",
                     "fantastic", "recommend", "perfect", "helpful", "wonderful"}
        negative = {"hate", "terrible", "worst", "awful", "scam", "avoid",
                     "disappointed", "horrible", "broken", "waste", "fraud"}

        pos = sum(1 for w in positive if w in text_lower)
        neg = sum(1 for w in negative if w in text_lower)

        if pos > neg:
            return "positive"
        elif neg > pos:
            return "negative"
        return "neutral"

    def _avg_rating(self, reviews):
        """Calculate average review rating."""
        ratings = [r["rating"] for r in reviews if r.get("rating")]
        return round(sum(ratings) / len(ratings), 1) if ratings else "N/A"


# Monitor competitor brands
monitor = BrandReputationMonitor()
report = monitor.monitor_brand({
    "TechCrunch": "TechCrunch",
    "TheVerge": "The Verge",
    "Wired": "WIRED",
})

for brand, data in report.items():
    print(f"\n📊 {brand}:")
    print(f"   Posts analyzed: {data['total_posts_analyzed']}")
    print(f"   Avg likes/post: {data['avg_likes_per_post']}")
    print(f"   Avg comments/post: {data['avg_comments_per_post']}")
    print(f"   Sentiment: {data['sentiment']}")
    print(f"   Avg review rating: {data['avg_review_rating']}")

Use Case 2: Competitor Page Analyzer

Compare engagement metrics, posting frequency, and content strategy across competitor Facebook pages.

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

class CompetitorPageAnalyzer {
  async analyzeCompetitors(pageNames) {
    const results = {};

    for (const pageName of pageNames) {
      console.log(`Analyzing: ${pageName}...`);
      await new Promise(r => setTimeout(r, 10000)); // Rate limit

      const browser = await puppeteer.launch({
        headless: 'new',
        args: ['--no-sandbox'],
      });

      try {
        const page = await browser.newPage();
        await page.setViewport({ width: 1920, height: 1080 });

        await page.goto(`https://www.facebook.com/${pageName}`, {
          waitUntil: 'networkidle2',
          timeout: 30000,
        });

        // Close popups
        try {
          await page.click('[aria-label="Close"]', { timeout: 3000 });
        } catch (e) {}

        // Scroll for posts
        for (let i = 0; i < 3; i++) {
          await page.evaluate(() =>
            window.scrollTo(0, document.body.scrollHeight)
          );
          await new Promise(r => setTimeout(r, 2000));
        }

        // Extract page metrics
        const metrics = await page.evaluate(() => {
          const articles = document.querySelectorAll('[role="article"]');
          const postCount = articles.length;

          let totalReactions = 0;
          let totalComments = 0;
          const postTexts = [];

          articles.forEach(article => {
            // Count reactions
            const reactionEl = article.querySelector(
              '[aria-label*="reaction"], [aria-label*="like"]'
            );
            if (reactionEl) {
              const label = reactionEl.getAttribute('aria-label') || '';
              const match = label.match(/([\d,.]+)/);
              if (match) totalReactions += parseInt(match[1].replace(/,/g, '')) || 0;
            }

            // Count comments
            const commentEl = article.querySelector('a[href*="comment"]');
            if (commentEl) {
              const match = commentEl.innerText.match(/(\d+)/);
              if (match) totalComments += parseInt(match[1]) || 0;
            }

            // Get post text
            const textEl = article.querySelector(
              '[data-ad-preview="message"]'
            );
            if (textEl) postTexts.push(textEl.innerText.slice(0, 200));
          });

          // Get follower count
          const followerEl = document.querySelector(
            'a[href*="followers"] span, [class*="follower"]'
          );
          const followers = followerEl ? followerEl.innerText : 'N/A';

          return {
            postCount,
            totalReactions,
            totalComments,
            avgReactionsPerPost: postCount > 0
              ? Math.round(totalReactions / postCount)
              : 0,
            avgCommentsPerPost: postCount > 0
              ? Math.round(totalComments / postCount)
              : 0,
            followers,
            samplePosts: postTexts.slice(0, 3),
          };
        });

        results[pageName] = metrics;
      } catch (error) {
        results[pageName] = { error: error.message };
      } finally {
        await browser.close();
      }
    }

    return results;
  }
}

// Analyze competitors
(async () => {
  const analyzer = new CompetitorPageAnalyzer();
  const results = await analyzer.analyzeCompetitors([
    'ScrapingBee',
    'Apify',
    'BrightData',
  ]);

  for (const [page, data] of Object.entries(results)) {
    if (data.error) {
      console.log(`\n❌ ${page}: ${data.error}`);
      continue;
    }
    console.log(`\n📊 ${page}:`);
    console.log(`   Followers: ${data.followers}`);
    console.log(`   Posts visible: ${data.postCount}`);
    console.log(`   Avg reactions/post: ${data.avgReactionsPerPost}`);
    console.log(`   Avg comments/post: ${data.avgCommentsPerPost}`);
    console.log(`   Sample: ${data.samplePosts[0]?.slice(0, 80) || 'N/A'}...`);
  }
})();

Use Case 3: AI Agent Social Intelligence

Build an AI agent tool that extracts and analyzes Facebook page data for automated market intelligence — perfect for AI agents that need to understand brand presence and social engagement.

import requests
import json
from datetime import datetime

class FacebookIntelligenceTool:
    """AI agent tool for Facebook social intelligence."""

    def __init__(self, mantis_api_key):
        self.api_key = mantis_api_key
        self.base_url = "https://api.mantisapi.com/v1/scrape"

    def get_page_intelligence(self, page_url):
        """Extract comprehensive intelligence from a Facebook page.

        Designed as an AI agent tool — returns structured data
        that agents can reason about.
        """
        # Scrape the page
        response = requests.post(
            self.base_url,
            headers={"x-api-key": self.api_key},
            json={
                "url": page_url,
                "render_js": True,
                "wait_for": "[role='article']",
                "scroll_count": 3,
                "extract": {
                    "page_name": "h1, [role='heading']",
                    "category": "a[href*='/pages/category/']",
                    "posts": {
                        "_selector": "[role='article']",
                        "_type": "list",
                        "text": "[data-ad-preview='message']",
                        "reactions": "[aria-label*='reaction']::attr(aria-label)",
                        "comments": "a[href*='comment']",
                        "timestamp": "a[href*='/posts/'] span",
                    }
                }
            }
        )

        data = response.json()
        extracted = data.get("extracted", {})

        # Analyze content strategy
        posts = extracted.get("posts", [])
        analysis = {
            "page_name": extracted.get("page_name", "Unknown"),
            "category": extracted.get("category", "Unknown"),
            "posts_analyzed": len(posts),
            "content_types": self._classify_content(posts),
            "posting_patterns": self._analyze_patterns(posts),
            "engagement_summary": self._summarize_engagement(posts),
            "top_performing": self._top_posts(posts, 3),
            "extracted_at": datetime.utcnow().isoformat(),
        }

        return analysis

    def get_ad_intelligence(self, brand_name, country="US"):
        """Extract competitor ad intelligence from Facebook Ad Library."""
        response = requests.post(
            self.base_url,
            headers={"x-api-key": self.api_key},
            json={
                "url": f"https://www.facebook.com/ads/library/"
                       f"?active_status=active&ad_type=all"
                       f"&country={country}&q={brand_name}",
                "render_js": True,
                "scroll_count": 5,
                "extract": {
                    "ads": {
                        "_selector": "[role='article']",
                        "_type": "list",
                        "text": "div",
                        "started": "span:has-text('Started')",
                        "platforms": "span:has-text('Facebook'), span:has-text('Instagram')",
                    }
                }
            }
        )

        data = response.json()
        ads = data.get("extracted", {}).get("ads", [])

        return {
            "brand": brand_name,
            "country": country,
            "active_ads": len(ads),
            "ad_samples": [
                {"text": ad.get("text", "")[:200], "started": ad.get("started", "")}
                for ad in ads[:5]
            ],
            "extracted_at": datetime.utcnow().isoformat(),
        }

    def _classify_content(self, posts):
        """Classify posts by content type."""
        types = {"text_only": 0, "with_link": 0, "with_media": 0}
        for post in posts:
            text = post.get("text", "")
            if "http" in text or "www." in text:
                types["with_link"] += 1
            elif post.get("imageUrl"):
                types["with_media"] += 1
            else:
                types["text_only"] += 1
        return types

    def _analyze_patterns(self, posts):
        """Analyze posting patterns."""
        return {
            "total_posts_visible": len(posts),
            "has_regular_cadence": len(posts) >= 5,
        }

    def _summarize_engagement(self, posts):
        """Summarize engagement across posts."""
        import re
        total_reactions = 0
        for post in posts:
            label = post.get("reactions", "")
            match = re.search(r"([\d,.]+)", str(label))
            if match:
                total_reactions += int(match.group(1).replace(",", ""))

        return {
            "total_reactions": total_reactions,
            "avg_reactions_per_post": total_reactions // max(len(posts), 1),
        }

    def _top_posts(self, posts, n=3):
        """Return top performing posts by engagement."""
        # Sort by reaction count (approximate from labels)
        import re
        scored = []
        for post in posts:
            label = post.get("reactions", "")
            match = re.search(r"([\d,.]+)", str(label))
            score = int(match.group(1).replace(",", "")) if match else 0
            scored.append({"text": post.get("text", "")[:200], "reactions": score})

        scored.sort(key=lambda x: x["reactions"], reverse=True)
        return scored[:n]


# Usage as an AI agent tool
tool = FacebookIntelligenceTool("YOUR_API_KEY")

# Get page intelligence
intel = tool.get_page_intelligence("https://www.facebook.com/TechCrunch")
print(json.dumps(intel, indent=2))

# Get ad intelligence
ad_intel = tool.get_ad_intelligence("ScrapingBee")
print(f"\n{ad_intel['brand']}: {ad_intel['active_ads']} active ads")
for ad in ad_intel["ad_samples"]:
    print(f"  - {ad['text'][:80]}...")

Graph API vs Scraping vs Mantis

Feature Facebook Graph API DIY Scraping Mantis API
Cost Free (own pages only) Server + residential proxy costs $29/mo (5K requests)
Access scope Only pages you own/manage Any public page Any public page
Competitor data ❌ Not available ✅ Public pages ✅ Public pages
Ad Library ✅ Ad Library API ✅ Web scraping ✅ Included
Setup time Hours (app review required) Days (stealth config, proxies) Minutes
Login wall handling N/A (API) You manage Included
Anti-bot handling N/A (API) You manage Included
Proxy management N/A Residential required ($$$) Included
JS rendering N/A (API) Browser required Included
Maintenance API version updates High (DOM changes, fingerprinting) Zero
Reliability High (limited scope) Low-Medium High

Extract Facebook Data at Scale — Without the Headaches

Mantis handles login walls, residential proxies, fingerprinting, and JavaScript rendering. Get structured Facebook data with a single API call.

View Pricing Get Started Free

Facebook scraping carries more legal risk than most platforms due to Meta's aggressive enforcement posture. Here's what you need to know:

Key Legal Precedents

Meta's Enforcement Stance

Meta is uniquely aggressive among tech companies in enforcing against scraping:

Best Practices

  1. Only scrape public data — Never access private profiles, closed groups, or authenticated-only content
  2. Minimize data collection — Only collect what you actually need; avoid bulk data hoarding
  3. Don't store personal data unnecessarily — User names, profile info, and comments have GDPR/CCPA implications
  4. Use the Ad Library API when possible — It's public and explicitly intended for transparency research
  5. Respect rate limits — Don't overwhelm Facebook's servers
  6. Don't republish raw content — Analysis and aggregation are safer than republishing posts verbatim
  7. Get legal advice — For any commercial use of Facebook data, consult legal counsel in your jurisdiction
Disclaimer: This article is for educational purposes only. Web scraping may violate Facebook's Terms of Service. Meta has actively pursued legal action against scrapers. Always ensure your scraping activities comply with applicable laws and regulations in your jurisdiction.

FAQ

See the structured FAQ data above for common questions about scraping Facebook. Key points:

Next Steps

Now that you know how to scrape Facebook, explore more scraping guides: