Is it legal to scrape Facebook?

Scraping publicly available Facebook data may be legal under the hiQ v. LinkedIn precedent, but Meta actively enforces against scraping. In Meta v. Bright Data (2024), Meta pursued legal action against a data scraping company. Facebook's Terms of Service explicitly prohibit automated data collection. Always scrape only public data, respect rate limits, and consult legal counsel for commercial use.

Can I use the Facebook Graph API instead of scraping?

The Facebook Graph API is extremely limited for data extraction. After the Cambridge Analytica scandal, Meta restricted API access severely. You can only access data from pages you own or manage, and public page discovery is not available. For competitive analysis, market research, or monitoring public pages you don't own, web scraping or a scraping API like Mantis is the only practical option.

What is the easiest way to scrape Facebook?

The easiest method is scraping mbasic.facebook.com — Facebook's mobile-basic version that serves simple HTML without JavaScript. It's lightweight and parseable with Python + BeautifulSoup. For JavaScript-heavy pages, use Playwright or Puppeteer with stealth plugins. For production use, a web scraping API like Mantis handles anti-bot measures and login walls automatically.

How do I scrape Facebook without getting blocked?

Facebook has aggressive anti-bot defenses. Key strategies: use residential proxies (datacenter IPs are blocked), rotate User-Agents, add random delays between requests (5-15 seconds), use stealth browser automation (playwright-stealth or puppeteer-extra-plugin-stealth), avoid login when possible by using mbasic.facebook.com for public pages, and limit request volume. A managed API like Mantis handles all of this automatically.

Can I scrape Facebook groups?

Public Facebook groups can be scraped, but most groups require login to view content. Scraping behind a login increases legal and ToS risk significantly. Public group listings and basic metadata are accessible without authentication, but post content typically requires a logged-in session. Consider whether the data you need is available from public pages instead.

Can I scrape Facebook with Python?

Yes. The most effective approaches are: (1) Requests + BeautifulSoup with mbasic.facebook.com for lightweight HTML parsing of public pages; (2) Playwright with stealth for JavaScript-rendered content and infinite scroll; (3) A scraping API like Mantis for production reliability. Avoid using the Graph API for data you don't own — it's too restricted.

How to Scrape Facebook Data in 2026: Pages, Posts & Groups

Published March 31, 2026 · 19 min read

Why Scrape Facebook?
The Facebook Graph API Problem
4 Methods to Scrape Facebook
Method 1: Python + Requests (mbasic.facebook.com)
Method 2: Playwright Headless Browser
Method 3: Node.js + Puppeteer
Method 4: Mantis Web Scraping API
Facebook Anti-Bot Defenses
What Data Can You Extract?
3 Real-World Use Cases
Graph API vs Scraping vs Mantis
Legal Considerations
FAQ

Why Scrape Facebook?

Facebook remains the world's largest social network with over 3 billion monthly active users. Despite declining engagement among younger demographics, it's still the dominant platform for:

Brand monitoring — Track what customers say about your brand on public pages and groups
Competitive intelligence — Monitor competitor pages, ad campaigns, and engagement metrics
Market research — Understand audience sentiment, trending topics, and community discussions
Lead generation — Find businesses and decision-makers through public pages and groups
Event tracking — Monitor public events, attendance, and community activity
Ad intelligence — Track competitors' ad creatives, targeting, and messaging through the Ad Library
Local business data — Extract business listings, reviews, and contact information

But since the Cambridge Analytica scandal in 2018, Meta has locked down data access more aggressively than any other major platform. The Graph API is nearly useless for third-party data extraction, making web scraping the only viable approach for competitive intelligence and market research.

The Facebook Graph API Problem

After Cambridge Analytica, Meta gutted the Graph API. Here's what you can't do with the official API:

What You Want	Graph API Support	Reality
Search public pages	❌ Removed	Page search API deprecated in 2019
Read other pages' posts	❌ Removed	Only your own pages via Page Access Token
Read group posts	❌ Removed	Only groups your app is installed in
User profiles	❌ Removed	Only the authenticated user's own data
Comments on others' posts	❌ Removed	Only comments on your own page's posts
Your own page insights	✅ Available	Analytics for pages you manage
Post to your page	✅ Available	Publishing to pages you manage
Ad Library	✅ Available	Public ad transparency data

Bottom line: The Graph API only lets you manage your own pages. For competitive analysis, market research, or monitoring any page you don't own, web scraping is the only option.

4 Methods to Scrape Facebook

Here are four approaches to extract data from Facebook, from lightweight HTML parsing to production-ready API solutions:

Python + Requests (mbasic.facebook.com) — Parse the simplified mobile HTML, no JS rendering needed
Playwright headless browser — Full browser automation for the main Facebook site
Node.js + Puppeteer — Stealth browser scraping with GraphQL API interception
Mantis Web Scraping API — One-call solution, handles anti-bot, production-ready

Method 1: Python + Requests (mbasic.facebook.com)

Facebook's mobile basic site (mbasic.facebook.com) serves simple HTML without JavaScript — perfect for lightweight scraping. It's designed for low-bandwidth connections and older devices, which means minimal anti-bot protections compared to the main site.

Setup

pip install requests beautifulsoup4

Scrape a Public Facebook Page

import requests
from bs4 import BeautifulSoup
import time
import re
import json
from datetime import datetime

class FacebookScraper:
    """Scrape Facebook using mbasic.facebook.com (simplified HTML)."""

    BASE_URL = "https://mbasic.facebook.com"

    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6 like Mac OS X) "
                          "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                          "Version/16.6 Mobile/15E148 Safari/604.1",
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
        })
        self.last_request = 0

    def _rate_limit(self, delay=8):
        """Respect rate limits — Facebook is aggressive about blocking."""
        elapsed = time.time() - self.last_request
        if elapsed < delay:
            time.sleep(delay - elapsed)
        self.last_request = time.time()

    def get_page_info(self, page_name):
        """Fetch basic info about a public Facebook page."""
        self._rate_limit()

        url = f"{self.BASE_URL}/{page_name}"
        resp = self.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        # Extract page name from title
        title = soup.find("title")
        page_title = title.text.strip() if title else page_name

        # Extract profile info
        profile_info = {}
        info_section = soup.find("div", {"id": "pages_mbasic_header_top"})
        if info_section:
            profile_info["name"] = info_section.get_text(strip=True)

        # Look for likes/followers count
        likes_elem = soup.find(string=re.compile(r"[\d,.]+ (likes?|followers?)", re.I))
        if likes_elem:
            profile_info["engagement_text"] = likes_elem.strip()

        return {
            "page_name": page_name,
            "title": page_title,
            "url": f"https://facebook.com/{page_name}",
            "info": profile_info,
        }

    def get_page_posts(self, page_name, max_posts=20):
        """Scrape posts from a public Facebook page."""
        self._rate_limit()

        url = f"{self.BASE_URL}/{page_name}"
        resp = self.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        posts = []
        # mbasic posts are in article tags or specific div structures
        post_elements = soup.find_all("div", {"role": "article"})

        if not post_elements:
            # Fallback: look for story containers
            post_elements = soup.find_all("div", class_=re.compile("story"))

        for elem in post_elements[:max_posts]:
            post = self._parse_post(elem)
            if post and post.get("text"):
                posts.append(post)

        return posts

    def _parse_post(self, elem):
        """Parse a single post element from mbasic HTML."""
        post = {}

        # Extract post text
        text_parts = []
        for p in elem.find_all(["p", "span"]):
            text = p.get_text(strip=True)
            if text and len(text) > 10:
                text_parts.append(text)

        post["text"] = " ".join(text_parts)[:1000] if text_parts else ""

        # Extract links
        links = []
        for a in elem.find_all("a", href=True):
            href = a["href"]
            if "facebook.com" not in href and href.startswith("http"):
                links.append(href)
        post["links"] = links

        # Extract timestamp (mbasic uses abbr tags for timestamps)
        time_elem = elem.find("abbr")
        if time_elem:
            post["timestamp"] = time_elem.get_text(strip=True)

        # Extract engagement (likes, comments, shares text)
        footer = elem.find("footer") or elem.find(
            "div", class_=re.compile("footer|action")
        )
        if footer:
            footer_text = footer.get_text(" ", strip=True)
            # Extract like count
            likes_match = re.search(r"([\d,.]+)\s*(likes?|reactions?)", footer_text, re.I)
            if likes_match:
                post["likes"] = likes_match.group(1)

            comments_match = re.search(r"([\d,.]+)\s*comments?", footer_text, re.I)
            if comments_match:
                post["comments_count"] = comments_match.group(1)

            shares_match = re.search(r"([\d,.]+)\s*shares?", footer_text, re.I)
            if shares_match:
                post["shares"] = shares_match.group(1)

        # Extract image if present
        img = elem.find("img", src=re.compile("scontent"))
        if img:
            post["image_url"] = img.get("src", "")

        return post

    def get_page_reviews(self, page_name, max_reviews=20):
        """Scrape reviews from a Facebook business page."""
        self._rate_limit()

        url = f"{self.BASE_URL}/{page_name}/reviews"
        resp = self.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        reviews = []
        review_elements = soup.find_all("div", {"role": "article"})

        for elem in review_elements[:max_reviews]:
            review = {}
            # Extract reviewer name
            author = elem.find("strong") or elem.find("h3")
            if author:
                review["author"] = author.get_text(strip=True)

            # Extract review text
            text = elem.get_text(" ", strip=True)
            review["text"] = text[:500]

            # Look for star ratings
            stars = elem.find_all("img", alt=re.compile("star", re.I))
            if stars:
                review["rating"] = len(stars)

            if review.get("text"):
                reviews.append(review)

        return reviews

    def search_pages(self, query, max_results=10):
        """Search for Facebook pages (limited on mbasic)."""
        self._rate_limit()

        url = f"{self.BASE_URL}/search/pages/"
        params = {"q": query}
        resp = self.session.get(url, params=params)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        results = []
        for link in soup.find_all("a", href=re.compile("/pages/")):
            name = link.get_text(strip=True)
            href = link["href"]
            if name and len(name) > 2:
                results.append({
                    "name": name,
                    "url": f"https://facebook.com{href}",
                })

        return results[:max_results]


# Usage
scraper = FacebookScraper()

# Get page info
page_info = scraper.get_page_info("TechCrunch")
print(f"Page: {page_info['title']}")
print(f"URL: {page_info['url']}")

# Get recent posts
posts = scraper.get_page_posts("TechCrunch", max_posts=5)
for i, post in enumerate(posts, 1):
    print(f"\n--- Post {i} ---")
    print(f"Text: {post['text'][:150]}...")
    if post.get("likes"):
        print(f"Likes: {post['likes']}")
    if post.get("timestamp"):
        print(f"Time: {post['timestamp']}")

Pagination: Scrape Multiple Pages of Posts

def scrape_all_posts(scraper, page_name, max_posts=50):
    """Scrape multiple pages of Facebook posts using mbasic pagination."""
    all_posts = []
    url = f"{scraper.BASE_URL}/{page_name}"

    while len(all_posts) < max_posts:
        scraper._rate_limit(delay=10)  # Extra cautious with Facebook

        resp = scraper.session.get(url)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        # Parse posts on this page
        post_elements = soup.find_all("div", {"role": "article"})
        if not post_elements:
            post_elements = soup.find_all("div", class_=re.compile("story"))

        new_posts = 0
        for elem in post_elements:
            post = scraper._parse_post(elem)
            if post and post.get("text"):
                all_posts.append(post)
                new_posts += 1

        if new_posts == 0:
            break

        print(f"Fetched {len(all_posts)} posts so far...")

        # Find "See more posts" link for pagination
        more_link = soup.find("a", string=re.compile("See More|Show More|Older", re.I))
        if more_link and more_link.get("href"):
            next_url = more_link["href"]
            if not next_url.startswith("http"):
                url = f"{scraper.BASE_URL}{next_url}"
            else:
                url = next_url
        else:
            break

    return all_posts[:max_posts]


# Scrape 50 posts from a page
all_posts = scrape_all_posts(scraper, "TechCrunch", max_posts=50)
print(f"\nTotal posts scraped: {len(all_posts)}")

Method 2: Playwright Headless Browser

For the full Facebook experience — JavaScript rendering, infinite scroll, and dynamic content loading — Playwright with stealth configuration is your best option.

Setup

pip install playwright playwright-stealth
playwright install chromium

Scrape Facebook Page with Full Rendering

import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import json
import re

class FacebookPlaywrightScraper:
    """Scrape Facebook using Playwright for full JS rendering."""

    async def scrape_page_posts(self, page_name, max_posts=20, scroll_count=5):
        """Scrape posts from a public Facebook page with full rendering."""
        async with async_playwright() as p:
            browser = await p.chromium.launch(
                headless=True,
                args=[
                    "--disable-blink-features=AutomationControlled",
                    "--no-sandbox",
                ]
            )

            context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                           "AppleWebKit/537.36 (KHTML, like Gecko) "
                           "Chrome/121.0.0.0 Safari/537.36",
                locale="en-US",
            )

            page = await context.new_page()
            await stealth_async(page)

            # Navigate to the Facebook page
            url = f"https://www.facebook.com/{page_name}"
            await page.goto(url, wait_until="networkidle", timeout=30000)

            # Handle cookie consent / login popup
            try:
                close_btn = page.locator('[aria-label="Close"]').first
                if await close_btn.is_visible(timeout=3000):
                    await close_btn.click()
            except Exception:
                pass

            # Dismiss login modal if it appears
            try:
                not_now = page.locator('text="Not Now"').first
                if await not_now.is_visible(timeout=3000):
                    await not_now.click()
            except Exception:
                pass

            # Scroll to load more posts
            for i in range(scroll_count):
                await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                await asyncio.sleep(2 + (i * 0.5))  # Progressive delay

            # Extract posts from the page
            posts = await page.evaluate("""() => {
                const posts = [];
                const postElements = document.querySelectorAll(
                    '[role="article"], [data-ad-preview="message"]'
                );

                postElements.forEach(elem => {
                    const textElem = elem.querySelector(
                        '[data-ad-preview="message"], [data-ad-comet-preview="message"]'
                    );
                    const text = textElem ? textElem.innerText : '';

                    // Extract engagement metrics
                    const likeSpan = elem.querySelector(
                        '[aria-label*="reaction"], [aria-label*="like"]'
                    );
                    const likes = likeSpan ? likeSpan.getAttribute('aria-label') : '';

                    const commentLink = elem.querySelector('a[href*="comment"]');
                    const comments = commentLink ? commentLink.innerText : '';

                    const shareSpan = elem.querySelector(
                        'span[class*="share"], a[href*="shares"]'
                    );
                    const shares = shareSpan ? shareSpan.innerText : '';

                    // Extract image
                    const img = elem.querySelector(
                        'img[src*="scontent"], img[data-visualcompletion]'
                    );
                    const imageUrl = img ? img.src : '';

                    // Extract timestamp
                    const timeLink = elem.querySelector('a[href*="/posts/"] span');
                    const timestamp = timeLink ? timeLink.innerText : '';

                    if (text && text.length > 10) {
                        posts.push({
                            text: text.slice(0, 1000),
                            likes,
                            comments,
                            shares,
                            imageUrl,
                            timestamp,
                        });
                    }
                });

                return posts;
            }""")

            await browser.close()
            return posts[:max_posts]

    async def scrape_ad_library(self, page_name, country="US"):
        """Scrape Facebook Ad Library for a page's active ads."""
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                           "AppleWebKit/537.36 Chrome/121.0.0.0 Safari/537.36",
            )

            page = await context.new_page()

            # Facebook Ad Library is publicly accessible
            url = (
                f"https://www.facebook.com/ads/library/"
                f"?active_status=active&ad_type=all"
                f"&country={country}&q={page_name}"
            )
            await page.goto(url, wait_until="networkidle", timeout=30000)
            await asyncio.sleep(3)

            # Scroll to load ads
            for _ in range(3):
                await page.evaluate(
                    "window.scrollTo(0, document.body.scrollHeight)"
                )
                await asyncio.sleep(2)

            # Extract ad data
            ads = await page.evaluate("""() => {
                const ads = [];
                const adCards = document.querySelectorAll(
                    '[class*="ad-card"], div[role="article"]'
                );

                adCards.forEach(card => {
                    const text = card.innerText;
                    const img = card.querySelector('img[src*="scontent"]');
                    const link = card.querySelector('a[href*="facebook.com"]');

                    ads.push({
                        text: text.slice(0, 500),
                        imageUrl: img ? img.src : '',
                        link: link ? link.href : '',
                    });
                });

                return ads;
            }""")

            await browser.close()
            return ads


# Usage
async def main():
    scraper = FacebookPlaywrightScraper()

    # Scrape page posts
    print("--- Scraping Facebook Page Posts ---")
    posts = await scraper.scrape_page_posts("TechCrunch", max_posts=10)
    for i, post in enumerate(posts, 1):
        print(f"\nPost {i}: {post['text'][:120]}...")
        if post.get("likes"):
            print(f"  Reactions: {post['likes']}")

    # Scrape Ad Library
    print("\n--- Scraping Ad Library ---")
    ads = await scraper.scrape_ad_library("TechCrunch")
    print(f"Found {len(ads)} active ads")


asyncio.run(main())

Method 3: Node.js + Puppeteer

Node.js with Puppeteer and the stealth plugin is excellent for intercepting Facebook's internal GraphQL API calls — giving you structured JSON data directly from Facebook's backend.

Setup

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

Scrape with GraphQL API Interception

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

class FacebookPuppeteerScraper {
  async scrapePagePosts(pageName, maxPosts = 20, scrollCount = 5) {
    const browser = await puppeteer.launch({
      headless: 'new',
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-blink-features=AutomationControlled',
      ],
    });

    const page = await browser.newPage();
    await page.setViewport({ width: 1920, height: 1080 });

    // Intercept GraphQL responses for structured data
    const graphqlResponses = [];
    page.on('response', async (response) => {
      const url = response.url();
      if (url.includes('graphql') || url.includes('api/graphql')) {
        try {
          const json = await response.json();
          graphqlResponses.push(json);
        } catch (e) {
          // Not JSON, skip
        }
      }
    });

    // Navigate to page
    const url = `https://www.facebook.com/${pageName}`;
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });

    // Handle popups
    try {
      const closeBtn = await page.$('[aria-label="Close"]');
      if (closeBtn) await closeBtn.click();
    } catch (e) {}

    try {
      const notNow = await page.$x('//div[text()="Not Now" or text()="Not now"]');
      if (notNow.length) await notNow[0].click();
    } catch (e) {}

    // Scroll to load posts
    for (let i = 0; i < scrollCount; i++) {
      await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await new Promise(r => setTimeout(r, 2000 + i * 500));
    }

    // Extract posts from DOM
    const posts = await page.evaluate((max) => {
      const results = [];
      const articles = document.querySelectorAll('[role="article"]');

      articles.forEach(article => {
        if (results.length >= max) return;

        const textEl = article.querySelector(
          '[data-ad-preview="message"], [data-ad-comet-preview="message"]'
        );
        const text = textEl ? textEl.innerText : '';

        const reactionLabel = article.querySelector(
          '[aria-label*="reaction"], [aria-label*="like"]'
        );
        const reactions = reactionLabel
          ? reactionLabel.getAttribute('aria-label')
          : '';

        const commentEl = article.querySelector('a[href*="comment"]');
        const comments = commentEl ? commentEl.innerText : '';

        const timeEl = article.querySelector('a[href*="/posts/"] span');
        const timestamp = timeEl ? timeEl.innerText : '';

        const imgEl = article.querySelector('img[src*="scontent"]');
        const imageUrl = imgEl ? imgEl.src : '';

        if (text && text.length > 10) {
          results.push({
            text: text.slice(0, 1000),
            reactions,
            comments,
            timestamp,
            imageUrl,
          });
        }
      });

      return results;
    }, maxPosts);

    // Extract structured data from intercepted GraphQL responses
    const structuredPosts = this.parseGraphQLPosts(graphqlResponses);

    await browser.close();

    // Merge DOM posts with GraphQL data
    return {
      domPosts: posts,
      graphqlPosts: structuredPosts,
      totalGraphQLResponses: graphqlResponses.length,
    };
  }

  parseGraphQLPosts(responses) {
    const posts = [];

    for (const resp of responses) {
      try {
        // Facebook's GraphQL responses have varying structures
        const data = resp?.data;
        if (!data) continue;

        // Look for timeline feed data
        const edges = this.findNestedKey(data, 'edges') || [];
        for (const edge of edges) {
          const node = edge?.node;
          if (!node) continue;

          const story = node?.comet_sections?.content?.story;
          if (!story) continue;

          const message = story?.message?.text || '';
          const createdTime = node?.created_time;
          const feedbackSummary = node?.comet_sections?.feedback?.story?.feedback_context;

          if (message) {
            posts.push({
              text: message,
              createdTime,
              reactions: feedbackSummary?.reaction_count?.count || 0,
              comments: feedbackSummary?.comment_count?.total_count || 0,
              shares: feedbackSummary?.share_count?.count || 0,
            });
          }
        }
      } catch (e) {
        // GraphQL structure varies, skip unparseable responses
      }
    }

    return posts;
  }

  findNestedKey(obj, key) {
    if (!obj || typeof obj !== 'object') return null;
    if (obj[key]) return obj[key];
    for (const k of Object.keys(obj)) {
      const result = this.findNestedKey(obj[k], key);
      if (result) return result;
    }
    return null;
  }
}

// Usage
(async () => {
  const scraper = new FacebookPuppeteerScraper();

  console.log('--- Scraping Facebook Page ---');
  const result = await scraper.scrapePagePosts('TechCrunch', 10);

  console.log(`\nDOM Posts: ${result.domPosts.length}`);
  result.domPosts.slice(0, 3).forEach((post, i) => {
    console.log(`\n  Post ${i + 1}: ${post.text.slice(0, 100)}...`);
    if (post.reactions) console.log(`  Reactions: ${post.reactions}`);
  });

  console.log(`\nGraphQL Posts: ${result.graphqlPosts.length}`);
  result.graphqlPosts.slice(0, 3).forEach((post, i) => {
    console.log(`\n  Post ${i + 1}: ${post.text.slice(0, 100)}...`);
    console.log(`  Reactions: ${post.reactions} | Comments: ${post.comments} | Shares: ${post.shares}`);
  });

  console.log(`\nTotal GraphQL responses intercepted: ${result.totalGraphQLResponses}`);
})();

Method 4: Mantis Web Scraping API

For production applications, Mantis provides the most reliable way to extract Facebook data. One API call handles rendering, login wall bypassing, proxy rotation, and structured data extraction — without maintaining browser infrastructure.

import requests

# Scrape a Facebook page
response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://www.facebook.com/TechCrunch",
        "render_js": True,
        "wait_for": "[role='article']",
        "scroll_count": 3,
        "extract": {
            "posts": {
                "_selector": "[role='article']",
                "_type": "list",
                "text": "[data-ad-preview='message']",
                "reactions": "[aria-label*='reaction']::attr(aria-label)",
                "comments": "a[href*='comment']",
                "timestamp": "a[href*='/posts/'] span",
            }
        }
    }
)

data = response.json()
for post in data["extracted"]["posts"][:10]:
    print(f"Post: {post['text'][:100]}...")
    print(f"  Reactions: {post['reactions']}")
    print(f"  Comments: {post['comments']}")
    print()

Scrape Facebook Ad Library with Mantis

# Scrape competitor ads from Facebook Ad Library
response = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://www.facebook.com/ads/library/?active_status=active"
              "&ad_type=all&country=US&q=competitor_name",
        "render_js": True,
        "scroll_count": 5,
        "extract": {
            "ads": {
                "_selector": "[role='article'], [class*='ad-card']",
                "_type": "list",
                "creative_text": "div[class*='_7jyr']",
                "started": "span:has-text('Started running')",
                "platform": "span:has-text('Facebook'), span:has-text('Instagram')",
            }
        }
    }
)

data = response.json()
print(f"Found {len(data['extracted']['ads'])} active ads")
for ad in data["extracted"]["ads"][:5]:
    print(f"  Ad: {ad['creative_text'][:100]}...")
    print(f"  Started: {ad['started']} | Platforms: {ad['platform']}")
    print()

Why Mantis for Facebook?

Login wall handling — Facebook aggressively pushes login modals; Mantis handles dismissal and navigation
Residential proxies — Facebook blocks datacenter IPs; Mantis uses residential proxy infrastructure
JavaScript rendering — Facebook is a React SPA; Mantis renders it fully with Chromium
Structured extraction — Get clean JSON from Facebook's complex DOM
Anti-fingerprinting — Mantis evades Facebook's browser fingerprinting detection

Scrape Facebook Without Getting Blocked

Extract page posts, ad library data, and engagement metrics with a single API call. No browser infrastructure, no proxy management, no login walls.

View Pricing Get Started Free

Facebook Anti-Bot Defenses

Facebook has the most sophisticated anti-scraping defenses of any major platform. Understanding them is critical:

1. Login Walls

Facebook aggressively pushes login modals on every page. Even public pages show a login overlay after a few seconds of browsing. The mbasic.facebook.com version is less aggressive, but the main site requires constant popup dismissal. This is Meta's primary anti-scraping strategy — force authentication to track and control access.

2. Device & Browser Fingerprinting

Facebook collects extensive browser fingerprints: canvas rendering, WebGL hashes, audio context, installed fonts, screen resolution, timezone, language, and dozens of other signals. Headless browsers have detectable fingerprint anomalies — use stealth plugins (playwright-stealth, puppeteer-extra-plugin-stealth) to mitigate this.

3. IP Reputation & Rate Limiting

Facebook maintains an extensive IP reputation database. Datacenter IPs (AWS, GCP, Azure, DigitalOcean) are typically pre-blocked or severely throttled. Residential proxies work better but must be rotated carefully — Facebook tracks behavioral patterns per IP and flags unusual access patterns.

4. Checkpoint Challenges

Suspicious sessions trigger "checkpoint" challenges — CAPTCHA, phone verification, photo identification, or "Is this you?" prompts. These are nearly impossible to automate and effectively block any scraping session that triggers them.

5. GraphQL Hash Rotation

Facebook's internal API uses GraphQL with hashed query identifiers (doc_id). These hashes change frequently during deployments, breaking any scraper that relies on intercepting specific GraphQL queries. You need to dynamically discover the current hashes or rely on DOM-based extraction.

6. Dynamic CSS Classes

Facebook obfuscates CSS class names with randomly generated strings (e.g., x1lliihq x6ikm8r x10wlt62). These classes change between deployments, making CSS selector-based scraping fragile. Use semantic selectors like [role="article"] and [aria-label] instead.

What Data Can You Extract?

Data Type	Fields	Auth Required?
Public Pages	Name, category, about, followers, likes, posts, contact info, hours, location	No (with mbasic)
Page Posts	Text, images, videos, reactions, comments count, shares, timestamp, links	Partial
Comments	Text, author name, reactions, replies, timestamp	Partial
Public Groups	Name, description, member count, post previews (limited without login)	Yes (most content)
Events	Title, date, location, description, attendance count, organizer	Partial
Marketplace	Listing title, price, location, images, seller info, description	Yes
Ad Library	Ad creative, start date, platforms, estimated spend, page name, status	No (public)
Reviews	Rating, text, author, date	Partial

3 Real-World Use Cases

Use Case 1: Brand Reputation Monitor

Track mentions, sentiment, and engagement across competitor and brand Facebook pages. Facebook reviews and page comments are brutally honest — making them valuable for understanding real customer sentiment.

import requests
from bs4 import BeautifulSoup
import time
import re
from collections import Counter

class BrandReputationMonitor:
    """Monitor brand reputation across Facebook pages."""

    def __init__(self):
        self.scraper = FacebookScraper()  # From Method 1

    def monitor_brand(self, brand_pages):
        """Monitor multiple brand pages for reputation signals."""
        report = {}

        for page_name, brand_name in brand_pages.items():
            print(f"Monitoring: {brand_name} ({page_name})...")

            # Get page info
            info = self.scraper.get_page_info(page_name)

            # Get recent posts
            posts = self.scraper.get_page_posts(page_name, max_posts=10)

            # Get reviews if available
            reviews = self.scraper.get_page_reviews(page_name, max_reviews=10)

            # Analyze sentiment
            all_text = " ".join([p.get("text", "") for p in posts])
            all_text += " ".join([r.get("text", "") for r in reviews])
            sentiment = self._analyze_sentiment(all_text)

            # Calculate engagement metrics
            total_likes = 0
            total_comments = 0
            for post in posts:
                try:
                    likes_str = post.get("likes", "0").replace(",", "")
                    total_likes += int(re.search(r"\d+", likes_str).group()) if likes_str else 0
                except (ValueError, AttributeError):
                    pass
                try:
                    comments_str = post.get("comments_count", "0").replace(",", "")
                    total_comments += int(re.search(r"\d+", comments_str).group()) if comments_str else 0
                except (ValueError, AttributeError):
                    pass

            report[brand_name] = {
                "page_name": page_name,
                "title": info.get("title", ""),
                "total_posts_analyzed": len(posts),
                "total_reviews_analyzed": len(reviews),
                "avg_likes_per_post": total_likes // max(len(posts), 1),
                "avg_comments_per_post": total_comments // max(len(posts), 1),
                "sentiment": sentiment,
                "avg_review_rating": self._avg_rating(reviews),
                "top_post": posts[0]["text"][:200] if posts else "N/A",
            }

        return report

    def _analyze_sentiment(self, text):
        """Simple keyword-based sentiment analysis."""
        text_lower = text.lower()
        positive = {"love", "great", "amazing", "best", "awesome", "excellent",
                     "fantastic", "recommend", "perfect", "helpful", "wonderful"}
        negative = {"hate", "terrible", "worst", "awful", "scam", "avoid",
                     "disappointed", "horrible", "broken", "waste", "fraud"}

        pos = sum(1 for w in positive if w in text_lower)
        neg = sum(1 for w in negative if w in text_lower)

        if pos > neg:
            return "positive"
        elif neg > pos:
            return "negative"
        return "neutral"

    def _avg_rating(self, reviews):
        """Calculate average review rating."""
        ratings = [r["rating"] for r in reviews if r.get("rating")]
        return round(sum(ratings) / len(ratings), 1) if ratings else "N/A"


# Monitor competitor brands
monitor = BrandReputationMonitor()
report = monitor.monitor_brand({
    "TechCrunch": "TechCrunch",
    "TheVerge": "The Verge",
    "Wired": "WIRED",
})

for brand, data in report.items():
    print(f"\n📊 {brand}:")
    print(f"   Posts analyzed: {data['total_posts_analyzed']}")
    print(f"   Avg likes/post: {data['avg_likes_per_post']}")
    print(f"   Avg comments/post: {data['avg_comments_per_post']}")
    print(f"   Sentiment: {data['sentiment']}")
    print(f"   Avg review rating: {data['avg_review_rating']}")

Use Case 2: Competitor Page Analyzer

Compare engagement metrics, posting frequency, and content strategy across competitor Facebook pages.

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

class CompetitorPageAnalyzer {
  async analyzeCompetitors(pageNames) {
    const results = {};

    for (const pageName of pageNames) {
      console.log(`Analyzing: ${pageName}...`);
      await new Promise(r => setTimeout(r, 10000)); // Rate limit

      const browser = await puppeteer.launch({
        headless: 'new',
        args: ['--no-sandbox'],
      });

      try {
        const page = await browser.newPage();
        await page.setViewport({ width: 1920, height: 1080 });

        await page.goto(`https://www.facebook.com/${pageName}`, {
          waitUntil: 'networkidle2',
          timeout: 30000,
        });

        // Close popups
        try {
          await page.click('[aria-label="Close"]', { timeout: 3000 });
        } catch (e) {}

        // Scroll for posts
        for (let i = 0; i < 3; i++) {
          await page.evaluate(() =>
            window.scrollTo(0, document.body.scrollHeight)
          );
          await new Promise(r => setTimeout(r, 2000));
        }

        // Extract page metrics
        const metrics = await page.evaluate(() => {
          const articles = document.querySelectorAll('[role="article"]');
          const postCount = articles.length;

          let totalReactions = 0;
          let totalComments = 0;
          const postTexts = [];

          articles.forEach(article => {
            // Count reactions
            const reactionEl = article.querySelector(
              '[aria-label*="reaction"], [aria-label*="like"]'
            );
            if (reactionEl) {
              const label = reactionEl.getAttribute('aria-label') || '';
              const match = label.match(/([\d,.]+)/);
              if (match) totalReactions += parseInt(match[1].replace(/,/g, '')) || 0;
            }

            // Count comments
            const commentEl = article.querySelector('a[href*="comment"]');
            if (commentEl) {
              const match = commentEl.innerText.match(/(\d+)/);
              if (match) totalComments += parseInt(match[1]) || 0;
            }

            // Get post text
            const textEl = article.querySelector(
              '[data-ad-preview="message"]'
            );
            if (textEl) postTexts.push(textEl.innerText.slice(0, 200));
          });

          // Get follower count
          const followerEl = document.querySelector(
            'a[href*="followers"] span, [class*="follower"]'
          );
          const followers = followerEl ? followerEl.innerText : 'N/A';

          return {
            postCount,
            totalReactions,
            totalComments,
            avgReactionsPerPost: postCount > 0
              ? Math.round(totalReactions / postCount)
              : 0,
            avgCommentsPerPost: postCount > 0
              ? Math.round(totalComments / postCount)
              : 0,
            followers,
            samplePosts: postTexts.slice(0, 3),
          };
        });

        results[pageName] = metrics;
      } catch (error) {
        results[pageName] = { error: error.message };
      } finally {
        await browser.close();
      }
    }

    return results;
  }
}

// Analyze competitors
(async () => {
  const analyzer = new CompetitorPageAnalyzer();
  const results = await analyzer.analyzeCompetitors([
    'ScrapingBee',
    'Apify',
    'BrightData',
  ]);

  for (const [page, data] of Object.entries(results)) {
    if (data.error) {
      console.log(`\n❌ ${page}: ${data.error}`);
      continue;
    }
    console.log(`\n📊 ${page}:`);
    console.log(`   Followers: ${data.followers}`);
    console.log(`   Posts visible: ${data.postCount}`);
    console.log(`   Avg reactions/post: ${data.avgReactionsPerPost}`);
    console.log(`   Avg comments/post: ${data.avgCommentsPerPost}`);
    console.log(`   Sample: ${data.samplePosts[0]?.slice(0, 80) || 'N/A'}...`);
  }
})();

Use Case 3: AI Agent Social Intelligence

Build an AI agent tool that extracts and analyzes Facebook page data for automated market intelligence — perfect for AI agents that need to understand brand presence and social engagement.

import requests
import json
from datetime import datetime

class FacebookIntelligenceTool:
    """AI agent tool for Facebook social intelligence."""

    def __init__(self, mantis_api_key):
        self.api_key = mantis_api_key
        self.base_url = "https://api.mantisapi.com/v1/scrape"

    def get_page_intelligence(self, page_url):
        """Extract comprehensive intelligence from a Facebook page.

        Designed as an AI agent tool — returns structured data
        that agents can reason about.
        """
        # Scrape the page
        response = requests.post(
            self.base_url,
            headers={"x-api-key": self.api_key},
            json={
                "url": page_url,
                "render_js": True,
                "wait_for": "[role='article']",
                "scroll_count": 3,
                "extract": {
                    "page_name": "h1, [role='heading']",
                    "category": "a[href*='/pages/category/']",
                    "posts": {
                        "_selector": "[role='article']",
                        "_type": "list",
                        "text": "[data-ad-preview='message']",
                        "reactions": "[aria-label*='reaction']::attr(aria-label)",
                        "comments": "a[href*='comment']",
                        "timestamp": "a[href*='/posts/'] span",
                    }
                }
            }
        )

        data = response.json()
        extracted = data.get("extracted", {})

        # Analyze content strategy
        posts = extracted.get("posts", [])
        analysis = {
            "page_name": extracted.get("page_name", "Unknown"),
            "category": extracted.get("category", "Unknown"),
            "posts_analyzed": len(posts),
            "content_types": self._classify_content(posts),
            "posting_patterns": self._analyze_patterns(posts),
            "engagement_summary": self._summarize_engagement(posts),
            "top_performing": self._top_posts(posts, 3),
            "extracted_at": datetime.utcnow().isoformat(),
        }

        return analysis

    def get_ad_intelligence(self, brand_name, country="US"):
        """Extract competitor ad intelligence from Facebook Ad Library."""
        response = requests.post(
            self.base_url,
            headers={"x-api-key": self.api_key},
            json={
                "url": f"https://www.facebook.com/ads/library/"
                       f"?active_status=active&ad_type=all"
                       f"&country={country}&q={brand_name}",
                "render_js": True,
                "scroll_count": 5,
                "extract": {
                    "ads": {
                        "_selector": "[role='article']",
                        "_type": "list",
                        "text": "div",
                        "started": "span:has-text('Started')",
                        "platforms": "span:has-text('Facebook'), span:has-text('Instagram')",
                    }
                }
            }
        )

        data = response.json()
        ads = data.get("extracted", {}).get("ads", [])

        return {
            "brand": brand_name,
            "country": country,
            "active_ads": len(ads),
            "ad_samples": [
                {"text": ad.get("text", "")[:200], "started": ad.get("started", "")}
                for ad in ads[:5]
            ],
            "extracted_at": datetime.utcnow().isoformat(),
        }

    def _classify_content(self, posts):
        """Classify posts by content type."""
        types = {"text_only": 0, "with_link": 0, "with_media": 0}
        for post in posts:
            text = post.get("text", "")
            if "http" in text or "www." in text:
                types["with_link"] += 1
            elif post.get("imageUrl"):
                types["with_media"] += 1
            else:
                types["text_only"] += 1
        return types

    def _analyze_patterns(self, posts):
        """Analyze posting patterns."""
        return {
            "total_posts_visible": len(posts),
            "has_regular_cadence": len(posts) >= 5,
        }

    def _summarize_engagement(self, posts):
        """Summarize engagement across posts."""
        import re
        total_reactions = 0
        for post in posts:
            label = post.get("reactions", "")
            match = re.search(r"([\d,.]+)", str(label))
            if match:
                total_reactions += int(match.group(1).replace(",", ""))

        return {
            "total_reactions": total_reactions,
            "avg_reactions_per_post": total_reactions // max(len(posts), 1),
        }

    def _top_posts(self, posts, n=3):
        """Return top performing posts by engagement."""
        # Sort by reaction count (approximate from labels)
        import re
        scored = []
        for post in posts:
            label = post.get("reactions", "")
            match = re.search(r"([\d,.]+)", str(label))
            score = int(match.group(1).replace(",", "")) if match else 0
            scored.append({"text": post.get("text", "")[:200], "reactions": score})

        scored.sort(key=lambda x: x["reactions"], reverse=True)
        return scored[:n]


# Usage as an AI agent tool
tool = FacebookIntelligenceTool("YOUR_API_KEY")

# Get page intelligence
intel = tool.get_page_intelligence("https://www.facebook.com/TechCrunch")
print(json.dumps(intel, indent=2))

# Get ad intelligence
ad_intel = tool.get_ad_intelligence("ScrapingBee")
print(f"\n{ad_intel['brand']}: {ad_intel['active_ads']} active ads")
for ad in ad_intel["ad_samples"]:
    print(f"  - {ad['text'][:80]}...")

Graph API vs Scraping vs Mantis

Feature	Facebook Graph API	DIY Scraping	Mantis API
Cost	Free (own pages only)	Server + residential proxy costs	$29/mo (5K requests)
Access scope	Only pages you own/manage	Any public page	Any public page
Competitor data	❌ Not available	✅ Public pages	✅ Public pages
Ad Library	✅ Ad Library API	✅ Web scraping	✅ Included
Setup time	Hours (app review required)	Days (stealth config, proxies)	Minutes
Login wall handling	N/A (API)	You manage	Included
Anti-bot handling	N/A (API)	You manage	Included
Proxy management	N/A	Residential required ($$$)	Included
JS rendering	N/A (API)	Browser required	Included
Maintenance	API version updates	High (DOM changes, fingerprinting)	Zero
Reliability	High (limited scope)	Low-Medium	High

Extract Facebook Data at Scale — Without the Headaches

Mantis handles login walls, residential proxies, fingerprinting, and JavaScript rendering. Get structured Facebook data with a single API call.

View Pricing Get Started Free

Legal Considerations

Facebook scraping carries more legal risk than most platforms due to Meta's aggressive enforcement posture. Here's what you need to know:

Key Legal Precedents

Meta v. Bright Data (2024) — Meta sued Bright Data for scraping Facebook and Instagram. The case resulted in a settlement with undisclosed terms, but Meta's willingness to pursue litigation sends a clear signal. Unlike LinkedIn (which lost the hiQ case), Meta has been more successful in enforcing its ToS against scrapers.
hiQ Labs v. LinkedIn (2022) — The Ninth Circuit ruled that scraping publicly accessible data does not violate the CFAA. While this case involved LinkedIn, its principle — that accessing public web data isn't "unauthorized access" — has broader implications.
Van Buren v. United States (2021) — The Supreme Court narrowed the CFAA, ruling that "exceeding authorized access" means accessing data you're not entitled to see, not violating terms of service.
Cambridge Analytica (2018) — While this involved API misuse (not scraping), it triggered Meta's data lockdown and made regulators scrutinize all forms of Facebook data collection.

Meta's Enforcement Stance

Meta is uniquely aggressive among tech companies in enforcing against scraping:

Dedicated internal team for detecting and blocking scrapers
Active litigation against scraping companies
Cease-and-desist letters to individuals and businesses
Technical measures (fingerprinting, IP blocking, rate limiting) that are among the most sophisticated in the industry

Best Practices

Only scrape public data — Never access private profiles, closed groups, or authenticated-only content
Minimize data collection — Only collect what you actually need; avoid bulk data hoarding
Don't store personal data unnecessarily — User names, profile info, and comments have GDPR/CCPA implications
Use the Ad Library API when possible — It's public and explicitly intended for transparency research
Respect rate limits — Don't overwhelm Facebook's servers
Don't republish raw content — Analysis and aggregation are safer than republishing posts verbatim
Get legal advice — For any commercial use of Facebook data, consult legal counsel in your jurisdiction

Disclaimer: This article is for educational purposes only. Web scraping may violate Facebook's Terms of Service. Meta has actively pursued legal action against scrapers. Always ensure your scraping activities comply with applicable laws and regulations in your jurisdiction.

FAQ

See the structured FAQ data above for common questions about scraping Facebook. Key points:

The Graph API is nearly useless for third-party data — only your own pages are accessible
mbasic.facebook.com is the easiest scraping target — simple HTML, no JS required
Facebook has the most aggressive anti-bot defenses of any major social platform
Residential proxies are essential — datacenter IPs are pre-blocked
The Ad Library is publicly accessible and the safest Facebook data to scrape
Meta actively litigates against scraping companies — proceed with caution
For production use, a managed API like Mantis handles login walls, proxies, and fingerprinting automatically

Next Steps

Now that you know how to scrape Facebook, explore more scraping guides:

How to Scrape Facebook Data in 2026: Pages, Posts & Groups

Table of Contents

Why Scrape Facebook?

The Facebook Graph API Problem

4 Methods to Scrape Facebook

Method 1: Python + Requests (mbasic.facebook.com)

Setup

Scrape a Public Facebook Page

Pagination: Scrape Multiple Pages of Posts

Method 2: Playwright Headless Browser

Setup

Scrape Facebook Page with Full Rendering

Method 3: Node.js + Puppeteer

Setup

Scrape with GraphQL API Interception

Method 4: Mantis Web Scraping API

Scrape Facebook Ad Library with Mantis

Why Mantis for Facebook?

Scrape Facebook Without Getting Blocked

Facebook Anti-Bot Defenses

1. Login Walls

2. Device & Browser Fingerprinting

3. IP Reputation & Rate Limiting

4. Checkpoint Challenges

5. GraphQL Hash Rotation

6. Dynamic CSS Classes

What Data Can You Extract?

3 Real-World Use Cases

Use Case 1: Brand Reputation Monitor

Use Case 2: Competitor Page Analyzer

Use Case 3: AI Agent Social Intelligence

Graph API vs Scraping vs Mantis

Extract Facebook Data at Scale — Without the Headaches

Legal Considerations

Key Legal Precedents

Meta's Enforcement Stance

Best Practices

FAQ

Next Steps