Is Puppeteer good for web scraping?

Yes — Puppeteer is one of the best tools for web scraping JavaScript-heavy websites. It controls a real Chrome browser, so it can render SPAs (React, Angular, Vue), handle dynamic content, execute JavaScript, and interact with pages just like a real user. It's especially powerful for sites that require login, infinite scroll, or complex user interactions.

What is the difference between Puppeteer and Playwright?

Puppeteer is developed by Google and primarily targets Chrome/Chromium. Playwright is developed by Microsoft and supports Chrome, Firefox, and WebKit out of the box. Playwright has better auto-waiting, a more modern API, and built-in mobile emulation. However, Puppeteer has a larger ecosystem, more community resources, and is slightly simpler for Chrome-only scraping tasks.

Can Puppeteer scrape without being detected?

Vanilla Puppeteer is easily detected by anti-bot systems because it sets navigator.webdriver to true and has other detectable signatures. Using puppeteer-extra with the stealth plugin patches most of these fingerprints. However, sophisticated anti-bot systems (Cloudflare, DataDome, PerimeterX) can still detect stealth Puppeteer. For guaranteed unblocked access, use a scraping API like Mantis.

Is Puppeteer faster than Selenium for scraping?

Yes — Puppeteer communicates with Chrome via the DevTools Protocol (CDP), which is faster than Selenium's WebDriver protocol. Puppeteer also has native async/await support in Node.js, better network interception, and lower memory overhead. For JavaScript developers, Puppeteer is generally the better choice for scraping performance.

How do I handle infinite scroll with Puppeteer?

Use page.evaluate() to scroll to the bottom of the page, then wait for new content to load using waitForSelector or waitForNetworkIdle. Repeat in a loop until no new content appears. You can detect the end by comparing the page height before and after scrolling, or by checking for a 'no more results' element.

Can I run Puppeteer on a server without a display?

Yes — Puppeteer runs in headless mode by default, which requires no display or GPU. It works on Linux servers, Docker containers, and CI/CD pipelines. For Docker, use the official puppeteer Docker image or install Chrome dependencies manually. On AWS Lambda, use chromium-min with puppeteer-core for a smaller deployment package.

Web Scraping with Puppeteer in 2026: The Complete Guide

Published March 16, 2026 · 22 min read · Updated for Puppeteer 23.x

Puppeteer is Google's official Node.js library for controlling Chrome and Chromium browsers. For JavaScript developers, it's the go-to tool for scraping dynamic, JavaScript-heavy websites that simple HTTP requests can't handle — SPAs built with React, Angular, or Vue that render content client-side.

This guide covers everything from basic page scraping to production-ready patterns — including stealth mode, network interception, infinite scroll, concurrency with puppeteer-cluster, and when to skip the browser entirely and use an API.

Installation & Setup
Your First Puppeteer Scrape
Selecting Elements & Extracting Data
Waiting Strategies
Page Interaction: Clicks, Forms & Navigation
Screenshots & PDFs
Handling Infinite Scroll
Network Interception
Cookies & Session Management
Stealth Mode: Avoiding Detection
Proxy Rotation
Concurrency with puppeteer-cluster
Production-Ready Scraper
Puppeteer vs Playwright vs Selenium vs Mantis API
When to Use Puppeteer vs an API

1. Installation & Setup

Puppeteer ships with a bundled Chromium binary, so you don't need to install Chrome separately:

# Install Puppeteer (includes Chromium)
npm install puppeteer

# Or install puppeteer-core (no bundled browser — bring your own)
npm install puppeteer-core

Tip: Use puppeteer for development and local scraping. Use puppeteer-core with a custom Chrome path for Docker/Lambda deployments where you control the browser binary.

Create your first scraper file:

// scraper.js
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({
  headless: true,        // Run without visible browser window
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',  // Prevent /dev/shm issues in Docker
  ]
});

const page = await browser.newPage();
await page.goto('https://example.com');

const title = await page.title();
console.log('Page title:', title);

await browser.close();

Note: Puppeteer 23.x uses the new headless mode by default. The old headless mode (headless: 'shell') is faster but more detectable. Use headless: true (new headless) for scraping to reduce bot detection.

2. Your First Puppeteer Scrape

Let's scrape product data from a page. Puppeteer evaluates JavaScript directly in the browser context:

import puppeteer from 'puppeteer';

async function scrapeProducts() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Set a realistic viewport
  await page.setViewport({ width: 1920, height: 1080 });

  // Set a real User-Agent
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
    '(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
  );

  await page.goto('https://books.toscrape.com/', {
    waitUntil: 'networkidle2',   // Wait until network is quiet
    timeout: 30000
  });

  // Extract all book titles and prices
  const books = await page.evaluate(() => {
    const items = document.querySelectorAll('article.product_pod');
    return Array.from(items).map(item => ({
      title: item.querySelector('h3 a').getAttribute('title'),
      price: item.querySelector('.price_color').textContent,
      inStock: item.querySelector('.instock') !== null
    }));
  });

  console.log(`Found ${books.length} books`);
  console.log(books.slice(0, 3));

  await browser.close();
  return books;
}

scrapeProducts();

3. Selecting Elements & Extracting Data

Puppeteer provides several methods for selecting and extracting data from the DOM:

page.$eval — Single Element

// Get text content of a single element
const heading = await page.$eval('h1', el => el.textContent);

// Get an attribute
const link = await page.$eval('a.main-link', el => el.href);

// Get inner HTML
const html = await page.$eval('.content', el => el.innerHTML);

page.$$eval — Multiple Elements

// Get all links on the page
const links = await page.$$eval('a', elements =>
  elements.map(el => ({
    text: el.textContent.trim(),
    href: el.href
  }))
);

// Get all prices
const prices = await page.$$eval('.price', elements =>
  elements.map(el => parseFloat(el.textContent.replace('$', '')))
);

page.evaluate — Full DOM Access

// Complex extraction logic
const data = await page.evaluate(() => {
  const rows = document.querySelectorAll('table tbody tr');
  return Array.from(rows).map(row => {
    const cells = row.querySelectorAll('td');
    return {
      name: cells[0]?.textContent.trim(),
      value: cells[1]?.textContent.trim(),
      date: cells[2]?.textContent.trim()
    };
  });
});

Using XPath

// Select elements by XPath
const elements = await page.$$('xpath/.//div[@class="result"]');
for (const el of elements) {
  const text = await el.evaluate(node => node.textContent);
  console.log(text);
}

4. Waiting Strategies

Proper waiting is the #1 factor in reliable Puppeteer scraping. Race conditions cause most scraper failures:

// Wait for a specific selector to appear in the DOM
await page.waitForSelector('.product-list', { timeout: 10000 });

// Wait for selector to be visible (not just in DOM)
await page.waitForSelector('.modal', { visible: true });

// Wait for selector to disappear
await page.waitForSelector('.loading-spinner', { hidden: true });

// Wait for navigation to complete
await Promise.all([
  page.click('a.next-page'),
  page.waitForNavigation({ waitUntil: 'networkidle2' })
]);

// Wait for network to be idle (no requests for 500ms)
await page.goto(url, { waitUntil: 'networkidle0' }); // 0 connections
await page.goto(url, { waitUntil: 'networkidle2' }); // ≤2 connections

// Wait for a function to return true
await page.waitForFunction(
  () => document.querySelectorAll('.item').length >= 20,
  { timeout: 15000, polling: 500 }
);

// Custom delay (use sparingly)
await new Promise(r => setTimeout(r, 2000));

Tip: Prefer waitForSelector and waitForFunction over fixed delays. They're faster and more reliable. Use networkidle2 over networkidle0 for most sites — analytics scripts keep connections open indefinitely.

5. Page Interaction: Clicks, Forms & Navigation

Puppeteer can simulate any user interaction — essential for scraping behind logins or multi-step flows:

Clicking Elements

// Click a button
await page.click('button.load-more');

// Click and wait for navigation
await Promise.all([
  page.click('a.next'),
  page.waitForNavigation({ waitUntil: 'networkidle2' })
]);

// Click at specific coordinates
await page.mouse.click(100, 200);

// Double-click
await page.click('.item', { clickCount: 2 });

Typing & Form Submission

// Type into an input field (simulates keystrokes)
await page.type('#search-input', 'web scraping API', { delay: 50 });

// Clear and type (select all, then type)
await page.click('#email', { clickCount: 3 });
await page.type('#email', 'user@example.com');

// Submit a form
await page.type('#username', 'myuser');
await page.type('#password', 'mypass');
await Promise.all([
  page.click('button[type="submit"]'),
  page.waitForNavigation()
]);

// Select from dropdown
await page.select('#country', 'US');

// Upload a file
const input = await page.$('input[type="file"]');
await input.uploadFile('/path/to/file.pdf');

Keyboard Shortcuts

// Press Enter
await page.keyboard.press('Enter');

// Keyboard shortcut (Ctrl+A to select all)
await page.keyboard.down('Control');
await page.keyboard.press('a');
await page.keyboard.up('Control');

6. Screenshots & PDFs

// Full page screenshot
await page.screenshot({
  path: 'fullpage.png',
  fullPage: true
});

// Specific element screenshot
const element = await page.$('.product-card');
await element.screenshot({ path: 'product.png' });

// Custom viewport screenshot
await page.setViewport({ width: 1440, height: 900 });
await page.screenshot({
  path: 'desktop.png',
  type: 'jpeg',
  quality: 85
});

// Generate PDF (only works in headless mode)
await page.pdf({
  path: 'page.pdf',
  format: 'A4',
  printBackground: true,
  margin: { top: '1cm', bottom: '1cm' }
});

Need Screenshots at Scale?

Mantis API captures pixel-perfect screenshots of any URL — no browser management required. One API call, instant results.

Try Mantis Free →

7. Handling Infinite Scroll

Many modern sites use infinite scroll instead of pagination. Here's a reliable pattern:

async function scrapeInfiniteScroll(page, maxScrolls = 50) {
  let items = [];
  let previousHeight = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    // Scroll to bottom
    const currentHeight = await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
      return document.body.scrollHeight;
    });

    // Wait for new content to load
    await new Promise(r => setTimeout(r, 2000));

    // Check if we've reached the end
    if (currentHeight === previousHeight) {
      console.log('No more content to load');
      break;
    }

    previousHeight = currentHeight;
    scrollCount++;
    console.log(`Scroll ${scrollCount}: height = ${currentHeight}`);
  }

  // Extract all loaded items
  items = await page.$$eval('.item', elements =>
    elements.map(el => ({
      title: el.querySelector('.title')?.textContent.trim(),
      url: el.querySelector('a')?.href
    }))
  );

  return items;
}

8. Network Interception

Puppeteer's network interception is a superpower for scraping. Block unnecessary resources, capture API responses, and modify requests:

Block Images & CSS (Faster Scraping)

await page.setRequestInterception(true);

page.on('request', request => {
  const resourceType = request.resourceType();
  if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
    request.abort();
  } else {
    request.continue();
  }
});

Capture API Responses

// Intercept JSON API calls — often easier than scraping the DOM
const apiData = [];

page.on('response', async response => {
  const url = response.url();
  if (url.includes('/api/products') && response.status() === 200) {
    try {
      const json = await response.json();
      apiData.push(...json.results);
      console.log(`Captured ${json.results.length} items from API`);
    } catch (e) {
      // Not JSON, skip
    }
  }
});

await page.goto('https://example.com/products');
await page.waitForNetworkIdle();

console.log(`Total items from API: ${apiData.length}`);

Pro Tip: Many SPAs fetch data from internal JSON APIs. Intercepting these API responses is often faster and more reliable than parsing the rendered DOM. Open DevTools Network tab to find them.

Modify Request Headers

await page.setRequestInterception(true);

page.on('request', request => {
  request.continue({
    headers: {
      ...request.headers(),
      'Accept-Language': 'en-US,en;q=0.9',
      'Referer': 'https://www.google.com/'
    }
  });
});

9. Cookies & Session Management

Persist login sessions across scraping runs by saving and restoring cookies:

import fs from 'fs/promises';

// Save cookies after login
async function saveCookies(page, filePath) {
  const cookies = await page.cookies();
  await fs.writeFile(filePath, JSON.stringify(cookies, null, 2));
  console.log(`Saved ${cookies.length} cookies`);
}

// Restore cookies before scraping
async function loadCookies(page, filePath) {
  try {
    const data = await fs.readFile(filePath, 'utf-8');
    const cookies = JSON.parse(data);
    await page.setCookie(...cookies);
    console.log(`Loaded ${cookies.length} cookies`);
  } catch {
    console.log('No saved cookies found');
  }
}

// Usage
const page = await browser.newPage();
await loadCookies(page, 'cookies.json');
await page.goto('https://example.com/dashboard');

// Check if still logged in
const isLoggedIn = await page.$('.user-avatar') !== null;
if (!isLoggedIn) {
  // Perform login...
  await page.type('#email', 'user@example.com');
  await page.type('#password', 'password');
  await page.click('#login-btn');
  await page.waitForNavigation();
  await saveCookies(page, 'cookies.json');
}

10. Stealth Mode: Avoiding Detection

Vanilla Puppeteer is trivially detected by anti-bot systems. The puppeteer-extra-plugin-stealth package patches most fingerprinting vectors:

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

// Apply stealth plugin
puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({
  headless: true,
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-blink-features=AutomationControlled',
    '--window-size=1920,1080'
  ]
});

const page = await browser.newPage();

// Additional stealth measures
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent(
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
  '(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
);

// Override WebGL vendor/renderer
await page.evaluateOnNewDocument(() => {
  const getParameter = WebGLRenderingContext.prototype.getParameter;
  WebGLRenderingContext.prototype.getParameter = function(parameter) {
    if (parameter === 37445) return 'Intel Inc.';
    if (parameter === 37446) return 'Intel Iris OpenGL Engine';
    return getParameter.call(this, parameter);
  };
});

Warning: Stealth plugins help with basic detection, but sophisticated anti-bot systems (Cloudflare Turnstile, DataDome, PerimeterX) use behavioral analysis, TLS fingerprinting, and ML classifiers that stealth plugins can't bypass. For guaranteed access to protected sites, use a scraping API that handles anti-bot challenges.

11. Proxy Rotation

Rotate IP addresses to avoid rate limiting and IP bans:

// Launch with a proxy
const browser = await puppeteer.launch({
  headless: true,
  args: ['--proxy-server=http://proxy-host:8080']
});

// Authenticate with proxy
const page = await browser.newPage();
await page.authenticate({
  username: 'proxy_user',
  password: 'proxy_pass'
});

// Rotate proxies across multiple browsers
const proxies = [
  'http://user:pass@proxy1.example.com:8080',
  'http://user:pass@proxy2.example.com:8080',
  'http://user:pass@proxy3.example.com:8080',
];

async function scrapeWithProxy(url, proxy) {
  const browser = await puppeteer.launch({
    headless: true,
    args: [`--proxy-server=${proxy}`]
  });
  const page = await browser.newPage();

  try {
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
    const data = await page.evaluate(() => document.body.innerText);
    return data;
  } finally {
    await browser.close();
  }
}

// Use a random proxy for each request
const proxy = proxies[Math.floor(Math.random() * proxies.length)];
const result = await scrapeWithProxy('https://example.com', proxy);

12. Concurrency with puppeteer-cluster

For scraping hundreds or thousands of URLs, puppeteer-cluster manages browser instances, concurrency, retries, and error handling:

import { Cluster } from 'puppeteer-cluster';

async function scrapeAtScale(urls) {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT, // One browser, multiple incognito contexts
    maxConcurrency: 5,                        // 5 parallel pages
    timeout: 30000,
    retryLimit: 2,
    puppeteerOptions: {
      headless: true,
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    },
    monitor: true  // Print cluster stats
  });

  const results = [];

  // Define the scraping task
  await cluster.task(async ({ page, data: url }) => {
    await page.setViewport({ width: 1920, height: 1080 });
    await page.goto(url, { waitUntil: 'networkidle2' });

    const pageData = await page.evaluate(() => ({
      title: document.title,
      description: document.querySelector('meta[name="description"]')
        ?.getAttribute('content') || '',
      h1: document.querySelector('h1')?.textContent || '',
      links: document.querySelectorAll('a').length
    }));

    results.push({ url, ...pageData });
    console.log(`Scraped: ${url}`);
  });

  // Queue all URLs
  for (const url of urls) {
    cluster.queue(url);
  }

  // Wait for all tasks to complete
  await cluster.idle();
  await cluster.close();

  return results;
}

// Usage
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  // ... hundreds more
];

const results = await scrapeAtScale(urls);
console.log(`Scraped ${results.length} pages`);

Concurrency modes:

CONCURRENCY_PAGE — One browser, pages share cookies/cache (fastest, least isolation)
CONCURRENCY_CONTEXT — One browser, separate incognito contexts (good balance)
CONCURRENCY_BROWSER — Separate browser per task (most isolation, most memory)

13. Production-Ready Scraper

Here's a battle-tested scraper class with retries, error handling, rate limiting, and data export:

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import fs from 'fs/promises';

puppeteer.use(StealthPlugin());

class ProductionScraper {
  constructor(options = {}) {
    this.maxRetries = options.maxRetries || 3;
    this.delay = options.delay || 1500;
    this.timeout = options.timeout || 30000;
    this.browser = null;
    this.results = [];
    this.errors = [];
  }

  async init() {
    this.browser = await puppeteer.launch({
      headless: true,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-blink-features=AutomationControlled'
      ]
    });
  }

  async scrapePage(url, extractFn, retries = 0) {
    const page = await this.browser.newPage();

    try {
      // Block heavy resources
      await page.setRequestInterception(true);
      page.on('request', req => {
        if (['image', 'font', 'media'].includes(req.resourceType())) {
          req.abort();
        } else {
          req.continue();
        }
      });

      await page.setViewport({ width: 1920, height: 1080 });
      await page.goto(url, {
        waitUntil: 'networkidle2',
        timeout: this.timeout
      });

      const data = await extractFn(page);
      this.results.push({ url, data, scrapedAt: new Date().toISOString() });
      return data;

    } catch (error) {
      if (retries < this.maxRetries) {
        console.log(`Retry ${retries + 1}/${this.maxRetries} for ${url}`);
        await new Promise(r => setTimeout(r, this.delay * (retries + 1)));
        return this.scrapePage(url, extractFn, retries + 1);
      }
      this.errors.push({ url, error: error.message });
      console.error(`Failed after ${this.maxRetries} retries: ${url}`);
    } finally {
      await page.close();
    }
  }

  async scrapeMany(urls, extractFn) {
    for (const url of urls) {
      await this.scrapePage(url, extractFn);
      // Rate limiting
      await new Promise(r =>
        setTimeout(r, this.delay + Math.random() * 1000)
      );
    }
  }

  async export(filePath) {
    const output = {
      scrapedAt: new Date().toISOString(),
      totalScraped: this.results.length,
      totalErrors: this.errors.length,
      results: this.results,
      errors: this.errors
    };
    await fs.writeFile(filePath, JSON.stringify(output, null, 2));
    console.log(`Exported ${this.results.length} results to ${filePath}`);
  }

  async close() {
    if (this.browser) await this.browser.close();
  }
}

// Usage
const scraper = new ProductionScraper({ delay: 2000, maxRetries: 3 });
await scraper.init();

await scraper.scrapeMany(
  ['https://books.toscrape.com/catalogue/page-1.html'],
  async (page) => {
    return page.$$eval('article.product_pod', items =>
      items.map(item => ({
        title: item.querySelector('h3 a').getAttribute('title'),
        price: item.querySelector('.price_color').textContent,
      }))
    );
  }
);

await scraper.export('results.json');
await scraper.close();

From Puppeteer to Production in Minutes

Mantis API handles Chrome, proxies, anti-bot bypasses, and scaling — so you don't have to. One API call replaces hundreds of lines of Puppeteer code.

Start Free — 100 Calls/Month →

14. Puppeteer vs Playwright vs Selenium vs Mantis API

Feature	Puppeteer	Playwright	Selenium	Mantis API
Language	JavaScript/TypeScript	JS, Python, Java, C#	All major languages	Any (REST API)
Browser support	Chrome/Chromium, Firefox (experimental)	Chrome, Firefox, WebKit	All browsers	Managed Chrome
JS rendering	✅ Full	✅ Full	✅ Full	✅ Full
Auto-waiting	Manual	Built-in (excellent)	Manual	Automatic
Stealth mode	Via plugin (good)	Via plugin (good)	Limited	Built-in (best)
Proxy support	Launch args	Per-context	Via capabilities	Built-in rotation
Concurrency	puppeteer-cluster	Native contexts	Selenium Grid	Unlimited (cloud)
Network interception	Excellent (CDP)	Excellent (native)	Limited	N/A
Infrastructure	Self-managed	Self-managed	Self-managed	Fully managed
Cost (10K pages/mo)	$150–500 (servers)	$150–500 (servers)	$200–800 (servers)	$29 (Starter plan)
Best for	JS devs, Chrome-focused scraping	Cross-browser, modern projects	Legacy, multi-browser testing	Production scraping at scale

15. When to Use Puppeteer vs an API

Use Puppeteer when:

You're a JavaScript/TypeScript developer
You need fine-grained browser control (clicks, forms, complex flows)
You're scraping a small number of sites you know well
You need network interception or request modification
Budget is $0 and volume is under 1,000 pages/month

Use an API when:

You need to scrape at scale (10K+ pages/month)
Sites use anti-bot protection (Cloudflare, DataDome, etc.)
You want structured data extraction without writing selectors
You don't want to manage browsers, proxies, and infrastructure
Reliability and uptime matter for your business

FAQ

See the FAQ section above for answers to common questions about web scraping with Puppeteer.

Next Steps

Web Scraping with Playwright — Cross-browser alternative to Puppeteer
Web Scraping with Selenium — Browser automation for Python developers
How to Scrape Without Getting Blocked — Anti-detection techniques
Best Web Scraping APIs Comparison — Find the right tool for your needs
Web Scraping with BeautifulSoup — HTML parsing fundamentals
Web Scraping with Scrapy — Full framework for large-scale crawling

Web Scraping with Puppeteer in 2026: The Complete Guide

Table of Contents

1. Installation & Setup

2. Your First Puppeteer Scrape

3. Selecting Elements & Extracting Data

page.$eval — Single Element

page.$$eval — Multiple Elements

page.evaluate — Full DOM Access

Using XPath

4. Waiting Strategies

5. Page Interaction: Clicks, Forms & Navigation

Clicking Elements

Typing & Form Submission

Keyboard Shortcuts

6. Screenshots & PDFs

Need Screenshots at Scale?

7. Handling Infinite Scroll

8. Network Interception

Block Images & CSS (Faster Scraping)

Capture API Responses

Modify Request Headers

9. Cookies & Session Management

10. Stealth Mode: Avoiding Detection

11. Proxy Rotation

12. Concurrency with puppeteer-cluster

13. Production-Ready Scraper

From Puppeteer to Production in Minutes

14. Puppeteer vs Playwright vs Selenium vs Mantis API

15. When to Use Puppeteer vs an API

Use Puppeteer when:

Use an API when:

FAQ

Next Steps