Web Scraping with Puppeteer in 2026: The Complete Guide

Published March 16, 2026 · 22 min read · Updated for Puppeteer 23.x

Puppeteer is Google's official Node.js library for controlling Chrome and Chromium browsers. For JavaScript developers, it's the go-to tool for scraping dynamic, JavaScript-heavy websites that simple HTTP requests can't handle — SPAs built with React, Angular, or Vue that render content client-side.

This guide covers everything from basic page scraping to production-ready patterns — including stealth mode, network interception, infinite scroll, concurrency with puppeteer-cluster, and when to skip the browser entirely and use an API.

Table of Contents

  1. Installation & Setup
  2. Your First Puppeteer Scrape
  3. Selecting Elements & Extracting Data
  4. Waiting Strategies
  5. Page Interaction: Clicks, Forms & Navigation
  6. Screenshots & PDFs
  7. Handling Infinite Scroll
  8. Network Interception
  9. Cookies & Session Management
  10. Stealth Mode: Avoiding Detection
  11. Proxy Rotation
  12. Concurrency with puppeteer-cluster
  13. Production-Ready Scraper
  14. Puppeteer vs Playwright vs Selenium vs Mantis API
  15. When to Use Puppeteer vs an API

1. Installation & Setup

Puppeteer ships with a bundled Chromium binary, so you don't need to install Chrome separately:

# Install Puppeteer (includes Chromium)
npm install puppeteer

# Or install puppeteer-core (no bundled browser — bring your own)
npm install puppeteer-core
Tip: Use puppeteer for development and local scraping. Use puppeteer-core with a custom Chrome path for Docker/Lambda deployments where you control the browser binary.

Create your first scraper file:

// scraper.js
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({
  headless: true,        // Run without visible browser window
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',  // Prevent /dev/shm issues in Docker
  ]
});

const page = await browser.newPage();
await page.goto('https://example.com');

const title = await page.title();
console.log('Page title:', title);

await browser.close();
Note: Puppeteer 23.x uses the new headless mode by default. The old headless mode (headless: 'shell') is faster but more detectable. Use headless: true (new headless) for scraping to reduce bot detection.

2. Your First Puppeteer Scrape

Let's scrape product data from a page. Puppeteer evaluates JavaScript directly in the browser context:

import puppeteer from 'puppeteer';

async function scrapeProducts() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Set a realistic viewport
  await page.setViewport({ width: 1920, height: 1080 });

  // Set a real User-Agent
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
    '(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
  );

  await page.goto('https://books.toscrape.com/', {
    waitUntil: 'networkidle2',   // Wait until network is quiet
    timeout: 30000
  });

  // Extract all book titles and prices
  const books = await page.evaluate(() => {
    const items = document.querySelectorAll('article.product_pod');
    return Array.from(items).map(item => ({
      title: item.querySelector('h3 a').getAttribute('title'),
      price: item.querySelector('.price_color').textContent,
      inStock: item.querySelector('.instock') !== null
    }));
  });

  console.log(`Found ${books.length} books`);
  console.log(books.slice(0, 3));

  await browser.close();
  return books;
}

scrapeProducts();

3. Selecting Elements & Extracting Data

Puppeteer provides several methods for selecting and extracting data from the DOM:

page.$eval — Single Element

// Get text content of a single element
const heading = await page.$eval('h1', el => el.textContent);

// Get an attribute
const link = await page.$eval('a.main-link', el => el.href);

// Get inner HTML
const html = await page.$eval('.content', el => el.innerHTML);

page.$$eval — Multiple Elements

// Get all links on the page
const links = await page.$$eval('a', elements =>
  elements.map(el => ({
    text: el.textContent.trim(),
    href: el.href
  }))
);

// Get all prices
const prices = await page.$$eval('.price', elements =>
  elements.map(el => parseFloat(el.textContent.replace('$', '')))
);

page.evaluate — Full DOM Access

// Complex extraction logic
const data = await page.evaluate(() => {
  const rows = document.querySelectorAll('table tbody tr');
  return Array.from(rows).map(row => {
    const cells = row.querySelectorAll('td');
    return {
      name: cells[0]?.textContent.trim(),
      value: cells[1]?.textContent.trim(),
      date: cells[2]?.textContent.trim()
    };
  });
});

Using XPath

// Select elements by XPath
const elements = await page.$$('xpath/.//div[@class="result"]');
for (const el of elements) {
  const text = await el.evaluate(node => node.textContent);
  console.log(text);
}

4. Waiting Strategies

Proper waiting is the #1 factor in reliable Puppeteer scraping. Race conditions cause most scraper failures:

// Wait for a specific selector to appear in the DOM
await page.waitForSelector('.product-list', { timeout: 10000 });

// Wait for selector to be visible (not just in DOM)
await page.waitForSelector('.modal', { visible: true });

// Wait for selector to disappear
await page.waitForSelector('.loading-spinner', { hidden: true });

// Wait for navigation to complete
await Promise.all([
  page.click('a.next-page'),
  page.waitForNavigation({ waitUntil: 'networkidle2' })
]);

// Wait for network to be idle (no requests for 500ms)
await page.goto(url, { waitUntil: 'networkidle0' }); // 0 connections
await page.goto(url, { waitUntil: 'networkidle2' }); // ≤2 connections

// Wait for a function to return true
await page.waitForFunction(
  () => document.querySelectorAll('.item').length >= 20,
  { timeout: 15000, polling: 500 }
);

// Custom delay (use sparingly)
await new Promise(r => setTimeout(r, 2000));
Tip: Prefer waitForSelector and waitForFunction over fixed delays. They're faster and more reliable. Use networkidle2 over networkidle0 for most sites — analytics scripts keep connections open indefinitely.

5. Page Interaction: Clicks, Forms & Navigation

Puppeteer can simulate any user interaction — essential for scraping behind logins or multi-step flows:

Clicking Elements

// Click a button
await page.click('button.load-more');

// Click and wait for navigation
await Promise.all([
  page.click('a.next'),
  page.waitForNavigation({ waitUntil: 'networkidle2' })
]);

// Click at specific coordinates
await page.mouse.click(100, 200);

// Double-click
await page.click('.item', { clickCount: 2 });

Typing & Form Submission

// Type into an input field (simulates keystrokes)
await page.type('#search-input', 'web scraping API', { delay: 50 });

// Clear and type (select all, then type)
await page.click('#email', { clickCount: 3 });
await page.type('#email', 'user@example.com');

// Submit a form
await page.type('#username', 'myuser');
await page.type('#password', 'mypass');
await Promise.all([
  page.click('button[type="submit"]'),
  page.waitForNavigation()
]);

// Select from dropdown
await page.select('#country', 'US');

// Upload a file
const input = await page.$('input[type="file"]');
await input.uploadFile('/path/to/file.pdf');

Keyboard Shortcuts

// Press Enter
await page.keyboard.press('Enter');

// Keyboard shortcut (Ctrl+A to select all)
await page.keyboard.down('Control');
await page.keyboard.press('a');
await page.keyboard.up('Control');

6. Screenshots & PDFs

// Full page screenshot
await page.screenshot({
  path: 'fullpage.png',
  fullPage: true
});

// Specific element screenshot
const element = await page.$('.product-card');
await element.screenshot({ path: 'product.png' });

// Custom viewport screenshot
await page.setViewport({ width: 1440, height: 900 });
await page.screenshot({
  path: 'desktop.png',
  type: 'jpeg',
  quality: 85
});

// Generate PDF (only works in headless mode)
await page.pdf({
  path: 'page.pdf',
  format: 'A4',
  printBackground: true,
  margin: { top: '1cm', bottom: '1cm' }
});

Need Screenshots at Scale?

Mantis API captures pixel-perfect screenshots of any URL — no browser management required. One API call, instant results.

Try Mantis Free →

7. Handling Infinite Scroll

Many modern sites use infinite scroll instead of pagination. Here's a reliable pattern:

async function scrapeInfiniteScroll(page, maxScrolls = 50) {
  let items = [];
  let previousHeight = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    // Scroll to bottom
    const currentHeight = await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
      return document.body.scrollHeight;
    });

    // Wait for new content to load
    await new Promise(r => setTimeout(r, 2000));

    // Check if we've reached the end
    if (currentHeight === previousHeight) {
      console.log('No more content to load');
      break;
    }

    previousHeight = currentHeight;
    scrollCount++;
    console.log(`Scroll ${scrollCount}: height = ${currentHeight}`);
  }

  // Extract all loaded items
  items = await page.$$eval('.item', elements =>
    elements.map(el => ({
      title: el.querySelector('.title')?.textContent.trim(),
      url: el.querySelector('a')?.href
    }))
  );

  return items;
}

8. Network Interception

Puppeteer's network interception is a superpower for scraping. Block unnecessary resources, capture API responses, and modify requests:

Block Images & CSS (Faster Scraping)

await page.setRequestInterception(true);

page.on('request', request => {
  const resourceType = request.resourceType();
  if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
    request.abort();
  } else {
    request.continue();
  }
});

Capture API Responses

// Intercept JSON API calls — often easier than scraping the DOM
const apiData = [];

page.on('response', async response => {
  const url = response.url();
  if (url.includes('/api/products') && response.status() === 200) {
    try {
      const json = await response.json();
      apiData.push(...json.results);
      console.log(`Captured ${json.results.length} items from API`);
    } catch (e) {
      // Not JSON, skip
    }
  }
});

await page.goto('https://example.com/products');
await page.waitForNetworkIdle();

console.log(`Total items from API: ${apiData.length}`);
Pro Tip: Many SPAs fetch data from internal JSON APIs. Intercepting these API responses is often faster and more reliable than parsing the rendered DOM. Open DevTools Network tab to find them.

Modify Request Headers

await page.setRequestInterception(true);

page.on('request', request => {
  request.continue({
    headers: {
      ...request.headers(),
      'Accept-Language': 'en-US,en;q=0.9',
      'Referer': 'https://www.google.com/'
    }
  });
});

9. Cookies & Session Management

Persist login sessions across scraping runs by saving and restoring cookies:

import fs from 'fs/promises';

// Save cookies after login
async function saveCookies(page, filePath) {
  const cookies = await page.cookies();
  await fs.writeFile(filePath, JSON.stringify(cookies, null, 2));
  console.log(`Saved ${cookies.length} cookies`);
}

// Restore cookies before scraping
async function loadCookies(page, filePath) {
  try {
    const data = await fs.readFile(filePath, 'utf-8');
    const cookies = JSON.parse(data);
    await page.setCookie(...cookies);
    console.log(`Loaded ${cookies.length} cookies`);
  } catch {
    console.log('No saved cookies found');
  }
}

// Usage
const page = await browser.newPage();
await loadCookies(page, 'cookies.json');
await page.goto('https://example.com/dashboard');

// Check if still logged in
const isLoggedIn = await page.$('.user-avatar') !== null;
if (!isLoggedIn) {
  // Perform login...
  await page.type('#email', 'user@example.com');
  await page.type('#password', 'password');
  await page.click('#login-btn');
  await page.waitForNavigation();
  await saveCookies(page, 'cookies.json');
}

10. Stealth Mode: Avoiding Detection

Vanilla Puppeteer is trivially detected by anti-bot systems. The puppeteer-extra-plugin-stealth package patches most fingerprinting vectors:

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

// Apply stealth plugin
puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({
  headless: true,
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-blink-features=AutomationControlled',
    '--window-size=1920,1080'
  ]
});

const page = await browser.newPage();

// Additional stealth measures
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent(
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
  '(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
);

// Override WebGL vendor/renderer
await page.evaluateOnNewDocument(() => {
  const getParameter = WebGLRenderingContext.prototype.getParameter;
  WebGLRenderingContext.prototype.getParameter = function(parameter) {
    if (parameter === 37445) return 'Intel Inc.';
    if (parameter === 37446) return 'Intel Iris OpenGL Engine';
    return getParameter.call(this, parameter);
  };
});
Warning: Stealth plugins help with basic detection, but sophisticated anti-bot systems (Cloudflare Turnstile, DataDome, PerimeterX) use behavioral analysis, TLS fingerprinting, and ML classifiers that stealth plugins can't bypass. For guaranteed access to protected sites, use a scraping API that handles anti-bot challenges.

11. Proxy Rotation

Rotate IP addresses to avoid rate limiting and IP bans:

// Launch with a proxy
const browser = await puppeteer.launch({
  headless: true,
  args: ['--proxy-server=http://proxy-host:8080']
});

// Authenticate with proxy
const page = await browser.newPage();
await page.authenticate({
  username: 'proxy_user',
  password: 'proxy_pass'
});

// Rotate proxies across multiple browsers
const proxies = [
  'http://user:pass@proxy1.example.com:8080',
  'http://user:pass@proxy2.example.com:8080',
  'http://user:pass@proxy3.example.com:8080',
];

async function scrapeWithProxy(url, proxy) {
  const browser = await puppeteer.launch({
    headless: true,
    args: [`--proxy-server=${proxy}`]
  });
  const page = await browser.newPage();

  try {
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
    const data = await page.evaluate(() => document.body.innerText);
    return data;
  } finally {
    await browser.close();
  }
}

// Use a random proxy for each request
const proxy = proxies[Math.floor(Math.random() * proxies.length)];
const result = await scrapeWithProxy('https://example.com', proxy);

12. Concurrency with puppeteer-cluster

For scraping hundreds or thousands of URLs, puppeteer-cluster manages browser instances, concurrency, retries, and error handling:

import { Cluster } from 'puppeteer-cluster';

async function scrapeAtScale(urls) {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT, // One browser, multiple incognito contexts
    maxConcurrency: 5,                        // 5 parallel pages
    timeout: 30000,
    retryLimit: 2,
    puppeteerOptions: {
      headless: true,
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    },
    monitor: true  // Print cluster stats
  });

  const results = [];

  // Define the scraping task
  await cluster.task(async ({ page, data: url }) => {
    await page.setViewport({ width: 1920, height: 1080 });
    await page.goto(url, { waitUntil: 'networkidle2' });

    const pageData = await page.evaluate(() => ({
      title: document.title,
      description: document.querySelector('meta[name="description"]')
        ?.getAttribute('content') || '',
      h1: document.querySelector('h1')?.textContent || '',
      links: document.querySelectorAll('a').length
    }));

    results.push({ url, ...pageData });
    console.log(`Scraped: ${url}`);
  });

  // Queue all URLs
  for (const url of urls) {
    cluster.queue(url);
  }

  // Wait for all tasks to complete
  await cluster.idle();
  await cluster.close();

  return results;
}

// Usage
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  // ... hundreds more
];

const results = await scrapeAtScale(urls);
console.log(`Scraped ${results.length} pages`);
Concurrency modes:

13. Production-Ready Scraper

Here's a battle-tested scraper class with retries, error handling, rate limiting, and data export:

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import fs from 'fs/promises';

puppeteer.use(StealthPlugin());

class ProductionScraper {
  constructor(options = {}) {
    this.maxRetries = options.maxRetries || 3;
    this.delay = options.delay || 1500;
    this.timeout = options.timeout || 30000;
    this.browser = null;
    this.results = [];
    this.errors = [];
  }

  async init() {
    this.browser = await puppeteer.launch({
      headless: true,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-blink-features=AutomationControlled'
      ]
    });
  }

  async scrapePage(url, extractFn, retries = 0) {
    const page = await this.browser.newPage();

    try {
      // Block heavy resources
      await page.setRequestInterception(true);
      page.on('request', req => {
        if (['image', 'font', 'media'].includes(req.resourceType())) {
          req.abort();
        } else {
          req.continue();
        }
      });

      await page.setViewport({ width: 1920, height: 1080 });
      await page.goto(url, {
        waitUntil: 'networkidle2',
        timeout: this.timeout
      });

      const data = await extractFn(page);
      this.results.push({ url, data, scrapedAt: new Date().toISOString() });
      return data;

    } catch (error) {
      if (retries < this.maxRetries) {
        console.log(`Retry ${retries + 1}/${this.maxRetries} for ${url}`);
        await new Promise(r => setTimeout(r, this.delay * (retries + 1)));
        return this.scrapePage(url, extractFn, retries + 1);
      }
      this.errors.push({ url, error: error.message });
      console.error(`Failed after ${this.maxRetries} retries: ${url}`);
    } finally {
      await page.close();
    }
  }

  async scrapeMany(urls, extractFn) {
    for (const url of urls) {
      await this.scrapePage(url, extractFn);
      // Rate limiting
      await new Promise(r =>
        setTimeout(r, this.delay + Math.random() * 1000)
      );
    }
  }

  async export(filePath) {
    const output = {
      scrapedAt: new Date().toISOString(),
      totalScraped: this.results.length,
      totalErrors: this.errors.length,
      results: this.results,
      errors: this.errors
    };
    await fs.writeFile(filePath, JSON.stringify(output, null, 2));
    console.log(`Exported ${this.results.length} results to ${filePath}`);
  }

  async close() {
    if (this.browser) await this.browser.close();
  }
}

// Usage
const scraper = new ProductionScraper({ delay: 2000, maxRetries: 3 });
await scraper.init();

await scraper.scrapeMany(
  ['https://books.toscrape.com/catalogue/page-1.html'],
  async (page) => {
    return page.$$eval('article.product_pod', items =>
      items.map(item => ({
        title: item.querySelector('h3 a').getAttribute('title'),
        price: item.querySelector('.price_color').textContent,
      }))
    );
  }
);

await scraper.export('results.json');
await scraper.close();

From Puppeteer to Production in Minutes

Mantis API handles Chrome, proxies, anti-bot bypasses, and scaling — so you don't have to. One API call replaces hundreds of lines of Puppeteer code.

Start Free — 100 Calls/Month →

14. Puppeteer vs Playwright vs Selenium vs Mantis API

Feature Puppeteer Playwright Selenium Mantis API
Language JavaScript/TypeScript JS, Python, Java, C# All major languages Any (REST API)
Browser support Chrome/Chromium, Firefox (experimental) Chrome, Firefox, WebKit All browsers Managed Chrome
JS rendering ✅ Full ✅ Full ✅ Full ✅ Full
Auto-waiting Manual Built-in (excellent) Manual Automatic
Stealth mode Via plugin (good) Via plugin (good) Limited Built-in (best)
Proxy support Launch args Per-context Via capabilities Built-in rotation
Concurrency puppeteer-cluster Native contexts Selenium Grid Unlimited (cloud)
Network interception Excellent (CDP) Excellent (native) Limited N/A
Infrastructure Self-managed Self-managed Self-managed Fully managed
Cost (10K pages/mo) $150–500 (servers) $150–500 (servers) $200–800 (servers) $29 (Starter plan)
Best for JS devs, Chrome-focused scraping Cross-browser, modern projects Legacy, multi-browser testing Production scraping at scale

15. When to Use Puppeteer vs an API

Use Puppeteer when:

Use an API when:

FAQ

See the FAQ section above for answers to common questions about web scraping with Puppeteer.

Next Steps