Web Scraping with Puppeteer in 2026: The Complete Guide
Puppeteer is Google's official Node.js library for controlling Chrome and Chromium browsers. For JavaScript developers, it's the go-to tool for scraping dynamic, JavaScript-heavy websites that simple HTTP requests can't handle — SPAs built with React, Angular, or Vue that render content client-side.
This guide covers everything from basic page scraping to production-ready patterns — including stealth mode, network interception, infinite scroll, concurrency with puppeteer-cluster, and when to skip the browser entirely and use an API.
Table of Contents
- Installation & Setup
- Your First Puppeteer Scrape
- Selecting Elements & Extracting Data
- Waiting Strategies
- Page Interaction: Clicks, Forms & Navigation
- Screenshots & PDFs
- Handling Infinite Scroll
- Network Interception
- Cookies & Session Management
- Stealth Mode: Avoiding Detection
- Proxy Rotation
- Concurrency with puppeteer-cluster
- Production-Ready Scraper
- Puppeteer vs Playwright vs Selenium vs Mantis API
- When to Use Puppeteer vs an API
1. Installation & Setup
Puppeteer ships with a bundled Chromium binary, so you don't need to install Chrome separately:
# Install Puppeteer (includes Chromium)
npm install puppeteer
# Or install puppeteer-core (no bundled browser — bring your own)
npm install puppeteer-core
puppeteer for development and local scraping. Use puppeteer-core with a custom Chrome path for Docker/Lambda deployments where you control the browser binary.
Create your first scraper file:
// scraper.js
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({
headless: true, // Run without visible browser window
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage', // Prevent /dev/shm issues in Docker
]
});
const page = await browser.newPage();
await page.goto('https://example.com');
const title = await page.title();
console.log('Page title:', title);
await browser.close();
headless: 'shell') is faster but more detectable. Use headless: true (new headless) for scraping to reduce bot detection.
2. Your First Puppeteer Scrape
Let's scrape product data from a page. Puppeteer evaluates JavaScript directly in the browser context:
import puppeteer from 'puppeteer';
async function scrapeProducts() {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Set a realistic viewport
await page.setViewport({ width: 1920, height: 1080 });
// Set a real User-Agent
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
);
await page.goto('https://books.toscrape.com/', {
waitUntil: 'networkidle2', // Wait until network is quiet
timeout: 30000
});
// Extract all book titles and prices
const books = await page.evaluate(() => {
const items = document.querySelectorAll('article.product_pod');
return Array.from(items).map(item => ({
title: item.querySelector('h3 a').getAttribute('title'),
price: item.querySelector('.price_color').textContent,
inStock: item.querySelector('.instock') !== null
}));
});
console.log(`Found ${books.length} books`);
console.log(books.slice(0, 3));
await browser.close();
return books;
}
scrapeProducts();
3. Selecting Elements & Extracting Data
Puppeteer provides several methods for selecting and extracting data from the DOM:
page.$eval — Single Element
// Get text content of a single element
const heading = await page.$eval('h1', el => el.textContent);
// Get an attribute
const link = await page.$eval('a.main-link', el => el.href);
// Get inner HTML
const html = await page.$eval('.content', el => el.innerHTML);
page.$$eval — Multiple Elements
// Get all links on the page
const links = await page.$$eval('a', elements =>
elements.map(el => ({
text: el.textContent.trim(),
href: el.href
}))
);
// Get all prices
const prices = await page.$$eval('.price', elements =>
elements.map(el => parseFloat(el.textContent.replace('$', '')))
);
page.evaluate — Full DOM Access
// Complex extraction logic
const data = await page.evaluate(() => {
const rows = document.querySelectorAll('table tbody tr');
return Array.from(rows).map(row => {
const cells = row.querySelectorAll('td');
return {
name: cells[0]?.textContent.trim(),
value: cells[1]?.textContent.trim(),
date: cells[2]?.textContent.trim()
};
});
});
Using XPath
// Select elements by XPath
const elements = await page.$$('xpath/.//div[@class="result"]');
for (const el of elements) {
const text = await el.evaluate(node => node.textContent);
console.log(text);
}
4. Waiting Strategies
Proper waiting is the #1 factor in reliable Puppeteer scraping. Race conditions cause most scraper failures:
// Wait for a specific selector to appear in the DOM
await page.waitForSelector('.product-list', { timeout: 10000 });
// Wait for selector to be visible (not just in DOM)
await page.waitForSelector('.modal', { visible: true });
// Wait for selector to disappear
await page.waitForSelector('.loading-spinner', { hidden: true });
// Wait for navigation to complete
await Promise.all([
page.click('a.next-page'),
page.waitForNavigation({ waitUntil: 'networkidle2' })
]);
// Wait for network to be idle (no requests for 500ms)
await page.goto(url, { waitUntil: 'networkidle0' }); // 0 connections
await page.goto(url, { waitUntil: 'networkidle2' }); // ≤2 connections
// Wait for a function to return true
await page.waitForFunction(
() => document.querySelectorAll('.item').length >= 20,
{ timeout: 15000, polling: 500 }
);
// Custom delay (use sparingly)
await new Promise(r => setTimeout(r, 2000));
waitForSelector and waitForFunction over fixed delays. They're faster and more reliable. Use networkidle2 over networkidle0 for most sites — analytics scripts keep connections open indefinitely.
5. Page Interaction: Clicks, Forms & Navigation
Puppeteer can simulate any user interaction — essential for scraping behind logins or multi-step flows:
Clicking Elements
// Click a button
await page.click('button.load-more');
// Click and wait for navigation
await Promise.all([
page.click('a.next'),
page.waitForNavigation({ waitUntil: 'networkidle2' })
]);
// Click at specific coordinates
await page.mouse.click(100, 200);
// Double-click
await page.click('.item', { clickCount: 2 });
Typing & Form Submission
// Type into an input field (simulates keystrokes)
await page.type('#search-input', 'web scraping API', { delay: 50 });
// Clear and type (select all, then type)
await page.click('#email', { clickCount: 3 });
await page.type('#email', 'user@example.com');
// Submit a form
await page.type('#username', 'myuser');
await page.type('#password', 'mypass');
await Promise.all([
page.click('button[type="submit"]'),
page.waitForNavigation()
]);
// Select from dropdown
await page.select('#country', 'US');
// Upload a file
const input = await page.$('input[type="file"]');
await input.uploadFile('/path/to/file.pdf');
Keyboard Shortcuts
// Press Enter
await page.keyboard.press('Enter');
// Keyboard shortcut (Ctrl+A to select all)
await page.keyboard.down('Control');
await page.keyboard.press('a');
await page.keyboard.up('Control');
6. Screenshots & PDFs
// Full page screenshot
await page.screenshot({
path: 'fullpage.png',
fullPage: true
});
// Specific element screenshot
const element = await page.$('.product-card');
await element.screenshot({ path: 'product.png' });
// Custom viewport screenshot
await page.setViewport({ width: 1440, height: 900 });
await page.screenshot({
path: 'desktop.png',
type: 'jpeg',
quality: 85
});
// Generate PDF (only works in headless mode)
await page.pdf({
path: 'page.pdf',
format: 'A4',
printBackground: true,
margin: { top: '1cm', bottom: '1cm' }
});
Need Screenshots at Scale?
Mantis API captures pixel-perfect screenshots of any URL — no browser management required. One API call, instant results.
Try Mantis Free →7. Handling Infinite Scroll
Many modern sites use infinite scroll instead of pagination. Here's a reliable pattern:
async function scrapeInfiniteScroll(page, maxScrolls = 50) {
let items = [];
let previousHeight = 0;
let scrollCount = 0;
while (scrollCount < maxScrolls) {
// Scroll to bottom
const currentHeight = await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
return document.body.scrollHeight;
});
// Wait for new content to load
await new Promise(r => setTimeout(r, 2000));
// Check if we've reached the end
if (currentHeight === previousHeight) {
console.log('No more content to load');
break;
}
previousHeight = currentHeight;
scrollCount++;
console.log(`Scroll ${scrollCount}: height = ${currentHeight}`);
}
// Extract all loaded items
items = await page.$$eval('.item', elements =>
elements.map(el => ({
title: el.querySelector('.title')?.textContent.trim(),
url: el.querySelector('a')?.href
}))
);
return items;
}
8. Network Interception
Puppeteer's network interception is a superpower for scraping. Block unnecessary resources, capture API responses, and modify requests:
Block Images & CSS (Faster Scraping)
await page.setRequestInterception(true);
page.on('request', request => {
const resourceType = request.resourceType();
if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
request.abort();
} else {
request.continue();
}
});
Capture API Responses
// Intercept JSON API calls — often easier than scraping the DOM
const apiData = [];
page.on('response', async response => {
const url = response.url();
if (url.includes('/api/products') && response.status() === 200) {
try {
const json = await response.json();
apiData.push(...json.results);
console.log(`Captured ${json.results.length} items from API`);
} catch (e) {
// Not JSON, skip
}
}
});
await page.goto('https://example.com/products');
await page.waitForNetworkIdle();
console.log(`Total items from API: ${apiData.length}`);
Modify Request Headers
await page.setRequestInterception(true);
page.on('request', request => {
request.continue({
headers: {
...request.headers(),
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://www.google.com/'
}
});
});
9. Cookies & Session Management
Persist login sessions across scraping runs by saving and restoring cookies:
import fs from 'fs/promises';
// Save cookies after login
async function saveCookies(page, filePath) {
const cookies = await page.cookies();
await fs.writeFile(filePath, JSON.stringify(cookies, null, 2));
console.log(`Saved ${cookies.length} cookies`);
}
// Restore cookies before scraping
async function loadCookies(page, filePath) {
try {
const data = await fs.readFile(filePath, 'utf-8');
const cookies = JSON.parse(data);
await page.setCookie(...cookies);
console.log(`Loaded ${cookies.length} cookies`);
} catch {
console.log('No saved cookies found');
}
}
// Usage
const page = await browser.newPage();
await loadCookies(page, 'cookies.json');
await page.goto('https://example.com/dashboard');
// Check if still logged in
const isLoggedIn = await page.$('.user-avatar') !== null;
if (!isLoggedIn) {
// Perform login...
await page.type('#email', 'user@example.com');
await page.type('#password', 'password');
await page.click('#login-btn');
await page.waitForNavigation();
await saveCookies(page, 'cookies.json');
}
10. Stealth Mode: Avoiding Detection
Vanilla Puppeteer is trivially detected by anti-bot systems. The puppeteer-extra-plugin-stealth package patches most fingerprinting vectors:
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
// Apply stealth plugin
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-blink-features=AutomationControlled',
'--window-size=1920,1080'
]
});
const page = await browser.newPage();
// Additional stealth measures
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
);
// Override WebGL vendor/renderer
await page.evaluateOnNewDocument(() => {
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) return 'Intel Inc.';
if (parameter === 37446) return 'Intel Iris OpenGL Engine';
return getParameter.call(this, parameter);
};
});
11. Proxy Rotation
Rotate IP addresses to avoid rate limiting and IP bans:
// Launch with a proxy
const browser = await puppeteer.launch({
headless: true,
args: ['--proxy-server=http://proxy-host:8080']
});
// Authenticate with proxy
const page = await browser.newPage();
await page.authenticate({
username: 'proxy_user',
password: 'proxy_pass'
});
// Rotate proxies across multiple browsers
const proxies = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
];
async function scrapeWithProxy(url, proxy) {
const browser = await puppeteer.launch({
headless: true,
args: [`--proxy-server=${proxy}`]
});
const page = await browser.newPage();
try {
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
const data = await page.evaluate(() => document.body.innerText);
return data;
} finally {
await browser.close();
}
}
// Use a random proxy for each request
const proxy = proxies[Math.floor(Math.random() * proxies.length)];
const result = await scrapeWithProxy('https://example.com', proxy);
12. Concurrency with puppeteer-cluster
For scraping hundreds or thousands of URLs, puppeteer-cluster manages browser instances, concurrency, retries, and error handling:
import { Cluster } from 'puppeteer-cluster';
async function scrapeAtScale(urls) {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT, // One browser, multiple incognito contexts
maxConcurrency: 5, // 5 parallel pages
timeout: 30000,
retryLimit: 2,
puppeteerOptions: {
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox']
},
monitor: true // Print cluster stats
});
const results = [];
// Define the scraping task
await cluster.task(async ({ page, data: url }) => {
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(url, { waitUntil: 'networkidle2' });
const pageData = await page.evaluate(() => ({
title: document.title,
description: document.querySelector('meta[name="description"]')
?.getAttribute('content') || '',
h1: document.querySelector('h1')?.textContent || '',
links: document.querySelectorAll('a').length
}));
results.push({ url, ...pageData });
console.log(`Scraped: ${url}`);
});
// Queue all URLs
for (const url of urls) {
cluster.queue(url);
}
// Wait for all tasks to complete
await cluster.idle();
await cluster.close();
return results;
}
// Usage
const urls = [
'https://example.com/page1',
'https://example.com/page2',
// ... hundreds more
];
const results = await scrapeAtScale(urls);
console.log(`Scraped ${results.length} pages`);
CONCURRENCY_PAGE— One browser, pages share cookies/cache (fastest, least isolation)CONCURRENCY_CONTEXT— One browser, separate incognito contexts (good balance)CONCURRENCY_BROWSER— Separate browser per task (most isolation, most memory)
13. Production-Ready Scraper
Here's a battle-tested scraper class with retries, error handling, rate limiting, and data export:
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import fs from 'fs/promises';
puppeteer.use(StealthPlugin());
class ProductionScraper {
constructor(options = {}) {
this.maxRetries = options.maxRetries || 3;
this.delay = options.delay || 1500;
this.timeout = options.timeout || 30000;
this.browser = null;
this.results = [];
this.errors = [];
}
async init() {
this.browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-blink-features=AutomationControlled'
]
});
}
async scrapePage(url, extractFn, retries = 0) {
const page = await this.browser.newPage();
try {
// Block heavy resources
await page.setRequestInterception(true);
page.on('request', req => {
if (['image', 'font', 'media'].includes(req.resourceType())) {
req.abort();
} else {
req.continue();
}
});
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: this.timeout
});
const data = await extractFn(page);
this.results.push({ url, data, scrapedAt: new Date().toISOString() });
return data;
} catch (error) {
if (retries < this.maxRetries) {
console.log(`Retry ${retries + 1}/${this.maxRetries} for ${url}`);
await new Promise(r => setTimeout(r, this.delay * (retries + 1)));
return this.scrapePage(url, extractFn, retries + 1);
}
this.errors.push({ url, error: error.message });
console.error(`Failed after ${this.maxRetries} retries: ${url}`);
} finally {
await page.close();
}
}
async scrapeMany(urls, extractFn) {
for (const url of urls) {
await this.scrapePage(url, extractFn);
// Rate limiting
await new Promise(r =>
setTimeout(r, this.delay + Math.random() * 1000)
);
}
}
async export(filePath) {
const output = {
scrapedAt: new Date().toISOString(),
totalScraped: this.results.length,
totalErrors: this.errors.length,
results: this.results,
errors: this.errors
};
await fs.writeFile(filePath, JSON.stringify(output, null, 2));
console.log(`Exported ${this.results.length} results to ${filePath}`);
}
async close() {
if (this.browser) await this.browser.close();
}
}
// Usage
const scraper = new ProductionScraper({ delay: 2000, maxRetries: 3 });
await scraper.init();
await scraper.scrapeMany(
['https://books.toscrape.com/catalogue/page-1.html'],
async (page) => {
return page.$$eval('article.product_pod', items =>
items.map(item => ({
title: item.querySelector('h3 a').getAttribute('title'),
price: item.querySelector('.price_color').textContent,
}))
);
}
);
await scraper.export('results.json');
await scraper.close();
From Puppeteer to Production in Minutes
Mantis API handles Chrome, proxies, anti-bot bypasses, and scaling — so you don't have to. One API call replaces hundreds of lines of Puppeteer code.
Start Free — 100 Calls/Month →14. Puppeteer vs Playwright vs Selenium vs Mantis API
| Feature | Puppeteer | Playwright | Selenium | Mantis API |
|---|---|---|---|---|
| Language | JavaScript/TypeScript | JS, Python, Java, C# | All major languages | Any (REST API) |
| Browser support | Chrome/Chromium, Firefox (experimental) | Chrome, Firefox, WebKit | All browsers | Managed Chrome |
| JS rendering | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
| Auto-waiting | Manual | Built-in (excellent) | Manual | Automatic |
| Stealth mode | Via plugin (good) | Via plugin (good) | Limited | Built-in (best) |
| Proxy support | Launch args | Per-context | Via capabilities | Built-in rotation |
| Concurrency | puppeteer-cluster | Native contexts | Selenium Grid | Unlimited (cloud) |
| Network interception | Excellent (CDP) | Excellent (native) | Limited | N/A |
| Infrastructure | Self-managed | Self-managed | Self-managed | Fully managed |
| Cost (10K pages/mo) | $150–500 (servers) | $150–500 (servers) | $200–800 (servers) | $29 (Starter plan) |
| Best for | JS devs, Chrome-focused scraping | Cross-browser, modern projects | Legacy, multi-browser testing | Production scraping at scale |
15. When to Use Puppeteer vs an API
Use Puppeteer when:
- You're a JavaScript/TypeScript developer
- You need fine-grained browser control (clicks, forms, complex flows)
- You're scraping a small number of sites you know well
- You need network interception or request modification
- Budget is $0 and volume is under 1,000 pages/month
Use an API when:
- You need to scrape at scale (10K+ pages/month)
- Sites use anti-bot protection (Cloudflare, DataDome, etc.)
- You want structured data extraction without writing selectors
- You don't want to manage browsers, proxies, and infrastructure
- Reliability and uptime matter for your business
FAQ
See the FAQ section above for answers to common questions about web scraping with Puppeteer.
Next Steps
- Web Scraping with Playwright — Cross-browser alternative to Puppeteer
- Web Scraping with Selenium — Browser automation for Python developers
- How to Scrape Without Getting Blocked — Anti-detection techniques
- Best Web Scraping APIs Comparison — Find the right tool for your needs
- Web Scraping with BeautifulSoup — HTML parsing fundamentals
- Web Scraping with Scrapy — Full framework for large-scale crawling