Puppeteer Web Scraping: The Complete Guide for 2026

March 6, 2026 tutorial

Puppeteer is Google's official Node.js library for controlling headless Chrome. It's one of the most popular tools for web scraping — and for good reason. It renders JavaScript, handles SPAs, and gives you full browser control. But in 2026, is Puppeteer still the best choice for web scraping? This guide covers everything: setup, common patterns, advanced techniques, and when you should consider an API instead. ## What Is Puppeteer? Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium. Originally built by the Chrome DevTools team at Google, it's designed for: - **Browser automation** — clicking, typing, navigating - **Screenshot and PDF generation** — headless rendering - **Web scraping** — extracting data from JavaScript-heavy sites - **Testing** — end-to-end browser testing Unlike HTTP-based scrapers (like Axios + Cheerio), Puppeteer runs a real browser. That means it can scrape sites that rely on JavaScript to render content — React apps, SPAs, infinite scroll pages, and more. ## Getting Started ### Installation ```bash npm install puppeteer ``` This installs Puppeteer along with a compatible version of Chromium (~170MB). If you want to use your own Chrome installation: ```bash npm install puppeteer-core ``` ### Your First Scraper ```javascript const puppeteer = require('puppeteer'); async function scrape() { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); const title = await page.$eval('h1', el => el.textContent); const links = await page.$$eval('a', anchors => anchors.map(a => ({ text: a.textContent, href: a.href })) ); console.log('Title:', title); console.log('Links:', links); await browser.close(); } scrape(); ``` This launches a headless Chrome instance, navigates to a page, extracts data, and closes the browser. ## Common Scraping Patterns ### Waiting for Dynamic Content Many modern websites load content asynchronously. You need to wait for the data to appear: ```javascript // Wait for a specific selector await page.waitForSelector('.product-card'); // Wait for navigation after a click await Promise.all([ page.waitForNavigation(), page.click('.next-page') ]); // Wait for network to be idle await page.goto(url, { waitUntil: 'networkidle0' }); ``` ### Handling Pagination ```javascript async function scrapeAllPages(baseUrl) { const browser = await puppeteer.launch(); const page = await browser.newPage(); const allData = []; let currentPage = 1; let hasNext = true; while (hasNext) { await page.goto(`${baseUrl}?page=${currentPage}`); await page.waitForSelector('.item'); const items = await page.$$eval('.item', els => els.map(el => ({ title: el.querySelector('h2')?.textContent?.trim(), price: el.querySelector('.price')?.textContent?.trim(), })) ); allData.push(...items); hasNext = await page.$('.next-page:not(.disabled)') !== null; currentPage++; } await browser.close(); return allData; } ``` ### Infinite Scroll ```javascript async function scrapeInfiniteScroll(url) { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); let previousHeight = 0; while (true) { await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight)); await page.waitForTimeout(2000); const currentHeight = await page.evaluate(() => document.body.scrollHeight); if (currentHeight === previousHeight) break; previousHeight = currentHeight; } const items = await page.$$eval('.feed-item', els => els.map(el => el.textContent.trim()) ); await browser.close(); return items; } ``` ### Intercepting Network Requests One of Puppeteer's most powerful features — intercept API calls directly: ```javascript async function interceptAPI(url) { const browser = await puppeteer.launch(); const page = await browser.newPage(); const apiResponses = []; page.on('response', async response => { if (response.url().includes('/api/products')) { const data = await response.json(); apiResponses.push(data); } }); await page.goto(url, { waitUntil: 'networkidle0' }); await browser.close(); return apiResponses; } ``` ### Taking Screenshots ```javascript // Full page screenshot await page.screenshot({ path: 'page.png', fullPage: true }); // Element screenshot const element = await page.$('.hero-section'); await element.screenshot({ path: 'hero.png' }); // With custom viewport await page.setViewport({ width: 1920, height: 1080 }); await page.screenshot({ path: 'desktop.png' }); ``` ## Advanced Techniques ### Stealth Mode Websites detect Puppeteer through various browser fingerprints. The `puppeteer-extra-plugin-stealth` plugin patches these: ```javascript const puppeteer = require('puppeteer-extra'); const StealthPlugin = require('puppeteer-extra-plugin-stealth'); puppeteer.use(StealthPlugin()); const browser = await puppeteer.launch(); // Now harder to detect as a bot ``` ### Custom Headers and User Agents ```javascript await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...'); await page.setExtraHTTPHeaders({ 'Accept-Language': 'en-US,en;q=0.9', }); ``` ### Proxy Support ```javascript const browser = await puppeteer.launch({ args: ['--proxy-server=http://proxy.example.com:8080'] }); // With authentication await page.authenticate({ username: 'proxy_user', password: 'proxy_pass' }); ``` ### Blocking Unnecessary Resources Speed up scraping by blocking images, fonts, and stylesheets: ```javascript await page.setRequestInterception(true); page.on('request', request => { const blocked = ['image', 'stylesheet', 'font', 'media']; if (blocked.includes(request.resourceType())) { request.abort(); } else { request.continue(); } }); ``` ### Running Multiple Pages in Parallel ```javascript async function scrapeUrls(urls, concurrency = 5) { const browser = await puppeteer.launch(); const results = []; for (let i = 0; i < urls.length; i += concurrency) { const batch = urls.slice(i, i + concurrency); const promises = batch.map(async url => { const page = await browser.newPage(); try { await page.goto(url, { timeout: 30000 }); const data = await page.$eval('body', el => el.textContent); return { url, data }; } catch (err) { return { url, error: err.message }; } finally { await page.close(); } }); results.push(...await Promise.all(promises)); } await browser.close(); return results; } ``` ## The Challenges of Puppeteer Scraping While Puppeteer is powerful, it comes with real production challenges: ### 1. Resource Hungry Each browser instance consumes 200-500MB of RAM. Scraping at scale means managing dozens of Chrome processes — that's expensive infrastructure. ### 2. Anti-Bot Detection Even with stealth plugins, sophisticated anti-bot systems (Cloudflare, DataDome, PerimeterX) detect and block headless Chrome. The arms race is constant. ### 3. Maintenance Burden Selectors break when websites redesign. You need monitoring, alerting, and constant maintenance to keep scrapers running. ### 4. Speed Launching a browser, loading all resources, waiting for JavaScript — it's slow. A simple page might take 3-5 seconds. An API call takes 1-2 seconds. ### 5. Infrastructure Complexity Running headless Chrome in production requires Docker containers, process management, crash recovery, and proxy rotation. It's an ops headache. ## Puppeteer vs WebPerception API What if you could get Puppeteer's capabilities — JavaScript rendering, screenshots, data extraction — without managing browsers? | Feature | Puppeteer (DIY) | WebPerception API | |---------|-----------------|-------------------| | JavaScript rendering | ✅ Full browser | ✅ Cloud-rendered | | Anti-bot handling | ❌ You manage it | ✅ Built-in | | Infrastructure | ❌ You host Chrome | ✅ Serverless | | AI data extraction | ❌ CSS selectors only | ✅ Natural language queries | | Screenshots | ✅ Manual setup | ✅ One API call | | Setup time | Hours to days | Minutes | | Cost at scale | High (compute + proxies) | Predictable per-call pricing | | Maintenance | Constant | Zero | ### WebPerception API Example Here's what Puppeteer scraping looks like vs a single API call: **Puppeteer (30+ lines):** ```javascript const puppeteer = require('puppeteer'); const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com/products'); await page.waitForSelector('.product'); const products = await page.$$eval('.product', els => els.map(el => ({ name: el.querySelector('.name')?.textContent, price: el.querySelector('.price')?.textContent, })) ); await browser.close(); ``` **WebPerception API (5 lines):** ```javascript const response = await fetch('https://api.mantisapi.com/extract', { method: 'POST', headers: { 'x-api-key': 'YOUR_API_KEY', 'Content-Type': 'application/json' }, body: JSON.stringify({ url: 'https://example.com/products', prompt: 'Extract all product names and prices' }) }); const data = await response.json(); ``` No browser to manage. No selectors to break. No infrastructure to maintain. And the AI extraction adapts automatically when the website changes its layout. ## When to Use Puppeteer vs an API **Use Puppeteer when:** - You need complex multi-step browser automation (login → navigate → click → scrape) - You're building browser testing tools - You need fine-grained control over every browser action - You're scraping a small number of pages infrequently **Use WebPerception API when:** - You need reliable, production-grade scraping - You want AI-powered data extraction without writing selectors - You're building an AI agent that needs web perception - You need screenshots at scale - You want zero infrastructure management - You're scraping many pages or need high reliability ## Getting Started with WebPerception API Ready to simplify your web scraping? WebPerception API gives you everything Puppeteer does — rendering, screenshots, data extraction — without the infrastructure headache. **Free tier:** 100 API calls/month, no credit card required. ```bash # Scrape any page (JavaScript rendered) curl -X POST https://api.mantisapi.com/scrape \ -H "x-api-key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' # AI-powered data extraction curl -X POST https://api.mantisapi.com/extract \ -H "x-api-key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/pricing", "prompt": "Extract all plan names, prices, and features"}' ``` [Get your free API key →](https://mantisapi.com) ## Conclusion Puppeteer remains a powerful browser automation tool. For testing and simple automation tasks, it's excellent. But for production web scraping in 2026, the complexity of managing headless browsers, fighting anti-bot systems, and maintaining CSS selectors is a losing battle. APIs like WebPerception handle the hard parts — rendering, anti-bot, infrastructure — so you can focus on what matters: using the data. Start with the [free tier](https://mantisapi.com) and see the difference.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →