Puppeteer Web Scraping: The Complete Guide for 2026
March 6, 2026 tutorial
Puppeteer is Google's official Node.js library for controlling headless Chrome. It's one of the most popular tools for web scraping — and for good reason. It renders JavaScript, handles SPAs, and gives you full browser control.
But in 2026, is Puppeteer still the best choice for web scraping? This guide covers everything: setup, common patterns, advanced techniques, and when you should consider an API instead.
## What Is Puppeteer?
Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium. Originally built by the Chrome DevTools team at Google, it's designed for:
- **Browser automation** — clicking, typing, navigating
- **Screenshot and PDF generation** — headless rendering
- **Web scraping** — extracting data from JavaScript-heavy sites
- **Testing** — end-to-end browser testing
Unlike HTTP-based scrapers (like Axios + Cheerio), Puppeteer runs a real browser. That means it can scrape sites that rely on JavaScript to render content — React apps, SPAs, infinite scroll pages, and more.
## Getting Started
### Installation
```bash
npm install puppeteer
```
This installs Puppeteer along with a compatible version of Chromium (~170MB). If you want to use your own Chrome installation:
```bash
npm install puppeteer-core
```
### Your First Scraper
```javascript
const puppeteer = require('puppeteer');
async function scrape() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const title = await page.$eval('h1', el => el.textContent);
const links = await page.$$eval('a', anchors =>
anchors.map(a => ({ text: a.textContent, href: a.href }))
);
console.log('Title:', title);
console.log('Links:', links);
await browser.close();
}
scrape();
```
This launches a headless Chrome instance, navigates to a page, extracts data, and closes the browser.
## Common Scraping Patterns
### Waiting for Dynamic Content
Many modern websites load content asynchronously. You need to wait for the data to appear:
```javascript
// Wait for a specific selector
await page.waitForSelector('.product-card');
// Wait for navigation after a click
await Promise.all([
page.waitForNavigation(),
page.click('.next-page')
]);
// Wait for network to be idle
await page.goto(url, { waitUntil: 'networkidle0' });
```
### Handling Pagination
```javascript
async function scrapeAllPages(baseUrl) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const allData = [];
let currentPage = 1;
let hasNext = true;
while (hasNext) {
await page.goto(`${baseUrl}?page=${currentPage}`);
await page.waitForSelector('.item');
const items = await page.$$eval('.item', els =>
els.map(el => ({
title: el.querySelector('h2')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
}))
);
allData.push(...items);
hasNext = await page.$('.next-page:not(.disabled)') !== null;
currentPage++;
}
await browser.close();
return allData;
}
```
### Infinite Scroll
```javascript
async function scrapeInfiniteScroll(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let previousHeight = 0;
while (true) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) break;
previousHeight = currentHeight;
}
const items = await page.$$eval('.feed-item', els =>
els.map(el => el.textContent.trim())
);
await browser.close();
return items;
}
```
### Intercepting Network Requests
One of Puppeteer's most powerful features — intercept API calls directly:
```javascript
async function interceptAPI(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const apiResponses = [];
page.on('response', async response => {
if (response.url().includes('/api/products')) {
const data = await response.json();
apiResponses.push(data);
}
});
await page.goto(url, { waitUntil: 'networkidle0' });
await browser.close();
return apiResponses;
}
```
### Taking Screenshots
```javascript
// Full page screenshot
await page.screenshot({ path: 'page.png', fullPage: true });
// Element screenshot
const element = await page.$('.hero-section');
await element.screenshot({ path: 'hero.png' });
// With custom viewport
await page.setViewport({ width: 1920, height: 1080 });
await page.screenshot({ path: 'desktop.png' });
```
## Advanced Techniques
### Stealth Mode
Websites detect Puppeteer through various browser fingerprints. The `puppeteer-extra-plugin-stealth` plugin patches these:
```javascript
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch();
// Now harder to detect as a bot
```
### Custom Headers and User Agents
```javascript
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...');
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9',
});
```
### Proxy Support
```javascript
const browser = await puppeteer.launch({
args: ['--proxy-server=http://proxy.example.com:8080']
});
// With authentication
await page.authenticate({
username: 'proxy_user',
password: 'proxy_pass'
});
```
### Blocking Unnecessary Resources
Speed up scraping by blocking images, fonts, and stylesheets:
```javascript
await page.setRequestInterception(true);
page.on('request', request => {
const blocked = ['image', 'stylesheet', 'font', 'media'];
if (blocked.includes(request.resourceType())) {
request.abort();
} else {
request.continue();
}
});
```
### Running Multiple Pages in Parallel
```javascript
async function scrapeUrls(urls, concurrency = 5) {
const browser = await puppeteer.launch();
const results = [];
for (let i = 0; i < urls.length; i += concurrency) {
const batch = urls.slice(i, i + concurrency);
const promises = batch.map(async url => {
const page = await browser.newPage();
try {
await page.goto(url, { timeout: 30000 });
const data = await page.$eval('body', el => el.textContent);
return { url, data };
} catch (err) {
return { url, error: err.message };
} finally {
await page.close();
}
});
results.push(...await Promise.all(promises));
}
await browser.close();
return results;
}
```
## The Challenges of Puppeteer Scraping
While Puppeteer is powerful, it comes with real production challenges:
### 1. Resource Hungry
Each browser instance consumes 200-500MB of RAM. Scraping at scale means managing dozens of Chrome processes — that's expensive infrastructure.
### 2. Anti-Bot Detection
Even with stealth plugins, sophisticated anti-bot systems (Cloudflare, DataDome, PerimeterX) detect and block headless Chrome. The arms race is constant.
### 3. Maintenance Burden
Selectors break when websites redesign. You need monitoring, alerting, and constant maintenance to keep scrapers running.
### 4. Speed
Launching a browser, loading all resources, waiting for JavaScript — it's slow. A simple page might take 3-5 seconds. An API call takes 1-2 seconds.
### 5. Infrastructure Complexity
Running headless Chrome in production requires Docker containers, process management, crash recovery, and proxy rotation. It's an ops headache.
## Puppeteer vs WebPerception API
What if you could get Puppeteer's capabilities — JavaScript rendering, screenshots, data extraction — without managing browsers?
| Feature | Puppeteer (DIY) | WebPerception API |
|---------|-----------------|-------------------|
| JavaScript rendering | ✅ Full browser | ✅ Cloud-rendered |
| Anti-bot handling | ❌ You manage it | ✅ Built-in |
| Infrastructure | ❌ You host Chrome | ✅ Serverless |
| AI data extraction | ❌ CSS selectors only | ✅ Natural language queries |
| Screenshots | ✅ Manual setup | ✅ One API call |
| Setup time | Hours to days | Minutes |
| Cost at scale | High (compute + proxies) | Predictable per-call pricing |
| Maintenance | Constant | Zero |
### WebPerception API Example
Here's what Puppeteer scraping looks like vs a single API call:
**Puppeteer (30+ lines):**
```javascript
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/products');
await page.waitForSelector('.product');
const products = await page.$$eval('.product', els =>
els.map(el => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
}))
);
await browser.close();
```
**WebPerception API (5 lines):**
```javascript
const response = await fetch('https://api.mantisapi.com/extract', {
method: 'POST',
headers: {
'x-api-key': 'YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://example.com/products',
prompt: 'Extract all product names and prices'
})
});
const data = await response.json();
```
No browser to manage. No selectors to break. No infrastructure to maintain. And the AI extraction adapts automatically when the website changes its layout.
## When to Use Puppeteer vs an API
**Use Puppeteer when:**
- You need complex multi-step browser automation (login → navigate → click → scrape)
- You're building browser testing tools
- You need fine-grained control over every browser action
- You're scraping a small number of pages infrequently
**Use WebPerception API when:**
- You need reliable, production-grade scraping
- You want AI-powered data extraction without writing selectors
- You're building an AI agent that needs web perception
- You need screenshots at scale
- You want zero infrastructure management
- You're scraping many pages or need high reliability
## Getting Started with WebPerception API
Ready to simplify your web scraping? WebPerception API gives you everything Puppeteer does — rendering, screenshots, data extraction — without the infrastructure headache.
**Free tier:** 100 API calls/month, no credit card required.
```bash
# Scrape any page (JavaScript rendered)
curl -X POST https://api.mantisapi.com/scrape \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
# AI-powered data extraction
curl -X POST https://api.mantisapi.com/extract \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/pricing", "prompt": "Extract all plan names, prices, and features"}'
```
[Get your free API key →](https://mantisapi.com)
## Conclusion
Puppeteer remains a powerful browser automation tool. For testing and simple automation tasks, it's excellent. But for production web scraping in 2026, the complexity of managing headless browsers, fighting anti-bot systems, and maintaining CSS selectors is a losing battle.
APIs like WebPerception handle the hard parts — rendering, anti-bot, infrastructure — so you can focus on what matters: using the data. Start with the [free tier](https://mantisapi.com) and see the difference.
Ready to try Mantis?
100 free API calls/month. No credit card required.
Get Your API Key →