Web Scraping with JavaScript and Node.js: The Complete Guide for 2026
Web Scraping with JavaScript and Node.js: The Complete Guide for 2026
JavaScript isn't just for building websites anymore. It's one of the most popular languages for web scraping — and with Node.js, you can build scrapers that handle everything from static HTML to JavaScript-heavy single-page apps.
But scraping in 2026 is harder than it used to be. Anti-bot systems, CAPTCHAs, and dynamic rendering make DIY scraping a constant battle. This guide covers every approach — from simple HTML parsing to headless browsers to API-based scraping — so you can pick the right tool for your project.
Why JavaScript for Web Scraping?
- Same language as the web. You're scraping websites built with JavaScript, using JavaScript. DOM manipulation feels natural.
- Massive ecosystem. npm has libraries for everything: HTTP clients, HTML parsers, headless browsers, proxies.
- Async by default. Node.js handles concurrent requests effortlessly with async/await.
- Full-stack capability. Build your scraper and your data pipeline in the same language.
The 4 Approaches to Web Scraping in JavaScript
| Approach | Best For | Handles JS? | Speed | Complexity |
|---|---|---|---|---|
| Cheerio + Axios | Static HTML pages | ❌ | ⚡ Fast | Low |
| Puppeteer | Chrome-rendered pages | ✅ | 🐢 Slow | Medium |
| Playwright | Cross-browser, complex SPAs | ✅ | 🐢 Slow | Medium |
| WebPerception API | Production scraping at scale | ✅ | ⚡ Fast | Very Low |
Approach 1: Cheerio + Axios (Static Pages)
The lightest option. Fetch raw HTML and parse it with jQuery-like syntax.
Setup
npm install axios cheerio
Example: Scrape Product Listings
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeProducts(url) {
const { data } = await axios.get(url);
const $ = cheerio.load(data);
const products = [];
$('.product-card').each((i, el) => {
products.push({
name: $(el).find('.title').text().trim(),
price: $(el).find('.price').text().trim(),
url: $(el).find('a').attr('href'),
});
});
return products;
}
scrapeProducts('https://example.com/products')
.then(console.log);
When Cheerio Works
- Server-rendered HTML (most blogs, news sites, e-commerce)
- Pages where the data is in the initial HTML response
- High-speed scraping where you need thousands of pages fast
When Cheerio Fails
- Single-page apps (React, Vue, Angular) — the HTML is empty until JavaScript runs
- Pages behind login walls that require cookie management
- Sites with anti-bot protection (Cloudflare, DataDome)
Approach 2: Puppeteer (Headless Chrome)
When pages need JavaScript to render, you need a real browser. Puppeteer controls Chrome programmatically.
Setup
npm install puppeteer
Example: Scrape a JavaScript-Rendered Page
const puppeteer = require('puppeteer');
async function scrapeSPA(url) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
const data = await page.evaluate(() => {
const items = document.querySelectorAll('.listing-item');
return Array.from(items).map(item => ({
title: item.querySelector('h2')?.textContent?.trim(),
price: item.querySelector('.price')?.textContent?.trim(),
description: item.querySelector('.desc')?.textContent?.trim(),
}));
});
await browser.close();
return data;
}
Handling Infinite Scroll
async function scrapeInfiniteScroll(url) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
let previousHeight;
while (true) {
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await new Promise(r => setTimeout(r, 2000));
const newHeight = await page.evaluate('document.body.scrollHeight');
if (newHeight === previousHeight) break;
}
const data = await page.evaluate(() => {
// Extract all loaded items
return Array.from(document.querySelectorAll('.item'))
.map(el => el.textContent.trim());
});
await browser.close();
return data;
}
Puppeteer Challenges
- Memory hungry. Each browser instance uses 100-300MB RAM.
- Slow. Launching Chrome, loading pages, waiting for JS — it adds up.
- Detection. Many sites detect Puppeteer via
navigator.webdriverand other signals. - Infrastructure. Running headless Chrome in production requires careful resource management.
Approach 3: Playwright (Cross-Browser)
Playwright is Microsoft's answer to Puppeteer. It supports Chrome, Firefox, and WebKit, with a more modern API.
Setup
npm install playwright
Example: Scrape with Auto-Waiting
const { chromium } = require('playwright');
async function scrapeWithPlaywright(url) {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url);
await page.waitForSelector('.results-loaded');
const results = await page.$$eval('.result-card', cards =>
cards.map(card => ({
title: card.querySelector('h3')?.textContent?.trim(),
link: card.querySelector('a')?.href,
snippet: card.querySelector('.snippet')?.textContent?.trim(),
}))
);
await browser.close();
return results;
}
Playwright vs Puppeteer
| Feature | Puppeteer | Playwright |
|---|---|---|
| Browsers | Chrome only | Chrome, Firefox, WebKit |
| Auto-waiting | Manual | Built-in |
| API design | Older, callback-style | Modern, promise-based |
| Parallelism | Page-level | Context-level (lighter) |
| Maintained by | Microsoft |
Both have the same fundamental limitations: they're slow, resource-heavy, and detectable.
Approach 4: WebPerception API (Production-Grade)
All three approaches above share the same problems at scale:
- Anti-bot arms race. You're constantly updating your scraper to bypass new protections.
- Infrastructure costs. Running headless browsers in production is expensive.
- Maintenance burden. Selectors break when sites change. Someone has to fix them.
The WebPerception API eliminates all of this. One API call replaces your entire scraping infrastructure.
Setup
npm install node-fetch # or use built-in fetch in Node 18+
Example: Scrape Any Page
const response = await fetch('https://api.mantisapi.com/v1/scrape', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com/products',
render_js: true,
}),
});
const { content, metadata } = await response.json();
// content = fully rendered HTML, ready for parsing
Example: AI-Powered Data Extraction
Skip the selectors entirely. Tell the API what you want in plain English:
const response = await fetch('https://api.mantisapi.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com/products',
prompt: 'Extract all products with name, price, rating, and availability',
schema: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
rating: { type: 'number' },
availability: { type: 'string' },
},
},
},
}),
});
const { data } = await response.json();
// data = structured JSON matching your schema
Example: Take Screenshots
const response = await fetch('https://api.mantisapi.com/v1/screenshot', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com',
full_page: true,
format: 'png',
}),
});
const screenshot = await response.buffer();
Why Use an API?
| DIY Scraping | WebPerception API |
|---|---|
| Manage browser infrastructure | One HTTP call |
| Fight anti-bot systems | Handled automatically |
| Fix broken selectors | AI extracts data by intent |
| Scale = more servers | Scale = more API calls |
| Hours of maintenance/week | Zero maintenance |
Pricing
- Free: 100 calls/month (perfect for testing)
- Starter: $29/month — 5,000 calls
- Pro: $99/month — 25,000 calls
- Scale: $299/month — 100,000 calls
Start free at mantisapi.com.
Common Patterns
Handling Pagination
// DIY with Cheerio
async function scrapeAllPages(baseUrl) {
let page = 1;
let allResults = [];
while (true) {
const { data } = await axios.get(`${baseUrl}?page=${page}`);
const $ = cheerio.load(data);
const items = $('.item').map((i, el) => $(el).text().trim()).get();
if (items.length === 0) break;
allResults.push(...items);
page++;
}
return allResults;
}
// With WebPerception API — just pass each URL
async function scrapeAllPagesAPI(baseUrl, totalPages) {
const results = await Promise.all(
Array.from({ length: totalPages }, (_, i) =>
fetch('https://api.mantisapi.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: `${baseUrl}?page=${i + 1}`,
prompt: 'Extract all product listings',
}),
}).then(r => r.json())
)
);
return results.flatMap(r => r.data);
}
Rate Limiting
function rateLimit(fn, delayMs) {
let lastCall = 0;
return async (...args) => {
const now = Date.now();
const wait = Math.max(0, delayMs - (now - lastCall));
await new Promise(r => setTimeout(r, wait));
lastCall = Date.now();
return fn(...args);
};
}
const scrapePage = rateLimit(async (url) => {
// your scraping logic
}, 1000); // 1 request per second
Error Handling & Retries
async function scrapeWithRetry(url, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await fetch('https://api.mantisapi.com/v1/scrape', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({ url, render_js: true }),
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return await response.json();
} catch (err) {
if (attempt === maxRetries) throw err;
await new Promise(r => setTimeout(r, 1000 * attempt)); // exponential backoff
}
}
}
Which Approach Should You Use?
Choose Cheerio + Axios if: - You're scraping static HTML pages - Speed matters more than complexity - You're comfortable writing CSS selectors
Choose Puppeteer/Playwright if: - Pages require JavaScript rendering - You need to interact with the page (click, scroll, type) - You're scraping a small number of pages and can manage the infrastructure
Choose WebPerception API if: - You're building a production application - You don't want to manage browser infrastructure - You need to handle anti-bot protection automatically - You want AI-powered data extraction instead of brittle selectors - You're building an AI agent that needs web perception
Building a Web Scraper for AI Agents
If you're building an AI agent that needs to read the web, WebPerception is the natural choice. Here's how to integrate it as a tool:
// LangChain-style tool definition
const webPerceptionTool = {
name: 'web_perception',
description: 'Fetch and extract structured data from any webpage',
async execute({ url, query }) {
const response = await fetch('https://api.mantisapi.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.MANTIS_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ url, prompt: query }),
});
return response.json();
},
};
Your agent can now perceive any webpage — no browser infrastructure, no broken selectors, no anti-bot headaches.
Conclusion
JavaScript has excellent tools for web scraping, from lightweight HTML parsing with Cheerio to full browser automation with Puppeteer and Playwright. But in 2026, the smartest approach for production applications is to let an API handle the hard parts.
The WebPerception API gives you rendered HTML, AI-powered extraction, and screenshot capabilities — all without managing a single browser instance.
Building an AI agent? Read our guide on how to build your first AI agent and learn how WebPerception fits into the agent stack.