JavaScript is the language of the web — and Node.js makes it the language of web scraping too. Here's why developers choose Node.js for scraping in 2026:
Here's every major tool in the Node.js scraping ecosystem:
| Tool | Type | Best For | JS Rendering | Guide |
|---|---|---|---|---|
| Cheerio | HTML parser | Fast HTML parsing (jQuery-style) | ❌ | Full guide → |
| Puppeteer | Browser automation | Headless Chrome, screenshots | ✅ | Full guide → |
| Playwright | Browser automation | Multi-browser, modern API | ✅ | Playwright guide → |
| Axios | HTTP client | Simple HTTP requests | ❌ | — |
| node-fetch | HTTP client | Fetch API for Node.js | ❌ | — |
| Got | HTTP client | Advanced HTTP (retries, streams) | ❌ | — |
| Crawlee | Framework | Large-scale crawling | ✅ (via Puppeteer/Playwright) | — |
| Mantis API | Web scraping API | Production scraping, AI agents | ✅ | Full guide → |
Let's build a working scraper in under 20 lines using Axios to fetch pages and Cheerio to parse HTML:
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeHN() {
// 1. Fetch the page
const { data } = await axios.get('https://news.ycombinator.com', {
headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }
});
// 2. Parse the HTML
const $ = cheerio.load(data);
// 3. Extract data
$('.titleline > a').slice(0, 10).each((i, el) => {
console.log($(el).text(), '→', $(el).attr('href'));
});
}
scrapeHN();
npm install axios cheerio
node scraper.js
That's it — a working scraper in 15 lines. For the complete jQuery-style API, DOM traversal, table scraping, and pagination, see our complete Cheerio guide.
Cheerio is the Node.js equivalent of Python's BeautifulSoup. It implements a subset of jQuery for fast, memory-efficient HTML parsing — no browser needed:
const cheerio = require('cheerio');
const html = `
<div class="products">
<div class="product">
<h2 class="name">Widget Pro</h2>
<span class="price">$49.99</span>
<a href="/products/widget-pro">Details</a>
</div>
<div class="product">
<h2 class="name">Gadget Max</h2>
<span class="price">$79.99</span>
<a href="/products/gadget-max">Details</a>
</div>
</div>
`;
const $ = cheerio.load(html);
// CSS selectors — just like jQuery
$('.product').each((i, el) => {
const name = $(el).find('.name').text();
const price = $(el).find('.price').text();
const url = $(el).find('a').attr('href');
console.log({ name, price, url });
});
// DOM traversal
$('.product').first().next().find('.name').text(); // "Gadget Max"
$('.name').parent().attr('class'); // "product"
Cheerio is 10-20x faster than browser-based scraping because it only parses HTML — no DOM rendering, no JavaScript execution. Use it whenever the page content is in the raw HTML. See the complete Cheerio guide for tables, pagination, and production patterns.
When pages render content with JavaScript (React, Angular, Vue), you need a real browser. Puppeteer controls headless Chrome:
const puppeteer = require('puppeteer');
async function scrapeSPA() {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
// Set a realistic User-Agent
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
await page.goto('https://example.com/spa-app', {
waitUntil: 'networkidle2'
});
// Wait for dynamic content
await page.waitForSelector('.product-card');
// Extract data from the rendered page
const products = await page.$$eval('.product-card', cards =>
cards.map(card => ({
name: card.querySelector('.name').textContent,
price: card.querySelector('.price').textContent,
}))
);
console.log(products);
// Take a screenshot
await page.screenshot({ path: 'products.png', fullPage: true });
await browser.close();
}
scrapeSPA();
npm install puppeteer
node scraper.js
Puppeteer excels at screenshots, PDF generation, and form interaction. For stealth mode, proxy rotation, network interception, and concurrent scraping with puppeteer-cluster, see our complete Puppeteer guide.
Playwright is the newer alternative to Puppeteer, created by the same team at Microsoft. It supports Chromium, Firefox, and WebKit, with a more modern API:
const { chromium } = require('playwright');
async function scrapeWithPlaywright() {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/products');
// Playwright auto-waits for elements
const products = await page.locator('.product-card').all();
for (const product of products) {
const name = await product.locator('.name').textContent();
const price = await product.locator('.price').textContent();
console.log({ name, price });
}
// Intercept network requests
await page.route('**/*.{png,jpg,gif}', route => route.abort());
// Capture API responses
page.on('response', async response => {
if (response.url().includes('/api/products')) {
const json = await response.json();
console.log('API data:', json);
}
});
await browser.close();
}
scrapeWithPlaywright();
npm install playwright
npx playwright install chromium
Node.js was built for concurrency. Here are the key patterns for scraping many pages at once:
const axios = require('axios');
const cheerio = require('cheerio');
// Process URLs in batches of N
async function scrapeBatch(urls, batchSize = 5) {
const results = [];
for (let i = 0; i < urls.length; i += batchSize) {
const batch = urls.slice(i, i + batchSize);
const batchResults = await Promise.all(
batch.map(async (url) => {
try {
const { data } = await axios.get(url, {
headers: { 'User-Agent': 'Mozilla/5.0' },
timeout: 10000
});
const $ = cheerio.load(data);
return { url, title: $('h1').text(), status: 'ok' };
} catch (err) {
return { url, error: err.message, status: 'error' };
}
})
);
results.push(...batchResults);
// Rate limit: wait 1 second between batches
if (i + batchSize < urls.length) {
await new Promise(r => setTimeout(r, 1000));
}
}
return results;
}
// Usage
const urls = Array.from({ length: 50 }, (_, i) => `https://example.com/page/${i + 1}`);
scrapeBatch(urls, 5).then(console.log);
const { Cluster } = require('puppeteer-cluster');
async function scrapeWithCluster() {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 4,
puppeteerOptions: { headless: 'new' }
});
await cluster.task(async ({ page, data: url }) => {
await page.goto(url, { waitUntil: 'networkidle2' });
const title = await page.title();
console.log(`${url} → ${title}`);
});
for (let i = 1; i <= 50; i++) {
cluster.queue(`https://example.com/page/${i}`);
}
await cluster.idle();
await cluster.close();
}
scrapeWithCluster();
For large-scale structured crawling, Crawlee (by Apify) is the most complete Node.js framework:
const { CheerioCrawler } = require('crawlee');
const crawler = new CheerioCrawler({
maxConcurrency: 10,
maxRequestsPerMinute: 60,
async requestHandler({ $, request, enqueueLinks }) {
const title = $('h1').text();
const price = $('.price').text();
console.log({ url: request.url, title, price });
// Auto-discover and follow links
await enqueueLinks({
selector: 'a.next-page',
});
},
});
crawler.run(['https://example.com/products']);
Crawlee handles retries, request queues, data storage, proxy rotation, and both HTTP and browser-based crawling. It's what you reach for when a simple script isn't enough.
The same anti-bot systems that block Python scrapers block Node.js scrapers. Here's how to stay under the radar:
const headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/',
'Connection': 'keep-alive',
};
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: 'new' });
// Now passes most bot detection tests
const proxies = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
];
// With Axios
const HttpsProxyAgent = require('https-proxy-agent');
const proxy = proxies[Math.floor(Math.random() * proxies.length)];
const { data } = await axios.get(url, {
httpsAgent: new HttpsProxyAgent(proxy)
});
// With Puppeteer
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxy}`]
});
For a comprehensive deep dive into anti-blocking for all languages, see our guide to scraping without getting blocked.
Mantis handles proxy rotation, JavaScript rendering, and anti-blocking automatically. One API call, clean data back.
Try Mantis Free — 100 Calls/Month →Building and maintaining scraping infrastructure is expensive. Here's the real cost:
| Component | DIY Cost (Monthly) | Mantis API |
|---|---|---|
| Proxy rotation | $50–500 | ✅ Included |
| Headless browsers | $100–300 | ✅ Included |
| CAPTCHA solving | $50–200 | ✅ Included |
| Anti-bot bypass | Engineering time | ✅ Included |
| Maintenance | Ongoing dev hours | ✅ Managed |
| Total | $200–1,000+ | From $29/mo |
Use a web scraping API when:
const axios = require('axios');
const response = await axios.post('https://api.mantisapi.com/v1/scrape', {
url: 'https://example.com/products',
render_js: true,
extract: {
products: '.product-card',
fields: {
name: '.name',
price: '.price'
}
}
}, {
headers: { 'Authorization': `Bearer ${API_KEY}` }
});
console.log(response.data.products);
// [{ name: "Widget Pro", price: "$49.99" }, ...]
One API call replaces Puppeteer + proxies + stealth plugins + error handling. See our API comparison guide for details.
| Criteria | Axios + Cheerio | Puppeteer | Playwright | Crawlee | Mantis API |
|---|---|---|---|---|---|
| Learning curve | ⭐ Easy | ⭐⭐ Medium | ⭐⭐ Medium | ⭐⭐ Medium | ⭐ Easy |
| Speed | Very fast | Slow | Slow | Fast | Fast |
| JS rendering | ❌ | ✅ | ✅ | ✅ (plugin) | ✅ |
| Concurrency | Promise.all | puppeteer-cluster | Manual | Built-in | Built-in |
| Anti-bot bypass | Manual | Stealth plugin | Manual | Built-in | Automatic |
| Best for | Quick scripts | Screenshots, PDFs | JS-heavy sites | Large crawls | Production / AI |
The eternal debate. Here's a fair comparison:
| Factor | JavaScript / Node.js | Python |
|---|---|---|
| Browser automation | ⭐⭐⭐ (Puppeteer/Playwright were built here) | ⭐⭐ (good bindings) |
| HTML parsing | ⭐⭐ Cheerio | ⭐⭐⭐ BeautifulSoup, lxml |
| Crawling frameworks | ⭐⭐ Crawlee | ⭐⭐⭐ Scrapy (mature) |
| Async/concurrency | ⭐⭐⭐ Native event loop | ⭐⭐ asyncio (added later) |
| Data processing | ⭐ Limited | ⭐⭐⭐ pandas, NumPy |
| Community/tutorials | ⭐⭐ Growing | ⭐⭐⭐ Dominant |
| AI/agent integration | ⭐⭐ Vercel AI SDK | ⭐⭐⭐ LangChain, CrewAI |
Choose JavaScript when: Your stack is already JS, you need browser automation, or you want native async concurrency.
Choose Python when: You need Scrapy-level crawling, data science integration, or access to the larger scraping community.
Choose Mantis API when: You don't want to worry about language-specific infrastructure at all.
See our Python scraping guide for the Python side of the comparison.
Mantis WebPerception API: scraping, screenshots, and AI extraction — one API call. Works with any language.
Start Free →Yes. Node.js is one of the best platforms for web scraping. Cheerio parses HTML fast, while Puppeteer and Playwright automate full browsers for JavaScript-rendered pages. Node's async-first design makes it naturally suited for concurrent scraping.
For static HTML pages, use Cheerio with Axios — it's fast and lightweight. For JavaScript-rendered pages (SPAs, React, Angular), use Puppeteer or Playwright. For production workloads at scale, a web scraping API like Mantis handles everything automatically.
Playwright is generally better for scraping in 2026. It supports Chromium, Firefox, and WebKit, has built-in auto-waiting, better network interception, and more reliable selectors. Puppeteer is Chromium-only but has a larger community. For new projects, we recommend Playwright.
Python has more scraping libraries and a larger community. JavaScript/Node.js excels at browser automation since Puppeteer and Playwright were built for it. If you already write JavaScript, Node.js is excellent. If you need large-scale crawling frameworks, Python (Scrapy) has the edge.
Use Puppeteer or Playwright to launch a headless browser, navigate to the page, wait for content to render, then extract the data. You can also feed the rendered HTML into Cheerio for fast parsing. Alternatively, use a web scraping API like Mantis that handles JavaScript rendering server-side.
Node.js libraries are free, but production scraping has hidden costs: proxy services ($50–500/month), headless browser servers ($100–300/month), CAPTCHA solving ($1–3 per 1,000), and maintenance time. A web scraping API like Mantis starts free (100 calls/month) with paid plans from $29/month for 5,000 calls.
© 2026 Mantis · Web scraping, screenshots, and AI data extraction for agents and developers.