Web Scraping with Cheerio and Node.js in 2026: The Complete Guide
Cheerio is the go-to HTML parsing library for Node.js developers — fast, lightweight, and built around the jQuery API you already know. If you're coming from a JavaScript background, Cheerio is the most natural way to scrape and extract data from web pages.
Think of Cheerio as jQuery for the server. It parses HTML into a traversable data structure and gives you a familiar $() API to select elements, extract text, and manipulate the DOM — all without running a browser. This guide takes you from zero to production-ready scraper.
Table of Contents
- Installation & Setup
- Your First Cheerio Scraper
- CSS Selectors: The Core of Cheerio
- DOM Traversal: parent, children, siblings
- Extracting Text, Attributes, and HTML
- Scraping Tables and Lists
- Handling Pagination
- Concurrent Scraping with Promise.all
- Error Handling and Retries
- Production-Ready Scraper Class
- When Pages Need JavaScript
- Cheerio vs Puppeteer vs Playwright vs API
- The API Shortcut: Skip the Parsing
- FAQ
1. Installation & Setup
Initialize a Node.js project and install Cheerio with a modern HTTP client:
mkdir my-scraper && cd my-scraper
npm init -y
npm install cheerio axios
We'll use axios for HTTP requests and cheerio for HTML parsing. You can also use node-fetch or the built-in fetch (Node 18+).
Verify your installation:
import * as cheerio from 'cheerio';
const html = '<h1>Hello, Cheerio!</h1>';
const $ = cheerio.load(html);
console.log($('h1').text()); // "Hello, Cheerio!"
import syntax or add "type": "module" to your package.json. For CommonJS, use const cheerio = require('cheerio') with Cheerio 0.22.x or dynamic import.
2. Your First Cheerio Scraper
Let's scrape article titles from Hacker News — a classic first target:
import * as cheerio from 'cheerio';
import axios from 'axios';
async function scrapeHackerNews() {
// 1. Fetch the HTML
const { data: html } = await axios.get('https://news.ycombinator.com', {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
// 2. Load into Cheerio
const $ = cheerio.load(html);
// 3. Extract data
const stories = [];
$('.titleline > a').each((i, el) => {
stories.push({
rank: i + 1,
title: $(el).text(),
url: $(el).attr('href')
});
});
console.log(`Found ${stories.length} stories`);
stories.slice(0, 5).forEach(s =>
console.log(`${s.rank}. ${s.title}`)
);
return stories;
}
scrapeHackerNews();
That's the core pattern: fetch HTML → load into Cheerio → select elements → extract data. Everything else builds on this.
3. CSS Selectors: The Core of Cheerio
Cheerio supports all standard CSS selectors — the same ones you use in browser DevTools and jQuery:
Basic Selectors
// By tag
$('h1') // All h1 elements
$('p') // All paragraphs
// By class
$('.product-card') // Elements with class "product-card"
$('.price.sale') // Elements with BOTH classes
// By ID
$('#main-content') // Element with id "main-content"
// By attribute
$('a[href]') // All links with href attribute
$('img[alt="logo"]') // Images with alt="logo"
$('a[href^="https"]') // Links starting with "https"
$('a[href$=".pdf"]') // Links ending with ".pdf"
$('a[href*="mantis"]') // Links containing "mantis"
Combinators
// Descendant (any depth)
$('div.products .price') // .price anywhere inside div.products
// Direct child
$('ul > li') // Only direct li children of ul
// Adjacent sibling
$('h2 + p') // First p immediately after h2
// General sibling
$('h2 ~ p') // All p siblings after h2
Pseudo-selectors
$('tr:first-child') // First tr in each group
$('tr:last-child') // Last tr
$('li:nth-child(2)') // Second li in each group
$('li:nth-child(odd)') // Odd-numbered list items
$('td:not(.hidden)') // td without class "hidden"
$('p:contains("price")') // p elements containing "price" text
document.querySelectorAll('.your-selector') to verify before coding.
4. DOM Traversal: parent, children, siblings
Sometimes CSS selectors alone aren't enough. Cheerio gives you jQuery-style traversal methods:
// Parent & ancestors
$('.price').parent() // Direct parent
$('.price').closest('.product-card') // Nearest ancestor matching selector
$('.price').parents('div') // All div ancestors
// Children
$('.product-card').children() // All direct children
$('.product-card').children('.title') // Direct children matching selector
$('.product-card').find('.price') // Descendants (any depth) matching selector
// Siblings
$('.active').next() // Next sibling
$('.active').prev() // Previous sibling
$('.active').nextAll('li') // All following li siblings
$('.active').siblings() // All siblings
Chaining Methods
// Chain traversals like jQuery
const prices = $('table.products')
.find('tbody tr')
.not('.out-of-stock')
.find('td.price')
.map((i, el) => $(el).text().trim())
.get(); // .get() converts Cheerio object to regular array
5. Extracting Text, Attributes, and HTML
Text Extraction
// Get text content
$('h1').text() // Text of first match
$('.description').text() // All text, concatenated
$('.description').first().text() // Explicitly first element
// Trim whitespace (common need)
$('.price').text().trim()
// Get text from multiple elements
const titles = $('h2').map((i, el) => $(el).text().trim()).get();
console.log(titles); // ['Title 1', 'Title 2', ...]
Attribute Extraction
// Get attributes
$('a').attr('href') // href of first link
$('img').attr('src') // src of first image
$('img').attr('alt') // alt text
$('input').attr('value') // input value
// Get all links
const links = $('a[href]').map((i, el) => ({
text: $(el).text().trim(),
url: $(el).attr('href')
})).get();
// Get data attributes
$('.product').attr('data-id')
$('.product').data('id') // Shorthand for data-* attributes
HTML Extraction
// Get inner HTML
$('.content').html() // Inner HTML of first match
// Get outer HTML (the element itself + its content)
$.html($('.content')) // Outer HTML
// Get all HTML
$.html() // Entire parsed document
6. Scraping Tables and Lists
Tables are one of the most common scraping targets. Here's how to extract tabular data:
function scrapeTable($, tableSelector) {
const headers = [];
const rows = [];
// Extract headers
$(`${tableSelector} thead th`).each((i, el) => {
headers.push($(el).text().trim());
});
// Extract rows
$(`${tableSelector} tbody tr`).each((i, tr) => {
const row = {};
$(tr).find('td').each((j, td) => {
row[headers[j] || `col_${j}`] = $(td).text().trim();
});
rows.push(row);
});
return { headers, rows };
}
// Usage
const { headers, rows } = scrapeTable($, 'table.pricing');
console.log(headers); // ['Plan', 'Price', 'API Calls']
console.log(rows); // [{Plan: 'Free', Price: '$0', ...}, ...]
Scraping Nested Lists
function scrapeList($, selector) {
return $(selector).map((i, el) => {
const $el = $(el);
const children = $el.children('ul, ol');
if (children.length) {
return {
text: $el.contents().first().text().trim(),
children: scrapeList($, `${selector}:eq(${i}) > ul > li, ${selector}:eq(${i}) > ol > li`)
};
}
return { text: $el.text().trim() };
}).get();
}
7. Handling Pagination
Most websites spread data across multiple pages. Here's how to handle common pagination patterns:
Next-Page Links
async function scrapeAllPages(startUrl) {
const allItems = [];
let url = startUrl;
while (url) {
console.log(`Scraping: ${url}`);
const { data: html } = await axios.get(url, {
headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' }
});
const $ = cheerio.load(html);
// Extract items from current page
$('.product-card').each((i, el) => {
allItems.push({
title: $(el).find('.title').text().trim(),
price: $(el).find('.price').text().trim(),
url: $(el).find('a').attr('href')
});
});
// Find next page link
const nextLink = $('a.next-page').attr('href');
url = nextLink ? new URL(nextLink, url).href : null;
// Be polite — wait between requests
if (url) await sleep(1000);
}
console.log(`Total items: ${allItems.length}`);
return allItems;
}
const sleep = (ms) => new Promise(r => setTimeout(r, ms));
Page Number Pagination
async function scrapePages(baseUrl, totalPages) {
const allItems = [];
for (let page = 1; page <= totalPages; page++) {
const url = `${baseUrl}?page=${page}`;
console.log(`Page ${page}/${totalPages}: ${url}`);
const { data: html } = await axios.get(url);
const $ = cheerio.load(html);
$('.item').each((i, el) => {
allItems.push({
name: $(el).find('.name').text().trim(),
price: $(el).find('.price').text().trim()
});
});
await sleep(1000 + Math.random() * 1000); // Random delay
}
return allItems;
}
8. Concurrent Scraping with Promise.all
Node.js excels at concurrent I/O. Scrape multiple pages simultaneously while respecting rate limits:
async function scrapeConcurrent(urls, concurrency = 5) {
const results = [];
// Process in batches
for (let i = 0; i < urls.length; i += concurrency) {
const batch = urls.slice(i, i + concurrency);
console.log(`Batch ${Math.floor(i / concurrency) + 1}: ${batch.length} URLs`);
const batchResults = await Promise.allSettled(
batch.map(async (url) => {
const { data: html } = await axios.get(url, {
headers: { 'User-Agent': 'Mozilla/5.0' },
timeout: 10000
});
const $ = cheerio.load(html);
return {
url,
title: $('h1').text().trim(),
description: $('meta[name="description"]').attr('content') || ''
};
})
);
// Collect successful results
for (const result of batchResults) {
if (result.status === 'fulfilled') {
results.push(result.value);
} else {
console.error(`Failed: ${result.reason.message}`);
}
}
// Delay between batches
if (i + concurrency < urls.length) {
await sleep(2000);
}
}
return results;
}
// Usage
const urls = Array.from({ length: 50 }, (_, i) =>
`https://example.com/products?page=${i + 1}`
);
const data = await scrapeConcurrent(urls, 5);
9. Error Handling and Retries
Production scrapers need robust error handling. Network failures, rate limits, and changed HTML are inevitable:
async function fetchWithRetry(url, options = {}, maxRetries = 3) {
const { delay = 1000, backoffMultiplier = 2 } = options;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
},
timeout: 15000,
validateStatus: (status) => status < 500 // Don't retry on 4xx
});
if (response.status === 429) {
// Rate limited — back off significantly
const retryAfter = parseInt(response.headers['retry-after'] || '60');
console.log(`Rate limited. Waiting ${retryAfter}s...`);
await sleep(retryAfter * 1000);
continue;
}
return response;
} catch (error) {
console.error(`Attempt ${attempt}/${maxRetries} failed: ${error.message}`);
if (attempt === maxRetries) throw error;
await sleep(delay * Math.pow(backoffMultiplier, attempt - 1));
}
}
}
// Usage
const response = await fetchWithRetry('https://example.com/data');
const $ = cheerio.load(response.data);
10. Production-Ready Scraper Class
Here's a complete, reusable scraper class with logging, retries, concurrency control, and data export:
import * as cheerio from 'cheerio';
import axios from 'axios';
import { writeFile } from 'fs/promises';
class CheerioScraper {
constructor(options = {}) {
this.concurrency = options.concurrency || 5;
this.delay = options.delay || 1000;
this.maxRetries = options.maxRetries || 3;
this.timeout = options.timeout || 15000;
this.results = [];
this.errors = [];
this.userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0'
];
}
randomUA() {
return this.userAgents[Math.floor(Math.random() * this.userAgents.length)];
}
async fetch(url) {
for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
try {
const { data } = await axios.get(url, {
headers: {
'User-Agent': this.randomUA(),
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
},
timeout: this.timeout
});
return cheerio.load(data);
} catch (err) {
if (attempt === this.maxRetries) {
this.errors.push({ url, error: err.message });
return null;
}
await this.sleep(this.delay * Math.pow(2, attempt - 1));
}
}
}
async scrapeUrls(urls, extractor) {
for (let i = 0; i < urls.length; i += this.concurrency) {
const batch = urls.slice(i, i + this.concurrency);
const batchNum = Math.floor(i / this.concurrency) + 1;
const totalBatches = Math.ceil(urls.length / this.concurrency);
console.log(`Batch ${batchNum}/${totalBatches} (${batch.length} URLs)`);
const promises = batch.map(async (url) => {
const $ = await this.fetch(url);
if ($) {
const data = extractor($, url);
if (data) this.results.push(data);
}
});
await Promise.allSettled(promises);
if (i + this.concurrency < urls.length) {
await this.sleep(this.delay + Math.random() * 1000);
}
}
console.log(`Done: ${this.results.length} results, ${this.errors.length} errors`);
return this.results;
}
async exportJSON(filename) {
await writeFile(filename, JSON.stringify(this.results, null, 2));
console.log(`Exported ${this.results.length} items to ${filename}`);
}
async exportCSV(filename) {
if (!this.results.length) return;
const headers = Object.keys(this.results[0]);
const csv = [
headers.join(','),
...this.results.map(row =>
headers.map(h => `"${String(row[h] || '').replace(/"/g, '""')}"`).join(',')
)
].join('\n');
await writeFile(filename, csv);
console.log(`Exported ${this.results.length} items to ${filename}`);
}
sleep(ms) {
return new Promise(r => setTimeout(r, ms));
}
}
// Usage example: scrape product listings
const scraper = new CheerioScraper({ concurrency: 3, delay: 1500 });
const urls = Array.from({ length: 20 }, (_, i) =>
`https://example.com/products?page=${i + 1}`
);
await scraper.scrapeUrls(urls, ($, url) => {
const products = [];
$('.product-card').each((i, el) => {
products.push({
name: $(el).find('.name').text().trim(),
price: $(el).find('.price').text().trim(),
rating: $(el).find('.rating').attr('data-score'),
url: $(el).find('a').attr('href'),
source: url
});
});
return products;
});
await scraper.exportJSON('products.json');
await scraper.exportCSV('products.csv');
🚀 Need Data at Scale? Skip the Infrastructure
Building scrapers is fun — maintaining proxy rotation, handling CAPTCHAs, and managing rate limits is not. Mantis API handles all of that for you.
Get 100 Free API Calls →11. When Pages Need JavaScript
Cheerio's biggest limitation: it cannot execute JavaScript. If a page loads content dynamically (React, Angular, Vue, infinite scroll), you won't see that content in Cheerio.
How to Detect JavaScript-Rendered Content
// Quick check: compare what Cheerio sees vs what browser sees
async function detectJSContent(url) {
const { data: html } = await axios.get(url);
const $ = cheerio.load(html);
// If the body is nearly empty or has a root div with no content,
// the page likely uses JavaScript rendering
const bodyText = $('body').text().trim();
const rootDiv = $('#root, #app, #__next').html();
console.log(`Body text length: ${bodyText.length}`);
console.log(`Root div content: ${rootDiv ? rootDiv.length : 'N/A'} chars`);
if (bodyText.length < 100 || (rootDiv && rootDiv.length < 50)) {
console.log('⚠️ This page likely requires JavaScript rendering');
return true;
}
return false;
}
Option 1: Check for Hidden APIs
Many SPAs fetch data from JSON APIs. Intercept these in DevTools (Network tab → XHR/Fetch) and call them directly — much faster than browser automation:
// Instead of scraping the rendered page, call the API directly
const { data } = await axios.get('https://example.com/api/products', {
headers: { 'Accept': 'application/json' },
params: { page: 1, limit: 50 }
});
// data is already structured — no parsing needed!
console.log(data.products);
Option 2: Puppeteer + Cheerio Combo
When you must render JavaScript, use Puppeteer to render the page, then hand the HTML to Cheerio for fast parsing:
import puppeteer from 'puppeteer';
import * as cheerio from 'cheerio';
async function scrapeJSPage(url) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
// Get rendered HTML and parse with Cheerio (much faster than Puppeteer's $ methods)
const html = await page.content();
await browser.close();
const $ = cheerio.load(html);
// Now use Cheerio as normal — much faster than page.evaluate()
return $('.product').map((i, el) => ({
name: $(el).find('.name').text().trim(),
price: $(el).find('.price').text().trim()
})).get();
}
Option 3: Use Mantis API
The simplest solution — Mantis handles JavaScript rendering, anti-bot detection, and proxy rotation automatically:
import axios from 'axios';
// One API call replaces Puppeteer + Cheerio + proxy rotation + CAPTCHA handling
const { data } = await axios.post('https://api.mantisapi.com/extract', {
url: 'https://example.com/products',
selectors: {
products: {
selector: '.product-card',
type: 'list',
fields: {
name: '.name',
price: '.price',
rating: { selector: '.rating', attr: 'data-score' }
}
}
}
}, {
headers: { 'x-api-key': 'YOUR_API_KEY' }
});
console.log(data.products); // Clean, structured data
12. Cheerio vs Puppeteer vs Playwright vs API
| Feature | Cheerio | Puppeteer | Playwright | Mantis API |
|---|---|---|---|---|
| Language | Node.js | Node.js | Node.js/Python/Java | Any (REST API) |
| JavaScript rendering | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes |
| Speed (per page) | ~50-200ms | ~2-10s | ~2-8s | ~200ms |
| Memory usage | ~20MB | ~300MB+ | ~300MB+ | ~0 (server-side) |
| Anti-bot bypass | ❌ Manual | ⚠️ With plugins | ⚠️ With plugins | ✅ Built-in |
| Proxy rotation | ❌ Manual | ⚠️ Manual config | ⚠️ Manual config | ✅ Built-in |
| Screenshots | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes |
| Learning curve | Low (jQuery) | Medium | Medium | Low (REST API) |
| Best for | Static HTML pages | JS-heavy sites | Cross-browser testing | Production at scale |
| Cost at scale | $50-200/mo (servers) | $200-800/mo (infra) | $200-800/mo (infra) | $29-299/mo |
13. The API Shortcut: Skip the Parsing
Here's the truth about web scraping at scale: the hard part isn't parsing HTML. Cheerio handles that beautifully. The hard parts are:
- Anti-bot detection — CAPTCHAs, fingerprinting, IP bans
- JavaScript rendering — SPAs, dynamic content, infinite scroll
- Proxy rotation — Residential proxies cost $5-15/GB
- Infrastructure — Headless browsers eat RAM and CPU
- Maintenance — Sites change layouts, selectors break
A web scraping API handles all of this. Here's the cost comparison:
| Component | DIY (Cheerio + Puppeteer) | Mantis API |
|---|---|---|
| Proxy rotation | $100-500/mo | Included |
| Headless browser servers | $50-200/mo | Included |
| CAPTCHA solving | $50-200/mo | Included |
| Engineering time | $$$ | None |
| Total | $200-900/mo | $29-299/mo |
// Cheerio: ~50 lines of code, manual proxy rotation, manual error handling
// Mantis API: 5 lines of code, everything handled
const { data } = await axios.get('https://api.mantisapi.com/screenshot', {
params: { url: 'https://example.com', format: 'png' },
headers: { 'x-api-key': 'YOUR_API_KEY' }
});
// Done. Screenshot captured. JavaScript rendered. Anti-bot bypassed.
📣 From Cheerio to Production in Minutes
Prototype with Cheerio. Ship with Mantis. Get 100 free API calls/month — no credit card required.
Start Free →14. FAQ
Check the structured FAQ data above for common Cheerio web scraping questions, covering use cases, Cheerio vs Puppeteer, JavaScript rendering limitations, speed benchmarks, Cheerio vs JSDOM, and legal considerations.
What's Next?
You now have everything you need to build production-grade scrapers with Cheerio and Node.js. Here are your next steps:
- For JavaScript-heavy pages: Read our Puppeteer Complete Guide
- For Python developers: Check our BeautifulSoup Guide or Scrapy Guide
- Getting blocked? Read How to Scrape Without Getting Blocked
- Comparing tools? See our Best Web Scraping APIs for AI Agents
- Ready to scale? Try Mantis API free — 100 calls/month, no credit card