Web Scraping with Cheerio and Node.js in 2026: The Complete Guide

Published March 16, 2026 · 20 min read · Updated for Cheerio 1.0+

Cheerio is the go-to HTML parsing library for Node.js developers — fast, lightweight, and built around the jQuery API you already know. If you're coming from a JavaScript background, Cheerio is the most natural way to scrape and extract data from web pages.

Think of Cheerio as jQuery for the server. It parses HTML into a traversable data structure and gives you a familiar $() API to select elements, extract text, and manipulate the DOM — all without running a browser. This guide takes you from zero to production-ready scraper.

Table of Contents

  1. Installation & Setup
  2. Your First Cheerio Scraper
  3. CSS Selectors: The Core of Cheerio
  4. DOM Traversal: parent, children, siblings
  5. Extracting Text, Attributes, and HTML
  6. Scraping Tables and Lists
  7. Handling Pagination
  8. Concurrent Scraping with Promise.all
  9. Error Handling and Retries
  10. Production-Ready Scraper Class
  11. When Pages Need JavaScript
  12. Cheerio vs Puppeteer vs Playwright vs API
  13. The API Shortcut: Skip the Parsing
  14. FAQ

1. Installation & Setup

Initialize a Node.js project and install Cheerio with a modern HTTP client:

mkdir my-scraper && cd my-scraper
npm init -y
npm install cheerio axios

We'll use axios for HTTP requests and cheerio for HTML parsing. You can also use node-fetch or the built-in fetch (Node 18+).

Verify your installation:

import * as cheerio from 'cheerio';

const html = '<h1>Hello, Cheerio!</h1>';
const $ = cheerio.load(html);
console.log($('h1').text()); // "Hello, Cheerio!"
💡 ESM vs CommonJS: Cheerio 1.0+ is ESM-first. Use import syntax or add "type": "module" to your package.json. For CommonJS, use const cheerio = require('cheerio') with Cheerio 0.22.x or dynamic import.

2. Your First Cheerio Scraper

Let's scrape article titles from Hacker News — a classic first target:

import * as cheerio from 'cheerio';
import axios from 'axios';

async function scrapeHackerNews() {
  // 1. Fetch the HTML
  const { data: html } = await axios.get('https://news.ycombinator.com', {
    headers: {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
  });

  // 2. Load into Cheerio
  const $ = cheerio.load(html);

  // 3. Extract data
  const stories = [];
  $('.titleline > a').each((i, el) => {
    stories.push({
      rank: i + 1,
      title: $(el).text(),
      url: $(el).attr('href')
    });
  });

  console.log(`Found ${stories.length} stories`);
  stories.slice(0, 5).forEach(s =>
    console.log(`${s.rank}. ${s.title}`)
  );

  return stories;
}

scrapeHackerNews();

That's the core pattern: fetch HTML → load into Cheerio → select elements → extract data. Everything else builds on this.

3. CSS Selectors: The Core of Cheerio

Cheerio supports all standard CSS selectors — the same ones you use in browser DevTools and jQuery:

Basic Selectors

// By tag
$('h1')                    // All h1 elements
$('p')                     // All paragraphs

// By class
$('.product-card')         // Elements with class "product-card"
$('.price.sale')           // Elements with BOTH classes

// By ID
$('#main-content')         // Element with id "main-content"

// By attribute
$('a[href]')               // All links with href attribute
$('img[alt="logo"]')       // Images with alt="logo"
$('a[href^="https"]')      // Links starting with "https"
$('a[href$=".pdf"]')       // Links ending with ".pdf"
$('a[href*="mantis"]')     // Links containing "mantis"

Combinators

// Descendant (any depth)
$('div.products .price')   // .price anywhere inside div.products

// Direct child
$('ul > li')               // Only direct li children of ul

// Adjacent sibling
$('h2 + p')                // First p immediately after h2

// General sibling
$('h2 ~ p')                // All p siblings after h2

Pseudo-selectors

$('tr:first-child')        // First tr in each group
$('tr:last-child')         // Last tr
$('li:nth-child(2)')       // Second li in each group
$('li:nth-child(odd)')     // Odd-numbered list items
$('td:not(.hidden)')       // td without class "hidden"
$('p:contains("price")')   // p elements containing "price" text
💡 Pro tip: Use your browser's DevTools to test selectors. Right-click an element → Inspect → in the Console, run document.querySelectorAll('.your-selector') to verify before coding.

4. DOM Traversal: parent, children, siblings

Sometimes CSS selectors alone aren't enough. Cheerio gives you jQuery-style traversal methods:

// Parent & ancestors
$('.price').parent()                  // Direct parent
$('.price').closest('.product-card')  // Nearest ancestor matching selector
$('.price').parents('div')            // All div ancestors

// Children
$('.product-card').children()         // All direct children
$('.product-card').children('.title') // Direct children matching selector
$('.product-card').find('.price')     // Descendants (any depth) matching selector

// Siblings
$('.active').next()                   // Next sibling
$('.active').prev()                   // Previous sibling
$('.active').nextAll('li')            // All following li siblings
$('.active').siblings()               // All siblings

Chaining Methods

// Chain traversals like jQuery
const prices = $('table.products')
  .find('tbody tr')
  .not('.out-of-stock')
  .find('td.price')
  .map((i, el) => $(el).text().trim())
  .get(); // .get() converts Cheerio object to regular array

5. Extracting Text, Attributes, and HTML

Text Extraction

// Get text content
$('h1').text()                        // Text of first match
$('.description').text()              // All text, concatenated
$('.description').first().text()      // Explicitly first element

// Trim whitespace (common need)
$('.price').text().trim()

// Get text from multiple elements
const titles = $('h2').map((i, el) => $(el).text().trim()).get();
console.log(titles); // ['Title 1', 'Title 2', ...]

Attribute Extraction

// Get attributes
$('a').attr('href')                   // href of first link
$('img').attr('src')                  // src of first image
$('img').attr('alt')                  // alt text
$('input').attr('value')              // input value

// Get all links
const links = $('a[href]').map((i, el) => ({
  text: $(el).text().trim(),
  url: $(el).attr('href')
})).get();

// Get data attributes
$('.product').attr('data-id')
$('.product').data('id')              // Shorthand for data-* attributes

HTML Extraction

// Get inner HTML
$('.content').html()                  // Inner HTML of first match

// Get outer HTML (the element itself + its content)
$.html($('.content'))                 // Outer HTML

// Get all HTML
$.html()                              // Entire parsed document

6. Scraping Tables and Lists

Tables are one of the most common scraping targets. Here's how to extract tabular data:

function scrapeTable($, tableSelector) {
  const headers = [];
  const rows = [];

  // Extract headers
  $(`${tableSelector} thead th`).each((i, el) => {
    headers.push($(el).text().trim());
  });

  // Extract rows
  $(`${tableSelector} tbody tr`).each((i, tr) => {
    const row = {};
    $(tr).find('td').each((j, td) => {
      row[headers[j] || `col_${j}`] = $(td).text().trim();
    });
    rows.push(row);
  });

  return { headers, rows };
}

// Usage
const { headers, rows } = scrapeTable($, 'table.pricing');
console.log(headers); // ['Plan', 'Price', 'API Calls']
console.log(rows);    // [{Plan: 'Free', Price: '$0', ...}, ...]

Scraping Nested Lists

function scrapeList($, selector) {
  return $(selector).map((i, el) => {
    const $el = $(el);
    const children = $el.children('ul, ol');

    if (children.length) {
      return {
        text: $el.contents().first().text().trim(),
        children: scrapeList($, `${selector}:eq(${i}) > ul > li, ${selector}:eq(${i}) > ol > li`)
      };
    }
    return { text: $el.text().trim() };
  }).get();
}

7. Handling Pagination

Most websites spread data across multiple pages. Here's how to handle common pagination patterns:

Next-Page Links

async function scrapeAllPages(startUrl) {
  const allItems = [];
  let url = startUrl;

  while (url) {
    console.log(`Scraping: ${url}`);
    const { data: html } = await axios.get(url, {
      headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' }
    });
    const $ = cheerio.load(html);

    // Extract items from current page
    $('.product-card').each((i, el) => {
      allItems.push({
        title: $(el).find('.title').text().trim(),
        price: $(el).find('.price').text().trim(),
        url: $(el).find('a').attr('href')
      });
    });

    // Find next page link
    const nextLink = $('a.next-page').attr('href');
    url = nextLink ? new URL(nextLink, url).href : null;

    // Be polite — wait between requests
    if (url) await sleep(1000);
  }

  console.log(`Total items: ${allItems.length}`);
  return allItems;
}

const sleep = (ms) => new Promise(r => setTimeout(r, ms));

Page Number Pagination

async function scrapePages(baseUrl, totalPages) {
  const allItems = [];

  for (let page = 1; page <= totalPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    console.log(`Page ${page}/${totalPages}: ${url}`);

    const { data: html } = await axios.get(url);
    const $ = cheerio.load(html);

    $('.item').each((i, el) => {
      allItems.push({
        name: $(el).find('.name').text().trim(),
        price: $(el).find('.price').text().trim()
      });
    });

    await sleep(1000 + Math.random() * 1000); // Random delay
  }

  return allItems;
}

8. Concurrent Scraping with Promise.all

Node.js excels at concurrent I/O. Scrape multiple pages simultaneously while respecting rate limits:

async function scrapeConcurrent(urls, concurrency = 5) {
  const results = [];

  // Process in batches
  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);
    console.log(`Batch ${Math.floor(i / concurrency) + 1}: ${batch.length} URLs`);

    const batchResults = await Promise.allSettled(
      batch.map(async (url) => {
        const { data: html } = await axios.get(url, {
          headers: { 'User-Agent': 'Mozilla/5.0' },
          timeout: 10000
        });
        const $ = cheerio.load(html);
        return {
          url,
          title: $('h1').text().trim(),
          description: $('meta[name="description"]').attr('content') || ''
        };
      })
    );

    // Collect successful results
    for (const result of batchResults) {
      if (result.status === 'fulfilled') {
        results.push(result.value);
      } else {
        console.error(`Failed: ${result.reason.message}`);
      }
    }

    // Delay between batches
    if (i + concurrency < urls.length) {
      await sleep(2000);
    }
  }

  return results;
}

// Usage
const urls = Array.from({ length: 50 }, (_, i) =>
  `https://example.com/products?page=${i + 1}`
);
const data = await scrapeConcurrent(urls, 5);
⚠️ Be respectful: Don't blast servers with hundreds of concurrent requests. Use a concurrency limit (3-10), add delays between batches, and check robots.txt. Getting your IP blocked helps nobody.

9. Error Handling and Retries

Production scrapers need robust error handling. Network failures, rate limits, and changed HTML are inevitable:

async function fetchWithRetry(url, options = {}, maxRetries = 3) {
  const { delay = 1000, backoffMultiplier = 2 } = options;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const response = await axios.get(url, {
        headers: {
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
          'Accept': 'text/html,application/xhtml+xml',
          'Accept-Language': 'en-US,en;q=0.9'
        },
        timeout: 15000,
        validateStatus: (status) => status < 500 // Don't retry on 4xx
      });

      if (response.status === 429) {
        // Rate limited — back off significantly
        const retryAfter = parseInt(response.headers['retry-after'] || '60');
        console.log(`Rate limited. Waiting ${retryAfter}s...`);
        await sleep(retryAfter * 1000);
        continue;
      }

      return response;
    } catch (error) {
      console.error(`Attempt ${attempt}/${maxRetries} failed: ${error.message}`);
      if (attempt === maxRetries) throw error;
      await sleep(delay * Math.pow(backoffMultiplier, attempt - 1));
    }
  }
}

// Usage
const response = await fetchWithRetry('https://example.com/data');
const $ = cheerio.load(response.data);

10. Production-Ready Scraper Class

Here's a complete, reusable scraper class with logging, retries, concurrency control, and data export:

import * as cheerio from 'cheerio';
import axios from 'axios';
import { writeFile } from 'fs/promises';

class CheerioScraper {
  constructor(options = {}) {
    this.concurrency = options.concurrency || 5;
    this.delay = options.delay || 1000;
    this.maxRetries = options.maxRetries || 3;
    this.timeout = options.timeout || 15000;
    this.results = [];
    this.errors = [];
    this.userAgents = [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0',
      'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0'
    ];
  }

  randomUA() {
    return this.userAgents[Math.floor(Math.random() * this.userAgents.length)];
  }

  async fetch(url) {
    for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
      try {
        const { data } = await axios.get(url, {
          headers: {
            'User-Agent': this.randomUA(),
            'Accept': 'text/html,application/xhtml+xml',
            'Accept-Language': 'en-US,en;q=0.9'
          },
          timeout: this.timeout
        });
        return cheerio.load(data);
      } catch (err) {
        if (attempt === this.maxRetries) {
          this.errors.push({ url, error: err.message });
          return null;
        }
        await this.sleep(this.delay * Math.pow(2, attempt - 1));
      }
    }
  }

  async scrapeUrls(urls, extractor) {
    for (let i = 0; i < urls.length; i += this.concurrency) {
      const batch = urls.slice(i, i + this.concurrency);
      const batchNum = Math.floor(i / this.concurrency) + 1;
      const totalBatches = Math.ceil(urls.length / this.concurrency);
      console.log(`Batch ${batchNum}/${totalBatches} (${batch.length} URLs)`);

      const promises = batch.map(async (url) => {
        const $ = await this.fetch(url);
        if ($) {
          const data = extractor($, url);
          if (data) this.results.push(data);
        }
      });
      await Promise.allSettled(promises);

      if (i + this.concurrency < urls.length) {
        await this.sleep(this.delay + Math.random() * 1000);
      }
    }

    console.log(`Done: ${this.results.length} results, ${this.errors.length} errors`);
    return this.results;
  }

  async exportJSON(filename) {
    await writeFile(filename, JSON.stringify(this.results, null, 2));
    console.log(`Exported ${this.results.length} items to ${filename}`);
  }

  async exportCSV(filename) {
    if (!this.results.length) return;
    const headers = Object.keys(this.results[0]);
    const csv = [
      headers.join(','),
      ...this.results.map(row =>
        headers.map(h => `"${String(row[h] || '').replace(/"/g, '""')}"`).join(',')
      )
    ].join('\n');
    await writeFile(filename, csv);
    console.log(`Exported ${this.results.length} items to ${filename}`);
  }

  sleep(ms) {
    return new Promise(r => setTimeout(r, ms));
  }
}

// Usage example: scrape product listings
const scraper = new CheerioScraper({ concurrency: 3, delay: 1500 });

const urls = Array.from({ length: 20 }, (_, i) =>
  `https://example.com/products?page=${i + 1}`
);

await scraper.scrapeUrls(urls, ($, url) => {
  const products = [];
  $('.product-card').each((i, el) => {
    products.push({
      name: $(el).find('.name').text().trim(),
      price: $(el).find('.price').text().trim(),
      rating: $(el).find('.rating').attr('data-score'),
      url: $(el).find('a').attr('href'),
      source: url
    });
  });
  return products;
});

await scraper.exportJSON('products.json');
await scraper.exportCSV('products.csv');

🚀 Need Data at Scale? Skip the Infrastructure

Building scrapers is fun — maintaining proxy rotation, handling CAPTCHAs, and managing rate limits is not. Mantis API handles all of that for you.

Get 100 Free API Calls →

11. When Pages Need JavaScript

Cheerio's biggest limitation: it cannot execute JavaScript. If a page loads content dynamically (React, Angular, Vue, infinite scroll), you won't see that content in Cheerio.

How to Detect JavaScript-Rendered Content

// Quick check: compare what Cheerio sees vs what browser sees
async function detectJSContent(url) {
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  // If the body is nearly empty or has a root div with no content,
  // the page likely uses JavaScript rendering
  const bodyText = $('body').text().trim();
  const rootDiv = $('#root, #app, #__next').html();

  console.log(`Body text length: ${bodyText.length}`);
  console.log(`Root div content: ${rootDiv ? rootDiv.length : 'N/A'} chars`);

  if (bodyText.length < 100 || (rootDiv && rootDiv.length < 50)) {
    console.log('⚠️ This page likely requires JavaScript rendering');
    return true;
  }
  return false;
}

Option 1: Check for Hidden APIs

Many SPAs fetch data from JSON APIs. Intercept these in DevTools (Network tab → XHR/Fetch) and call them directly — much faster than browser automation:

// Instead of scraping the rendered page, call the API directly
const { data } = await axios.get('https://example.com/api/products', {
  headers: { 'Accept': 'application/json' },
  params: { page: 1, limit: 50 }
});

// data is already structured — no parsing needed!
console.log(data.products);

Option 2: Puppeteer + Cheerio Combo

When you must render JavaScript, use Puppeteer to render the page, then hand the HTML to Cheerio for fast parsing:

import puppeteer from 'puppeteer';
import * as cheerio from 'cheerio';

async function scrapeJSPage(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  // Get rendered HTML and parse with Cheerio (much faster than Puppeteer's $ methods)
  const html = await page.content();
  await browser.close();

  const $ = cheerio.load(html);
  // Now use Cheerio as normal — much faster than page.evaluate()
  return $('.product').map((i, el) => ({
    name: $(el).find('.name').text().trim(),
    price: $(el).find('.price').text().trim()
  })).get();
}

Option 3: Use Mantis API

The simplest solution — Mantis handles JavaScript rendering, anti-bot detection, and proxy rotation automatically:

import axios from 'axios';

// One API call replaces Puppeteer + Cheerio + proxy rotation + CAPTCHA handling
const { data } = await axios.post('https://api.mantisapi.com/extract', {
  url: 'https://example.com/products',
  selectors: {
    products: {
      selector: '.product-card',
      type: 'list',
      fields: {
        name: '.name',
        price: '.price',
        rating: { selector: '.rating', attr: 'data-score' }
      }
    }
  }
}, {
  headers: { 'x-api-key': 'YOUR_API_KEY' }
});

console.log(data.products); // Clean, structured data

12. Cheerio vs Puppeteer vs Playwright vs API

Feature Cheerio Puppeteer Playwright Mantis API
Language Node.js Node.js Node.js/Python/Java Any (REST API)
JavaScript rendering ❌ No ✅ Yes ✅ Yes ✅ Yes
Speed (per page) ~50-200ms ~2-10s ~2-8s ~200ms
Memory usage ~20MB ~300MB+ ~300MB+ ~0 (server-side)
Anti-bot bypass ❌ Manual ⚠️ With plugins ⚠️ With plugins ✅ Built-in
Proxy rotation ❌ Manual ⚠️ Manual config ⚠️ Manual config ✅ Built-in
Screenshots ❌ No ✅ Yes ✅ Yes ✅ Yes
Learning curve Low (jQuery) Medium Medium Low (REST API)
Best for Static HTML pages JS-heavy sites Cross-browser testing Production at scale
Cost at scale $50-200/mo (servers) $200-800/mo (infra) $200-800/mo (infra) $29-299/mo
💡 Recommendation: Start with Cheerio for static pages — it's fast and simple. When you hit JavaScript-rendered pages or anti-bot measures, switch to Puppeteer or Mantis API. For production workloads that need reliability, an API eliminates the infrastructure headache.

13. The API Shortcut: Skip the Parsing

Here's the truth about web scraping at scale: the hard part isn't parsing HTML. Cheerio handles that beautifully. The hard parts are:

A web scraping API handles all of this. Here's the cost comparison:

Component DIY (Cheerio + Puppeteer) Mantis API
Proxy rotation $100-500/mo Included
Headless browser servers $50-200/mo Included
CAPTCHA solving $50-200/mo Included
Engineering time $$$ None
Total $200-900/mo $29-299/mo
// Cheerio: ~50 lines of code, manual proxy rotation, manual error handling
// Mantis API: 5 lines of code, everything handled

const { data } = await axios.get('https://api.mantisapi.com/screenshot', {
  params: { url: 'https://example.com', format: 'png' },
  headers: { 'x-api-key': 'YOUR_API_KEY' }
});
// Done. Screenshot captured. JavaScript rendered. Anti-bot bypassed.

📣 From Cheerio to Production in Minutes

Prototype with Cheerio. Ship with Mantis. Get 100 free API calls/month — no credit card required.

Start Free →

14. FAQ

Check the structured FAQ data above for common Cheerio web scraping questions, covering use cases, Cheerio vs Puppeteer, JavaScript rendering limitations, speed benchmarks, Cheerio vs JSDOM, and legal considerations.

What's Next?

You now have everything you need to build production-grade scrapers with Cheerio and Node.js. Here are your next steps: