Web Scraping with JavaScript & Node.js in 2026: The Ultimate Guide

Updated March 27, 2026 · 22 min read · By the Mantis Team

📑 Table of Contents

  1. Why Node.js for Web Scraping?
  2. The Node.js Web Scraping Stack
  3. Quick Start: Your First Node.js Scraper
  4. Parsing HTML with Cheerio
  5. Browser Automation with Puppeteer
  6. Modern Scraping with Playwright
  7. Concurrent Scraping Patterns
  8. Avoiding Blocks: Headers, Proxies & Stealth
  9. When to Use a Web Scraping API Instead
  10. Choosing the Right Tool
  11. JavaScript vs Python for Scraping
  12. FAQ

1. Why Node.js for Web Scraping?

JavaScript is the language of the web — and Node.js makes it the language of web scraping too. Here's why developers choose Node.js for scraping in 2026:

2. The Node.js Web Scraping Stack

Here's every major tool in the Node.js scraping ecosystem:

ToolTypeBest ForJS RenderingGuide
Cheerio HTML parser Fast HTML parsing (jQuery-style) Full guide →
Puppeteer Browser automation Headless Chrome, screenshots Full guide →
Playwright Browser automation Multi-browser, modern API Playwright guide →
Axios HTTP client Simple HTTP requests
node-fetch HTTP client Fetch API for Node.js
Got HTTP client Advanced HTTP (retries, streams)
Crawlee Framework Large-scale crawling ✅ (via Puppeteer/Playwright)
Mantis API Web scraping API Production scraping, AI agents Full guide →
💡 Tip: The most common combo is Axios + Cheerio for static pages (like Python's Requests + BeautifulSoup), and Puppeteer or Playwright for JavaScript-rendered sites.

3. Quick Start: Your First Node.js Scraper

Let's build a working scraper in under 20 lines using Axios to fetch pages and Cheerio to parse HTML:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeHN() {
  // 1. Fetch the page
  const { data } = await axios.get('https://news.ycombinator.com', {
    headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }
  });

  // 2. Parse the HTML
  const $ = cheerio.load(data);

  // 3. Extract data
  $('.titleline > a').slice(0, 10).each((i, el) => {
    console.log($(el).text(), '→', $(el).attr('href'));
  });
}

scrapeHN();
npm install axios cheerio
node scraper.js

That's it — a working scraper in 15 lines. For the complete jQuery-style API, DOM traversal, table scraping, and pagination, see our complete Cheerio guide.

4. Parsing HTML with Cheerio

Cheerio is the Node.js equivalent of Python's BeautifulSoup. It implements a subset of jQuery for fast, memory-efficient HTML parsing — no browser needed:

const cheerio = require('cheerio');

const html = `
  <div class="products">
    <div class="product">
      <h2 class="name">Widget Pro</h2>
      <span class="price">$49.99</span>
      <a href="/products/widget-pro">Details</a>
    </div>
    <div class="product">
      <h2 class="name">Gadget Max</h2>
      <span class="price">$79.99</span>
      <a href="/products/gadget-max">Details</a>
    </div>
  </div>
`;

const $ = cheerio.load(html);

// CSS selectors — just like jQuery
$('.product').each((i, el) => {
  const name = $(el).find('.name').text();
  const price = $(el).find('.price').text();
  const url = $(el).find('a').attr('href');
  console.log({ name, price, url });
});

// DOM traversal
$('.product').first().next().find('.name').text(); // "Gadget Max"
$('.name').parent().attr('class'); // "product"

Cheerio is 10-20x faster than browser-based scraping because it only parses HTML — no DOM rendering, no JavaScript execution. Use it whenever the page content is in the raw HTML. See the complete Cheerio guide for tables, pagination, and production patterns.

5. Browser Automation with Puppeteer

When pages render content with JavaScript (React, Angular, Vue), you need a real browser. Puppeteer controls headless Chrome:

const puppeteer = require('puppeteer');

async function scrapeSPA() {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  // Set a realistic User-Agent
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');

  await page.goto('https://example.com/spa-app', {
    waitUntil: 'networkidle2'
  });

  // Wait for dynamic content
  await page.waitForSelector('.product-card');

  // Extract data from the rendered page
  const products = await page.$$eval('.product-card', cards =>
    cards.map(card => ({
      name: card.querySelector('.name').textContent,
      price: card.querySelector('.price').textContent,
    }))
  );

  console.log(products);

  // Take a screenshot
  await page.screenshot({ path: 'products.png', fullPage: true });

  await browser.close();
}

scrapeSPA();
npm install puppeteer
node scraper.js

Puppeteer excels at screenshots, PDF generation, and form interaction. For stealth mode, proxy rotation, network interception, and concurrent scraping with puppeteer-cluster, see our complete Puppeteer guide.

6. Modern Scraping with Playwright

Playwright is the newer alternative to Puppeteer, created by the same team at Microsoft. It supports Chromium, Firefox, and WebKit, with a more modern API:

const { chromium } = require('playwright');

async function scrapeWithPlaywright() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/products');

  // Playwright auto-waits for elements
  const products = await page.locator('.product-card').all();

  for (const product of products) {
    const name = await product.locator('.name').textContent();
    const price = await product.locator('.price').textContent();
    console.log({ name, price });
  }

  // Intercept network requests
  await page.route('**/*.{png,jpg,gif}', route => route.abort());

  // Capture API responses
  page.on('response', async response => {
    if (response.url().includes('/api/products')) {
      const json = await response.json();
      console.log('API data:', json);
    }
  });

  await browser.close();
}

scrapeWithPlaywright();
npm install playwright
npx playwright install chromium
💡 Playwright vs Puppeteer: Playwright has auto-waiting (fewer flaky selectors), multi-browser support, and better network interception. Puppeteer has a larger ecosystem and community. For new projects in 2026, we recommend Playwright.

7. Concurrent Scraping Patterns

Node.js was built for concurrency. Here are the key patterns for scraping many pages at once:

Promise.all with Rate Limiting

const axios = require('axios');
const cheerio = require('cheerio');

// Process URLs in batches of N
async function scrapeBatch(urls, batchSize = 5) {
  const results = [];

  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);

    const batchResults = await Promise.all(
      batch.map(async (url) => {
        try {
          const { data } = await axios.get(url, {
            headers: { 'User-Agent': 'Mozilla/5.0' },
            timeout: 10000
          });
          const $ = cheerio.load(data);
          return { url, title: $('h1').text(), status: 'ok' };
        } catch (err) {
          return { url, error: err.message, status: 'error' };
        }
      })
    );

    results.push(...batchResults);

    // Rate limit: wait 1 second between batches
    if (i + batchSize < urls.length) {
      await new Promise(r => setTimeout(r, 1000));
    }
  }

  return results;
}

// Usage
const urls = Array.from({ length: 50 }, (_, i) => `https://example.com/page/${i + 1}`);
scrapeBatch(urls, 5).then(console.log);

Puppeteer Cluster for Browser-Based Concurrency

const { Cluster } = require('puppeteer-cluster');

async function scrapeWithCluster() {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 4,
    puppeteerOptions: { headless: 'new' }
  });

  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url, { waitUntil: 'networkidle2' });
    const title = await page.title();
    console.log(`${url} → ${title}`);
  });

  for (let i = 1; i <= 50; i++) {
    cluster.queue(`https://example.com/page/${i}`);
  }

  await cluster.idle();
  await cluster.close();
}

scrapeWithCluster();

Crawlee — The Scrapy of Node.js

For large-scale structured crawling, Crawlee (by Apify) is the most complete Node.js framework:

const { CheerioCrawler } = require('crawlee');

const crawler = new CheerioCrawler({
  maxConcurrency: 10,
  maxRequestsPerMinute: 60,

  async requestHandler({ $, request, enqueueLinks }) {
    const title = $('h1').text();
    const price = $('.price').text();
    console.log({ url: request.url, title, price });

    // Auto-discover and follow links
    await enqueueLinks({
      selector: 'a.next-page',
    });
  },
});

crawler.run(['https://example.com/products']);

Crawlee handles retries, request queues, data storage, proxy rotation, and both HTTP and browser-based crawling. It's what you reach for when a simple script isn't enough.

8. Avoiding Blocks: Headers, Proxies & Stealth

The same anti-bot systems that block Python scrapers block Node.js scrapers. Here's how to stay under the radar:

Essential Headers

const headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
  'Accept-Language': 'en-US,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Referer': 'https://www.google.com/',
  'Connection': 'keep-alive',
};

Puppeteer Stealth

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({ headless: 'new' });
// Now passes most bot detection tests

Proxy Rotation

const proxies = [
  'http://user:pass@proxy1.example.com:8080',
  'http://user:pass@proxy2.example.com:8080',
  'http://user:pass@proxy3.example.com:8080',
];

// With Axios
const HttpsProxyAgent = require('https-proxy-agent');
const proxy = proxies[Math.floor(Math.random() * proxies.length)];
const { data } = await axios.get(url, {
  httpsAgent: new HttpsProxyAgent(proxy)
});

// With Puppeteer
const browser = await puppeteer.launch({
  args: [`--proxy-server=${proxy}`]
});

For a comprehensive deep dive into anti-blocking for all languages, see our guide to scraping without getting blocked.

🛡️ Tired of Fighting Anti-Bot Systems?

Mantis handles proxy rotation, JavaScript rendering, and anti-blocking automatically. One API call, clean data back.

Try Mantis Free — 100 Calls/Month →

9. When to Use a Web Scraping API Instead

Building and maintaining scraping infrastructure is expensive. Here's the real cost:

ComponentDIY Cost (Monthly)Mantis API
Proxy rotation$50–500✅ Included
Headless browsers$100–300✅ Included
CAPTCHA solving$50–200✅ Included
Anti-bot bypassEngineering time✅ Included
MaintenanceOngoing dev hours✅ Managed
Total$200–1,000+From $29/mo

Use a web scraping API when:

Mantis API with Node.js

const axios = require('axios');

const response = await axios.post('https://api.mantisapi.com/v1/scrape', {
  url: 'https://example.com/products',
  render_js: true,
  extract: {
    products: '.product-card',
    fields: {
      name: '.name',
      price: '.price'
    }
  }
}, {
  headers: { 'Authorization': `Bearer ${API_KEY}` }
});

console.log(response.data.products);
// [{ name: "Widget Pro", price: "$49.99" }, ...]

One API call replaces Puppeteer + proxies + stealth plugins + error handling. See our API comparison guide for details.

10. Choosing the Right Tool

Decision flowchart:

📌 Does the page need JavaScript to render?
├─ No → Is it a small project?
│    ├─ Yes → Axios + Cheerio (guide)
│    └─ No → Need structured crawling?
│         ├─ Yes → Crawlee (CheerioCrawler)
│         └─ No → Axios + Cheerio with batching
└─ Yes → Is it for production / at scale?
     ├─ Yes → Mantis API (pricing)
     └─ No → Playwright or Puppeteer (guide)

Quick Comparison

CriteriaAxios + CheerioPuppeteerPlaywrightCrawleeMantis API
Learning curve⭐ Easy⭐⭐ Medium⭐⭐ Medium⭐⭐ Medium⭐ Easy
SpeedVery fastSlowSlowFastFast
JS rendering✅ (plugin)
ConcurrencyPromise.allpuppeteer-clusterManualBuilt-inBuilt-in
Anti-bot bypassManualStealth pluginManualBuilt-inAutomatic
Best forQuick scriptsScreenshots, PDFsJS-heavy sitesLarge crawlsProduction / AI

11. JavaScript vs Python for Web Scraping

The eternal debate. Here's a fair comparison:

FactorJavaScript / Node.jsPython
Browser automation⭐⭐⭐ (Puppeteer/Playwright were built here)⭐⭐ (good bindings)
HTML parsing⭐⭐ Cheerio⭐⭐⭐ BeautifulSoup, lxml
Crawling frameworks⭐⭐ Crawlee⭐⭐⭐ Scrapy (mature)
Async/concurrency⭐⭐⭐ Native event loop⭐⭐ asyncio (added later)
Data processing⭐ Limited⭐⭐⭐ pandas, NumPy
Community/tutorials⭐⭐ Growing⭐⭐⭐ Dominant
AI/agent integration⭐⭐ Vercel AI SDK⭐⭐⭐ LangChain, CrewAI

Choose JavaScript when: Your stack is already JS, you need browser automation, or you want native async concurrency.

Choose Python when: You need Scrapy-level crawling, data science integration, or access to the larger scraping community.

Choose Mantis API when: You don't want to worry about language-specific infrastructure at all.

See our Python scraping guide for the Python side of the comparison.

🚀 Need Data at Scale? Skip the Infrastructure.

Mantis WebPerception API: scraping, screenshots, and AI extraction — one API call. Works with any language.

Start Free →

12. Frequently Asked Questions

Can you web scrape with JavaScript?

Yes. Node.js is one of the best platforms for web scraping. Cheerio parses HTML fast, while Puppeteer and Playwright automate full browsers for JavaScript-rendered pages. Node's async-first design makes it naturally suited for concurrent scraping.

What is the best Node.js library for web scraping?

For static HTML pages, use Cheerio with Axios — it's fast and lightweight. For JavaScript-rendered pages (SPAs, React, Angular), use Puppeteer or Playwright. For production workloads at scale, a web scraping API like Mantis handles everything automatically.

Is Puppeteer or Playwright better for web scraping?

Playwright is generally better for scraping in 2026. It supports Chromium, Firefox, and WebKit, has built-in auto-waiting, better network interception, and more reliable selectors. Puppeteer is Chromium-only but has a larger community. For new projects, we recommend Playwright.

Is JavaScript better than Python for web scraping?

Python has more scraping libraries and a larger community. JavaScript/Node.js excels at browser automation since Puppeteer and Playwright were built for it. If you already write JavaScript, Node.js is excellent. If you need large-scale crawling frameworks, Python (Scrapy) has the edge.

How do I scrape a JavaScript-rendered website with Node.js?

Use Puppeteer or Playwright to launch a headless browser, navigate to the page, wait for content to render, then extract the data. You can also feed the rendered HTML into Cheerio for fast parsing. Alternatively, use a web scraping API like Mantis that handles JavaScript rendering server-side.

How much does web scraping with Node.js cost?

Node.js libraries are free, but production scraping has hidden costs: proxy services ($50–500/month), headless browser servers ($100–300/month), CAPTCHA solving ($1–3 per 1,000), and maintenance time. A web scraping API like Mantis starts free (100 calls/month) with paid plans from $29/month for 5,000 calls.


Related Guides

© 2026 Mantis · Web scraping, screenshots, and AI data extraction for agents and developers.