Web Scraping with JavaScript and Node.js: The Complete Guide for 2026

{{DATE_DISPLAY}} Web Scraping

Web Scraping with JavaScript and Node.js: The Complete Guide for 2026

JavaScript isn't just for building websites anymore. It's one of the most popular languages for web scraping — and with Node.js, you can build scrapers that handle everything from static HTML to JavaScript-heavy single-page apps.

But scraping in 2026 is harder than it used to be. Anti-bot systems, CAPTCHAs, and dynamic rendering make DIY scraping a constant battle. This guide covers every approach — from simple HTML parsing to headless browsers to API-based scraping — so you can pick the right tool for your project.

Why JavaScript for Web Scraping?

The 4 Approaches to Web Scraping in JavaScript

Approach Best For Handles JS? Speed Complexity
Cheerio + Axios Static HTML pages ⚡ Fast Low
Puppeteer Chrome-rendered pages 🐢 Slow Medium
Playwright Cross-browser, complex SPAs 🐢 Slow Medium
WebPerception API Production scraping at scale ⚡ Fast Very Low

Approach 1: Cheerio + Axios (Static Pages)

The lightest option. Fetch raw HTML and parse it with jQuery-like syntax.

Setup

npm install axios cheerio

Example: Scrape Product Listings

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeProducts(url) {
  const { data } = await axios.get(url);
  const $ = cheerio.load(data);

  const products = [];
  $('.product-card').each((i, el) => {
    products.push({
      name: $(el).find('.title').text().trim(),
      price: $(el).find('.price').text().trim(),
      url: $(el).find('a').attr('href'),
    });
  });

  return products;
}

scrapeProducts('https://example.com/products')
  .then(console.log);

When Cheerio Works

When Cheerio Fails

Approach 2: Puppeteer (Headless Chrome)

When pages need JavaScript to render, you need a real browser. Puppeteer controls Chrome programmatically.

Setup

npm install puppeteer

Example: Scrape a JavaScript-Rendered Page

const puppeteer = require('puppeteer');

async function scrapeSPA(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  const data = await page.evaluate(() => {
    const items = document.querySelectorAll('.listing-item');
    return Array.from(items).map(item => ({
      title: item.querySelector('h2')?.textContent?.trim(),
      price: item.querySelector('.price')?.textContent?.trim(),
      description: item.querySelector('.desc')?.textContent?.trim(),
    }));
  });

  await browser.close();
  return data;
}

Handling Infinite Scroll

async function scrapeInfiniteScroll(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  let previousHeight;
  while (true) {
    previousHeight = await page.evaluate('document.body.scrollHeight');
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
    await new Promise(r => setTimeout(r, 2000));
    const newHeight = await page.evaluate('document.body.scrollHeight');
    if (newHeight === previousHeight) break;
  }

  const data = await page.evaluate(() => {
    // Extract all loaded items
    return Array.from(document.querySelectorAll('.item'))
      .map(el => el.textContent.trim());
  });

  await browser.close();
  return data;
}

Puppeteer Challenges

Approach 3: Playwright (Cross-Browser)

Playwright is Microsoft's answer to Puppeteer. It supports Chrome, Firefox, and WebKit, with a more modern API.

Setup

npm install playwright

Example: Scrape with Auto-Waiting

const { chromium } = require('playwright');

async function scrapeWithPlaywright(url) {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(url);
  await page.waitForSelector('.results-loaded');

  const results = await page.$$eval('.result-card', cards =>
    cards.map(card => ({
      title: card.querySelector('h3')?.textContent?.trim(),
      link: card.querySelector('a')?.href,
      snippet: card.querySelector('.snippet')?.textContent?.trim(),
    }))
  );

  await browser.close();
  return results;
}

Playwright vs Puppeteer

Feature Puppeteer Playwright
Browsers Chrome only Chrome, Firefox, WebKit
Auto-waiting Manual Built-in
API design Older, callback-style Modern, promise-based
Parallelism Page-level Context-level (lighter)
Maintained by Google Microsoft

Both have the same fundamental limitations: they're slow, resource-heavy, and detectable.

Approach 4: WebPerception API (Production-Grade)

All three approaches above share the same problems at scale:

The WebPerception API eliminates all of this. One API call replaces your entire scraping infrastructure.

Setup

npm install node-fetch  # or use built-in fetch in Node 18+

Example: Scrape Any Page

const response = await fetch('https://api.mantisapi.com/v1/scrape', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/products',
    render_js: true,
  }),
});

const { content, metadata } = await response.json();
// content = fully rendered HTML, ready for parsing

Example: AI-Powered Data Extraction

Skip the selectors entirely. Tell the API what you want in plain English:

const response = await fetch('https://api.mantisapi.com/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/products',
    prompt: 'Extract all products with name, price, rating, and availability',
    schema: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          name: { type: 'string' },
          price: { type: 'number' },
          rating: { type: 'number' },
          availability: { type: 'string' },
        },
      },
    },
  }),
});

const { data } = await response.json();
// data = structured JSON matching your schema

Example: Take Screenshots

const response = await fetch('https://api.mantisapi.com/v1/screenshot', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com',
    full_page: true,
    format: 'png',
  }),
});

const screenshot = await response.buffer();

Why Use an API?

DIY Scraping WebPerception API
Manage browser infrastructure One HTTP call
Fight anti-bot systems Handled automatically
Fix broken selectors AI extracts data by intent
Scale = more servers Scale = more API calls
Hours of maintenance/week Zero maintenance

Pricing

Start free at mantisapi.com.

Common Patterns

Handling Pagination

// DIY with Cheerio
async function scrapeAllPages(baseUrl) {
  let page = 1;
  let allResults = [];

  while (true) {
    const { data } = await axios.get(`${baseUrl}?page=${page}`);
    const $ = cheerio.load(data);
    const items = $('.item').map((i, el) => $(el).text().trim()).get();

    if (items.length === 0) break;
    allResults.push(...items);
    page++;
  }

  return allResults;
}

// With WebPerception API — just pass each URL
async function scrapeAllPagesAPI(baseUrl, totalPages) {
  const results = await Promise.all(
    Array.from({ length: totalPages }, (_, i) =>
      fetch('https://api.mantisapi.com/v1/extract', {
        method: 'POST',
        headers: {
          'Authorization': 'Bearer YOUR_API_KEY',
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          url: `${baseUrl}?page=${i + 1}`,
          prompt: 'Extract all product listings',
        }),
      }).then(r => r.json())
    )
  );

  return results.flatMap(r => r.data);
}

Rate Limiting

function rateLimit(fn, delayMs) {
  let lastCall = 0;
  return async (...args) => {
    const now = Date.now();
    const wait = Math.max(0, delayMs - (now - lastCall));
    await new Promise(r => setTimeout(r, wait));
    lastCall = Date.now();
    return fn(...args);
  };
}

const scrapePage = rateLimit(async (url) => {
  // your scraping logic
}, 1000); // 1 request per second

Error Handling & Retries

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch('https://api.mantisapi.com/v1/scrape', {
        method: 'POST',
        headers: {
          'Authorization': 'Bearer YOUR_API_KEY',
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ url, render_js: true }),
      });

      if (!response.ok) throw new Error(`HTTP ${response.status}`);
      return await response.json();
    } catch (err) {
      if (attempt === maxRetries) throw err;
      await new Promise(r => setTimeout(r, 1000 * attempt)); // exponential backoff
    }
  }
}

Which Approach Should You Use?

Choose Cheerio + Axios if: - You're scraping static HTML pages - Speed matters more than complexity - You're comfortable writing CSS selectors

Choose Puppeteer/Playwright if: - Pages require JavaScript rendering - You need to interact with the page (click, scroll, type) - You're scraping a small number of pages and can manage the infrastructure

Choose WebPerception API if: - You're building a production application - You don't want to manage browser infrastructure - You need to handle anti-bot protection automatically - You want AI-powered data extraction instead of brittle selectors - You're building an AI agent that needs web perception

Building a Web Scraper for AI Agents

If you're building an AI agent that needs to read the web, WebPerception is the natural choice. Here's how to integrate it as a tool:

// LangChain-style tool definition
const webPerceptionTool = {
  name: 'web_perception',
  description: 'Fetch and extract structured data from any webpage',
  async execute({ url, query }) {
    const response = await fetch('https://api.mantisapi.com/v1/extract', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.MANTIS_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({ url, prompt: query }),
    });
    return response.json();
  },
};

Your agent can now perceive any webpage — no browser infrastructure, no broken selectors, no anti-bot headaches.

Conclusion

JavaScript has excellent tools for web scraping, from lightweight HTML parsing with Cheerio to full browser automation with Puppeteer and Playwright. But in 2026, the smartest approach for production applications is to let an API handle the hard parts.

The WebPerception API gives you rendered HTML, AI-powered extraction, and screenshot capabilities — all without managing a single browser instance.

Get your free API key →


Building an AI agent? Read our guide on how to build your first AI agent and learn how WebPerception fits into the agent stack.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →