๐Ÿ”ฑ Mantis

How to Scrape Amazon Product Data in 2026

Extract prices, reviews, ratings, and product details using Python, Node.js, and API-based approaches โ€” with production-ready code.

๐Ÿ“‘ Table of Contents

Why Scrape Amazon Product Data?

Amazon is the world's largest online marketplace, with over 350 million products and 300 million active customers. That product data powers some of the most valuable business intelligence in e-commerce:

Whether you're building a price tracker, a product research tool, or an AI shopping agent, scraping Amazon is a foundational data capability.

What Data Can You Extract?

Amazon product pages contain rich structured data across multiple sections:

Data PointLocationCSS Selector Hint
Product TitleTop of page#productTitle
PriceBuy box.a-price .a-offscreen
List PriceBuy box (strikethrough).basisPrice .a-offscreen
RatingBelow title#acrPopover
Review CountBelow title#acrCustomerReviewText
ImagesLeft gallery#imgTagWrapperId img
Bullet FeaturesFeature section#feature-bullets li
ASINProduct detailsth:contains("ASIN")+td
BSR (Best Seller Rank)Product details#SalesRank
AvailabilityBuy box#availability
SellerBuy box#sellerProfileTriggerId
Category BreadcrumbsTop of page#wayfinding-breadcrumbs_feature_div

Method 1: Python + BeautifulSoup

The simplest approach for scraping individual Amazon product pages. Works well for small-scale data collection and prototyping.

Install Dependencies

pip install requests beautifulsoup4 lxml

Basic Product Scraper

# amazon_scraper.py
import requests
from bs4 import BeautifulSoup
import json
import time
import random

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/125.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml",
    "Referer": "https://www.google.com/",
}

def scrape_amazon_product(url: str) -> dict:
    """Scrape product data from an Amazon product page."""
    resp = requests.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    def text(selector):
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else None

    return {
        "title": text("#productTitle"),
        "price": text(".a-price .a-offscreen"),
        "list_price": text(".basisPrice .a-offscreen"),
        "rating": text("#acrPopover .a-icon-alt"),
        "review_count": text("#acrCustomerReviewText"),
        "availability": text("#availability span"),
        "features": [
            li.get_text(strip=True)
            for li in soup.select("#feature-bullets li span.a-list-item")
        ],
        "images": [
            img.get("src")
            for img in soup.select("#altImages img")
            if img.get("src") and "sprite" not in img["src"]
        ],
        "asin": url.split("/dp/")[1].split("/")[0] if "/dp/" in url else None,
        "url": url,
    }

# Example usage
product = scrape_amazon_product(
    "https://www.amazon.com/dp/B0CHX3QBCH"
)
print(json.dumps(product, indent=2))
โš ๏ธ Important

Amazon changes their HTML structure frequently. CSS selectors that work today may break tomorrow. Always test your selectors and build in error handling. For production use, consider an API-based approach that maintains selectors for you.

Scraping Search Results

# search_scraper.py
def scrape_amazon_search(keyword: str, pages: int = 3) -> list:
    """Scrape Amazon search results for a keyword."""
    products = []

    for page in range(1, pages + 1):
        url = (
            f"https://www.amazon.com/s?k={keyword.replace(' ', '+')}"
            f"&page={page}"
        )
        resp = requests.get(url, headers=HEADERS, timeout=15)
        soup = BeautifulSoup(resp.text, "lxml")

        for item in soup.select('[data-component-type="s-search-result"]'):
            title_el = item.select_one("h2 a span")
            price_el = item.select_one(".a-price .a-offscreen")
            rating_el = item.select_one(".a-icon-alt")
            reviews_el = item.select_one(
                '[aria-label*="stars"] + span'
            )
            link_el = item.select_one("h2 a")

            products.append({
                "title": title_el.text.strip() if title_el else None,
                "price": price_el.text.strip() if price_el else None,
                "rating": rating_el.text.strip() if rating_el else None,
                "reviews": reviews_el.text.strip() if reviews_el else None,
                "url": (
                    "https://www.amazon.com" + link_el["href"]
                    if link_el else None
                ),
                "asin": item.get("data-asin"),
            })

        # Random delay between pages
        time.sleep(random.uniform(3, 8))

    return products

results = scrape_amazon_search("wireless earbuds", pages=2)
print(f"Found {len(results)} products")

Method 2: Playwright (Headless Browser)

Amazon heavily relies on JavaScript for dynamic content โ€” lazy-loaded images, price updates, variant selectors, and review widgets. Playwright renders the full page like a real browser, giving you access to all dynamic content.

Install

pip install playwright
playwright install chromium

Full-Render Amazon Scraper

# playwright_amazon.py
import asyncio
from playwright.async_api import async_playwright
import json

async def scrape_amazon_product(asin: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/125.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1920, "height": 1080},
            locale="en-US",
        )

        page = await context.new_page()

        # Block unnecessary resources for speed
        await page.route("**/*.{png,jpg,jpeg,gif,svg,ico}", 
                         lambda route: route.abort())
        await page.route("**/ads/**", lambda route: route.abort())

        url = f"https://www.amazon.com/dp/{asin}"
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_timeout(2000)

        product = await page.evaluate("""() => {
            const text = (sel) => {
                const el = document.querySelector(sel);
                return el ? el.textContent.trim() : null;
            };
            return {
                title: text('#productTitle'),
                price: text('.a-price .a-offscreen'),
                list_price: text('.basisPrice .a-offscreen'),
                rating: text('#acrPopover .a-icon-alt'),
                review_count: text('#acrCustomerReviewText'),
                availability: text('#availability span'),
                features: [...document.querySelectorAll(
                    '#feature-bullets li span.a-list-item'
                )].map(el => el.textContent.trim()).filter(Boolean),
                description: text('#productDescription p'),
                seller: text('#sellerProfileTriggerId'),
            };
        }""")

        # Extract all high-res images
        images = await page.evaluate("""() => {
            const imgs = document.querySelectorAll(
                '#altImages .a-button-thumbnail img'
            );
            return [...imgs]
                .map(img => img.src)
                .filter(src => src && !src.includes('sprite'))
                .map(src => src.replace(/\._.*_\./, '.'));
        }""")

        product["images"] = images
        product["asin"] = asin
        product["url"] = url

        await browser.close()
        return product

# Run it
data = asyncio.run(scrape_amazon_product("B0CHX3QBCH"))
print(json.dumps(data, indent=2))
๐Ÿ’ก Pro Tip: Stealth Mode

Install playwright-stealth to bypass Amazon's bot detection. It patches common browser fingerprint checks like navigator.webdriver, Chrome plugin arrays, and WebGL rendering differences.

Method 3: Node.js + Cheerio

Lightweight and fast โ€” ideal for scraping Amazon at moderate scale from a Node.js backend or serverless function.

Install

npm install cheerio node-fetch

Product Scraper

// amazon-scraper.mjs
import fetch from "node-fetch";
import * as cheerio from "cheerio";

const HEADERS = {
  "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
    "AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36",
  "Accept-Language": "en-US,en;q=0.9",
  Accept: "text/html",
};

async function scrapeProduct(asin) {
  const url = `https://www.amazon.com/dp/${asin}`;
  const resp = await fetch(url, { headers: HEADERS });
  const html = await resp.text();
  const $ = cheerio.load(html);

  const text = (sel) => $(sel).first().text().trim() || null;

  return {
    title: text("#productTitle"),
    price: text(".a-price .a-offscreen"),
    listPrice: text(".basisPrice .a-offscreen"),
    rating: text("#acrPopover .a-icon-alt"),
    reviewCount: text("#acrCustomerReviewText"),
    availability: text("#availability span"),
    features: $("#feature-bullets li span.a-list-item")
      .map((_, el) => $(el).text().trim())
      .get()
      .filter(Boolean),
    asin,
    url,
  };
}

// Batch scrape with rate limiting
async function scrapeMultiple(asins, delayMs = 5000) {
  const results = [];
  for (const asin of asins) {
    try {
      const product = await scrapeProduct(asin);
      results.push(product);
      console.log(`โœ“ ${product.title?.slice(0, 50)}`);
    } catch (err) {
      console.error(`โœ— ${asin}: ${err.message}`);
    }
    await new Promise((r) => setTimeout(r, delayMs));
  }
  return results;
}

// Usage
const products = await scrapeMultiple([
  "B0CHX3QBCH",
  "B0BSHF7WHW",
  "B09V3KXJPB",
]);
console.log(JSON.stringify(products, null, 2));

Method 4: Web Scraping API (Easiest)

The most reliable approach for production. A web scraping API handles proxies, CAPTCHAs, browser rendering, and selector maintenance โ€” you just send a URL and get structured data back.

Using the Mantis API

# One API call โ€” structured Amazon data
import requests

resp = requests.post(
    "https://api.mantisapi.com/v1/scrape",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "url": "https://www.amazon.com/dp/B0CHX3QBCH",
        "extract": {
            "title": "product title",
            "price": "current price",
            "original_price": "list/original price",
            "rating": "star rating",
            "review_count": "number of reviews",
            "features": "bullet point features (array)",
            "availability": "in stock status",
            "seller": "seller name",
            "images": "product image URLs (array)",
            "description": "product description",
        },
        "render_js": True,
    },
)

product = resp.json()
print(product)

Skip the Proxy Headaches

Mantis handles Amazon's anti-bot detection, proxy rotation, CAPTCHA solving, and JavaScript rendering โ€” so you don't have to.

View Pricing Get Started Free

Node.js with Mantis

// mantis-amazon.mjs
const resp = await fetch("https://api.mantisapi.com/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://www.amazon.com/dp/B0CHX3QBCH",
    extract: {
      title: "product title",
      price: "current price",
      rating: "star rating out of 5",
      review_count: "total number of reviews",
      features: "key product features (array)",
    },
    render_js: true,
  }),
});

const product = await resp.json();
console.log(product);

Beating Amazon's Anti-Bot Detection

Amazon has some of the most aggressive anti-scraping measures on the web. Here's what you're up against and how to handle it:

Amazon's Defense Layers

DefenseWhat It DoesCountermeasure
IP Rate LimitingBlocks IPs making too many requestsRotating residential proxies
CAPTCHA ChallengesServes CAPTCHA on suspicious requestsCAPTCHA solving services or API
Browser FingerprintingDetects headless browsers via JSStealth plugins, real browser profiles
Behavioral AnalysisDetects non-human browsing patternsRandom delays, scroll simulation
Session TrackingCorrelates requests across sessionsFresh sessions, cookie rotation
Dynamic SelectorsChanges CSS class names periodicallySemantic selectors, AI extraction

Essential Anti-Detection Techniques

# anti_detection.py
import random
import time

PROXY_POOL = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
    "Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
    "Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) "
    "Gecko/20100101 Firefox/126.0",
]

def get_session():
    """Create a requests session with random proxy and UA."""
    session = requests.Session()
    session.proxies = {
        "http": random.choice(PROXY_POOL),
        "https": random.choice(PROXY_POOL),
    }
    session.headers.update({
        "User-Agent": random.choice(USER_AGENTS),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "text/html,application/xhtml+xml",
    })
    return session

def polite_delay():
    """Random delay to mimic human browsing."""
    time.sleep(random.uniform(4, 12))

def handle_captcha(response):
    """Detect and handle Amazon CAPTCHAs."""
    if "captcha" in response.text.lower() or response.status_code == 503:
        print("โš ๏ธ CAPTCHA detected โ€” rotating proxy")
        return True
    return False

Amazon PA-API vs Scraping

Amazon offers the Product Advertising API (PA-API 5.0) as an official data source. Here's how it compares:

FeaturePA-API 5.0Web ScrapingMantis API
Setup DifficultyMedium (Associates account required)High (proxies, CAPTCHAs, selectors)Low (API key)
Rate Limit1 req/sec (scales with sales)Depends on proxy poolBased on plan (up to 100K/mo)
Data CoverageBasic product info, prices, imagesEverything visible on the pageEverything visible on the page
ReviewsRating + count onlyFull review text + individual ratingsFull review text + individual ratings
Q&A ContentNot availableFull Q&A textFull Q&A text
Seller DetailsLimitedFull seller infoFull seller info
BSR HistoryCurrent rank onlyCurrent rank (track over time)Current rank (track over time)
ReliabilityVery high (official)Breaks when Amazon changes HTMLHigh (maintained selectors)
CostFree (requires qualifying sales)Proxy costs ($50-500+/mo)$0-299/mo
Legal RiskNone (authorized)ToS violation riskAPI handles compliance
๐Ÿ’ก When to Use Each

PA-API: You're an Amazon affiliate and need basic product data (prices, images, ratings). Scraping/Mantis API: You need full review text, Q&A, seller data, BSR tracking, or data not available through PA-API.

Method Comparison

CriteriaPython + BS4PlaywrightNode.js + CheerioMantis API
Setup Time5 min10 min5 min2 min
JS RenderingโŒโœ…โŒโœ…
Anti-DetectionBasicGood (with stealth)BasicBuilt-in
SpeedFastSlow (browser overhead)FastMedium
MaintenanceHigh (selectors break)High (selectors break)High (selectors break)None
ScaleLow-MediumLowMediumHigh
Cost (10K pages/mo)$50-200 (proxies)$100-300 (proxies + compute)$50-200 (proxies)$99 (Pro plan)
Best ForPrototypingDynamic contentServerless / APIsProduction

Real-World Use Cases

1. Price Tracker Bot

Monitor product prices and alert when they drop below a threshold โ€” perfect for deal sites, purchasing agents, or personal shopping bots.

# price_tracker.py
import requests
import json
from datetime import datetime

MANTIS_KEY = "YOUR_API_KEY"
WATCHLIST = [
    {"asin": "B0CHX3QBCH", "target_price": 249.99},
    {"asin": "B0BSHF7WHW", "target_price": 89.99},
    {"asin": "B09V3KXJPB", "target_price": 349.00},
]

def check_prices():
    alerts = []
    for item in WATCHLIST:
        resp = requests.post(
            "https://api.mantisapi.com/v1/scrape",
            headers={
                "Authorization": f"Bearer {MANTIS_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "url": f"https://www.amazon.com/dp/{item['asin']}",
                "extract": {
                    "title": "product title",
                    "price": "current price as number",
                },
                "render_js": True,
            },
        )
        data = resp.json()
        price = float(
            data.get("price", "0").replace("$", "").replace(",", "")
        )

        # Log price history
        with open(f"prices_{item['asin']}.jsonl", "a") as f:
            f.write(json.dumps({
                "asin": item["asin"],
                "price": price,
                "timestamp": datetime.utcnow().isoformat(),
            }) + "\n")

        if price <= item["target_price"] and price > 0:
            alerts.append({
                "title": data.get("title", item["asin"]),
                "price": price,
                "target": item["target_price"],
                "url": f"https://www.amazon.com/dp/{item['asin']}",
            })

    return alerts

# Run on a schedule (cron, Lambda, etc.)
alerts = check_prices()
for alert in alerts:
    print(f"๐Ÿšจ PRICE DROP: {alert['title']}")
    print(f"   ${alert['price']} (target: ${alert['target']})")
    print(f"   {alert['url']}\n")

2. Product Comparison Engine

Build a comparison tool that aggregates data across multiple products โ€” useful for review sites, affiliate content, or internal procurement tools.

# comparison_engine.py
import requests
import json

MANTIS_KEY = "YOUR_API_KEY"

def compare_products(asins: list) -> dict:
    """Compare multiple Amazon products side by side."""
    products = []
    for asin in asins:
        resp = requests.post(
            "https://api.mantisapi.com/v1/scrape",
            headers={
                "Authorization": f"Bearer {MANTIS_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "url": f"https://www.amazon.com/dp/{asin}",
                "extract": {
                    "title": "product title",
                    "price": "current price",
                    "rating": "star rating as number",
                    "review_count": "number of reviews as integer",
                    "features": "top 5 key features (array)",
                    "availability": "stock status",
                },
                "render_js": True,
            },
        )
        data = resp.json()
        data["asin"] = asin
        products.append(data)

    # Rank by value (rating * reviews / price)
    for p in products:
        try:
            price = float(
                p.get("price", "0").replace("$", "").replace(",", "")
            )
            rating = float(p.get("rating", "0"))
            reviews = int(
                p.get("review_count", "0").replace(",", "")
            )
            p["value_score"] = round(
                (rating * reviews) / max(price, 1), 2
            )
        except (ValueError, TypeError):
            p["value_score"] = 0

    products.sort(key=lambda x: x["value_score"], reverse=True)
    return {"comparison": products, "winner": products[0]["asin"]}

result = compare_products([
    "B0CHX3QBCH", "B0BSHF7WHW", "B09V3KXJPB"
])
print(json.dumps(result, indent=2))

3. AI Agent Shopping Assistant

Give an AI agent the ability to search Amazon and recommend products โ€” a core building block for e-commerce AI assistants.

# agent_shopping.py โ€” LangChain tool for Amazon search
from langchain.tools import tool
import requests

MANTIS_KEY = "YOUR_API_KEY"

@tool
def search_amazon(query: str) -> str:
    """Search Amazon for products and return top results
    with prices, ratings, and links."""
    resp = requests.post(
        "https://api.mantisapi.com/v1/scrape",
        headers={
            "Authorization": f"Bearer {MANTIS_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": (
                f"https://www.amazon.com/s?k="
                f"{query.replace(' ', '+')}"
            ),
            "extract": {
                "products": (
                    "array of top 5 products with: "
                    "title, price, rating, review_count, url, asin"
                ),
            },
            "render_js": True,
        },
    )
    data = resp.json()
    products = data.get("products", [])

    if not products:
        return f"No products found for '{query}'"

    result = f"Top Amazon results for '{query}':\n\n"
    for i, p in enumerate(products, 1):
        result += (
            f"{i}. {p.get('title', 'N/A')}\n"
            f"   Price: {p.get('price', 'N/A')} | "
            f"Rating: {p.get('rating', 'N/A')} "
            f"({p.get('review_count', '0')} reviews)\n"
            f"   https://www.amazon.com/dp/"
            f"{p.get('asin', '')}\n\n"
        )
    return result

# Use in a LangChain agent
# agent = create_agent(tools=[search_amazon], ...)

Amazon scraping exists in a legal gray area. Key precedents and considerations:

โš–๏ธ Best Practices for Legal Safety

Only scrape publicly available data. Respect rate limits. Don't circumvent explicit access controls. Don't scrape personal data. Consult legal counsel for commercial use cases. Consider using Amazon's official PA-API for basic product data, and a web scraping API for data PA-API doesn't cover.

Production-Ready Amazon Scraping

Stop fighting proxies, CAPTCHAs, and broken selectors. Mantis extracts structured Amazon data with a single API call.

View Pricing Get Started Free

Frequently Asked Questions

Is it legal to scrape Amazon product data?

Scraping publicly available Amazon product pages is in a legal gray area. The Van Buren v. United States (2021) decision narrowed the CFAA, and hiQ v. LinkedIn affirmed that scraping public data isn't a federal crime. However, Amazon's ToS prohibit automated access. For commercial use, consider an API-based approach.

How do I scrape Amazon without getting blocked?

Use rotating residential proxies, randomize delays (3-15 seconds), rotate User-Agent strings, handle CAPTCHAs, and use headless browsers with stealth plugins. Or use a web scraping API like Mantis that handles all anti-blocking automatically.

What Python library is best for scraping Amazon?

For prototyping: requests + BeautifulSoup. For JS-rendered content: Playwright with stealth plugins. For production: a web scraping API that maintains selectors and handles anti-detection.

Can I use Amazon's official API instead of scraping?

Yes โ€” the PA-API 5.0 provides basic product data (prices, images, ratings) but requires an Associates account with qualifying sales and doesn't include full reviews, Q&A, or seller details.

What data can I extract from Amazon product pages?

Product title, price, rating, review count, images, bullet features, description, ASIN, BSR, category, seller info, availability, Q&A content, and individual reviews with ratings.

How many Amazon pages can I scrape per day?

Without proxies: 20-50 before blocking. With rotating residential proxies: 5,000-10,000. With Mantis API: up to 100,000/month on the Scale plan.

Related Guides