๐ Table of Contents
- Why Scrape Amazon Product Data?
- What Data Can You Extract?
- Method 1: Python + BeautifulSoup
- Method 2: Playwright (Headless Browser)
- Method 3: Node.js + Cheerio
- Method 4: Web Scraping API (Easiest)
- Beating Amazon's Anti-Bot Detection
- Amazon PA-API vs Scraping
- Method Comparison
- Real-World Use Cases
- Legal Considerations
- FAQ
Why Scrape Amazon Product Data?
Amazon is the world's largest online marketplace, with over 350 million products and 300 million active customers. That product data powers some of the most valuable business intelligence in e-commerce:
- Price monitoring โ Track competitor prices in real time and adjust your pricing strategy automatically
- Market research โ Discover trending products, market gaps, and demand signals before competitors
- Review analysis โ Aggregate customer sentiment across thousands of reviews to inform product development
- Competitive intelligence โ Monitor competitor listings, BSR rankings, and new product launches
- Dropshipping & arbitrage โ Find price discrepancies between Amazon and other marketplaces
- AI agent shopping tools โ Give AI assistants the ability to search, compare, and recommend products
- Investment research โ Track product trends and brand performance as market indicators
Whether you're building a price tracker, a product research tool, or an AI shopping agent, scraping Amazon is a foundational data capability.
What Data Can You Extract?
Amazon product pages contain rich structured data across multiple sections:
| Data Point | Location | CSS Selector Hint |
|---|---|---|
| Product Title | Top of page | #productTitle |
| Price | Buy box | .a-price .a-offscreen |
| List Price | Buy box (strikethrough) | .basisPrice .a-offscreen |
| Rating | Below title | #acrPopover |
| Review Count | Below title | #acrCustomerReviewText |
| Images | Left gallery | #imgTagWrapperId img |
| Bullet Features | Feature section | #feature-bullets li |
| ASIN | Product details | th:contains("ASIN")+td |
| BSR (Best Seller Rank) | Product details | #SalesRank |
| Availability | Buy box | #availability |
| Seller | Buy box | #sellerProfileTriggerId |
| Category Breadcrumbs | Top of page | #wayfinding-breadcrumbs_feature_div |
Method 1: Python + BeautifulSoup
The simplest approach for scraping individual Amazon product pages. Works well for small-scale data collection and prototyping.
Install Dependencies
pip install requests beautifulsoup4 lxml
Basic Product Scraper
# amazon_scraper.py import requests from bs4 import BeautifulSoup import json import time import random HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/125.0.0.0 Safari/537.36", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Accept": "text/html,application/xhtml+xml", "Referer": "https://www.google.com/", } def scrape_amazon_product(url: str) -> dict: """Scrape product data from an Amazon product page.""" resp = requests.get(url, headers=HEADERS, timeout=15) resp.raise_for_status() soup = BeautifulSoup(resp.text, "lxml") def text(selector): el = soup.select_one(selector) return el.get_text(strip=True) if el else None return { "title": text("#productTitle"), "price": text(".a-price .a-offscreen"), "list_price": text(".basisPrice .a-offscreen"), "rating": text("#acrPopover .a-icon-alt"), "review_count": text("#acrCustomerReviewText"), "availability": text("#availability span"), "features": [ li.get_text(strip=True) for li in soup.select("#feature-bullets li span.a-list-item") ], "images": [ img.get("src") for img in soup.select("#altImages img") if img.get("src") and "sprite" not in img["src"] ], "asin": url.split("/dp/")[1].split("/")[0] if "/dp/" in url else None, "url": url, } # Example usage product = scrape_amazon_product( "https://www.amazon.com/dp/B0CHX3QBCH" ) print(json.dumps(product, indent=2))
Amazon changes their HTML structure frequently. CSS selectors that work today may break tomorrow. Always test your selectors and build in error handling. For production use, consider an API-based approach that maintains selectors for you.
Scraping Search Results
# search_scraper.py def scrape_amazon_search(keyword: str, pages: int = 3) -> list: """Scrape Amazon search results for a keyword.""" products = [] for page in range(1, pages + 1): url = ( f"https://www.amazon.com/s?k={keyword.replace(' ', '+')}" f"&page={page}" ) resp = requests.get(url, headers=HEADERS, timeout=15) soup = BeautifulSoup(resp.text, "lxml") for item in soup.select('[data-component-type="s-search-result"]'): title_el = item.select_one("h2 a span") price_el = item.select_one(".a-price .a-offscreen") rating_el = item.select_one(".a-icon-alt") reviews_el = item.select_one( '[aria-label*="stars"] + span' ) link_el = item.select_one("h2 a") products.append({ "title": title_el.text.strip() if title_el else None, "price": price_el.text.strip() if price_el else None, "rating": rating_el.text.strip() if rating_el else None, "reviews": reviews_el.text.strip() if reviews_el else None, "url": ( "https://www.amazon.com" + link_el["href"] if link_el else None ), "asin": item.get("data-asin"), }) # Random delay between pages time.sleep(random.uniform(3, 8)) return products results = scrape_amazon_search("wireless earbuds", pages=2) print(f"Found {len(results)} products")
Method 2: Playwright (Headless Browser)
Amazon heavily relies on JavaScript for dynamic content โ lazy-loaded images, price updates, variant selectors, and review widgets. Playwright renders the full page like a real browser, giving you access to all dynamic content.
Install
pip install playwright playwright install chromium
Full-Render Amazon Scraper
# playwright_amazon.py import asyncio from playwright.async_api import async_playwright import json async def scrape_amazon_product(asin: str) -> dict: async with async_playwright() as p: browser = await p.chromium.launch(headless=True) context = await browser.new_context( user_agent=( "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/125.0.0.0 Safari/537.36" ), viewport={"width": 1920, "height": 1080}, locale="en-US", ) page = await context.new_page() # Block unnecessary resources for speed await page.route("**/*.{png,jpg,jpeg,gif,svg,ico}", lambda route: route.abort()) await page.route("**/ads/**", lambda route: route.abort()) url = f"https://www.amazon.com/dp/{asin}" await page.goto(url, wait_until="domcontentloaded") await page.wait_for_timeout(2000) product = await page.evaluate("""() => { const text = (sel) => { const el = document.querySelector(sel); return el ? el.textContent.trim() : null; }; return { title: text('#productTitle'), price: text('.a-price .a-offscreen'), list_price: text('.basisPrice .a-offscreen'), rating: text('#acrPopover .a-icon-alt'), review_count: text('#acrCustomerReviewText'), availability: text('#availability span'), features: [...document.querySelectorAll( '#feature-bullets li span.a-list-item' )].map(el => el.textContent.trim()).filter(Boolean), description: text('#productDescription p'), seller: text('#sellerProfileTriggerId'), }; }""") # Extract all high-res images images = await page.evaluate("""() => { const imgs = document.querySelectorAll( '#altImages .a-button-thumbnail img' ); return [...imgs] .map(img => img.src) .filter(src => src && !src.includes('sprite')) .map(src => src.replace(/\._.*_\./, '.')); }""") product["images"] = images product["asin"] = asin product["url"] = url await browser.close() return product # Run it data = asyncio.run(scrape_amazon_product("B0CHX3QBCH")) print(json.dumps(data, indent=2))
Install playwright-stealth to bypass Amazon's bot detection. It patches common browser fingerprint checks like navigator.webdriver, Chrome plugin arrays, and WebGL rendering differences.
Method 3: Node.js + Cheerio
Lightweight and fast โ ideal for scraping Amazon at moderate scale from a Node.js backend or serverless function.
Install
npm install cheerio node-fetch
Product Scraper
// amazon-scraper.mjs import fetch from "node-fetch"; import * as cheerio from "cheerio"; const HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " + "AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36", "Accept-Language": "en-US,en;q=0.9", Accept: "text/html", }; async function scrapeProduct(asin) { const url = `https://www.amazon.com/dp/${asin}`; const resp = await fetch(url, { headers: HEADERS }); const html = await resp.text(); const $ = cheerio.load(html); const text = (sel) => $(sel).first().text().trim() || null; return { title: text("#productTitle"), price: text(".a-price .a-offscreen"), listPrice: text(".basisPrice .a-offscreen"), rating: text("#acrPopover .a-icon-alt"), reviewCount: text("#acrCustomerReviewText"), availability: text("#availability span"), features: $("#feature-bullets li span.a-list-item") .map((_, el) => $(el).text().trim()) .get() .filter(Boolean), asin, url, }; } // Batch scrape with rate limiting async function scrapeMultiple(asins, delayMs = 5000) { const results = []; for (const asin of asins) { try { const product = await scrapeProduct(asin); results.push(product); console.log(`โ ${product.title?.slice(0, 50)}`); } catch (err) { console.error(`โ ${asin}: ${err.message}`); } await new Promise((r) => setTimeout(r, delayMs)); } return results; } // Usage const products = await scrapeMultiple([ "B0CHX3QBCH", "B0BSHF7WHW", "B09V3KXJPB", ]); console.log(JSON.stringify(products, null, 2));
Method 4: Web Scraping API (Easiest)
The most reliable approach for production. A web scraping API handles proxies, CAPTCHAs, browser rendering, and selector maintenance โ you just send a URL and get structured data back.
Using the Mantis API
# One API call โ structured Amazon data
import requests
resp = requests.post(
"https://api.mantisapi.com/v1/scrape",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
json={
"url": "https://www.amazon.com/dp/B0CHX3QBCH",
"extract": {
"title": "product title",
"price": "current price",
"original_price": "list/original price",
"rating": "star rating",
"review_count": "number of reviews",
"features": "bullet point features (array)",
"availability": "in stock status",
"seller": "seller name",
"images": "product image URLs (array)",
"description": "product description",
},
"render_js": True,
},
)
product = resp.json()
print(product)
Skip the Proxy Headaches
Mantis handles Amazon's anti-bot detection, proxy rotation, CAPTCHA solving, and JavaScript rendering โ so you don't have to.
View Pricing Get Started FreeNode.js with Mantis
// mantis-amazon.mjs
const resp = await fetch("https://api.mantisapi.com/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://www.amazon.com/dp/B0CHX3QBCH",
extract: {
title: "product title",
price: "current price",
rating: "star rating out of 5",
review_count: "total number of reviews",
features: "key product features (array)",
},
render_js: true,
}),
});
const product = await resp.json();
console.log(product);
Beating Amazon's Anti-Bot Detection
Amazon has some of the most aggressive anti-scraping measures on the web. Here's what you're up against and how to handle it:
Amazon's Defense Layers
| Defense | What It Does | Countermeasure |
|---|---|---|
| IP Rate Limiting | Blocks IPs making too many requests | Rotating residential proxies |
| CAPTCHA Challenges | Serves CAPTCHA on suspicious requests | CAPTCHA solving services or API |
| Browser Fingerprinting | Detects headless browsers via JS | Stealth plugins, real browser profiles |
| Behavioral Analysis | Detects non-human browsing patterns | Random delays, scroll simulation |
| Session Tracking | Correlates requests across sessions | Fresh sessions, cookie rotation |
| Dynamic Selectors | Changes CSS class names periodically | Semantic selectors, AI extraction |
Essential Anti-Detection Techniques
# anti_detection.py
import random
import time
PROXY_POOL = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"Chrome/125.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
"Chrome/125.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) "
"Gecko/20100101 Firefox/126.0",
]
def get_session():
"""Create a requests session with random proxy and UA."""
session = requests.Session()
session.proxies = {
"http": random.choice(PROXY_POOL),
"https": random.choice(PROXY_POOL),
}
session.headers.update({
"User-Agent": random.choice(USER_AGENTS),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml",
})
return session
def polite_delay():
"""Random delay to mimic human browsing."""
time.sleep(random.uniform(4, 12))
def handle_captcha(response):
"""Detect and handle Amazon CAPTCHAs."""
if "captcha" in response.text.lower() or response.status_code == 503:
print("โ ๏ธ CAPTCHA detected โ rotating proxy")
return True
return False
Amazon PA-API vs Scraping
Amazon offers the Product Advertising API (PA-API 5.0) as an official data source. Here's how it compares:
| Feature | PA-API 5.0 | Web Scraping | Mantis API |
|---|---|---|---|
| Setup Difficulty | Medium (Associates account required) | High (proxies, CAPTCHAs, selectors) | Low (API key) |
| Rate Limit | 1 req/sec (scales with sales) | Depends on proxy pool | Based on plan (up to 100K/mo) |
| Data Coverage | Basic product info, prices, images | Everything visible on the page | Everything visible on the page |
| Reviews | Rating + count only | Full review text + individual ratings | Full review text + individual ratings |
| Q&A Content | Not available | Full Q&A text | Full Q&A text |
| Seller Details | Limited | Full seller info | Full seller info |
| BSR History | Current rank only | Current rank (track over time) | Current rank (track over time) |
| Reliability | Very high (official) | Breaks when Amazon changes HTML | High (maintained selectors) |
| Cost | Free (requires qualifying sales) | Proxy costs ($50-500+/mo) | $0-299/mo |
| Legal Risk | None (authorized) | ToS violation risk | API handles compliance |
PA-API: You're an Amazon affiliate and need basic product data (prices, images, ratings). Scraping/Mantis API: You need full review text, Q&A, seller data, BSR tracking, or data not available through PA-API.
Method Comparison
| Criteria | Python + BS4 | Playwright | Node.js + Cheerio | Mantis API |
|---|---|---|---|---|
| Setup Time | 5 min | 10 min | 5 min | 2 min |
| JS Rendering | โ | โ | โ | โ |
| Anti-Detection | Basic | Good (with stealth) | Basic | Built-in |
| Speed | Fast | Slow (browser overhead) | Fast | Medium |
| Maintenance | High (selectors break) | High (selectors break) | High (selectors break) | None |
| Scale | Low-Medium | Low | Medium | High |
| Cost (10K pages/mo) | $50-200 (proxies) | $100-300 (proxies + compute) | $50-200 (proxies) | $99 (Pro plan) |
| Best For | Prototyping | Dynamic content | Serverless / APIs | Production |
Real-World Use Cases
1. Price Tracker Bot
Monitor product prices and alert when they drop below a threshold โ perfect for deal sites, purchasing agents, or personal shopping bots.
# price_tracker.py import requests import json from datetime import datetime MANTIS_KEY = "YOUR_API_KEY" WATCHLIST = [ {"asin": "B0CHX3QBCH", "target_price": 249.99}, {"asin": "B0BSHF7WHW", "target_price": 89.99}, {"asin": "B09V3KXJPB", "target_price": 349.00}, ] def check_prices(): alerts = [] for item in WATCHLIST: resp = requests.post( "https://api.mantisapi.com/v1/scrape", headers={ "Authorization": f"Bearer {MANTIS_KEY}", "Content-Type": "application/json", }, json={ "url": f"https://www.amazon.com/dp/{item['asin']}", "extract": { "title": "product title", "price": "current price as number", }, "render_js": True, }, ) data = resp.json() price = float( data.get("price", "0").replace("$", "").replace(",", "") ) # Log price history with open(f"prices_{item['asin']}.jsonl", "a") as f: f.write(json.dumps({ "asin": item["asin"], "price": price, "timestamp": datetime.utcnow().isoformat(), }) + "\n") if price <= item["target_price"] and price > 0: alerts.append({ "title": data.get("title", item["asin"]), "price": price, "target": item["target_price"], "url": f"https://www.amazon.com/dp/{item['asin']}", }) return alerts # Run on a schedule (cron, Lambda, etc.) alerts = check_prices() for alert in alerts: print(f"๐จ PRICE DROP: {alert['title']}") print(f" ${alert['price']} (target: ${alert['target']})") print(f" {alert['url']}\n")
2. Product Comparison Engine
Build a comparison tool that aggregates data across multiple products โ useful for review sites, affiliate content, or internal procurement tools.
# comparison_engine.py import requests import json MANTIS_KEY = "YOUR_API_KEY" def compare_products(asins: list) -> dict: """Compare multiple Amazon products side by side.""" products = [] for asin in asins: resp = requests.post( "https://api.mantisapi.com/v1/scrape", headers={ "Authorization": f"Bearer {MANTIS_KEY}", "Content-Type": "application/json", }, json={ "url": f"https://www.amazon.com/dp/{asin}", "extract": { "title": "product title", "price": "current price", "rating": "star rating as number", "review_count": "number of reviews as integer", "features": "top 5 key features (array)", "availability": "stock status", }, "render_js": True, }, ) data = resp.json() data["asin"] = asin products.append(data) # Rank by value (rating * reviews / price) for p in products: try: price = float( p.get("price", "0").replace("$", "").replace(",", "") ) rating = float(p.get("rating", "0")) reviews = int( p.get("review_count", "0").replace(",", "") ) p["value_score"] = round( (rating * reviews) / max(price, 1), 2 ) except (ValueError, TypeError): p["value_score"] = 0 products.sort(key=lambda x: x["value_score"], reverse=True) return {"comparison": products, "winner": products[0]["asin"]} result = compare_products([ "B0CHX3QBCH", "B0BSHF7WHW", "B09V3KXJPB" ]) print(json.dumps(result, indent=2))
3. AI Agent Shopping Assistant
Give an AI agent the ability to search Amazon and recommend products โ a core building block for e-commerce AI assistants.
# agent_shopping.py โ LangChain tool for Amazon search from langchain.tools import tool import requests MANTIS_KEY = "YOUR_API_KEY" @tool def search_amazon(query: str) -> str: """Search Amazon for products and return top results with prices, ratings, and links.""" resp = requests.post( "https://api.mantisapi.com/v1/scrape", headers={ "Authorization": f"Bearer {MANTIS_KEY}", "Content-Type": "application/json", }, json={ "url": ( f"https://www.amazon.com/s?k=" f"{query.replace(' ', '+')}" ), "extract": { "products": ( "array of top 5 products with: " "title, price, rating, review_count, url, asin" ), }, "render_js": True, }, ) data = resp.json() products = data.get("products", []) if not products: return f"No products found for '{query}'" result = f"Top Amazon results for '{query}':\n\n" for i, p in enumerate(products, 1): result += ( f"{i}. {p.get('title', 'N/A')}\n" f" Price: {p.get('price', 'N/A')} | " f"Rating: {p.get('rating', 'N/A')} " f"({p.get('review_count', '0')} reviews)\n" f" https://www.amazon.com/dp/" f"{p.get('asin', '')}\n\n" ) return result # Use in a LangChain agent # agent = create_agent(tools=[search_amazon], ...)
Legal Considerations
Amazon scraping exists in a legal gray area. Key precedents and considerations:
- Van Buren v. United States (2021) โ The Supreme Court narrowed the CFAA, ruling that accessing publicly available data (even against ToS) isn't "exceeding authorized access" under federal law
- hiQ Labs v. LinkedIn (2022) โ The Ninth Circuit ruled that scraping publicly accessible data doesn't violate the CFAA, strengthening the case for scraping public product listings
- Amazon's Terms of Service โ Explicitly prohibit scraping and automated access. Violating ToS is a contract issue, not criminal, but Amazon can pursue civil action
- Robots.txt โ Amazon's robots.txt disallows many paths. While not legally binding, respecting it demonstrates good faith
- GDPR/CCPA โ Product data is generally not personal data, but review scraping may involve personal information (reviewer names, profiles)
- Rate & Volume โ Excessive scraping that degrades service could be considered tortious interference or trespass to chattels
Only scrape publicly available data. Respect rate limits. Don't circumvent explicit access controls. Don't scrape personal data. Consult legal counsel for commercial use cases. Consider using Amazon's official PA-API for basic product data, and a web scraping API for data PA-API doesn't cover.
Production-Ready Amazon Scraping
Stop fighting proxies, CAPTCHAs, and broken selectors. Mantis extracts structured Amazon data with a single API call.
View Pricing Get Started FreeFrequently Asked Questions
Is it legal to scrape Amazon product data?
Scraping publicly available Amazon product pages is in a legal gray area. The Van Buren v. United States (2021) decision narrowed the CFAA, and hiQ v. LinkedIn affirmed that scraping public data isn't a federal crime. However, Amazon's ToS prohibit automated access. For commercial use, consider an API-based approach.
How do I scrape Amazon without getting blocked?
Use rotating residential proxies, randomize delays (3-15 seconds), rotate User-Agent strings, handle CAPTCHAs, and use headless browsers with stealth plugins. Or use a web scraping API like Mantis that handles all anti-blocking automatically.
What Python library is best for scraping Amazon?
For prototyping: requests + BeautifulSoup. For JS-rendered content: Playwright with stealth plugins. For production: a web scraping API that maintains selectors and handles anti-detection.
Can I use Amazon's official API instead of scraping?
Yes โ the PA-API 5.0 provides basic product data (prices, images, ratings) but requires an Associates account with qualifying sales and doesn't include full reviews, Q&A, or seller details.
What data can I extract from Amazon product pages?
Product title, price, rating, review count, images, bullet features, description, ASIN, BSR, category, seller info, availability, Q&A content, and individual reviews with ratings.
How many Amazon pages can I scrape per day?
Without proxies: 20-50 before blocking. With rotating residential proxies: 5,000-10,000. With Mantis API: up to 100,000/month on the Scale plan.