Web Scraping with Selenium and Python in 2026: The Complete Guide
Selenium is the most widely-used browser automation framework in the world. With over 30,000 GitHub stars and millions of active users, it's the tool most developers learn first for web scraping. This guide covers everything you need to scrape the modern web with Selenium and Python in 2026 โ from basic setup to production-ready anti-detection.
We'll cover setup, headless Chrome, explicit waits, pagination, login handling, proxy rotation, stealth techniques, and when it makes sense to switch to a web scraping API instead.
Table of Contents
- Setting Up Selenium with Python
- Basic Web Scraping
- Headless Chrome Configuration
- Waiting Strategies
- Handling Pagination
- Infinite Scroll Pages
- Login and Authentication
- Anti-Detection and Stealth
- Proxy Rotation
- Screenshots and PDFs
- Production-Ready Scraper
- Selenium vs Playwright vs APIs
- Cost Analysis
- FAQ
1. Setting Up Selenium with Python
Installation
# Install Selenium and webdriver-manager
pip install selenium webdriver-manager
# Or with all common extras
pip install selenium webdriver-manager beautifulsoup4 lxml
Since Selenium 4.6+, you no longer need to manually download ChromeDriver โ Selenium Manager handles it automatically. But webdriver-manager gives you more control:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
# Selenium 4.x setup
options = Options()
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)
# Navigate to a page
driver.get("https://example.com")
print(driver.title)
# Always clean up
driver.quit()
Selenium Manager (Built-in, No Extra Dependencies)
from selenium import webdriver
# Selenium 4.6+ handles driver management automatically
driver = webdriver.Chrome()
driver.get("https://example.com")
print(driver.title)
driver.quit()
2. Basic Web Scraping
Selenium finds elements using locators. The recommended approach in Selenium 4 is the By class:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://quotes.toscrape.com")
# Find elements by CSS selector
quotes = driver.find_elements(By.CSS_SELECTOR, "div.quote")
for quote in quotes:
text = quote.find_element(By.CSS_SELECTOR, "span.text").text
author = quote.find_element(By.CSS_SELECTOR, "small.author").text
tags = [tag.text for tag in quote.find_elements(By.CSS_SELECTOR, "a.tag")]
print(f"{text}\n โ {author} | Tags: {', '.join(tags)}\n")
driver.quit()
Common Locator Strategies
# By ID
element = driver.find_element(By.ID, "search-input")
# By class name
elements = driver.find_elements(By.CLASS_NAME, "product-card")
# By CSS selector (most flexible)
element = driver.find_element(By.CSS_SELECTOR, "div.results > a.link")
# By XPath (for complex traversals)
element = driver.find_element(By.XPATH, "//div[@data-testid='price']/span")
# By link text
element = driver.find_element(By.LINK_TEXT, "Next Page")
# By partial link text
element = driver.find_element(By.PARTIAL_LINK_TEXT, "Next")
3. Headless Chrome Configuration
For scraping, you almost always want headless mode โ no visible browser window, faster execution, and lower resource usage:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
# Core headless settings
options.add_argument("--headless=new") # New headless mode (Chrome 109+)
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
# Performance optimizations
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
options.add_argument("--disable-infobars")
options.add_argument("--window-size=1920,1080")
# Reduce memory usage
options.add_argument("--disable-images") # Skip loading images
options.add_argument("--blink-settings=imagesEnabled=false")
# Set a realistic user agent
options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
)
driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
print(f"Page title: {driver.title}")
driver.quit()
4. Waiting Strategies
Modern websites load content dynamically. Never use time.sleep() โ use Selenium's built-in waits instead:
Explicit Waits (Recommended)
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(options=options)
driver.get("https://example.com/dynamic-page")
# Wait up to 10 seconds for an element to appear
wait = WebDriverWait(driver, 10)
# Wait for element to be present in DOM
element = wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.results"))
)
# Wait for element to be clickable
button = wait.until(
EC.element_to_be_clickable((By.ID, "load-more"))
)
# Wait for text to appear in element
wait.until(
EC.text_to_be_present_in_element((By.ID, "status"), "Complete")
)
# Wait for element to disappear (loading spinner)
wait.until(
EC.invisibility_of_element_located((By.CSS_SELECTOR, ".spinner"))
)
# Custom wait condition
def results_loaded(driver):
items = driver.find_elements(By.CSS_SELECTOR, ".result-item")
return len(items) > 0
wait.until(results_loaded)
Implicit Waits (Simpler but Less Control)
# Sets a default wait time for all find_element calls
driver.implicitly_wait(10) # Wait up to 10 seconds
# Now all find_element calls will wait up to 10 seconds
# before throwing NoSuchElementException
element = driver.find_element(By.CSS_SELECTOR, "div.results")
Best practice: Use explicit waits for specific conditions, and avoid mixing implicit and explicit waits (it can cause unpredictable timeout behavior).
5. Handling Pagination
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_paginated_site(base_url, max_pages=10):
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver, 10)
all_items = []
try:
driver.get(base_url)
for page in range(max_pages):
# Wait for results to load
wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)
# Extract data from current page
cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for card in cards:
item = {
"name": card.find_element(By.CSS_SELECTOR, "h3").text,
"price": card.find_element(By.CSS_SELECTOR, ".price").text,
"url": card.find_element(By.CSS_SELECTOR, "a").get_attribute("href"),
}
all_items.append(item)
print(f"Page {page + 1}: scraped {len(cards)} items")
# Try to click "Next" button
try:
next_btn = driver.find_element(By.CSS_SELECTOR, "a.next-page")
if "disabled" in next_btn.get_attribute("class"):
break
next_btn.click()
# Wait for new content to load
wait.until(EC.staleness_of(cards[0]))
except Exception:
break # No more pages
finally:
driver.quit()
return all_items
items = scrape_paginated_site("https://example.com/products")
print(f"Total items scraped: {len(items)}")
6. Infinite Scroll Pages
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
def scrape_infinite_scroll(url, max_scrolls=20, scroll_pause=2):
driver = webdriver.Chrome(options=options)
driver.get(url)
items = set()
last_height = driver.execute_script("return document.body.scrollHeight")
for i in range(max_scrolls):
# Scroll to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(scroll_pause)
# Collect items
elements = driver.find_elements(By.CSS_SELECTOR, ".feed-item")
for el in elements:
items.add(el.text)
# Check if we've reached the bottom
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
print(f"Reached bottom after {i + 1} scrolls")
break
last_height = new_height
print(f"Scroll {i + 1}: {len(items)} unique items")
driver.quit()
return list(items)
results = scrape_infinite_scroll("https://example.com/feed")
7. Login and Authentication
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pickle
import os
class AuthenticatedScraper:
def __init__(self):
self.driver = webdriver.Chrome(options=options)
self.wait = WebDriverWait(self.driver, 10)
self.cookies_file = "cookies.pkl"
def login(self, username, password):
"""Login and save cookies for session reuse."""
# Check for saved cookies first
if os.path.exists(self.cookies_file):
self.driver.get("https://example.com")
cookies = pickle.load(open(self.cookies_file, "rb"))
for cookie in cookies:
self.driver.add_cookie(cookie)
self.driver.refresh()
# Verify we're logged in
try:
self.wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".user-menu"))
)
print("Restored session from cookies")
return True
except Exception:
pass # Cookies expired, login fresh
# Fresh login
self.driver.get("https://example.com/login")
username_field = self.wait.until(
EC.presence_of_element_located((By.NAME, "username"))
)
username_field.clear()
username_field.send_keys(username)
password_field = self.driver.find_element(By.NAME, "password")
password_field.clear()
password_field.send_keys(password)
# Click login button
login_btn = self.driver.find_element(By.CSS_SELECTOR, "button[type='submit']")
login_btn.click()
# Wait for login to complete
self.wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".user-menu"))
)
# Save cookies
pickle.dump(self.driver.get_cookies(), open(self.cookies_file, "wb"))
print("Login successful, cookies saved")
return True
def scrape_protected_page(self, url):
"""Scrape a page that requires authentication."""
self.driver.get(url)
self.wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".content"))
)
return self.driver.find_element(By.CSS_SELECTOR, ".content").text
def close(self):
self.driver.quit()
8. Anti-Detection and Stealth
Default Selenium is trivially detected by anti-bot systems. Here's how to reduce your detection footprint:
Using undetected-chromedriver
# pip install undetected-chromedriver
import undetected_chromedriver as uc
# Automatically patches ChromeDriver to avoid detection
driver = uc.Chrome(headless=True, version_main=122)
driver.get("https://nowsecure.nl") # Anti-bot test site
# Check if we passed
print(driver.title)
driver.quit()
Manual Stealth Configuration
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless=new")
options.add_argument("--window-size=1920,1080")
# Key anti-detection flags
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
# Realistic user agent
options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
)
driver = webdriver.Chrome(options=options)
# Remove navigator.webdriver flag
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5]
});
window.chrome = { runtime: {} };
"""
})
driver.get("https://bot.sannysoft.com")
driver.save_screenshot("stealth_test.png")
driver.quit()
Human-Like Behavior
import random
import time
from selenium.webdriver.common.action_chains import ActionChains
def human_like_delay(min_sec=0.5, max_sec=2.0):
"""Random delay to mimic human behavior."""
time.sleep(random.uniform(min_sec, max_sec))
def human_like_scroll(driver):
"""Scroll like a human โ not perfectly to the bottom."""
total_height = driver.execute_script("return document.body.scrollHeight")
viewport = driver.execute_script("return window.innerHeight")
current = 0
while current < total_height:
scroll_amount = random.randint(200, viewport)
current += scroll_amount
driver.execute_script(f"window.scrollTo(0, {current});")
time.sleep(random.uniform(0.3, 1.2))
def human_like_type(element, text):
"""Type text character by character with random delays."""
for char in text:
element.send_keys(char)
time.sleep(random.uniform(0.05, 0.15))
def random_mouse_movement(driver):
"""Move mouse to random positions on the page."""
actions = ActionChains(driver)
body = driver.find_element(By.TAG_NAME, "body")
for _ in range(random.randint(2, 5)):
x = random.randint(100, 800)
y = random.randint(100, 600)
actions.move_to_element_with_offset(body, x, y)
actions.pause(random.uniform(0.1, 0.5))
actions.perform()
9. Proxy Rotation
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import random
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
def create_driver_with_proxy(proxy_url=None):
"""Create a Chrome driver with proxy support."""
options = Options()
options.add_argument("--headless=new")
options.add_argument("--disable-blink-features=AutomationControlled")
if proxy_url:
options.add_argument(f"--proxy-server={proxy_url}")
return webdriver.Chrome(options=options)
# Rotate proxy per request
proxy = random.choice(PROXIES)
driver = create_driver_with_proxy(proxy)
driver.get("https://httpbin.org/ip")
print(driver.find_element(By.TAG_NAME, "body").text)
driver.quit()
Using Selenium Wire for Advanced Proxy Control
# pip install selenium-wire
from seleniumwire import webdriver
proxy_options = {
"proxy": {
"http": "http://user:pass@proxy.example.com:8080",
"https": "http://user:pass@proxy.example.com:8080",
}
}
driver = webdriver.Chrome(seleniumwire_options=proxy_options)
driver.get("https://httpbin.org/ip")
print(driver.find_element(By.TAG_NAME, "body").text)
# Selenium Wire also lets you intercept/modify requests
for request in driver.requests:
if request.response:
print(f"{request.url} โ {request.response.status_code}")
driver.quit()
10. Screenshots and PDFs
# Full page screenshot
driver.save_screenshot("page.png")
# Element screenshot
element = driver.find_element(By.CSS_SELECTOR, ".chart-container")
element.screenshot("chart.png")
# Full page screenshot (requires scrolling for long pages)
def full_page_screenshot(driver, filename):
"""Capture full page by adjusting window size."""
total_height = driver.execute_script("return document.body.scrollHeight")
total_width = driver.execute_script("return document.body.scrollWidth")
driver.set_window_size(total_width, total_height)
driver.save_screenshot(filename)
driver.set_window_size(1920, 1080) # Reset
# Save as PDF (Chrome headless)
def save_as_pdf(driver, filename):
"""Save page as PDF using Chrome DevTools Protocol."""
import base64
result = driver.execute_cdp_cmd("Page.printToPDF", {
"printBackground": True,
"preferCSSPageSize": True,
})
with open(filename, "wb") as f:
f.write(base64.b64decode(result["data"]))
11. Production-Ready Scraper
Here's a complete, production-ready scraper with retries, error handling, and data export:
import json
import csv
import time
import random
import logging
from dataclasses import dataclass, asdict
from typing import List, Optional
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import (
TimeoutException, StaleElementReferenceException,
NoSuchElementException, WebDriverException
)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Product:
name: str
price: str
url: str
rating: Optional[str] = None
reviews: Optional[int] = None
class ProductScraper:
def __init__(self, headless=True, max_retries=3):
self.max_retries = max_retries
self.options = Options()
if headless:
self.options.add_argument("--headless=new")
self.options.add_argument("--no-sandbox")
self.options.add_argument("--disable-dev-shm-usage")
self.options.add_argument("--disable-blink-features=AutomationControlled")
self.options.add_argument("--window-size=1920,1080")
self.options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
)
self.driver = None
self.wait = None
def start(self):
self.driver = webdriver.Chrome(options=self.options)
self.wait = WebDriverWait(self.driver, 15)
logger.info("Browser started")
def stop(self):
if self.driver:
self.driver.quit()
logger.info("Browser stopped")
def _retry(self, func, *args, **kwargs):
"""Retry a function with exponential backoff."""
for attempt in range(self.max_retries):
try:
return func(*args, **kwargs)
except (TimeoutException, WebDriverException) as e:
wait_time = (2 ** attempt) + random.uniform(0, 1)
logger.warning(
f"Attempt {attempt + 1}/{self.max_retries} failed: {e}. "
f"Retrying in {wait_time:.1f}s"
)
time.sleep(wait_time)
if attempt == self.max_retries - 1:
raise
def scrape_page(self, url) -> List[Product]:
"""Scrape all products from a single page."""
self.driver.get(url)
self.wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)
products = []
cards = self.driver.find_elements(By.CSS_SELECTOR, ".product-card")
for card in cards:
try:
product = Product(
name=card.find_element(By.CSS_SELECTOR, "h3").text.strip(),
price=card.find_element(By.CSS_SELECTOR, ".price").text.strip(),
url=card.find_element(By.CSS_SELECTOR, "a").get_attribute("href"),
)
try:
product.rating = card.find_element(
By.CSS_SELECTOR, ".rating"
).text.strip()
except NoSuchElementException:
pass
products.append(product)
except StaleElementReferenceException:
logger.warning("Stale element, skipping card")
continue
return products
def scrape_all_pages(self, start_url, max_pages=50) -> List[Product]:
"""Scrape products across multiple pages."""
all_products = []
url = start_url
for page_num in range(1, max_pages + 1):
logger.info(f"Scraping page {page_num}: {url}")
products = self._retry(self.scrape_page, url)
all_products.extend(products)
logger.info(f" Found {len(products)} products (total: {len(all_products)})")
# Find next page
try:
next_link = self.driver.find_element(By.CSS_SELECTOR, "a.next-page")
url = next_link.get_attribute("href")
time.sleep(random.uniform(1, 3)) # Polite delay
except NoSuchElementException:
logger.info("No more pages")
break
return all_products
@staticmethod
def export_json(products: List[Product], filename: str):
with open(filename, "w") as f:
json.dump([asdict(p) for p in products], f, indent=2)
logger.info(f"Exported {len(products)} products to {filename}")
@staticmethod
def export_csv(products: List[Product], filename: str):
with open(filename, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price", "url", "rating", "reviews"])
writer.writeheader()
writer.writerows(asdict(p) for p in products)
logger.info(f"Exported {len(products)} products to {filename}")
# Usage
if __name__ == "__main__":
scraper = ProductScraper(headless=True)
try:
scraper.start()
products = scraper.scrape_all_pages("https://example.com/products")
scraper.export_json(products, "products.json")
scraper.export_csv(products, "products.csv")
print(f"\nScraped {len(products)} products successfully!")
finally:
scraper.stop()
12. Selenium vs Playwright vs Web Scraping APIs
| Feature | Selenium | Playwright | Mantis API |
|---|---|---|---|
| Setup complexity | Medium (driver management) | Low (auto-install) | None (HTTP calls) |
| Speed per page | 3-15 seconds | 1-8 seconds | 1-5 seconds |
| Memory per instance | 300-600 MB | 200-500 MB | 0 (serverless) |
| Anti-detection | Manual (undetected-chromedriver) | Manual (stealth plugin) | Built-in |
| Proxy management | Manual or selenium-wire | Built-in | Built-in |
| JavaScript rendering | Yes (full browser) | Yes (full browser) | Yes (cloud rendering) |
| Auto-waiting | Explicit waits required | Built-in auto-wait | N/A (returns when ready) |
| Community size | Largest (30K+ stars) | Growing (65K+ stars) | API-based |
| Language support | Python, Java, JS, C#, Ruby | Python, JS, Java, .NET | Any (HTTP/REST) |
| Scaling | Hard (infrastructure) | Hard (infrastructure) | Easy (API calls) |
| AI data extraction | No | No | Yes (built-in) |
| Best for | Legacy projects, broad language support | New projects, performance | Production at scale |
13. Cost Analysis: DIY Selenium vs. Mantis API
| Cost Component | DIY Selenium | Mantis API |
|---|---|---|
| Compute (cloud VMs for browsers) | $150-500/mo | $0 |
| Residential proxies | $50-200/mo | $0 (included) |
| CAPTCHA solving | $20-100/mo | $0 (handled) |
| ChromeDriver management | $0 (time cost) | $0 |
| Developer time (maintenance) | 10-20 hrs/mo | ~0 |
| Total monthly cost | $200-800 + time | $29-299 |
๐ฆ Stop Managing Browsers โ Start Shipping Data
Mantis API handles rendering, proxies, anti-detection, and AI extraction. One API call replaces 200+ lines of Selenium code.
Get Started Free โ14. Frequently Asked Questions
Is Selenium good for web scraping in 2026?
Selenium remains a solid choice for web scraping in 2026, especially for JavaScript-heavy sites that require full browser rendering. It has the largest community of any browser automation tool, extensive documentation, and supports Chrome, Firefox, Edge, and Safari. However, it's slower than HTTP-based scraping and newer tools like Playwright offer better performance. For large-scale scraping, a web scraping API like Mantis is more cost-effective and reliable.
Is Selenium or Playwright better for web scraping?
Playwright is generally faster and has better built-in features (auto-waiting, network interception, multi-browser from one API). Selenium has a much larger community, more tutorials, better IDE integration, and longer track record. For new projects in 2026, Playwright is the better technical choice, but Selenium is perfectly capable and many developers prefer it for its familiarity and ecosystem.
Can websites detect Selenium scraping?
Yes. Default Selenium instances are easily detected through the navigator.webdriver property, ChromeDriver-specific JavaScript variables, missing browser plugins, and WebDriver protocol fingerprints. Tools like undetected-chromedriver help bypass basic detection, but sophisticated anti-bot systems can still detect Selenium through behavioral analysis, TLS fingerprinting, and HTTP/2 characteristics.
How much does Selenium web scraping cost to run?
Running Selenium at scale costs $200-800/month: cloud compute for headless browsers ($150-500), residential proxies ($50-200), and CAPTCHA solving ($20-100). Each browser instance uses 300-600MB RAM. A web scraping API like Mantis costs $29-299/month and handles everything automatically.
How do I make Selenium undetectable?
Use undetected-chromedriver, set realistic window sizes and user agents, disable automation flags, add random delays between actions, rotate proxies, and handle cookies properly. For advanced anti-bot systems, you may also need to spoof WebGL, Canvas, and AudioContext fingerprints. Web scraping APIs handle all of this automatically.
Should I use Selenium or an API for web scraping?
Use Selenium for complex browser interactions on a small number of sites. Use an API like Mantis when you need scale, reliability, or cost efficiency. Most teams start with Selenium and switch to APIs as they scale beyond a few hundred pages per day.
Conclusion
Selenium remains one of the most popular tools for web scraping in 2026. Its massive community, multi-language support, and battle-tested reliability make it a solid choice โ especially if you're already familiar with it from testing.
For small to medium scraping projects (up to a few hundred pages per day), Selenium with undetected-chromedriver and residential proxies can handle most sites. But as you scale, the infrastructure cost and maintenance burden grows quickly.
That's where a web scraping API shines โ one API call replaces hundreds of lines of Selenium code, and you never worry about driver updates, proxy rotation, or anti-detection again.