Web Scraping with PHP in 2026: The Complete Guide

Published March 28, 2026 · 12 min read

Table of Contents

Why PHP for Web Scraping?

PHP powers over 75% of all websites with a known server-side language. If you're working in a PHP codebase — Laravel, Symfony, WordPress, or any custom application — web scraping in PHP means no context switching, no polyglot complexity, and direct integration with your existing stack.

PHP's web scraping strengths:

The PHP Scraping Tool Stack

ToolPurposeBest For
cURL (built-in)HTTP requestsSimple scripts, no dependencies
GuzzleHTTP clientProduction scraping, async requests
DOMDocument (built-in)HTML/XML parsingBasic parsing without Composer
Symfony DomCrawlerCSS/XPath parsingComplex extraction, jQuery-like API
GoutteHTTP + parsing combinedQuick scraping projects
Symfony PantherHeadless browserJavaScript-rendered pages
chrome-php/chromeChrome DevTools ProtocolFull browser control

Quick Start: cURL + DOMDocument

No Composer required. This works with any PHP installation:

<?php
// Fetch the page
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL            => 'https://example.com/products',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTPHEADER     => [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
        'Accept: text/html,application/xhtml+xml',
    ],
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

if ($httpCode !== 200) {
    die("Request failed with status: $httpCode");
}

// Parse with DOMDocument
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

// Extract all product titles
$titles = $xpath->query('//h2[@class="product-title"]');
foreach ($titles as $title) {
    echo $title->textContent . "\n";
}

This is PHP scraping at its simplest — zero dependencies, runs anywhere. For anything beyond simple scripts, use Guzzle and DomCrawler.

Guzzle: The Modern HTTP Client

Guzzle is the standard HTTP client for PHP. It handles cookies, redirects, retries, and concurrent requests out of the box:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\RequestOptions;

$client = new Client([
    'base_uri' => 'https://example.com',
    'timeout'  => 30,
    'headers'  => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
        'Accept'     => 'text/html,application/xhtml+xml',
    ],
    'cookies' => true,  // Enable cookie jar
]);

// Simple GET request
$response = $client->get('/products');
$html = (string) $response->getBody();

// POST with form data (login, search)
$response = $client->post('/search', [
    RequestOptions::FORM_PARAMS => [
        'query' => 'web scraping',
        'page'  => 1,
    ],
]);

// Handle JSON APIs
$response = $client->get('/api/products', [
    RequestOptions::QUERY => ['category' => 'electronics'],
]);
$data = json_decode((string) $response->getBody(), true);

Retry with Exponential Backoff

<?php
use GuzzleHttp\Client;
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;
use GuzzleHttp\Exception\ConnectException;

$stack = HandlerStack::create();
$stack->push(Middleware::retry(
    function (int $retries, Request $request, ?Response $response, ?\Throwable $e) {
        if ($retries >= 3) return false;
        if ($e instanceof ConnectException) return true;
        if ($response && $response->getStatusCode() >= 500) return true;
        if ($response && $response->getStatusCode() === 429) return true;
        return false;
    },
    function (int $retries) {
        return 1000 * pow(2, $retries); // 1s, 2s, 4s
    }
));

$client = new Client(['handler' => $stack]);

Symfony DomCrawler: jQuery-Like Parsing

DomCrawler gives you CSS selectors and XPath in a clean, chainable API:

<?php
require 'vendor/autoload.php';
// composer require symfony/dom-crawler symfony/css-selector

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://example.com/products');
$crawler = new Crawler($html);

// CSS selectors (like jQuery)
$products = $crawler->filter('.product-card')->each(function (Crawler $node) {
    return [
        'title' => $node->filter('h2.title')->text(''),
        'price' => $node->filter('.price')->text(''),
        'link'  => $node->filter('a')->attr('href'),
        'image' => $node->filter('img')->attr('src'),
    ];
});

// XPath for complex queries
$reviews = $crawler->filterXPath('//div[@data-rating > 4]')->each(function (Crawler $node) {
    return $node->text();
});

// Extract table data
$rows = $crawler->filter('table.data tbody tr')->each(function (Crawler $row) {
    $cells = $row->filter('td')->each(fn(Crawler $cell) => trim($cell->text()));
    return $cells;
});

print_r($products);

Goutte: All-in-One Scraping

Goutte combines Guzzle and DomCrawler into a single package with built-in link clicking and form submission:

<?php
// composer require fabpot/goutte
use Goutte\Client;

$client = new Client();

// Navigate and scrape
$crawler = $client->request('GET', 'https://example.com/products');

// Click links (follows the link, returns new page)
$detailPage = $client->click($crawler->filter('a.product-link')->link());

// Submit forms
$crawler = $client->request('GET', 'https://example.com/login');
$form = $crawler->filter('form#login')->form([
    'username' => 'user@example.com',
    'password' => 'secret',
]);
$client->submit($form);

// Now scrape authenticated pages
$dashboard = $client->request('GET', 'https://example.com/dashboard');
$data = $dashboard->filter('.metric-value')->each(fn($node) => $node->text());

Scraping JavaScript-Rendered Pages

For SPAs and dynamic content, use Symfony Panther or chrome-php:

Symfony Panther (Headless Chrome)

<?php
// composer require symfony/panther
use Symfony\Component\Panther\Client;

$client = Client::createChromeClient();

$crawler = $client->request('GET', 'https://example.com/spa');

// Wait for JavaScript to render
$client->waitFor('.dynamic-content', 10); // Wait up to 10 seconds

// Now extract the rendered content
$items = $crawler->filter('.dynamic-content .item')->each(function ($node) {
    return [
        'title' => $node->filter('.title')->text(''),
        'data'  => $node->filter('.data')->text(''),
    ];
});

// Take a screenshot
$client->takeScreenshot('page.png');

$client->quit();

chrome-php (Low-Level Chrome Control)

<?php
// composer require chrome-php/chrome
use HeadlessChromium\BrowserFactory;

$browserFactory = new BrowserFactory();
$browser = $browserFactory->createBrowser([
    'headless'      => true,
    'windowSize'    => [1920, 1080],
    'noSandbox'     => true,
]);

$page = $browser->createPage();
$page->navigate('https://example.com/app')->waitForNavigation();

// Execute JavaScript
$result = $page->evaluate('document.title')->getReturnValue();

// Get full HTML after JS rendering
$html = $page->getHtml();

// Screenshot
$page->screenshot()->saveToFile('screenshot.png');

$browser->close();
💡 Tip: Headless browsers are resource-intensive. For most scraping tasks, a web scraping API like Mantis handles JavaScript rendering server-side — your PHP code just makes simple HTTP requests.

Concurrent Scraping with Guzzle Promises

Scrape multiple pages simultaneously using Guzzle's async capabilities:

<?php
use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client(['timeout' => 30]);
$results = [];

// Generate requests for 50 pages
$requests = function () {
    for ($page = 1; $page <= 50; $page++) {
        yield new Request('GET', "https://example.com/products?page=$page");
    }
};

$pool = new Pool($client, $requests(), [
    'concurrency' => 5,  // 5 concurrent requests
    'fulfilled' => function ($response, $index) use (&$results) {
        $crawler = new Crawler((string) $response->getBody());
        $products = $crawler->filter('.product')->each(function ($node) {
            return [
                'title' => $node->filter('.title')->text(''),
                'price' => $node->filter('.price')->text(''),
            ];
        });
        $results = array_merge($results, $products);
        echo "Page " . ($index + 1) . ": " . count($products) . " products\n";
    },
    'rejected' => function ($reason, $index) {
        echo "Page " . ($index + 1) . " failed: " . $reason->getMessage() . "\n";
    },
]);

$pool->promise()->wait();
echo "Total products: " . count($results) . "\n";

// Export to CSV
$fp = fopen('products.csv', 'w');
fputcsv($fp, ['Title', 'Price']);
foreach ($results as $product) {
    fputcsv($fp, $product);
}
fclose($fp);

Anti-Blocking Techniques

Avoid getting blocked when scraping at scale:

User-Agent Rotation

<?php
$userAgents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) Firefox/121.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
];

$client = new Client([
    'headers' => [
        'User-Agent' => $userAgents[array_rand($userAgents)],
        'Accept'     => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language' => 'en-US,en;q=0.9',
    ],
]);

Proxy Rotation

<?php
$proxies = [
    'http://user:pass@proxy1.example.com:8080',
    'http://user:pass@proxy2.example.com:8080',
    'http://user:pass@proxy3.example.com:8080',
];

$client = new Client([
    'proxy' => $proxies[array_rand($proxies)],
]);

Rate Limiting

<?php
function scrapePage(Client $client, string $url): string
{
    static $lastRequest = 0;

    // Minimum 2 seconds between requests
    $elapsed = microtime(true) - $lastRequest;
    if ($elapsed < 2.0) {
        usleep((int)((2.0 - $elapsed) * 1_000_000));
    }

    $response = $client->get($url);
    $lastRequest = microtime(true);

    return (string) $response->getBody();
}

For production scraping where anti-blocking matters, a managed API handles all of this for you. See the complete anti-blocking guide.

Web Scraping in Laravel

Laravel makes scraping clean with its HTTP client (Guzzle wrapper) and job system:

<?php
// app/Jobs/ScrapeProducts.php
namespace App\Jobs;

use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Support\Facades\Http;
use Symfony\Component\DomCrawler\Crawler;
use App\Models\Product;

class ScrapeProducts implements ShouldQueue
{
    use Queueable;

    public function __construct(private string $url) {}

    public function handle(): void
    {
        $response = Http::withHeaders([
            'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
        ])->retry(3, 1000)->get($this->url);

        $crawler = new Crawler($response->body());

        $crawler->filter('.product-card')->each(function (Crawler $node) {
            Product::updateOrCreate(
                ['sku' => $node->filter('.sku')->text('')],
                [
                    'title' => $node->filter('.title')->text(''),
                    'price' => (float) str_replace('$', '', $node->filter('.price')->text('0')),
                ]
            );
        });
    }
}

// Dispatch from a controller or command
// ScrapeProducts::dispatch('https://example.com/products');

When to Use a Web Scraping API

Building a production scraper in PHP means maintaining:

A web scraping API like Mantis handles all of this with a single HTTP call:

<?php
// Using Mantis API — one line replaces hundreds of lines of scraping code
$response = Http::withHeaders([
    'Authorization' => 'Bearer YOUR_API_KEY',
])->post('https://api.mantisapi.com/v1/scrape', [
    'url' => 'https://example.com/products',
    'render_js' => true,
    'extract' => [
        'products' => [
            'selector' => '.product-card',
            'fields' => [
                'title' => '.title',
                'price' => '.price',
            ],
        ],
    ],
]);

$products = $response->json('data.products');
ApproachSetup TimeMaintenanceJS RenderingAnti-BlockingCost (10K pages/mo)
DIY PHP + GuzzleDaysOngoingNeed headless browserBuild yourself$50-200 (proxies + servers)
Mantis APIMinutesZeroBuilt-inBuilt-in$29/mo

Skip the Infrastructure. Start Scraping.

Mantis handles proxies, JavaScript rendering, and anti-blocking — so you can focus on your data.

View Pricing Try Mantis Free

PHP vs Python vs Node.js for Web Scraping

FeaturePHPPythonNode.js
HTTP ClientGuzzle / cURLRequests / httpxAxios / node-fetch
HTML ParserDomCrawler / DOMDocumentBeautifulSoup / lxmlCheerio
Headless BrowserPanther / chrome-phpPlaywright / SeleniumPuppeteer / Playwright
Async SupportGuzzle Promises / Fibersasyncio / httpxNative async/await
ConcurrencyPool (5-10 concurrent)asyncio.gatherPromise.all
Web Framework Integration⭐⭐⭐ Laravel/Symfony⭐⭐ Django/Flask⭐⭐ Express
Scraping Ecosystem⭐⭐ Good⭐⭐⭐ Best⭐⭐ Good
Learning CurveLow (web devs)LowLow (JS devs)
Best ForPHP codebases, WordPressStandalone scrapersJS-heavy sites

Bottom line: Use the language your project already uses. If you're in a PHP codebase, scrape with PHP. The tools are mature and battle-tested. For new standalone projects, Python has the largest scraping ecosystem. For any language, a web scraping API eliminates the complexity entirely.

Frequently Asked Questions

Is PHP good for web scraping?

Yes. PHP has mature scraping tools — cURL and DOMDocument are built-in, and libraries like Guzzle and Symfony DomCrawler are production-grade. If you're already working in PHP, there's no need to switch languages for scraping.

What's the best PHP library for web scraping?

Guzzle (HTTP client) + Symfony DomCrawler (HTML parsing) is the most popular production stack. For quick projects, Goutte combines both into a single package. For JavaScript-rendered pages, use Symfony Panther.

Can PHP scrape JavaScript-rendered pages?

Yes — use Symfony Panther or chrome-php/chrome for headless browser control. Or use a web scraping API that handles rendering server-side.

How fast is PHP for web scraping?

With Guzzle's concurrent request pool, PHP can scrape hundreds of pages per minute. PHP 8.x with JIT compilation handles HTML parsing efficiently. The bottleneck is always network I/O, not PHP performance.

Is web scraping legal?

Web scraping publicly available data is generally legal, but always check the website's Terms of Service and robots.txt. Respect rate limits and don't scrape personal data without consent. See our legal compliance guide.

Should I use a web scraping API instead of building my own?

If you need to scrape at scale, handle JavaScript rendering, or avoid blocks — yes. A web scraping API like Mantis handles the infrastructure so you can focus on data extraction. It's especially cost-effective compared to maintaining your own proxy infrastructure.