PHP powers over 75% of all websites with a known server-side language. If you're working in a PHP codebase — Laravel, Symfony, WordPress, or any custom application — web scraping in PHP means no context switching, no polyglot complexity, and direct integration with your existing stack.
PHP's web scraping strengths:
| Tool | Purpose | Best For |
|---|---|---|
| cURL (built-in) | HTTP requests | Simple scripts, no dependencies |
| Guzzle | HTTP client | Production scraping, async requests |
| DOMDocument (built-in) | HTML/XML parsing | Basic parsing without Composer |
| Symfony DomCrawler | CSS/XPath parsing | Complex extraction, jQuery-like API |
| Goutte | HTTP + parsing combined | Quick scraping projects |
| Symfony Panther | Headless browser | JavaScript-rendered pages |
| chrome-php/chrome | Chrome DevTools Protocol | Full browser control |
No Composer required. This works with any PHP installation:
<?php
// Fetch the page
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => 'https://example.com/products',
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
'Accept: text/html,application/xhtml+xml',
],
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200) {
die("Request failed with status: $httpCode");
}
// Parse with DOMDocument
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Extract all product titles
$titles = $xpath->query('//h2[@class="product-title"]');
foreach ($titles as $title) {
echo $title->textContent . "\n";
}
This is PHP scraping at its simplest — zero dependencies, runs anywhere. For anything beyond simple scripts, use Guzzle and DomCrawler.
Guzzle is the standard HTTP client for PHP. It handles cookies, redirects, retries, and concurrent requests out of the box:
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\RequestOptions;
$client = new Client([
'base_uri' => 'https://example.com',
'timeout' => 30,
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
'Accept' => 'text/html,application/xhtml+xml',
],
'cookies' => true, // Enable cookie jar
]);
// Simple GET request
$response = $client->get('/products');
$html = (string) $response->getBody();
// POST with form data (login, search)
$response = $client->post('/search', [
RequestOptions::FORM_PARAMS => [
'query' => 'web scraping',
'page' => 1,
],
]);
// Handle JSON APIs
$response = $client->get('/api/products', [
RequestOptions::QUERY => ['category' => 'electronics'],
]);
$data = json_decode((string) $response->getBody(), true);
<?php
use GuzzleHttp\Client;
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;
use GuzzleHttp\Exception\ConnectException;
$stack = HandlerStack::create();
$stack->push(Middleware::retry(
function (int $retries, Request $request, ?Response $response, ?\Throwable $e) {
if ($retries >= 3) return false;
if ($e instanceof ConnectException) return true;
if ($response && $response->getStatusCode() >= 500) return true;
if ($response && $response->getStatusCode() === 429) return true;
return false;
},
function (int $retries) {
return 1000 * pow(2, $retries); // 1s, 2s, 4s
}
));
$client = new Client(['handler' => $stack]);
DomCrawler gives you CSS selectors and XPath in a clean, chainable API:
<?php
require 'vendor/autoload.php';
// composer require symfony/dom-crawler symfony/css-selector
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://example.com/products');
$crawler = new Crawler($html);
// CSS selectors (like jQuery)
$products = $crawler->filter('.product-card')->each(function (Crawler $node) {
return [
'title' => $node->filter('h2.title')->text(''),
'price' => $node->filter('.price')->text(''),
'link' => $node->filter('a')->attr('href'),
'image' => $node->filter('img')->attr('src'),
];
});
// XPath for complex queries
$reviews = $crawler->filterXPath('//div[@data-rating > 4]')->each(function (Crawler $node) {
return $node->text();
});
// Extract table data
$rows = $crawler->filter('table.data tbody tr')->each(function (Crawler $row) {
$cells = $row->filter('td')->each(fn(Crawler $cell) => trim($cell->text()));
return $cells;
});
print_r($products);
Goutte combines Guzzle and DomCrawler into a single package with built-in link clicking and form submission:
<?php
// composer require fabpot/goutte
use Goutte\Client;
$client = new Client();
// Navigate and scrape
$crawler = $client->request('GET', 'https://example.com/products');
// Click links (follows the link, returns new page)
$detailPage = $client->click($crawler->filter('a.product-link')->link());
// Submit forms
$crawler = $client->request('GET', 'https://example.com/login');
$form = $crawler->filter('form#login')->form([
'username' => 'user@example.com',
'password' => 'secret',
]);
$client->submit($form);
// Now scrape authenticated pages
$dashboard = $client->request('GET', 'https://example.com/dashboard');
$data = $dashboard->filter('.metric-value')->each(fn($node) => $node->text());
For SPAs and dynamic content, use Symfony Panther or chrome-php:
<?php
// composer require symfony/panther
use Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://example.com/spa');
// Wait for JavaScript to render
$client->waitFor('.dynamic-content', 10); // Wait up to 10 seconds
// Now extract the rendered content
$items = $crawler->filter('.dynamic-content .item')->each(function ($node) {
return [
'title' => $node->filter('.title')->text(''),
'data' => $node->filter('.data')->text(''),
];
});
// Take a screenshot
$client->takeScreenshot('page.png');
$client->quit();
<?php
// composer require chrome-php/chrome
use HeadlessChromium\BrowserFactory;
$browserFactory = new BrowserFactory();
$browser = $browserFactory->createBrowser([
'headless' => true,
'windowSize' => [1920, 1080],
'noSandbox' => true,
]);
$page = $browser->createPage();
$page->navigate('https://example.com/app')->waitForNavigation();
// Execute JavaScript
$result = $page->evaluate('document.title')->getReturnValue();
// Get full HTML after JS rendering
$html = $page->getHtml();
// Screenshot
$page->screenshot()->saveToFile('screenshot.png');
$browser->close();
Scrape multiple pages simultaneously using Guzzle's async capabilities:
<?php
use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client(['timeout' => 30]);
$results = [];
// Generate requests for 50 pages
$requests = function () {
for ($page = 1; $page <= 50; $page++) {
yield new Request('GET', "https://example.com/products?page=$page");
}
};
$pool = new Pool($client, $requests(), [
'concurrency' => 5, // 5 concurrent requests
'fulfilled' => function ($response, $index) use (&$results) {
$crawler = new Crawler((string) $response->getBody());
$products = $crawler->filter('.product')->each(function ($node) {
return [
'title' => $node->filter('.title')->text(''),
'price' => $node->filter('.price')->text(''),
];
});
$results = array_merge($results, $products);
echo "Page " . ($index + 1) . ": " . count($products) . " products\n";
},
'rejected' => function ($reason, $index) {
echo "Page " . ($index + 1) . " failed: " . $reason->getMessage() . "\n";
},
]);
$pool->promise()->wait();
echo "Total products: " . count($results) . "\n";
// Export to CSV
$fp = fopen('products.csv', 'w');
fputcsv($fp, ['Title', 'Price']);
foreach ($results as $product) {
fputcsv($fp, $product);
}
fclose($fp);
Avoid getting blocked when scraping at scale:
<?php
$userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) Firefox/121.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
];
$client = new Client([
'headers' => [
'User-Agent' => $userAgents[array_rand($userAgents)],
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.9',
],
]);
<?php
$proxies = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
];
$client = new Client([
'proxy' => $proxies[array_rand($proxies)],
]);
<?php
function scrapePage(Client $client, string $url): string
{
static $lastRequest = 0;
// Minimum 2 seconds between requests
$elapsed = microtime(true) - $lastRequest;
if ($elapsed < 2.0) {
usleep((int)((2.0 - $elapsed) * 1_000_000));
}
$response = $client->get($url);
$lastRequest = microtime(true);
return (string) $response->getBody();
}
For production scraping where anti-blocking matters, a managed API handles all of this for you. See the complete anti-blocking guide.
Laravel makes scraping clean with its HTTP client (Guzzle wrapper) and job system:
<?php
// app/Jobs/ScrapeProducts.php
namespace App\Jobs;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Support\Facades\Http;
use Symfony\Component\DomCrawler\Crawler;
use App\Models\Product;
class ScrapeProducts implements ShouldQueue
{
use Queueable;
public function __construct(private string $url) {}
public function handle(): void
{
$response = Http::withHeaders([
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
])->retry(3, 1000)->get($this->url);
$crawler = new Crawler($response->body());
$crawler->filter('.product-card')->each(function (Crawler $node) {
Product::updateOrCreate(
['sku' => $node->filter('.sku')->text('')],
[
'title' => $node->filter('.title')->text(''),
'price' => (float) str_replace('$', '', $node->filter('.price')->text('0')),
]
);
});
}
}
// Dispatch from a controller or command
// ScrapeProducts::dispatch('https://example.com/products');
Building a production scraper in PHP means maintaining:
A web scraping API like Mantis handles all of this with a single HTTP call:
<?php
// Using Mantis API — one line replaces hundreds of lines of scraping code
$response = Http::withHeaders([
'Authorization' => 'Bearer YOUR_API_KEY',
])->post('https://api.mantisapi.com/v1/scrape', [
'url' => 'https://example.com/products',
'render_js' => true,
'extract' => [
'products' => [
'selector' => '.product-card',
'fields' => [
'title' => '.title',
'price' => '.price',
],
],
],
]);
$products = $response->json('data.products');
| Approach | Setup Time | Maintenance | JS Rendering | Anti-Blocking | Cost (10K pages/mo) |
|---|---|---|---|---|---|
| DIY PHP + Guzzle | Days | Ongoing | Need headless browser | Build yourself | $50-200 (proxies + servers) |
| Mantis API | Minutes | Zero | Built-in | Built-in | $29/mo |
Mantis handles proxies, JavaScript rendering, and anti-blocking — so you can focus on your data.
View Pricing Try Mantis Free| Feature | PHP | Python | Node.js |
|---|---|---|---|
| HTTP Client | Guzzle / cURL | Requests / httpx | Axios / node-fetch |
| HTML Parser | DomCrawler / DOMDocument | BeautifulSoup / lxml | Cheerio |
| Headless Browser | Panther / chrome-php | Playwright / Selenium | Puppeteer / Playwright |
| Async Support | Guzzle Promises / Fibers | asyncio / httpx | Native async/await |
| Concurrency | Pool (5-10 concurrent) | asyncio.gather | Promise.all |
| Web Framework Integration | ⭐⭐⭐ Laravel/Symfony | ⭐⭐ Django/Flask | ⭐⭐ Express |
| Scraping Ecosystem | ⭐⭐ Good | ⭐⭐⭐ Best | ⭐⭐ Good |
| Learning Curve | Low (web devs) | Low | Low (JS devs) |
| Best For | PHP codebases, WordPress | Standalone scrapers | JS-heavy sites |
Bottom line: Use the language your project already uses. If you're in a PHP codebase, scrape with PHP. The tools are mature and battle-tested. For new standalone projects, Python has the largest scraping ecosystem. For any language, a web scraping API eliminates the complexity entirely.
Yes. PHP has mature scraping tools — cURL and DOMDocument are built-in, and libraries like Guzzle and Symfony DomCrawler are production-grade. If you're already working in PHP, there's no need to switch languages for scraping.
Guzzle (HTTP client) + Symfony DomCrawler (HTML parsing) is the most popular production stack. For quick projects, Goutte combines both into a single package. For JavaScript-rendered pages, use Symfony Panther.
Yes — use Symfony Panther or chrome-php/chrome for headless browser control. Or use a web scraping API that handles rendering server-side.
With Guzzle's concurrent request pool, PHP can scrape hundreds of pages per minute. PHP 8.x with JIT compilation handles HTML parsing efficiently. The bottleneck is always network I/O, not PHP performance.
Web scraping publicly available data is generally legal, but always check the website's Terms of Service and robots.txt. Respect rate limits and don't scrape personal data without consent. See our legal compliance guide.
If you need to scrape at scale, handle JavaScript rendering, or avoid blocks — yes. A web scraping API like Mantis handles the infrastructure so you can focus on data extraction. It's especially cost-effective compared to maintaining your own proxy infrastructure.