Build blazing-fast web scrapers with Go's most popular scraping framework. From basic collectors to distributed scraping at scale.
Go is one of the fastest languages for web scraping. Its built-in concurrency model (goroutines), compiled speed, and tiny memory footprint make it perfect for scraping millions of pages. And Colly โ Go's premier scraping framework โ makes it elegant too.
In this guide, you'll learn everything about web scraping with Go and Colly in 2026: from basic collectors to distributed, production-grade scrapers that process thousands of pages per minute.
Go brings unique advantages to web scraping that Python and Node.js can't match:
go build produces one binary with zero dependencies. Copy it anywhere, it runsnet/http and encoding/json are production-grade out of the boxMake sure you have Go 1.21+ installed, then create a new project:
# Create project directory
mkdir my-scraper && cd my-scraper
# Initialize Go module
go mod init my-scraper
# Install Colly v2
go get github.com/gocolly/colly/v2
# Optional: install goquery for advanced HTML parsing
go get github.com/PuerkitoBio/goquery
Your project structure:
my-scraper/
โโโ go.mod
โโโ go.sum
โโโ main.go
Colly uses a collector pattern with callbacks. You create a collector, attach callbacks for different events, then start scraping:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly/v2"
)
func main() {
// Create a new collector
c := colly.NewCollector(
// Restrict domains to scrape
colly.AllowedDomains("quotes.toscrape.com"),
)
// Called when a HTML element matching the selector is found
c.OnHTML(".quote", func(e *colly.HTMLElement) {
quote := e.ChildText(".text")
author := e.ChildText(".author")
fmt.Printf(""%s" โ %s\n", quote, author)
})
// Called before a request is made
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting:", r.URL.String())
})
// Called if an error occurs during the request
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error on %s: %v", r.Request.URL, err)
})
// Start scraping
err := c.Visit("https://quotes.toscrape.com/")
if err != nil {
log.Fatal(err)
}
}
Run it:
go run main.go
That's it โ a working scraper in ~30 lines. Colly handles HTTP requests, HTML parsing, and error handling automatically.
Colly uses goquery under the hood, giving you jQuery-style CSS selectors:
// Extract text content
title := e.ChildText("h1")
// Extract an attribute
href := e.ChildAttr("a", "href")
imgSrc := e.ChildAttr("img", "src")
// Extract multiple items
e.ForEach("li.item", func(i int, el *colly.HTMLElement) {
name := el.ChildText(".name")
price := el.ChildText(".price")
link := el.ChildAttr("a", "href")
fmt.Printf("%d. %s โ %s (%s)\n", i+1, name, price, link)
})
// Get the raw HTML of an element
html, _ := e.DOM.Html()
// Use goquery directly for complex selections
e.DOM.Find("table tr").Each(func(i int, s *goquery.Selection) {
cells := s.Find("td")
col1 := cells.Eq(0).Text()
col2 := cells.Eq(1).Text()
fmt.Printf("Row %d: %s | %s\n", i, col1, col2)
})
| Selector | Matches | Example |
|---|---|---|
div | Tag name | All <div> elements |
.class | Class | .product-card |
#id | ID | #main-content |
div.class | Tag + class | span.price |
div > p | Direct child | article > h2 |
div p | Descendant | .card .title |
[attr=val] | Attribute | [data-type="premium"] |
a[href^="https"] | Starts with | External links |
:nth-child(n) | Position | tr:nth-child(2) |
Colly's power comes from its callback system. Each callback fires at a different stage of the scraping lifecycle:
c := colly.NewCollector()
// 1. Before making a request
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("Accept-Language", "en-US,en;q=0.9")
fmt.Println("โ", r.URL)
})
// 2. When the server responds (raw response)
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Status: %d, Size: %d bytes\n", r.StatusCode, len(r.Body))
})
// 3. When an HTML element is found (most used)
c.OnHTML("h1", func(e *colly.HTMLElement) {
fmt.Println("Title:", e.Text)
})
// 4. When an XML element is found (for RSS, sitemap, etc.)
c.OnXML("//item/title", func(e *colly.XMLElement) {
fmt.Println("Feed item:", e.Text)
})
// 5. When scraping is finished for a page
c.OnScraped(func(r *colly.Response) {
fmt.Println("โ Done:", r.Request.URL)
})
// 6. When an error occurs
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("โ Error %d on %s: %v\n", r.StatusCode, r.Request.URL, err)
})
Callbacks are called in order: OnRequest โ OnResponse โ OnHTML/OnXML โ OnScraped. If an error occurs, OnError is called instead of OnResponse.
Colly makes following links trivial โ just call Visit() inside an OnHTML callback:
c := colly.NewCollector(
colly.AllowedDomains("quotes.toscrape.com"),
colly.MaxDepth(3), // Limit crawl depth
)
// Scrape quotes from each page
c.OnHTML(".quote", func(e *colly.HTMLElement) {
fmt.Printf(""%s" โ %s\n", e.ChildText(".text"), e.ChildText(".author"))
})
// Follow pagination links
c.OnHTML("li.next a[href]", func(e *colly.HTMLElement) {
nextPage := e.Attr("href")
fmt.Println("Following โ", nextPage)
e.Request.Visit(nextPage) // Relative URLs resolved automatically
})
c.Visit("https://quotes.toscrape.com/")
Colly automatically deduplicates URLs โ it won't visit the same page twice (unless you set colly.AllowURLRevisit()).
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
colly.MaxDepth(5),
)
// Follow all internal links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
// Process each page
c.OnRequest(func(r *colly.Request) {
fmt.Println("Crawling:", r.URL)
})
c.Visit("https://example.com/")
This is where Go and Colly truly shine. Colly has built-in concurrency and rate limiting โ no external libraries needed:
c := colly.NewCollector(
colly.Async(true), // Enable asynchronous scraping
)
// Rate limiting rules
c.Limit(&colly.LimitRule{
// Match all domains
DomainGlob: "*",
// Max 5 concurrent requests per domain
Parallelism: 5,
// Wait 1 second between requests
Delay: 1 * time.Second,
// Add random delay up to 500ms
RandomDelay: 500 * time.Millisecond,
})
c.OnHTML(".product", func(e *colly.HTMLElement) {
fmt.Println(e.ChildText(".name"))
})
// Queue up multiple URLs
urls := []string{
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3",
// ... hundreds more
}
for _, url := range urls {
c.Visit(url)
}
// Wait for all async requests to finish
c.Wait()
// Different rules for different domains
c.Limit(&colly.LimitRule{
DomainGlob: "*.fast-site.com",
Parallelism: 10,
Delay: 200 * time.Millisecond,
})
c.Limit(&colly.LimitRule{
DomainGlob: "*.slow-site.com",
Parallelism: 2,
Delay: 2 * time.Second,
})
c := colly.NewCollector()
// Set default headers
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
r.Headers.Set("Accept", "text/html,application/xhtml+xml")
r.Headers.Set("Accept-Language", "en-US,en;q=0.9")
r.Headers.Set("Referer", "https://www.google.com/")
})
// Cookies are handled automatically per-domain
// To set custom cookies:
c.SetCookies("https://example.com", []*http.Cookie{
{Name: "session_id", Value: "abc123"},
{Name: "consent", Value: "accepted"},
})
c := colly.NewCollector()
// Step 1: Login
c.OnHTML("form[action='/login']", func(e *colly.HTMLElement) {
// Extract CSRF token
csrfToken := e.ChildAttr("input[name='csrf']", "value")
// Submit login form
e.Request.Post("https://example.com/login", map[string]string{
"username": "myuser",
"password": "mypass",
"csrf": csrfToken,
})
})
// Step 2: After login, cookies are stored automatically
// Subsequent requests will include the session cookie
c.OnResponse(func(r *colly.Response) {
if r.Request.URL.Path == "/login" {
// Login successful, now scrape protected pages
c.Visit("https://example.com/dashboard")
}
})
c.Visit("https://example.com/login")
Colly supports proxy rotation out of the box with round-robin or custom proxy switching:
import "github.com/gocolly/colly/v2/proxy"
c := colly.NewCollector()
// Round-robin proxy rotation
proxySwitcher, err := proxy.RoundRobinProxySwitcher(
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
"socks5://proxy4.example.com:1080",
)
if err != nil {
log.Fatal(err)
}
c.SetProxyFunc(proxySwitcher)
// Custom proxy function โ rotate based on request count
var requestCount int32
c.SetProxyFunc(func(r *http.Request) (*url.URL, error) {
proxies := []string{
"http://us-proxy.example.com:8080",
"http://eu-proxy.example.com:8080",
"http://asia-proxy.example.com:8080",
}
idx := atomic.AddInt32(&requestCount, 1) % int32(len(proxies))
return url.Parse(proxies[idx])
})
Colly can cache responses to disk, reducing redundant requests during development:
c := colly.NewCollector(
colly.CacheDir("./cache"), // Cache responses to disk
)
// Responses are cached by URL โ revisiting returns cached version
// Delete ./cache to force re-fetching
import (
"encoding/csv"
"encoding/json"
"os"
)
type Product struct {
Name string `json:"name"`
Price string `json:"price"`
URL string `json:"url"`
}
var products []Product
c.OnHTML(".product-card", func(e *colly.HTMLElement) {
products = append(products, Product{
Name: e.ChildText(".name"),
Price: e.ChildText(".price"),
URL: e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
})
})
c.OnScraped(func(r *colly.Response) {
// Export to JSON
jsonFile, _ := os.Create("products.json")
defer jsonFile.Close()
json.NewEncoder(jsonFile).Encode(products)
// Export to CSV
csvFile, _ := os.Create("products.csv")
defer csvFile.Close()
w := csv.NewWriter(csvFile)
w.Write([]string{"Name", "Price", "URL"})
for _, p := range products {
w.Write([]string{p.Name, p.Price, p.URL})
}
w.Flush()
})
Colly is an HTTP-based scraper โ it doesn't execute JavaScript. For JS-heavy sites, you have three options:
Many "JavaScript-rendered" sites actually load data from JSON APIs. Check the browser's Network tab:
// If the site loads data from an API endpoint
c.OnResponse(func(r *colly.Response) {
var data struct {
Products []struct {
Name string `json:"name"`
Price float64 `json:"price"`
} `json:"products"`
}
json.Unmarshal(r.Body, &data)
for _, p := range data.Products {
fmt.Printf("%s: $%.2f\n", p.Name, p.Price)
}
})
c.Visit("https://api.example.com/products?page=1")
import "github.com/chromedp/chromedp"
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var htmlContent string
err := chromedp.Run(ctx,
chromedp.Navigate("https://spa-site.com/products"),
chromedp.WaitVisible(".product-list"),
chromedp.OuterHTML("html", &htmlContent),
)
// Now parse htmlContent with goquery or feed it to Colly
Skip the infrastructure entirely. Mantis handles JavaScript rendering, proxy rotation, and anti-bot evasion:
// One API call replaces hundreds of lines of scraping code
resp, err := http.Get("https://api.mantisapi.com/v1/scrape?url=https://spa-site.com/products&render_js=true&api_key=YOUR_KEY")
var result struct {
HTML string `json:"html"`
Metadata map[string]string `json:"metadata"`
}
json.NewDecoder(resp.Body).Decode(&result)
Mantis API handles JavaScript rendering, proxy rotation, and anti-bot evasion โ so you can focus on the data, not the scraping mechanics.
Get 100 Free API Calls โHere's a complete, production-grade scraper with all best practices:
package main
import (
"encoding/json"
"fmt"
"log"
"os"
"strings"
"sync"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/proxy"
)
type ScrapedItem struct {
Title string `json:"title"`
URL string `json:"url"`
Price string `json:"price,omitempty"`
Description string `json:"description,omitempty"`
Tags []string `json:"tags,omitempty"`
ScrapedAt string `json:"scraped_at"`
}
func main() {
var (
items []ScrapedItem
mu sync.Mutex
stats struct {
Pages int
Items int
Errors int
}
)
c := colly.NewCollector(
colly.AllowedDomains("example.com", "www.example.com"),
colly.MaxDepth(5),
colly.Async(true),
colly.CacheDir("./cache"),
)
// Rate limiting
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 4,
Delay: 1 * time.Second,
RandomDelay: 500 * time.Millisecond,
})
// Rotate User-Agents
userAgents := []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}
var uaIdx int32
c.OnRequest(func(r *colly.Request) {
idx := int(uaIdx) % len(userAgents)
uaIdx++
r.Headers.Set("User-Agent", userAgents[idx])
r.Headers.Set("Accept-Language", "en-US,en;q=0.9")
})
// Extract items
c.OnHTML(".product-card", func(e *colly.HTMLElement) {
item := ScrapedItem{
Title: strings.TrimSpace(e.ChildText("h2")),
URL: e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
Price: strings.TrimSpace(e.ChildText(".price")),
Description: strings.TrimSpace(e.ChildText(".description")),
ScrapedAt: time.Now().UTC().Format(time.RFC3339),
}
e.ForEach(".tag", func(_ int, el *colly.HTMLElement) {
item.Tags = append(item.Tags, el.Text)
})
mu.Lock()
items = append(items, item)
stats.Items++
mu.Unlock()
})
// Follow pagination
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnResponse(func(r *colly.Response) {
mu.Lock()
stats.Pages++
mu.Unlock()
})
c.OnError(func(r *colly.Response, err error) {
mu.Lock()
stats.Errors++
mu.Unlock()
log.Printf("Error [%d] %s: %v", r.StatusCode, r.Request.URL, err)
})
// Start scraping
start := time.Now()
c.Visit("https://example.com/products")
c.Wait()
// Export results
f, err := os.Create("results.json")
if err != nil {
log.Fatal(err)
}
defer f.Close()
enc := json.NewEncoder(f)
enc.SetIndent("", " ")
enc.Encode(items)
elapsed := time.Since(start)
fmt.Printf("\n=== Scraping Complete ===\n")
fmt.Printf("Pages: %d\n", stats.Pages)
fmt.Printf("Items: %d\n", stats.Items)
fmt.Printf("Errors: %d\n", stats.Errors)
fmt.Printf("Duration: %s\n", elapsed.Round(time.Millisecond))
fmt.Printf("Speed: %.1f pages/sec\n", float64(stats.Pages)/elapsed.Seconds())
}
| Feature | Go + Colly | Python + Scrapy | Node.js + Puppeteer | Mantis API |
|---|---|---|---|---|
| Speed | โก Fastest | ๐ Fast | ๐ Slow (browser) | โก Fast |
| Concurrency | Built-in (goroutines) | Built-in (Twisted) | Limited | Handled by API |
| Memory | ~5-20 MB | ~50-200 MB | ~200-500 MB | N/A |
| JavaScript | โ (needs chromedp) | โ (needs Splash) | โ Native | โ Built-in |
| Anti-bot | Manual | Manual | Stealth plugin | โ Built-in |
| Proxy rotation | Built-in | Manual/middleware | Manual | โ Built-in |
| Deployment | Single binary | Python env + deps | Node + Chromium | HTTP calls |
| Learning curve | Moderate | Moderate | Easy | Easiest |
| Best for | High-performance crawls | Large-scale projects | JS-heavy sites | Any site, any scale |
Choose Colly when: You need raw speed, efficient resource usage, easy deployment, and are comfortable with Go. Perfect for infrastructure teams, microservices, and high-volume data pipelines.
Building a production scraper with Colly means managing proxies, rotating user agents, handling CAPTCHAs, and fighting anti-bot systems. Or you can make one API call:
package main
import (
"encoding/json"
"fmt"
"io"
"net/http"
"net/url"
)
func main() {
// One API call โ Mantis handles the rest
target := url.QueryEscape("https://example.com/products")
apiURL := fmt.Sprintf("https://api.mantisapi.com/v1/scrape?url=%s&render_js=true", target)
req, _ := http.NewRequest("GET", apiURL, nil)
req.Header.Set("X-API-Key", "your-api-key")
resp, err := http.DefaultClient.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
var result map[string]interface{}
json.Unmarshal(body, &result)
fmt.Println(result["html"])
}
| You Build (DIY) | Mantis Handles |
|---|---|
| Proxy infrastructure ($200-1000/mo) | โ Built-in proxy rotation |
| Anti-bot evasion code | โ Automatic anti-detection |
| JavaScript rendering (chromedp setup) | โ Full JS rendering |
| CAPTCHA solving integration | โ Handled automatically |
| User-agent rotation | โ Realistic browser headers |
| Error handling & retries | โ Built-in reliability |
| Maintenance & monitoring | โ Managed infrastructure |
Stop building scraping infrastructure. Mantis gives you clean data from any website โ with Go, Python, Node.js, or any language that speaks HTTP.
Start Free โ 100 Calls/Month โYes โ Go is one of the best languages for web scraping. Its goroutines handle thousands of concurrent requests with minimal memory, compiled speed makes HTML parsing 5-10x faster than Python, and single-binary deployment makes it trivial to run scrapers anywhere. Colly adds an elegant API on top.
Colly is Go's most popular web scraping framework. It provides a callback-based API for making HTTP requests, parsing HTML with CSS selectors, following links, managing cookies, rate limiting, proxy rotation, and caching. Think of it as "Scrapy for Go" โ production-ready and battle-tested.
Colly in Go is typically 5-10x faster than Python scraping libraries. In benchmarks, Colly processes 1,000+ pages per minute with 4 concurrent workers, while Scrapy manages 200-400. Go's compiled nature and lightweight goroutines give it a significant edge for high-volume scraping.
Colly alone doesn't execute JavaScript. For JS-heavy sites, use chromedp (Go's headless Chrome library) alongside Colly, find hidden API endpoints that serve JSON data, or use a scraping API like Mantis that handles JavaScript rendering automatically.
Colly has built-in rate limiting via LimitRule: set Delay (wait between requests), RandomDelay (jitter), Parallelism (concurrent requests per domain), and DomainGlob (pattern matching). This makes it easy to scrape responsibly.
Use Colly when you need maximum control and performance, and can manage proxy infrastructure. Use a web scraping API when you want to skip infrastructure management, need anti-bot evasion, or want JavaScript rendering without running headless browsers.