Collect data from a curated list of sources
-------|-------------|-------------------|
| Billion-page crawl | ✅ | ❌ (cost) |
| Curated domain list | ⚠️ | ✅ |
| JavaScript-heavy sites | ❌ | ✅ |
| Structured extraction | ❌ | ✅ |
| Quality over quantity | ⚠️ | ✅ |
| Real-time data feeds | ❌ | ✅ |
Step 2: Text Extraction and Cleaning
Raw HTML to clean text is harder than it sounds:
from bs4 import BeautifulSoup
import re
def extract_text(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove non-content elements
for tag in soup(['script', 'style', 'nav', 'header',
'footer', 'aside', 'form', 'button']):
tag.decompose()
# Get text
text = soup.get_text(separator='\n')
# Clean up
lines = [line.strip() for line in text.splitlines()]
text = '\n'.join(line for line in lines if line)
# Remove boilerplate patterns
text = remove_boilerplate(text)
return text
def remove_boilerplate(text):
"""Remove common web boilerplate text."""
patterns = [
r'cookie.*?accept',
r'subscribe.*?newsletter',
r'all rights reserved',
r'privacy policy.*?terms',
r'©\s*\d{4}',
]
for pattern in patterns:
text = re.sub(pattern, '', text, flags=re.IGNORECASE)
return text
Step 3: Quality Filtering
Not all web text is training-worthy. Filter aggressively:
import langdetect
from collections import Counter
def quality_score(text):
"""Score text quality for LLM training (0-1)."""
scores = {}
# Length check
word_count = len(text.split())
scores['length'] = min(word_count / 500, 1.0) # Prefer 500+ words
# Language consistency
try:
lang = langdetect.detect(text)
scores['language'] = 1.0 if lang == 'en' else 0.0
except:
scores['language'] = 0.0
# Repetition check
words = text.lower().split()
word_freq = Counter(words)
most_common_ratio = word_freq.most_common(1)[0][1] / len(words)
scores['repetition'] = 1.0 - min(most_common_ratio * 5, 1.0)
# Special character ratio
alpha_chars = sum(1 for c in text if c.isalpha())
total_chars = len(text)
scores['alpha_ratio'] = alpha_chars / total_chars if total_chars > 0 else 0
# Average score
return sum(scores.values()) / len(scores)
def filter_documents(documents, threshold=0.6):
"""Keep only high-quality documents."""
return [doc for doc in documents if quality_score(doc['text']) >= threshold]
Step 4: Deduplication
Duplicate content destroys model quality. Use MinHash LSH for efficient dedup:
from datasketch import MinHash, MinHashLSH
def create_minhash(text, num_perm=128):
"""Create MinHash signature for a document."""
m = MinHash(num_perm=num_perm)
words = text.lower().split()
# Use 5-grams for better duplicate detection
for i in range(len(words) - 4):
ngram = ' '.join(words[i:i+5])
m.update(ngram.encode('utf-8'))
return m
def deduplicate(documents, threshold=0.8):
"""Remove near-duplicate documents using MinHash LSH."""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
unique_docs = []
for i, doc in enumerate(documents):
mh = create_minhash(doc['text'])
# Check for duplicates
result = lsh.query(mh)
if not result:
lsh.insert(str(i), mh)
unique_docs.append(doc)
return unique_docs
Step 5: PII Removal
Training on personal data is a legal and ethical minefield:
import re
def remove_pii(text):
"""Remove personally identifiable information."""
# Email addresses
text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)
# Phone numbers
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
# Social Security Numbers
text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
# IP addresses
text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '[IP]', text)
# Credit card numbers
text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CC]', text)
return text
Legal and Ethical Considerations in 2026
This is the most important section. The legal landscape has shifted dramatically:
Copyright
- NYT v. OpenAI and similar lawsuits have established that scraping copyrighted content for training may require licensing
- Fair use is still debated — courts are split on whether AI training is transformative
- Opt-out mechanisms — Many sites now use
ai.txtor robots.txt directives to block AI training crawlers
Best Practices
Respect robots.txt — If a site blocks your crawler, don't crawl it
Check ai.txt — The emerging standard for AI training opt-out
Avoid copyrighted content — News articles, books, academic papers behind paywalls
Focus on permissive sources — Wikipedia, government data, open-source docs, Creative Commons content
Document your sources — Keep provenance records for every document
PII removal — Mandatory, not optional
Safe Data Sources
- Wikipedia — CC BY-SA license, massive multilingual corpus
- Common Crawl — Public domain web archive
- Government sites — Public domain by default (in the US)
- Stack Overflow — CC BY-SA (with attribution requirements)
- arXiv — Open access research papers
- GitHub — Open source code (check licenses)
- Creative Commons content
The WebPerception Advantage for Data Collection
For targeted, high-quality data collection:
# Collect data from a curated list of sources
sources = [
"https://en.wikipedia.org/wiki/Artificial_intelligence",
"https://docs.python.org/3/tutorial/",
# ... curated list of high-quality, permissive sources
]
dataset = []
for url in sources:
result = extract_clean_text(url) # Using WebPerception
if quality_score(result['content']) > 0.7:
clean_text = remove_pii(result['content'])
dataset.append({
"source": url,
"text": clean_text,
"title": result['title'],
"collected_at": datetime.now().isoformat()
})
WebPerception handles JavaScript rendering and anti-bot evasion, so you get cleaner text from more sources — including SPAs, documentation sites, and dynamic content that basic crawlers miss.
Conclusion
Web scraping for LLM training data in 2026 requires:
Quality over quantity — Better to train on 1B clean tokens than 100B dirty ones
Legal compliance — The regulatory landscape is evolving fast
Technical rigor — Dedup, quality filtering, and PII removal are non-negotiable
The right tools — DIY crawlers for scale, APIs like WebPerception for quality
Whether you're fine-tuning a model or building a RAG pipeline, clean web data is the foundation. Get it right, and everything downstream improves.
Need clean web data fast? Try WebPerception API → — 100 free API calls/month.