Collect data from a curated list of sources

March 6, 2026 AI / Machine Learning

-------|-------------|-------------------|

| Billion-page crawl | ✅ | ❌ (cost) |

| Curated domain list | ⚠️ | ✅ |

| JavaScript-heavy sites | ❌ | ✅ |

| Structured extraction | ❌ | ✅ |

| Quality over quantity | ⚠️ | ✅ |

| Real-time data feeds | ❌ | ✅ |

Step 2: Text Extraction and Cleaning

Raw HTML to clean text is harder than it sounds:

from bs4 import BeautifulSoup
import re

def extract_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Remove non-content elements
    for tag in soup(['script', 'style', 'nav', 'header', 
                     'footer', 'aside', 'form', 'button']):
        tag.decompose()
    
    # Get text
    text = soup.get_text(separator='\n')
    
    # Clean up
    lines = [line.strip() for line in text.splitlines()]
    text = '\n'.join(line for line in lines if line)
    
    # Remove boilerplate patterns
    text = remove_boilerplate(text)
    
    return text

def remove_boilerplate(text):
    """Remove common web boilerplate text."""
    patterns = [
        r'cookie.*?accept',
        r'subscribe.*?newsletter',
        r'all rights reserved',
        r'privacy policy.*?terms',
        r'©\s*\d{4}',
    ]
    for pattern in patterns:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE)
    return text

Step 3: Quality Filtering

Not all web text is training-worthy. Filter aggressively:

import langdetect
from collections import Counter

def quality_score(text):
    """Score text quality for LLM training (0-1)."""
    scores = {}
    
    # Length check
    word_count = len(text.split())
    scores['length'] = min(word_count / 500, 1.0)  # Prefer 500+ words
    
    # Language consistency
    try:
        lang = langdetect.detect(text)
        scores['language'] = 1.0 if lang == 'en' else 0.0
    except:
        scores['language'] = 0.0
    
    # Repetition check
    words = text.lower().split()
    word_freq = Counter(words)
    most_common_ratio = word_freq.most_common(1)[0][1] / len(words)
    scores['repetition'] = 1.0 - min(most_common_ratio * 5, 1.0)
    
    # Special character ratio
    alpha_chars = sum(1 for c in text if c.isalpha())
    total_chars = len(text)
    scores['alpha_ratio'] = alpha_chars / total_chars if total_chars > 0 else 0
    
    # Average score
    return sum(scores.values()) / len(scores)

def filter_documents(documents, threshold=0.6):
    """Keep only high-quality documents."""
    return [doc for doc in documents if quality_score(doc['text']) >= threshold]

Step 4: Deduplication

Duplicate content destroys model quality. Use MinHash LSH for efficient dedup:

from datasketch import MinHash, MinHashLSH

def create_minhash(text, num_perm=128):
    """Create MinHash signature for a document."""
    m = MinHash(num_perm=num_perm)
    words = text.lower().split()
    # Use 5-grams for better duplicate detection
    for i in range(len(words) - 4):
        ngram = ' '.join(words[i:i+5])
        m.update(ngram.encode('utf-8'))
    return m

def deduplicate(documents, threshold=0.8):
    """Remove near-duplicate documents using MinHash LSH."""
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique_docs = []
    
    for i, doc in enumerate(documents):
        mh = create_minhash(doc['text'])
        
        # Check for duplicates
        result = lsh.query(mh)
        if not result:
            lsh.insert(str(i), mh)
            unique_docs.append(doc)
    
    return unique_docs

Step 5: PII Removal

Training on personal data is a legal and ethical minefield:

import re

def remove_pii(text):
    """Remove personally identifiable information."""
    # Email addresses
    text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)
    
    # Phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    
    # Social Security Numbers
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    
    # IP addresses
    text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '[IP]', text)
    
    # Credit card numbers
    text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CC]', text)
    
    return text

Legal and Ethical Considerations in 2026

This is the most important section. The legal landscape has shifted dramatically:

Copyright

NYT v. OpenAI and similar lawsuits have established that scraping copyrighted content for training may require licensing
Fair use is still debated — courts are split on whether AI training is transformative
Opt-out mechanisms — Many sites now use ai.txt or robots.txt directives to block AI training crawlers

Best Practices

Respect robots.txt — If a site blocks your crawler, don't crawl it

Check ai.txt — The emerging standard for AI training opt-out

Avoid copyrighted content — News articles, books, academic papers behind paywalls

Focus on permissive sources — Wikipedia, government data, open-source docs, Creative Commons content

Document your sources — Keep provenance records for every document

PII removal — Mandatory, not optional

Safe Data Sources

Wikipedia — CC BY-SA license, massive multilingual corpus
Common Crawl — Public domain web archive
Government sites — Public domain by default (in the US)
Stack Overflow — CC BY-SA (with attribution requirements)
arXiv — Open access research papers
GitHub — Open source code (check licenses)
Creative Commons content

The WebPerception Advantage for Data Collection

For targeted, high-quality data collection:

# Collect data from a curated list of sources
sources = [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://docs.python.org/3/tutorial/",
    # ... curated list of high-quality, permissive sources
]

dataset = []
for url in sources:
    result = extract_clean_text(url)  # Using WebPerception
    if quality_score(result['content']) > 0.7:
        clean_text = remove_pii(result['content'])
        dataset.append({
            "source": url,
            "text": clean_text,
            "title": result['title'],
            "collected_at": datetime.now().isoformat()
        })

WebPerception handles JavaScript rendering and anti-bot evasion, so you get cleaner text from more sources — including SPAs, documentation sites, and dynamic content that basic crawlers miss.

Conclusion

Web scraping for LLM training data in 2026 requires:

Quality over quantity — Better to train on 1B clean tokens than 100B dirty ones

Legal compliance — The regulatory landscape is evolving fast

Technical rigor — Dedup, quality filtering, and PII removal are non-negotiable

The right tools — DIY crawlers for scale, APIs like WebPerception for quality

Whether you're fine-tuning a model or building a RAG pipeline, clean web data is the foundation. Get it right, and everything downstream improves.

Need clean web data fast? Try WebPerception API → — 100 free API calls/month.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →