Web Scraping for RAG: Build a Knowledge Base from Any Website
Web Scraping for RAG: Build a Knowledge Base from Any Website
Retrieval-Augmented Generation (RAG) is only as good as the data you feed it. Most RAG tutorials use static PDFs or pre-loaded documents ā but the real world lives on the web. Product pages change daily. Documentation updates weekly. Competitor pricing shifts hourly.
This guide shows you how to build a live RAG knowledge base by scraping any website, chunking the content, generating embeddings, and retrieving relevant context for your LLM. No stale data. No manual uploads.
Why Web Scraping + RAG?
Traditional RAG pipelines have a freshness problem:
- Static documents go stale ā your knowledge base reflects last month's reality
- Manual updates don't scale ā someone has to re-upload files every time content changes
- The web has the answers ā product docs, pricing pages, support articles, and competitor sites are all live HTML
By combining web scraping with RAG, you get:
- Always-fresh data ā scrape on demand or on a schedule
- Any source ā if it has a URL, you can build a knowledge base from it
- Structured extraction ā pull clean text from messy HTML automatically
- Scale ā scrape 10 pages or 10,000 with the same pipeline
Architecture Overview
Here's the full pipeline:
URLs ā Scrape (WebPerception API) ā Clean Text ā Chunk ā Embed ā Vector Store ā Query ā LLM Answer
Each step:
Scrape ā Fetch web pages and extract clean text/markdown
Chunk ā Split content into retrieval-friendly segments
Embed ā Convert chunks to vector embeddings
Store ā Save embeddings in a vector database
Retrieve ā Find relevant chunks for a user query
Generate ā Pass context + query to an LLM for an answer
Prerequisites
Install the required packages:
pip install requests openai chromadb tiktoken
You'll need:
- A WebPerception API key from mantisapi.com (free tier: 100 calls/month)
- An OpenAI API key for embeddings and generation
Step 1: Scrape Web Pages
First, let's scrape content from any URL using the WebPerception API:
import requests
from typing import List, Dict
MANTIS_API_KEY = "your_mantis_api_key"
BASE_URL = "https://api.mantisapi.com"
def scrape_page(url: str) -> Dict:
"""Scrape a single page and return clean text."""
response = requests.post(
f"{BASE_URL}/v1/scrape",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={
"url": url,
"format": "markdown",
"wait_for": "networkidle"
}
)
response.raise_for_status()
data = response.json()
return {
"url": url,
"title": data.get("metadata", {}).get("title", ""),
"content": data.get("content", ""),
"scraped_at": data.get("timestamp", "")
}
def scrape_urls(urls: List[str]) -> List[Dict]:
"""Scrape multiple pages."""
results = []
for url in urls:
try:
page = scrape_page(url)
if page["content"]:
results.append(page)
print(f"ā
Scraped: {page['title'][:60]}")
except Exception as e:
print(f"ā Failed: {url} ā {e}")
return results
Step 2: Chunk the Content
Raw scraped text is too long for embedding models. We need to split it into overlapping chunks:
import tiktoken
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
"""Split text into overlapping chunks by token count."""
encoder = tiktoken.encoding_for_model("text-embedding-3-small")
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
start = end - overlap # Overlap for context continuity
return chunks
def chunk_pages(pages: List[Dict]) -> List[Dict]:
"""Chunk all scraped pages with metadata."""
all_chunks = []
for page in pages:
chunks = chunk_text(page["content"])
for i, chunk in enumerate(chunks):
all_chunks.append({
"text": chunk,
"url": page["url"],
"title": page["title"],
"chunk_index": i,
"total_chunks": len(chunks)
})
print(f"š¦ Created {len(all_chunks)} chunks from {len(pages)} pages")
return all_chunks
Step 3: Generate Embeddings and Store
Now embed the chunks and store them in ChromaDB:
import chromadb
from openai import OpenAI
openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./knowledge_base")
collection = chroma_client.get_or_create_collection(
name="web_knowledge",
metadata={"hnsw:space": "cosine"}
)
def embed_and_store(chunks: List[Dict]):
"""Embed chunks and store in ChromaDB."""
batch_size = 100
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [c["text"] for c in batch]
# Generate embeddings
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
embeddings = [e.embedding for e in response.data]
# Store in ChromaDB
collection.add(
ids=[f"chunk_{i+j}" for j in range(len(batch))],
embeddings=embeddings,
documents=texts,
metadatas=[{
"url": c["url"],
"title": c["title"],
"chunk_index": c["chunk_index"]
} for c in batch]
)
print(f"š¾ Stored {len(chunks)} chunks in ChromaDB")
Step 4: Query the Knowledge Base
Retrieve relevant chunks and generate answers:
def query_knowledge_base(question: str, n_results: int = 5) -> str:
"""Query the knowledge base and generate an answer."""
# Embed the question
q_embedding = openai_client.embeddings.create(
model="text-embedding-3-small",
input=[question]
).data[0].embedding
# Retrieve relevant chunks
results = collection.query(
query_embeddings=[q_embedding],
n_results=n_results
)
# Build context from retrieved chunks
context_parts = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
context_parts.append(f"[Source: {meta['title']}]\n{doc}")
context = "\n\n---\n\n".join(context_parts)
# Generate answer with GPT-4
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Answer the question using ONLY the provided context. "
"Cite sources by title. If the context doesn't contain "
"the answer, say so."
)},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
Step 5: Put It All Together
Here's the complete pipeline:
# 1. Define your sources
urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/tutorials",
"https://docs.example.com/faq",
"https://docs.example.com/pricing",
]
# 2. Scrape
pages = scrape_urls(urls)
# 3. Chunk
chunks = chunk_pages(pages)
# 4. Embed and store
embed_and_store(chunks)
# 5. Query
answer = query_knowledge_base("How do I authenticate with the API?")
print(answer)
Output:
Based on the documentation, you authenticate with the API using a Bearer token
in the Authorization header. You can generate an API key from the dashboard
under Settings > API Keys. [Source: API Reference]
Keeping Your Knowledge Base Fresh
The real power is automated refresh. Set up a scraping schedule:
import time
from datetime import datetime
def refresh_knowledge_base(urls: List[str], interval_hours: int = 24):
"""Periodically re-scrape and update the knowledge base."""
while True:
print(f"\nš Refreshing knowledge base ā {datetime.now()}")
# Scrape fresh content
pages = scrape_urls(urls)
chunks = chunk_pages(pages)
# Clear old data and re-index
collection.delete(where={}) # Clear collection
embed_and_store(chunks)
print(f"ā
Knowledge base updated with {len(chunks)} chunks")
time.sleep(interval_hours * 3600)
For production, use a proper scheduler (cron, Celery, or an AI agent framework) instead of time.sleep.
Advanced: AI-Powered Extraction
Instead of scraping raw text, use the WebPerception API's AI extraction to get structured data:
def extract_structured(url: str, prompt: str) -> Dict:
"""Use AI to extract specific information from a page."""
response = requests.post(
f"{BASE_URL}/v1/extract",
headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
json={
"url": url,
"prompt": prompt,
"schema": {
"type": "object",
"properties": {
"main_topics": {"type": "array", "items": {"type": "string"}},
"key_facts": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"}
}
}
}
)
return response.json()["data"]
# Extract structured knowledge from a page
knowledge = extract_structured(
"https://docs.example.com/api-reference",
"Extract the main API endpoints, authentication methods, and rate limits"
)
This gives you pre-structured knowledge that's already chunked and organized ā perfect for high-precision RAG.
Production Tips
Chunk size matters. Start with 500 tokens, overlap 50. Smaller chunks (200-300) work better for precise Q&A. Larger chunks (800-1000) capture more context for summarization.
Deduplicate. If you scrape overlapping pages (e.g., nav menus appear on every page), use content hashing to avoid duplicate chunks.
Metadata enrichment. Store the source URL, scrape timestamp, and page title with every chunk. This enables source attribution and freshness filtering.
Hybrid search. Combine vector similarity with keyword search (BM25) for better retrieval. ChromaDB supports this with where_document filters.
Incremental updates. Don't re-scrape everything every time. Track content hashes and only re-embed pages that changed.
Cost Breakdown
For a 100-page knowledge base:
| Step | Cost |
|------|------|
| Scraping (100 pages) | Free tier or ~$0.50 |
| Embeddings (text-embedding-3-small) | ~$0.02 |
| Storage (ChromaDB local) | Free |
| Queries (per question) | ~$0.001 embedding + LLM cost |
Total setup cost: under $1. Refresh costs the same minus storage.
What You Can Build
- Customer support bot ā RAG over your help docs, always up to date
- Competitive intelligence ā Knowledge base of competitor features and pricing
- Research assistant ā Scrape academic sources and answer complex questions
- Internal wiki search ā Better search over your company's scattered documentation
- Product comparison tool ā Scrape review sites and help users choose
Next Steps
- Get your free API key ā 100 calls/month on the free tier
- WebPerception API Quickstart ā Full API documentation
- AI Agent Tools Integration Guide ā Connect to LangChain, CrewAI, and more
Web scraping + RAG turns the entire internet into your knowledge base. Stop uploading PDFs manually ā let your agents scrape, chunk, and learn from live data.