Web Scraping for RAG: Build a Knowledge Base from Any Website

March 9, 2026 AI Agents

Web Scraping for RAG: Build a Knowledge Base from Any Website

Retrieval-Augmented Generation (RAG) is only as good as the data you feed it. Most RAG tutorials use static PDFs or pre-loaded documents — but the real world lives on the web. Product pages change daily. Documentation updates weekly. Competitor pricing shifts hourly.

This guide shows you how to build a live RAG knowledge base by scraping any website, chunking the content, generating embeddings, and retrieving relevant context for your LLM. No stale data. No manual uploads.

Why Web Scraping + RAG?

Traditional RAG pipelines have a freshness problem:

Static documents go stale — your knowledge base reflects last month's reality
Manual updates don't scale — someone has to re-upload files every time content changes
The web has the answers — product docs, pricing pages, support articles, and competitor sites are all live HTML

By combining web scraping with RAG, you get:

Always-fresh data — scrape on demand or on a schedule
Any source — if it has a URL, you can build a knowledge base from it
Structured extraction — pull clean text from messy HTML automatically
Scale — scrape 10 pages or 10,000 with the same pipeline

Architecture Overview

Here's the full pipeline:

URLs → Scrape (WebPerception API) → Clean Text → Chunk → Embed → Vector Store → Query → LLM Answer

Each step:

Scrape — Fetch web pages and extract clean text/markdown

Chunk — Split content into retrieval-friendly segments

Embed — Convert chunks to vector embeddings

Store — Save embeddings in a vector database

Retrieve — Find relevant chunks for a user query

Generate — Pass context + query to an LLM for an answer

Prerequisites

Install the required packages:

pip install requests openai chromadb tiktoken

You'll need:

A WebPerception API key from mantisapi.com (free tier: 100 calls/month)
An OpenAI API key for embeddings and generation

Step 1: Scrape Web Pages

First, let's scrape content from any URL using the WebPerception API:

import requests
from typing import List, Dict

MANTIS_API_KEY = "your_mantis_api_key"
BASE_URL = "https://api.mantisapi.com"

def scrape_page(url: str) -> Dict:
    """Scrape a single page and return clean text."""
    response = requests.post(
        f"{BASE_URL}/v1/scrape",
        headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
        json={
            "url": url,
            "format": "markdown",
            "wait_for": "networkidle"
        }
    )
    response.raise_for_status()
    data = response.json()
    return {
        "url": url,
        "title": data.get("metadata", {}).get("title", ""),
        "content": data.get("content", ""),
        "scraped_at": data.get("timestamp", "")
    }

def scrape_urls(urls: List[str]) -> List[Dict]:
    """Scrape multiple pages."""
    results = []
    for url in urls:
        try:
            page = scrape_page(url)
            if page["content"]:
                results.append(page)
                print(f"✅ Scraped: {page['title'][:60]}")
        except Exception as e:
            print(f"❌ Failed: {url} — {e}")
    return results

Step 2: Chunk the Content

Raw scraped text is too long for embedding models. We need to split it into overlapping chunks:

import tiktoken

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """Split text into overlapping chunks by token count."""
    encoder = tiktoken.encoding_for_model("text-embedding-3-small")
    tokens = encoder.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
        start = end - overlap  # Overlap for context continuity

    return chunks

def chunk_pages(pages: List[Dict]) -> List[Dict]:
    """Chunk all scraped pages with metadata."""
    all_chunks = []
    for page in pages:
        chunks = chunk_text(page["content"])
        for i, chunk in enumerate(chunks):
            all_chunks.append({
                "text": chunk,
                "url": page["url"],
                "title": page["title"],
                "chunk_index": i,
                "total_chunks": len(chunks)
            })
    print(f"📦 Created {len(all_chunks)} chunks from {len(pages)} pages")
    return all_chunks

Step 3: Generate Embeddings and Store

Now embed the chunks and store them in ChromaDB:

import chromadb
from openai import OpenAI

openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./knowledge_base")
collection = chroma_client.get_or_create_collection(
    name="web_knowledge",
    metadata={"hnsw:space": "cosine"}
)

def embed_and_store(chunks: List[Dict]):
    """Embed chunks and store in ChromaDB."""
    batch_size = 100
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["text"] for c in batch]

        # Generate embeddings
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )
        embeddings = [e.embedding for e in response.data]

        # Store in ChromaDB
        collection.add(
            ids=[f"chunk_{i+j}" for j in range(len(batch))],
            embeddings=embeddings,
            documents=texts,
            metadatas=[{
                "url": c["url"],
                "title": c["title"],
                "chunk_index": c["chunk_index"]
            } for c in batch]
        )

    print(f"💾 Stored {len(chunks)} chunks in ChromaDB")

Step 4: Query the Knowledge Base

Retrieve relevant chunks and generate answers:

def query_knowledge_base(question: str, n_results: int = 5) -> str:
    """Query the knowledge base and generate an answer."""
    # Embed the question
    q_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[question]
    ).data[0].embedding

    # Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=n_results
    )

    # Build context from retrieved chunks
    context_parts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_parts.append(f"[Source: {meta['title']}]\n{doc}")
    context = "\n\n---\n\n".join(context_parts)

    # Generate answer with GPT-4
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Answer the question using ONLY the provided context. "
                "Cite sources by title. If the context doesn't contain "
                "the answer, say so."
            )},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )

    return response.choices[0].message.content

Step 5: Put It All Together

Here's the complete pipeline:

# 1. Define your sources
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/tutorials",
    "https://docs.example.com/faq",
    "https://docs.example.com/pricing",
]

# 2. Scrape
pages = scrape_urls(urls)

# 3. Chunk
chunks = chunk_pages(pages)

# 4. Embed and store
embed_and_store(chunks)

# 5. Query
answer = query_knowledge_base("How do I authenticate with the API?")
print(answer)

Output:

Based on the documentation, you authenticate with the API using a Bearer token
in the Authorization header. You can generate an API key from the dashboard
under Settings > API Keys. [Source: API Reference]

Keeping Your Knowledge Base Fresh

The real power is automated refresh. Set up a scraping schedule:

import time
from datetime import datetime

def refresh_knowledge_base(urls: List[str], interval_hours: int = 24):
    """Periodically re-scrape and update the knowledge base."""
    while True:
        print(f"\n🔄 Refreshing knowledge base — {datetime.now()}")

        # Scrape fresh content
        pages = scrape_urls(urls)
        chunks = chunk_pages(pages)

        # Clear old data and re-index
        collection.delete(where={})  # Clear collection
        embed_and_store(chunks)

        print(f"✅ Knowledge base updated with {len(chunks)} chunks")
        time.sleep(interval_hours * 3600)

For production, use a proper scheduler (cron, Celery, or an AI agent framework) instead of time.sleep.

Advanced: AI-Powered Extraction

Instead of scraping raw text, use the WebPerception API's AI extraction to get structured data:

def extract_structured(url: str, prompt: str) -> Dict:
    """Use AI to extract specific information from a page."""
    response = requests.post(
        f"{BASE_URL}/v1/extract",
        headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
        json={
            "url": url,
            "prompt": prompt,
            "schema": {
                "type": "object",
                "properties": {
                    "main_topics": {"type": "array", "items": {"type": "string"}},
                    "key_facts": {"type": "array", "items": {"type": "string"}},
                    "summary": {"type": "string"}
                }
            }
        }
    )
    return response.json()["data"]

# Extract structured knowledge from a page
knowledge = extract_structured(
    "https://docs.example.com/api-reference",
    "Extract the main API endpoints, authentication methods, and rate limits"
)

This gives you pre-structured knowledge that's already chunked and organized — perfect for high-precision RAG.

Production Tips

Chunk size matters. Start with 500 tokens, overlap 50. Smaller chunks (200-300) work better for precise Q&A. Larger chunks (800-1000) capture more context for summarization.

Deduplicate. If you scrape overlapping pages (e.g., nav menus appear on every page), use content hashing to avoid duplicate chunks.

Metadata enrichment. Store the source URL, scrape timestamp, and page title with every chunk. This enables source attribution and freshness filtering.

Hybrid search. Combine vector similarity with keyword search (BM25) for better retrieval. ChromaDB supports this with where_document filters.

Incremental updates. Don't re-scrape everything every time. Track content hashes and only re-embed pages that changed.

Cost Breakdown

For a 100-page knowledge base:

| Step | Cost |

|------|------|

| Scraping (100 pages) | Free tier or ~$0.50 |

| Embeddings (text-embedding-3-small) | ~$0.02 |

| Storage (ChromaDB local) | Free |

| Queries (per question) | ~$0.001 embedding + LLM cost |

Total setup cost: under $1. Refresh costs the same minus storage.

What You Can Build

Customer support bot — RAG over your help docs, always up to date
Competitive intelligence — Knowledge base of competitor features and pricing
Research assistant — Scrape academic sources and answer complex questions
Internal wiki search — Better search over your company's scattered documentation
Product comparison tool — Scrape review sites and help users choose

Next Steps

Get your free API key — 100 calls/month on the free tier
WebPerception API Quickstart — Full API documentation
AI Agent Tools Integration Guide — Connect to LangChain, CrewAI, and more

Web scraping + RAG turns the entire internet into your knowledge base. Stop uploading PDFs manually — let your agents scrape, chunk, and learn from live data.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →