Web Scraping for RAG: Build a Knowledge Base from Any Website

March 9, 2026 AI Agents

Web Scraping for RAG: Build a Knowledge Base from Any Website

Retrieval-Augmented Generation (RAG) is only as good as the data you feed it. Most RAG tutorials use static PDFs or pre-loaded documents — but the real world lives on the web. Product pages change daily. Documentation updates weekly. Competitor pricing shifts hourly.

This guide shows you how to build a live RAG knowledge base by scraping any website, chunking the content, generating embeddings, and retrieving relevant context for your LLM. No stale data. No manual uploads.

Why Web Scraping + RAG?

Traditional RAG pipelines have a freshness problem:

By combining web scraping with RAG, you get:

Architecture Overview

Here's the full pipeline:

URLs → Scrape (WebPerception API) → Clean Text → Chunk → Embed → Vector Store → Query → LLM Answer

Each step:

Scrape — Fetch web pages and extract clean text/markdown

Chunk — Split content into retrieval-friendly segments

Embed — Convert chunks to vector embeddings

Store — Save embeddings in a vector database

Retrieve — Find relevant chunks for a user query

Generate — Pass context + query to an LLM for an answer

Prerequisites

Install the required packages:

pip install requests openai chromadb tiktoken

You'll need:

Step 1: Scrape Web Pages

First, let's scrape content from any URL using the WebPerception API:

import requests
from typing import List, Dict

MANTIS_API_KEY = "your_mantis_api_key"
BASE_URL = "https://api.mantisapi.com"

def scrape_page(url: str) -> Dict:
    """Scrape a single page and return clean text."""
    response = requests.post(
        f"{BASE_URL}/v1/scrape",
        headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
        json={
            "url": url,
            "format": "markdown",
            "wait_for": "networkidle"
        }
    )
    response.raise_for_status()
    data = response.json()
    return {
        "url": url,
        "title": data.get("metadata", {}).get("title", ""),
        "content": data.get("content", ""),
        "scraped_at": data.get("timestamp", "")
    }

def scrape_urls(urls: List[str]) -> List[Dict]:
    """Scrape multiple pages."""
    results = []
    for url in urls:
        try:
            page = scrape_page(url)
            if page["content"]:
                results.append(page)
                print(f"āœ… Scraped: {page['title'][:60]}")
        except Exception as e:
            print(f"āŒ Failed: {url} — {e}")
    return results

Step 2: Chunk the Content

Raw scraped text is too long for embedding models. We need to split it into overlapping chunks:

import tiktoken

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """Split text into overlapping chunks by token count."""
    encoder = tiktoken.encoding_for_model("text-embedding-3-small")
    tokens = encoder.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
        start = end - overlap  # Overlap for context continuity

    return chunks

def chunk_pages(pages: List[Dict]) -> List[Dict]:
    """Chunk all scraped pages with metadata."""
    all_chunks = []
    for page in pages:
        chunks = chunk_text(page["content"])
        for i, chunk in enumerate(chunks):
            all_chunks.append({
                "text": chunk,
                "url": page["url"],
                "title": page["title"],
                "chunk_index": i,
                "total_chunks": len(chunks)
            })
    print(f"šŸ“¦ Created {len(all_chunks)} chunks from {len(pages)} pages")
    return all_chunks

Step 3: Generate Embeddings and Store

Now embed the chunks and store them in ChromaDB:

import chromadb
from openai import OpenAI

openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./knowledge_base")
collection = chroma_client.get_or_create_collection(
    name="web_knowledge",
    metadata={"hnsw:space": "cosine"}
)

def embed_and_store(chunks: List[Dict]):
    """Embed chunks and store in ChromaDB."""
    batch_size = 100
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["text"] for c in batch]

        # Generate embeddings
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )
        embeddings = [e.embedding for e in response.data]

        # Store in ChromaDB
        collection.add(
            ids=[f"chunk_{i+j}" for j in range(len(batch))],
            embeddings=embeddings,
            documents=texts,
            metadatas=[{
                "url": c["url"],
                "title": c["title"],
                "chunk_index": c["chunk_index"]
            } for c in batch]
        )

    print(f"šŸ’¾ Stored {len(chunks)} chunks in ChromaDB")

Step 4: Query the Knowledge Base

Retrieve relevant chunks and generate answers:

def query_knowledge_base(question: str, n_results: int = 5) -> str:
    """Query the knowledge base and generate an answer."""
    # Embed the question
    q_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[question]
    ).data[0].embedding

    # Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=n_results
    )

    # Build context from retrieved chunks
    context_parts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_parts.append(f"[Source: {meta['title']}]\n{doc}")
    context = "\n\n---\n\n".join(context_parts)

    # Generate answer with GPT-4
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Answer the question using ONLY the provided context. "
                "Cite sources by title. If the context doesn't contain "
                "the answer, say so."
            )},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )

    return response.choices[0].message.content

Step 5: Put It All Together

Here's the complete pipeline:

# 1. Define your sources
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/tutorials",
    "https://docs.example.com/faq",
    "https://docs.example.com/pricing",
]

# 2. Scrape
pages = scrape_urls(urls)

# 3. Chunk
chunks = chunk_pages(pages)

# 4. Embed and store
embed_and_store(chunks)

# 5. Query
answer = query_knowledge_base("How do I authenticate with the API?")
print(answer)

Output:

Based on the documentation, you authenticate with the API using a Bearer token
in the Authorization header. You can generate an API key from the dashboard
under Settings > API Keys. [Source: API Reference]

Keeping Your Knowledge Base Fresh

The real power is automated refresh. Set up a scraping schedule:

import time
from datetime import datetime

def refresh_knowledge_base(urls: List[str], interval_hours: int = 24):
    """Periodically re-scrape and update the knowledge base."""
    while True:
        print(f"\nšŸ”„ Refreshing knowledge base — {datetime.now()}")

        # Scrape fresh content
        pages = scrape_urls(urls)
        chunks = chunk_pages(pages)

        # Clear old data and re-index
        collection.delete(where={})  # Clear collection
        embed_and_store(chunks)

        print(f"āœ… Knowledge base updated with {len(chunks)} chunks")
        time.sleep(interval_hours * 3600)

For production, use a proper scheduler (cron, Celery, or an AI agent framework) instead of time.sleep.

Advanced: AI-Powered Extraction

Instead of scraping raw text, use the WebPerception API's AI extraction to get structured data:

def extract_structured(url: str, prompt: str) -> Dict:
    """Use AI to extract specific information from a page."""
    response = requests.post(
        f"{BASE_URL}/v1/extract",
        headers={"Authorization": f"Bearer {MANTIS_API_KEY}"},
        json={
            "url": url,
            "prompt": prompt,
            "schema": {
                "type": "object",
                "properties": {
                    "main_topics": {"type": "array", "items": {"type": "string"}},
                    "key_facts": {"type": "array", "items": {"type": "string"}},
                    "summary": {"type": "string"}
                }
            }
        }
    )
    return response.json()["data"]

# Extract structured knowledge from a page
knowledge = extract_structured(
    "https://docs.example.com/api-reference",
    "Extract the main API endpoints, authentication methods, and rate limits"
)

This gives you pre-structured knowledge that's already chunked and organized — perfect for high-precision RAG.

Production Tips

Chunk size matters. Start with 500 tokens, overlap 50. Smaller chunks (200-300) work better for precise Q&A. Larger chunks (800-1000) capture more context for summarization.

Deduplicate. If you scrape overlapping pages (e.g., nav menus appear on every page), use content hashing to avoid duplicate chunks.

Metadata enrichment. Store the source URL, scrape timestamp, and page title with every chunk. This enables source attribution and freshness filtering.

Hybrid search. Combine vector similarity with keyword search (BM25) for better retrieval. ChromaDB supports this with where_document filters.

Incremental updates. Don't re-scrape everything every time. Track content hashes and only re-embed pages that changed.

Cost Breakdown

For a 100-page knowledge base:

| Step | Cost |

|------|------|

| Scraping (100 pages) | Free tier or ~$0.50 |

| Embeddings (text-embedding-3-small) | ~$0.02 |

| Storage (ChromaDB local) | Free |

| Queries (per question) | ~$0.001 embedding + LLM cost |

Total setup cost: under $1. Refresh costs the same minus storage.

What You Can Build

Next Steps

Web scraping + RAG turns the entire internet into your knowledge base. Stop uploading PDFs manually — let your agents scrape, chunk, and learn from live data.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →