Build Web Scraping Plugins for Microsoft Semantic Kernel

March 9, 2026 ยท 12 min read Semantic Kernel C# Python AI Agents

Microsoft Semantic Kernel is the enterprise AI orchestration framework that powers Copilot, Microsoft 365, and thousands of enterprise apps. If you're building AI agents in the .NET or Python ecosystem, Semantic Kernel's plugin architecture is the standard way to give agents capabilities โ€” and web scraping is one of the most powerful capabilities you can add.

In this tutorial, you'll build a complete WebScrapingPlugin for Semantic Kernel that gives your agents real-time web access through the WebPerception API. We'll cover both C# and Python implementations, planners, and multi-agent patterns.

Why Semantic Kernel for Web Scraping Agents?

Semantic Kernel stands out for enterprise AI agent development:

Prerequisites

You'll need:

Building the WebScrapingPlugin in C#

Semantic Kernel plugins are classes decorated with [KernelFunction] attributes. Each function becomes a tool the AI agent can invoke.

Step 1: Install Dependencies

dotnet add package Microsoft.SemanticKernel
dotnet add package System.Net.Http.Json

Step 2: Create the Plugin

using System.ComponentModel;
using System.Net.Http.Json;
using Microsoft.SemanticKernel;

public class WebScrapingPlugin
{
    private readonly HttpClient _http;
    private readonly string _apiKey;

    public WebScrapingPlugin(string apiKey)
    {
        _apiKey = apiKey;
        _http = new HttpClient
        {
            BaseAddress = new Uri("https://api.mantisapi.com/v1/")
        };
        _http.DefaultRequestHeaders.Add("x-api-key", _apiKey);
    }

    [KernelFunction("scrape_url")]
    [Description("Scrape a webpage and return its content as clean text or markdown")]
    public async Task<string> ScrapeUrlAsync(
        [Description("The URL to scrape")] string url,
        [Description("Output format: text or markdown")] string format = "markdown")
    {
        var response = await _http.PostAsJsonAsync("scrape", new
        {
            url,
            format,
            wait_for = "networkidle"
        });
        response.EnsureSuccessStatusCode();
        var result = await response.Content.ReadFromJsonAsync<ScrapeResponse>();
        return result?.Content ?? "No content returned";
    }

    [KernelFunction("screenshot_url")]
    [Description("Take a screenshot of a webpage and return the image URL")]
    public async Task<string> ScreenshotUrlAsync(
        [Description("The URL to screenshot")] string url,
        [Description("Viewport width in pixels")] int width = 1280)
    {
        var response = await _http.PostAsJsonAsync("screenshot", new
        {
            url,
            viewport = new { width, height = 720 },
            format = "png"
        });
        response.EnsureSuccessStatusCode();
        var result = await response.Content.ReadFromJsonAsync<ScreenshotResponse>();
        return result?.Url ?? "Screenshot failed";
    }

    [KernelFunction("extract_data")]
    [Description("Extract structured data from a webpage using AI")]
    public async Task<string> ExtractDataAsync(
        [Description("The URL to extract data from")] string url,
        [Description("What data to extract (e.g., 'product names and prices')")] string prompt)
    {
        var response = await _http.PostAsJsonAsync("extract", new
        {
            url,
            prompt,
            wait_for = "networkidle"
        });
        response.EnsureSuccessStatusCode();
        var result = await response.Content.ReadFromJsonAsync<ExtractResponse>();
        return result?.Data ?? "No data extracted";
    }
}

record ScrapeResponse(string Content, string Url);
record ScreenshotResponse(string Url);
record ExtractResponse(string Data);

Step 3: Register and Use the Plugin

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;

var builder = Kernel.CreateBuilder();
builder.AddOpenAIChatCompletion("gpt-4o", Environment.GetEnvironmentVariable("OPENAI_API_KEY")!);

var kernel = builder.Build();

// Register the web scraping plugin
var mantisApiKey = Environment.GetEnvironmentVariable("MANTIS_API_KEY")!;
kernel.Plugins.AddFromObject(new WebScrapingPlugin(mantisApiKey), "WebScraping");

// Enable automatic function calling
var settings = new OpenAIPromptExecutionSettings
{
    FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};

var chat = kernel.GetRequiredService<IChatCompletionService>();
var history = new ChatHistory();
history.AddSystemMessage("You are a research assistant with web scraping capabilities. Use your tools to gather real-time information from the web.");

// Agent loop
history.AddUserMessage("What are the top stories on Hacker News right now? Scrape the front page.");

var response = await chat.GetChatMessageContentAsync(history, settings, kernel);
Console.WriteLine(response.Content);

Building the Plugin in Python

The Python Semantic Kernel SDK follows the same plugin pattern with decorators:

import httpx
from semantic_kernel.functions import kernel_function

class WebScrapingPlugin:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.mantisapi.com/v1"

    @kernel_function(
        name="scrape_url",
        description="Scrape a webpage and return its content as clean text or markdown"
    )
    async def scrape_url(self, url: str, format: str = "markdown") -> str:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{self.base_url}/scrape",
                headers={"x-api-key": self.api_key},
                json={"url": url, "format": format, "wait_for": "networkidle"}
            )
            resp.raise_for_status()
            return resp.json().get("content", "No content returned")

    @kernel_function(
        name="screenshot_url",
        description="Take a screenshot of a webpage and return the image URL"
    )
    async def screenshot_url(self, url: str, width: int = 1280) -> str:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{self.base_url}/screenshot",
                headers={"x-api-key": self.api_key},
                json={"url": url, "viewport": {"width": width, "height": 720}, "format": "png"}
            )
            resp.raise_for_status()
            return resp.json().get("url", "Screenshot failed")

    @kernel_function(
        name="extract_data",
        description="Extract structured data from a webpage using AI"
    )
    async def extract_data(self, url: str, prompt: str) -> str:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{self.base_url}/extract",
                headers={"x-api-key": self.api_key},
                json={"url": url, "prompt": prompt, "wait_for": "networkidle"}
            )
            resp.raise_for_status()
            return resp.json().get("data", "No data extracted")

Using the Plugin with Python Semantic Kernel

import asyncio
import os
import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.connectors.ai.open_ai import OpenAIChatPromptExecutionSettings
from semantic_kernel.connectors.ai.function_choice_behavior import FunctionChoiceBehavior
from semantic_kernel.contents import ChatHistory

async def main():
    kernel = sk.Kernel()

    kernel.add_service(OpenAIChatCompletion(
        ai_model_id="gpt-4o",
        api_key=os.environ["OPENAI_API_KEY"]
    ))

    # Register plugin
    kernel.add_plugin(
        WebScrapingPlugin(os.environ["MANTIS_API_KEY"]),
        plugin_name="WebScraping"
    )

    settings = OpenAIChatPromptExecutionSettings(
        function_choice_behavior=FunctionChoiceBehavior.Auto()
    )

    chat = kernel.get_service(type=OpenAIChatCompletion)
    history = ChatHistory()
    history.add_system_message("You are a research assistant with web scraping tools.")
    history.add_user_message("Extract all product prices from https://example-store.com/deals")

    result = await chat.get_chat_message_content(history, settings, kernel)
    print(result.content)

asyncio.run(main())

Advanced: Multi-Step Planner with Web Scraping

Semantic Kernel's planner can automatically compose multi-step workflows from your plugins. Here's how to set up a research agent that plans its own scraping strategy:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Planning;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Build kernel with plugins
var kernel = builder.Build();
kernel.Plugins.AddFromObject(new WebScrapingPlugin(mantisApiKey), "WebScraping");

// Add a summarization function
kernel.Plugins.AddFromPromptDirectory("./Plugins/SummarizePlugin");

// Use function-calling planner (auto mode)
var settings = new OpenAIPromptExecutionSettings
{
    FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};

var chat = kernel.GetRequiredService<IChatCompletionService>();
var history = new ChatHistory();
history.AddSystemMessage(@"You are a competitive intelligence agent. 
When asked to research a company:
1. Scrape their website for product info
2. Extract pricing data
3. Take a screenshot of their homepage
4. Summarize your findings

Use your tools step by step.");

history.AddUserMessage("Research Stripe's current pricing for payment processing.");

var response = await chat.GetChatMessageContentAsync(history, settings, kernel);
Console.WriteLine(response.Content);

Real-World Use Cases

1. Enterprise Content Aggregator

Build an agent that monitors competitor websites, press releases, and industry news:

[KernelFunction("monitor_competitors")]
[Description("Check competitor websites for changes and new content")]
public async Task<string> MonitorCompetitorsAsync(
    [Description("Comma-separated list of competitor URLs")] string urls)
{
    var results = new List<string>();
    foreach (var url in urls.Split(',').Select(u => u.Trim()))
    {
        var content = await ScrapeUrlAsync(url, "text");
        results.Add($"## {url}\n{content[..Math.Min(content.Length, 500)]}...");
    }
    return string.Join("\n\n", results);
}

2. Automated Due Diligence

Agents that research companies for investment or partnership decisions:

history.AddUserMessage(@"Perform due diligence on Acme Corp (acmecorp.com):
- Scrape their about page, team page, and pricing
- Extract key metrics and leadership info
- Screenshot their homepage for our records
- Summarize findings with risk assessment");

3. Price Monitoring Pipeline

Track pricing changes across SaaS competitors with scheduled agent runs:

var competitors = new[]
{
    "https://competitor-a.com/pricing",
    "https://competitor-b.com/pricing",
    "https://competitor-c.com/pricing"
};

foreach (var url in competitors)
{
    history.AddUserMessage($"Extract all pricing tiers and features from {url}");
    var result = await chat.GetChatMessageContentAsync(history, settings, kernel);
    history.AddAssistantMessage(result.Content!);
}

history.AddUserMessage("Now compare all three competitors' pricing and identify where we can undercut them.");

Dependency Injection Setup (ASP.NET)

For production .NET apps, register the plugin via DI:

// Program.cs
builder.Services.AddSingleton(sp =>
{
    var config = sp.GetRequiredService<IConfiguration>();
    return new WebScrapingPlugin(config["Mantis:ApiKey"]!);
});

builder.Services.AddKernel()
    .AddOpenAIChatCompletion("gpt-4o", builder.Configuration["OpenAI:ApiKey"]!)
    .Plugins.AddFromObject(builder.Services.BuildServiceProvider()
        .GetRequiredService<WebScrapingPlugin>(), "WebScraping");

Error Handling and Retry Logic

[KernelFunction("scrape_url_safe")]
[Description("Scrape a URL with retry logic and error handling")]
public async Task<string> ScrapeUrlSafeAsync(
    [Description("URL to scrape")] string url,
    [Description("Max retries")] int maxRetries = 3)
{
    for (int attempt = 1; attempt <= maxRetries; attempt++)
    {
        try
        {
            return await ScrapeUrlAsync(url, "markdown");
        }
        catch (HttpRequestException ex) when (ex.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
        {
            if (attempt == maxRetries) throw;
            await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, attempt)));
        }
        catch (HttpRequestException ex) when (ex.StatusCode == System.Net.HttpStatusCode.BadGateway)
        {
            if (attempt == maxRetries) return $"Failed to scrape {url} after {maxRetries} attempts";
            await Task.Delay(TimeSpan.FromSeconds(2));
        }
    }
    return "Unexpected error";
}

Cost Optimization

StrategyImplementationSavings
Cache responsesStore scrape results in Redis/memory with TTL50-80%
Use text formatSet format: "text" when markdown isn't neededFaster processing
Limit content lengthTruncate before sending to LLMReduce token costs
Batch requestsScrape multiple URLs in parallelFaster execution
Smart selectorsTarget specific page sections via CSS selectorsLess noise, fewer tokens

Start Building with Semantic Kernel + Mantis

Get 100 free API calls per month. No credit card required.

Get Your API Key โ†’

What You Learned

Semantic Kernel's plugin architecture makes it trivially easy to add web scraping to any .NET or Python AI agent. The combination of type-safe plugins, automatic function calling, and planner support means your agents can autonomously gather, process, and analyze web data โ€” the same patterns powering Microsoft Copilot.

Next steps: Check out our quickstart guide to get your API key, or explore our other framework integrations.