Build Web Scraping Plugins for Microsoft Semantic Kernel
Microsoft Semantic Kernel is the enterprise AI orchestration framework that powers Copilot, Microsoft 365, and thousands of enterprise apps. If you're building AI agents in the .NET or Python ecosystem, Semantic Kernel's plugin architecture is the standard way to give agents capabilities โ and web scraping is one of the most powerful capabilities you can add.
In this tutorial, you'll build a complete WebScrapingPlugin for Semantic Kernel that gives your agents real-time web access through the WebPerception API. We'll cover both C# and Python implementations, planners, and multi-agent patterns.
Why Semantic Kernel for Web Scraping Agents?
Semantic Kernel stands out for enterprise AI agent development:
- Plugin architecture โ Clean, type-safe function registration with automatic schema generation
- Planner support โ Agents automatically compose multi-step plans from available plugins
- Multi-model โ Works with OpenAI, Azure OpenAI, Anthropic, Google, and local models
- Enterprise-ready โ Built by Microsoft, used in production at Fortune 500 companies
- Dependency injection โ Native .NET DI integration for clean, testable code
Prerequisites
You'll need:
- A Mantis API key (free tier: 100 calls/month)
- An OpenAI or Azure OpenAI API key
- .NET 8+ (for C#) or Python 3.10+ (for Python)
Microsoft.SemanticKernelNuGet package orsemantic-kernelpip package
Building the WebScrapingPlugin in C#
Semantic Kernel plugins are classes decorated with [KernelFunction] attributes. Each function becomes a tool the AI agent can invoke.
Step 1: Install Dependencies
dotnet add package Microsoft.SemanticKernel
dotnet add package System.Net.Http.Json
Step 2: Create the Plugin
using System.ComponentModel;
using System.Net.Http.Json;
using Microsoft.SemanticKernel;
public class WebScrapingPlugin
{
private readonly HttpClient _http;
private readonly string _apiKey;
public WebScrapingPlugin(string apiKey)
{
_apiKey = apiKey;
_http = new HttpClient
{
BaseAddress = new Uri("https://api.mantisapi.com/v1/")
};
_http.DefaultRequestHeaders.Add("x-api-key", _apiKey);
}
[KernelFunction("scrape_url")]
[Description("Scrape a webpage and return its content as clean text or markdown")]
public async Task<string> ScrapeUrlAsync(
[Description("The URL to scrape")] string url,
[Description("Output format: text or markdown")] string format = "markdown")
{
var response = await _http.PostAsJsonAsync("scrape", new
{
url,
format,
wait_for = "networkidle"
});
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<ScrapeResponse>();
return result?.Content ?? "No content returned";
}
[KernelFunction("screenshot_url")]
[Description("Take a screenshot of a webpage and return the image URL")]
public async Task<string> ScreenshotUrlAsync(
[Description("The URL to screenshot")] string url,
[Description("Viewport width in pixels")] int width = 1280)
{
var response = await _http.PostAsJsonAsync("screenshot", new
{
url,
viewport = new { width, height = 720 },
format = "png"
});
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<ScreenshotResponse>();
return result?.Url ?? "Screenshot failed";
}
[KernelFunction("extract_data")]
[Description("Extract structured data from a webpage using AI")]
public async Task<string> ExtractDataAsync(
[Description("The URL to extract data from")] string url,
[Description("What data to extract (e.g., 'product names and prices')")] string prompt)
{
var response = await _http.PostAsJsonAsync("extract", new
{
url,
prompt,
wait_for = "networkidle"
});
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<ExtractResponse>();
return result?.Data ?? "No data extracted";
}
}
record ScrapeResponse(string Content, string Url);
record ScreenshotResponse(string Url);
record ExtractResponse(string Data);
Step 3: Register and Use the Plugin
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
var builder = Kernel.CreateBuilder();
builder.AddOpenAIChatCompletion("gpt-4o", Environment.GetEnvironmentVariable("OPENAI_API_KEY")!);
var kernel = builder.Build();
// Register the web scraping plugin
var mantisApiKey = Environment.GetEnvironmentVariable("MANTIS_API_KEY")!;
kernel.Plugins.AddFromObject(new WebScrapingPlugin(mantisApiKey), "WebScraping");
// Enable automatic function calling
var settings = new OpenAIPromptExecutionSettings
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};
var chat = kernel.GetRequiredService<IChatCompletionService>();
var history = new ChatHistory();
history.AddSystemMessage("You are a research assistant with web scraping capabilities. Use your tools to gather real-time information from the web.");
// Agent loop
history.AddUserMessage("What are the top stories on Hacker News right now? Scrape the front page.");
var response = await chat.GetChatMessageContentAsync(history, settings, kernel);
Console.WriteLine(response.Content);
Building the Plugin in Python
The Python Semantic Kernel SDK follows the same plugin pattern with decorators:
import httpx
from semantic_kernel.functions import kernel_function
class WebScrapingPlugin:
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.mantisapi.com/v1"
@kernel_function(
name="scrape_url",
description="Scrape a webpage and return its content as clean text or markdown"
)
async def scrape_url(self, url: str, format: str = "markdown") -> str:
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{self.base_url}/scrape",
headers={"x-api-key": self.api_key},
json={"url": url, "format": format, "wait_for": "networkidle"}
)
resp.raise_for_status()
return resp.json().get("content", "No content returned")
@kernel_function(
name="screenshot_url",
description="Take a screenshot of a webpage and return the image URL"
)
async def screenshot_url(self, url: str, width: int = 1280) -> str:
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{self.base_url}/screenshot",
headers={"x-api-key": self.api_key},
json={"url": url, "viewport": {"width": width, "height": 720}, "format": "png"}
)
resp.raise_for_status()
return resp.json().get("url", "Screenshot failed")
@kernel_function(
name="extract_data",
description="Extract structured data from a webpage using AI"
)
async def extract_data(self, url: str, prompt: str) -> str:
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{self.base_url}/extract",
headers={"x-api-key": self.api_key},
json={"url": url, "prompt": prompt, "wait_for": "networkidle"}
)
resp.raise_for_status()
return resp.json().get("data", "No data extracted")
Using the Plugin with Python Semantic Kernel
import asyncio
import os
import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.connectors.ai.open_ai import OpenAIChatPromptExecutionSettings
from semantic_kernel.connectors.ai.function_choice_behavior import FunctionChoiceBehavior
from semantic_kernel.contents import ChatHistory
async def main():
kernel = sk.Kernel()
kernel.add_service(OpenAIChatCompletion(
ai_model_id="gpt-4o",
api_key=os.environ["OPENAI_API_KEY"]
))
# Register plugin
kernel.add_plugin(
WebScrapingPlugin(os.environ["MANTIS_API_KEY"]),
plugin_name="WebScraping"
)
settings = OpenAIChatPromptExecutionSettings(
function_choice_behavior=FunctionChoiceBehavior.Auto()
)
chat = kernel.get_service(type=OpenAIChatCompletion)
history = ChatHistory()
history.add_system_message("You are a research assistant with web scraping tools.")
history.add_user_message("Extract all product prices from https://example-store.com/deals")
result = await chat.get_chat_message_content(history, settings, kernel)
print(result.content)
asyncio.run(main())
Advanced: Multi-Step Planner with Web Scraping
Semantic Kernel's planner can automatically compose multi-step workflows from your plugins. Here's how to set up a research agent that plans its own scraping strategy:
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Planning;
using Microsoft.SemanticKernel.Connectors.OpenAI;
// Build kernel with plugins
var kernel = builder.Build();
kernel.Plugins.AddFromObject(new WebScrapingPlugin(mantisApiKey), "WebScraping");
// Add a summarization function
kernel.Plugins.AddFromPromptDirectory("./Plugins/SummarizePlugin");
// Use function-calling planner (auto mode)
var settings = new OpenAIPromptExecutionSettings
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};
var chat = kernel.GetRequiredService<IChatCompletionService>();
var history = new ChatHistory();
history.AddSystemMessage(@"You are a competitive intelligence agent.
When asked to research a company:
1. Scrape their website for product info
2. Extract pricing data
3. Take a screenshot of their homepage
4. Summarize your findings
Use your tools step by step.");
history.AddUserMessage("Research Stripe's current pricing for payment processing.");
var response = await chat.GetChatMessageContentAsync(history, settings, kernel);
Console.WriteLine(response.Content);
Real-World Use Cases
1. Enterprise Content Aggregator
Build an agent that monitors competitor websites, press releases, and industry news:
[KernelFunction("monitor_competitors")]
[Description("Check competitor websites for changes and new content")]
public async Task<string> MonitorCompetitorsAsync(
[Description("Comma-separated list of competitor URLs")] string urls)
{
var results = new List<string>();
foreach (var url in urls.Split(',').Select(u => u.Trim()))
{
var content = await ScrapeUrlAsync(url, "text");
results.Add($"## {url}\n{content[..Math.Min(content.Length, 500)]}...");
}
return string.Join("\n\n", results);
}
2. Automated Due Diligence
Agents that research companies for investment or partnership decisions:
history.AddUserMessage(@"Perform due diligence on Acme Corp (acmecorp.com):
- Scrape their about page, team page, and pricing
- Extract key metrics and leadership info
- Screenshot their homepage for our records
- Summarize findings with risk assessment");
3. Price Monitoring Pipeline
Track pricing changes across SaaS competitors with scheduled agent runs:
var competitors = new[]
{
"https://competitor-a.com/pricing",
"https://competitor-b.com/pricing",
"https://competitor-c.com/pricing"
};
foreach (var url in competitors)
{
history.AddUserMessage($"Extract all pricing tiers and features from {url}");
var result = await chat.GetChatMessageContentAsync(history, settings, kernel);
history.AddAssistantMessage(result.Content!);
}
history.AddUserMessage("Now compare all three competitors' pricing and identify where we can undercut them.");
Dependency Injection Setup (ASP.NET)
For production .NET apps, register the plugin via DI:
// Program.cs
builder.Services.AddSingleton(sp =>
{
var config = sp.GetRequiredService<IConfiguration>();
return new WebScrapingPlugin(config["Mantis:ApiKey"]!);
});
builder.Services.AddKernel()
.AddOpenAIChatCompletion("gpt-4o", builder.Configuration["OpenAI:ApiKey"]!)
.Plugins.AddFromObject(builder.Services.BuildServiceProvider()
.GetRequiredService<WebScrapingPlugin>(), "WebScraping");
Error Handling and Retry Logic
[KernelFunction("scrape_url_safe")]
[Description("Scrape a URL with retry logic and error handling")]
public async Task<string> ScrapeUrlSafeAsync(
[Description("URL to scrape")] string url,
[Description("Max retries")] int maxRetries = 3)
{
for (int attempt = 1; attempt <= maxRetries; attempt++)
{
try
{
return await ScrapeUrlAsync(url, "markdown");
}
catch (HttpRequestException ex) when (ex.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
{
if (attempt == maxRetries) throw;
await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, attempt)));
}
catch (HttpRequestException ex) when (ex.StatusCode == System.Net.HttpStatusCode.BadGateway)
{
if (attempt == maxRetries) return $"Failed to scrape {url} after {maxRetries} attempts";
await Task.Delay(TimeSpan.FromSeconds(2));
}
}
return "Unexpected error";
}
Cost Optimization
| Strategy | Implementation | Savings |
|---|---|---|
| Cache responses | Store scrape results in Redis/memory with TTL | 50-80% |
| Use text format | Set format: "text" when markdown isn't needed | Faster processing |
| Limit content length | Truncate before sending to LLM | Reduce token costs |
| Batch requests | Scrape multiple URLs in parallel | Faster execution |
| Smart selectors | Target specific page sections via CSS selectors | Less noise, fewer tokens |
Start Building with Semantic Kernel + Mantis
Get 100 free API calls per month. No credit card required.
Get Your API Key โWhat You Learned
- How to build a WebScrapingPlugin with scrape, screenshot, and extract functions
- Both C# and Python implementations with full code examples
- Automatic function calling โ let the AI decide when to scrape
- Multi-step planning โ compose complex research workflows
- Enterprise patterns โ DI registration, error handling, cost optimization
- Real-world use cases โ competitor monitoring, due diligence, price tracking
Semantic Kernel's plugin architecture makes it trivially easy to add web scraping to any .NET or Python AI agent. The combination of type-safe plugins, automatic function calling, and planner support means your agents can autonomously gather, process, and analyze web data โ the same patterns powering Microsoft Copilot.
Next steps: Check out our quickstart guide to get your API key, or explore our other framework integrations.