YouTube is the world's second-largest search engine and the biggest video platform, with over 800 million videos and 2.7 billion monthly active users. For developers, researchers, and AI agents, YouTube data is a goldmine โ video trends, audience sentiment, competitor analysis, content research, and market intelligence.
But getting that data at scale isn't easy. The YouTube Data API v3 has strict quotas (10,000 units/day โ roughly 100 search queries), doesn't expose transcripts, and limits comment retrieval. For anything beyond basic metadata, you need to scrape.
In this guide, you'll learn 4 methods to scrape YouTube data โ from simple Python scripts to production-ready API solutions โ plus how to handle YouTube's anti-bot measures, legal considerations, and real-world use cases with code.
What Data Can You Extract from YouTube?
| Data Type | Available via Scraping | YouTube API v3 | Notes |
|---|---|---|---|
| Video title, description, tags | โ | โ | API costs 1 unit per video |
| View count, likes | โ | โ | Dislikes hidden since 2021 (some scrapers estimate) |
| Channel subscribers, video count | โ | โ | API costs 1 unit per channel |
| Comments & replies | โ | โ (limited) | API returns max 100 per page, costs 1 unit each |
| Video transcripts/captions | โ | โ | Major gap โ scraping is the only way |
| Search results | โ | โ | API costs 100 units per search (expensive!) |
| Related/recommended videos | โ | โ (deprecated) | Removed from API in 2023 |
| Trending videos by country | โ | โ | API limited to 200 results |
| Playlist contents | โ | โ | API costs 1 unit per 50 items |
| Shorts metadata | โ | Partial | API doesn't distinguish Shorts from regular videos |
| Hashtag pages | โ | โ | No API endpoint for hashtag discovery |
| Revenue/analytics | โ | Own channel only | YouTube Analytics API โ creator's own data only |
Method 1: Python + yt-dlp + Requests (Fastest Setup)
The fastest way to extract YouTube data in Python is combining yt-dlp (the most maintained YouTube extractor) with direct HTTP requests for page data.
Extract Video Metadata with yt-dlp
import yt_dlp
import json
def scrape_video_metadata(url):
"""Extract comprehensive video metadata without downloading."""
yt_opts = {
'quiet': True,
'no_download': True, # Don't download the video
'extract_flat': False,
}
with yt_dlp.YoutubeDL(yt_opts) as ydl:
info = ydl.extract_info(url, download=False)
return {
'title': info.get('title'),
'description': info.get('description'),
'view_count': info.get('view_count'),
'like_count': info.get('like_count'),
'duration': info.get('duration'),
'upload_date': info.get('upload_date'),
'channel': info.get('channel'),
'channel_id': info.get('channel_id'),
'channel_url': info.get('channel_url'),
'subscriber_count': info.get('channel_follower_count'),
'tags': info.get('tags', []),
'categories': info.get('categories', []),
'thumbnail': info.get('thumbnail'),
'comment_count': info.get('comment_count'),
'age_limit': info.get('age_limit'),
'is_live': info.get('is_live'),
'was_live': info.get('was_live'),
}
# Usage
video = scrape_video_metadata('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
print(json.dumps(video, indent=2))
Scrape YouTube Search Results
import requests
import re
import json
def scrape_youtube_search(query, max_results=20):
"""Scrape YouTube search results without API quotas."""
url = f'https://www.youtube.com/results?search_query={query}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
}
response = requests.get(url, headers=headers)
# Extract ytInitialData from page source
pattern = r'var ytInitialData = ({.*?});'
match = re.search(pattern, response.text)
if not match:
return []
data = json.loads(match.group(1))
# Navigate the nested structure
results = []
try:
contents = data['contents']['twoColumnSearchResultsRenderer']\
['primaryContents']['sectionListRenderer']['contents'][0]\
['itemSectionRenderer']['contents']
for item in contents[:max_results]:
if 'videoRenderer' in item:
video = item['videoRenderer']
results.append({
'video_id': video['videoId'],
'title': video['title']['runs'][0]['text'],
'url': f"https://www.youtube.com/watch?v={video['videoId']}",
'channel': video.get('ownerText', {}).get('runs', [{}])[0].get('text', ''),
'views': video.get('viewCountText', {}).get('simpleText', ''),
'published': video.get('publishedTimeText', {}).get('simpleText', ''),
'duration': video.get('lengthText', {}).get('simpleText', ''),
'description': ''.join([s.get('text', '') for s in video.get('detailedMetadataSnippets', [{}])[0].get('snippetText', {}).get('runs', [])]),
})
except (KeyError, IndexError):
pass
return results
# Usage
results = scrape_youtube_search('web scraping python tutorial 2026')
for r in results[:5]:
print(f"{r['title']} โ {r['views']} โ {r['channel']}")
Extract Video Transcripts
from youtube_transcript_api import YouTubeTranscriptApi
def get_transcript(video_id, language='en'):
"""Extract video transcript/captions."""
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=[language])
# Full text
full_text = ' '.join([entry['text'] for entry in transcript])
# Timestamped entries
return {
'full_text': full_text,
'segments': [{
'text': entry['text'],
'start': entry['start'],
'duration': entry['duration'],
} for entry in transcript],
'word_count': len(full_text.split()),
}
except Exception as e:
return {'error': str(e)}
# Usage
transcript = get_transcript('dQw4w9WgXcQ')
print(f"Words: {transcript.get('word_count', 0)}")
print(transcript.get('full_text', '')[:500])
Method 2: Playwright Headless Browser (Full JS Rendering)
For data that requires JavaScript rendering โ comments, dynamically-loaded content, infinite scroll, and Shorts โ Playwright is the best option.
import asyncio
from playwright.async_api import async_playwright
import json
import re
async def scrape_video_with_comments(video_url, max_comments=50):
"""Scrape video metadata + comments with Playwright."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
viewport={'width': 1920, 'height': 1080},
)
page = await context.new_page()
# Intercept the initial data
video_data = {}
await page.goto(video_url, wait_until='networkidle')
# Extract ytInitialData from page
initial_data = await page.evaluate('''() => {
return window.ytInitialData;
}''')
# Extract ytInitialPlayerResponse for video details
player_data = await page.evaluate('''() => {
return window.ytInitialPlayerResponse;
}''')
# Parse video metadata
if player_data:
video_details = player_data.get('videoDetails', {})
video_data = {
'title': video_details.get('title'),
'video_id': video_details.get('videoId'),
'views': video_details.get('viewCount'),
'author': video_details.get('author'),
'channel_id': video_details.get('channelId'),
'duration_seconds': video_details.get('lengthSeconds'),
'keywords': video_details.get('keywords', []),
'description': video_details.get('shortDescription'),
'is_live': video_details.get('isLiveContent'),
}
# Scroll to load comments
comments = []
for scroll in range(5):
await page.evaluate('window.scrollBy(0, 800)')
await asyncio.sleep(2)
# Extract comments from the DOM
comment_elements = await page.query_selector_all('#content-text')
for elem in comment_elements[:max_comments]:
text = await elem.inner_text()
if text.strip():
comments.append(text.strip())
video_data['comments'] = comments
video_data['comment_count_scraped'] = len(comments)
await browser.close()
return video_data
# Usage
data = asyncio.run(scrape_video_with_comments(
'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
max_comments=30
))
print(f"Title: {data.get('title')}")
print(f"Views: {data.get('views')}")
print(f"Comments scraped: {data.get('comment_count_scraped')}")
Scrape a YouTube Channel's Video List
async def scrape_channel_videos(channel_url, max_videos=100):
"""Scrape all videos from a YouTube channel."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = await context.new_page()
# Navigate to channel's videos tab
videos_url = channel_url.rstrip('/') + '/videos'
await page.goto(videos_url, wait_until='networkidle')
videos = []
last_count = 0
# Scroll to load more videos
while len(videos) < max_videos:
# Extract video data from the page
new_videos = await page.evaluate('''() => {
const items = document.querySelectorAll('ytd-rich-item-renderer');
return Array.from(items).map(item => {
const title = item.querySelector('#video-title');
const meta = item.querySelector('#metadata-line');
const link = title?.getAttribute('href');
const spans = meta?.querySelectorAll('span') || [];
return {
title: title?.textContent?.trim(),
url: link ? 'https://www.youtube.com' + link : null,
views: spans[0]?.textContent?.trim() || '',
published: spans[1]?.textContent?.trim() || '',
};
}).filter(v => v.title && v.url);
}''')
videos = new_videos
if len(videos) == last_count:
break # No more videos loading
last_count = len(videos)
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
await asyncio.sleep(2)
await browser.close()
return videos[:max_videos]
# Usage
videos = asyncio.run(scrape_channel_videos(
'https://www.youtube.com/@GoogleDevelopers',
max_videos=50
))
print(f"Found {len(videos)} videos")
for v in videos[:5]:
print(f" {v['title']} โ {v['views']}")
Method 3: Node.js + Puppeteer (Stealth Scraping)
Node.js with Puppeteer and the stealth plugin is excellent for YouTube because it closely mimics real browser behavior, reducing detection risk.
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
async function scrapeYouTubeVideo(videoUrl) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
// Intercept YouTube's internal API responses
const apiResponses = [];
page.on('response', async (response) => {
const url = response.url();
if (url.includes('youtubei/v1/next')) {
try {
const json = await response.json();
apiResponses.push(json);
} catch (e) {}
}
});
await page.goto(videoUrl, { waitUntil: 'networkidle2' });
// Extract data from ytInitialPlayerResponse
const videoData = await page.evaluate(() => {
const playerResponse = window.ytInitialPlayerResponse;
const initialData = window.ytInitialData;
const details = playerResponse?.videoDetails || {};
// Extract engagement metrics from initialData
const contents = initialData?.contents?.twoColumnWatchNextResults
?.results?.results?.contents || [];
let likes = '';
for (const content of contents) {
const buttons = content?.videoPrimaryInfoRenderer?.videoActions
?.menuRenderer?.topLevelButtons || [];
for (const btn of buttons) {
const toggle = btn?.segmentedLikeDislikeButtonViewModel
?.likeButtonViewModel?.likeButtonViewModel?.toggleButtonViewModel
?.toggleButtonViewModel?.defaultButtonViewModel?.buttonViewModel;
if (toggle?.title) {
likes = toggle.title;
break;
}
}
}
return {
title: details.title,
videoId: details.videoId,
views: parseInt(details.viewCount) || 0,
likes,
duration: parseInt(details.lengthSeconds) || 0,
author: details.author,
channelId: details.channelId,
subscribers: details.channelFollowerCount,
description: details.shortDescription,
keywords: details.keywords || [],
isLive: details.isLiveContent,
publishDate: playerResponse?.microformat?.playerMicroformatRenderer
?.publishDate,
category: playerResponse?.microformat?.playerMicroformatRenderer
?.category,
};
});
// Scroll to load comments
for (let i = 0; i < 5; i++) {
await page.evaluate(() => window.scrollBy(0, 600));
await new Promise(r => setTimeout(r, 2000));
}
// Extract comments
const comments = await page.evaluate(() => {
const commentElements = document.querySelectorAll('#content-text');
return Array.from(commentElements).map(el => el.textContent.trim())
.filter(t => t.length > 0);
});
videoData.comments = comments;
await browser.close();
return videoData;
}
// Usage
(async () => {
const data = await scrapeYouTubeVideo(
'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
);
console.log(`Title: ${data.title}`);
console.log(`Views: ${data.views.toLocaleString()}`);
console.log(`Likes: ${data.likes}`);
console.log(`Comments: ${data.comments.length}`);
})();
Batch Scrape YouTube Search Results
async function scrapeYouTubeSearch(query, maxResults = 20) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
const searchUrl = `https://www.youtube.com/results?search_query=${
encodeURIComponent(query)
}`;
await page.goto(searchUrl, { waitUntil: 'networkidle2' });
const results = await page.evaluate(() => {
const data = window.ytInitialData;
const contents = data?.contents?.twoColumnSearchResultsRenderer
?.primaryContents?.sectionListRenderer?.contents?.[0]
?.itemSectionRenderer?.contents || [];
return contents
.filter(item => item.videoRenderer)
.map(item => {
const v = item.videoRenderer;
return {
videoId: v.videoId,
title: v.title?.runs?.[0]?.text,
url: `https://www.youtube.com/watch?v=${v.videoId}`,
channel: v.ownerText?.runs?.[0]?.text,
views: v.viewCountText?.simpleText,
published: v.publishedTimeText?.simpleText,
duration: v.lengthText?.simpleText,
thumbnail: v.thumbnail?.thumbnails?.pop()?.url,
};
});
});
await browser.close();
return results.slice(0, maxResults);
}
// Usage
(async () => {
const results = await scrapeYouTubeSearch('ai agent tutorial 2026');
results.forEach(r => {
console.log(`${r.title} | ${r.views} | ${r.channel}`);
});
})();
Method 4: Mantis API (Production-Ready, One Call)
For production workloads, the Mantis WebPerception API handles anti-bot measures, proxy rotation, and JavaScript rendering automatically. One API call returns structured, extracted data.
import requests
# Scrape a YouTube video page
response = requests.post('https://api.mantisapi.com/v1/scrape', json={
'url': 'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
'render_js': True,
'wait_for': 'networkidle',
'extract': {
'title': 'meta[property="og:title"]@content',
'description': 'meta[property="og:description"]@content',
'channel': 'link[itemprop="name"]@content',
'views': 'meta[itemprop="interactionCount"]@content',
}
}, headers={
'Authorization': 'Bearer YOUR_API_KEY',
})
data = response.json()
print(data['extracted'])
// Scrape YouTube search results with Mantis
const axios = require('axios');
const response = await axios.post('https://api.mantisapi.com/v1/scrape', {
url: 'https://www.youtube.com/results?search_query=web+scraping+api',
render_js: true,
wait_for: 'networkidle',
screenshot: true, // Optional: get a screenshot
extract_schema: 'auto', // AI-powered structured extraction
}, {
headers: { 'Authorization': 'Bearer YOUR_API_KEY' },
});
console.log(response.data);
YouTube's Anti-Bot Measures (What You'll Face)
YouTube uses sophisticated detection to block scrapers:
| Defense | How It Works | How to Bypass |
|---|---|---|
| Rate limiting | Blocks IPs making too many requests | Rotating residential proxies, random delays (3-10s) |
| JavaScript challenges | Page requires JS execution to render data | Headless browser (Playwright/Puppeteer) or yt-dlp |
| Consent walls | Cookie consent popup (EU) blocks content | Set consent cookies, dismiss via automation |
| Bot detection | Fingerprinting, WebDriver detection, behavioral analysis | Stealth plugins, realistic mouse movements, human-like timing |
| Dynamic class names | CSS class names change between deployments | Use data attributes, aria labels, or ytInitialData JSON instead |
| Age-gated content | Requires login to view | yt-dlp with cookies, authenticated browser sessions |
| Geo-restrictions | Content blocked by country | Proxies in the target country |
YouTube Data API v3 vs Scraping vs Mantis
| Feature | YouTube API v3 | DIY Scraping | Mantis API |
|---|---|---|---|
| Cost | Free (10K units/day) | Proxy costs ($50-200/mo) | From $29/mo (5K requests) |
| Video metadata | โ | โ | โ |
| Transcripts | โ | โ | โ |
| Comments (full threads) | Partial (quota-heavy) | โ | โ |
| Search (unlimited) | 100 searches/day max | โ | โ |
| Related videos | โ (deprecated) | โ | โ |
| Anti-bot handling | N/A | You build it | Built-in |
| Rate limits | 10K units/day | Depends on proxies | Plan-based |
| Maintenance | Low | High (selectors break) | Zero |
| Setup time | 30 min | Days-weeks | 5 min |
The API's biggest limitation: A single search query costs 100 units. With 10,000 free units per day, you can only make ~100 searches. At scale, the YouTube API becomes impractical โ and paid quota increases are expensive ($0.00015/unit beyond the free tier).
Real-World Use Cases with Code
1. Content Research Tool for AI Agents
import yt_dlp
from youtube_transcript_api import YouTubeTranscriptApi
class YouTubeResearcher:
"""AI agent tool for YouTube content research."""
def research_topic(self, query, top_n=5):
"""Find and analyze top videos for a topic."""
# Search for videos
yt_opts = {
'quiet': True,
'no_download': True,
'extract_flat': True,
'default_search': f'ytsearch{top_n}',
}
with yt_dlp.YoutubeDL(yt_opts) as ydl:
results = ydl.extract_info(query, download=False)
analyses = []
for entry in results.get('entries', [])[:top_n]:
video_id = entry.get('id')
# Get transcript
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
text = ' '.join([t['text'] for t in transcript])
except:
text = 'Transcript not available'
analyses.append({
'title': entry.get('title'),
'video_id': video_id,
'url': entry.get('url'),
'channel': entry.get('channel'),
'view_count': entry.get('view_count'),
'duration': entry.get('duration'),
'transcript_preview': text[:1000],
'word_count': len(text.split()),
})
return {
'query': query,
'results': analyses,
'total_views': sum(a.get('view_count', 0) or 0 for a in analyses),
}
# Usage โ perfect for AI agents analyzing content
researcher = YouTubeResearcher()
report = researcher.research_topic('web scraping best practices 2026')
for r in report['results']:
print(f"{r['title']} โ {r['view_count']:,} views โ {r['word_count']} words")
2. YouTube Competitor Tracker
import yt_dlp
import json
from datetime import datetime
def track_competitor_channel(channel_url, days_back=30):
"""Track a competitor's recent YouTube activity."""
yt_opts = {
'quiet': True,
'no_download': True,
'extract_flat': True,
'playlistend': 50, # Last 50 videos
}
# Fetch channel's uploads
videos_url = channel_url.rstrip('/') + '/videos'
with yt_dlp.YoutubeDL(yt_opts) as ydl:
results = ydl.extract_info(videos_url, download=False)
videos = results.get('entries', [])
# Analyze recent uploads
recent = []
for v in videos:
if not v:
continue
recent.append({
'title': v.get('title'),
'views': v.get('view_count', 0),
'duration': v.get('duration'),
'id': v.get('id'),
})
# Calculate metrics
total_views = sum(v.get('views', 0) or 0 for v in recent)
avg_views = total_views // len(recent) if recent else 0
return {
'channel': results.get('channel', results.get('title')),
'recent_videos': len(recent),
'total_recent_views': total_views,
'avg_views_per_video': avg_views,
'top_videos': sorted(recent, key=lambda x: x.get('views', 0) or 0, reverse=True)[:5],
'analyzed_at': datetime.now().isoformat(),
}
# Usage
report = track_competitor_channel('https://www.youtube.com/@TechWithTim')
print(f"Channel: {report['channel']}")
print(f"Recent videos: {report['recent_videos']}")
print(f"Avg views: {report['avg_views_per_video']:,}")
print("Top videos:")
for v in report['top_videos']:
print(f" {v['title']} โ {(v.get('views') or 0):,} views")
3. Sentiment Analysis from Comments
import asyncio
from playwright.async_api import async_playwright
async def analyze_video_sentiment(video_url, max_comments=100):
"""Scrape comments and analyze sentiment."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(video_url, wait_until='networkidle')
# Scroll to load comments
for _ in range(10):
await page.evaluate('window.scrollBy(0, 1000)')
await asyncio.sleep(1.5)
comments = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('#content-text'))
.map(el => el.textContent.trim())
.filter(t => t.length > 0);
}''')
await browser.close()
# Simple keyword-based sentiment (replace with NLP model in production)
positive_words = {'great', 'amazing', 'love', 'awesome', 'best', 'excellent', 'helpful', 'thank', 'perfect', 'fantastic'}
negative_words = {'bad', 'worst', 'hate', 'terrible', 'awful', 'waste', 'boring', 'useless', 'scam', 'disappointed'}
sentiment_scores = []
for comment in comments[:max_comments]:
words = set(comment.lower().split())
pos = len(words & positive_words)
neg = len(words & negative_words)
score = 'positive' if pos > neg else ('negative' if neg > pos else 'neutral')
sentiment_scores.append({'comment': comment[:200], 'sentiment': score})
total = len(sentiment_scores)
return {
'total_comments': total,
'positive': sum(1 for s in sentiment_scores if s['sentiment'] == 'positive'),
'negative': sum(1 for s in sentiment_scores if s['sentiment'] == 'negative'),
'neutral': sum(1 for s in sentiment_scores if s['sentiment'] == 'neutral'),
'positive_pct': f"{sum(1 for s in sentiment_scores if s['sentiment'] == 'positive') / total * 100:.1f}%" if total else '0%',
'sample_positive': [s['comment'] for s in sentiment_scores if s['sentiment'] == 'positive'][:3],
'sample_negative': [s['comment'] for s in sentiment_scores if s['sentiment'] == 'negative'][:3],
}
# Usage
report = asyncio.run(analyze_video_sentiment(
'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
))
print(f"Sentiment: {report['positive_pct']} positive ({report['total_comments']} comments)")
Legal Considerations
Before scraping YouTube at scale, understand the legal landscape:
- YouTube Terms of Service: Section 5 explicitly prohibits "access, reproduce, download, distribute, transmit, broadcast, display, sell, license, alter, modify or otherwise use any part of the Service" through automated means. However, ToS violations are contract disputes, not criminal matters.
- hiQ v. LinkedIn (2022): The Ninth Circuit ruled that scraping publicly available data likely does not violate the Computer Fraud and Abuse Act (CFAA). This precedent applies broadly to public web data.
- Van Buren v. United States (2021): The Supreme Court narrowed the CFAA's "exceeds authorized access" provision, generally supporting that accessing publicly available data isn't a federal crime.
- Google v. Socratic (ongoing): Google has pursued legal action against some YouTube scrapers, particularly those that download or redistribute copyrighted content.
- Copyright: Scraping metadata is different from downloading/redistributing videos. Metadata extraction is generally lower risk; video downloading is higher risk.
- GDPR/CCPA: Comments and channel data may constitute personal data under privacy regulations. Handle with care, especially for EU users.
Getting Started
Choose the right method based on your needs:
| Method | Best For | Setup Time | Maintenance |
|---|---|---|---|
| Python + yt-dlp | Video metadata, search, transcripts | 5 min | Low (yt-dlp is well-maintained) |
| Playwright | Comments, dynamic content, full pages | 30 min | Medium |
| Node.js + Puppeteer | Stealth scraping, API interception | 30 min | Medium |
| Mantis API | Production workloads, zero maintenance | 5 min | Zero |
For most use cases, start with yt-dlp for metadata (it's incredibly robust) and add Playwright for comments/dynamic data. When you need to scale or eliminate maintenance, switch to the Mantis API.
Stop Fighting YouTube's Anti-Bot Systems
Mantis handles proxy rotation, browser fingerprinting, and JS rendering so you can focus on the data. Start free โ 100 requests/month, no credit card required.
View Pricing Get Started Free