Is it legal to scrape YouTube data?

Scraping publicly available YouTube data exists in a legal gray area. The hiQ v. LinkedIn ruling supports scraping public data, but YouTube's Terms of Service prohibit automated access. Google has pursued legal action against scrapers. For commercial use, consult legal counsel and consider using the YouTube Data API v3 for permitted data, or a web scraping API like Mantis that handles compliance considerations.

How do I scrape YouTube without getting blocked?

YouTube uses aggressive anti-bot measures including rate limiting, CAPTCHAs, IP blocking, and JavaScript challenges. To avoid blocks: use rotating residential proxies, headless browsers with stealth plugins, random delays between requests (3-10 seconds), and avoid making too many requests from the same IP. A web scraping API like Mantis handles anti-blocking automatically with built-in proxy rotation and browser fingerprint management.

Can I get YouTube video transcripts without the API?

Yes. YouTube auto-generates captions for most videos, and these can be extracted by scraping the video page's initial data (ytInitialPlayerResponse) which contains caption track URLs. Python libraries like youtube-transcript-api simplify this. Alternatively, a scraping API like Mantis can extract the full page including transcript data in a single call.

What data can I extract from YouTube?

From YouTube you can extract: video metadata (title, description, views, likes, duration, publish date), channel info (name, subscribers, total views, video count), comments and replies, search results, playlist contents, trending videos, video transcripts/captions, related videos, and hashtag pages. Most of this data is publicly available on the page or embedded in JSON-LD and ytInitialData objects.

YouTube Data API vs scraping: which should I use?

The YouTube Data API v3 is free (10,000 units/day) but has significant limitations: no transcript access, limited comment retrieval, no revenue/analytics data for channels you don't own, and strict quota limits that are easy to exceed. Scraping gives you access to everything visible on the page — transcripts, full comment threads, related videos, and engagement metrics — without quota restrictions. For production workloads, an API like Mantis combines the reliability of an API with the data access of scraping.

How do I scrape YouTube comments at scale?

YouTube loads comments dynamically via AJAX requests to its internal API (youtubei/v1/next). You can intercept these requests using a headless browser like Playwright or Puppeteer, scroll to trigger comment loading, and capture the responses. For scale, extract the continuation tokens from the page's ytInitialData and replay the API calls directly with proper headers. Mantis can extract page content including dynamically-loaded comments.

How to Scrape YouTube Data in 2026: Videos, Channels & Comments

YouTube is the world's second-largest search engine and the biggest video platform, with over 800 million videos and 2.7 billion monthly active users. For developers, researchers, and AI agents, YouTube data is a goldmine — video trends, audience sentiment, competitor analysis, content research, and market intelligence.

But getting that data at scale isn't easy. The YouTube Data API v3 has strict quotas (10,000 units/day — roughly 100 search queries), doesn't expose transcripts, and limits comment retrieval. For anything beyond basic metadata, you need to scrape.

In this guide, you'll learn 4 methods to scrape YouTube data — from simple Python scripts to production-ready API solutions — plus how to handle YouTube's anti-bot measures, legal considerations, and real-world use cases with code.

What Data Can You Extract from YouTube?

Data Type	Available via Scraping	YouTube API v3	Notes
Video title, description, tags	✅	✅	API costs 1 unit per video
View count, likes	✅	✅	Dislikes hidden since 2021 (some scrapers estimate)
Channel subscribers, video count	✅	✅	API costs 1 unit per channel
Comments & replies	✅	✅ (limited)	API returns max 100 per page, costs 1 unit each
Video transcripts/captions	✅	❌	Major gap — scraping is the only way
Search results	✅	✅	API costs 100 units per search (expensive!)
Related/recommended videos	✅	❌ (deprecated)	Removed from API in 2023
Trending videos by country	✅	✅	API limited to 200 results
Playlist contents	✅	✅	API costs 1 unit per 50 items
Shorts metadata	✅	Partial	API doesn't distinguish Shorts from regular videos
Hashtag pages	✅	❌	No API endpoint for hashtag discovery
Revenue/analytics	❌	Own channel only	YouTube Analytics API — creator's own data only

Key insight: The YouTube Data API's biggest gaps are transcripts, related videos, and affordable search. These are exactly where scraping shines — and what AI agents need most for content analysis.

Method 1: Python + yt-dlp + Requests (Fastest Setup)

The fastest way to extract YouTube data in Python is combining yt-dlp (the most maintained YouTube extractor) with direct HTTP requests for page data.

Extract Video Metadata with yt-dlp

import yt_dlp
import json

def scrape_video_metadata(url):
    """Extract comprehensive video metadata without downloading."""
    yt_opts = {
        'quiet': True,
        'no_download': True,  # Don't download the video
        'extract_flat': False,
    }
    
    with yt_dlp.YoutubeDL(yt_opts) as ydl:
        info = ydl.extract_info(url, download=False)
    
    return {
        'title': info.get('title'),
        'description': info.get('description'),
        'view_count': info.get('view_count'),
        'like_count': info.get('like_count'),
        'duration': info.get('duration'),
        'upload_date': info.get('upload_date'),
        'channel': info.get('channel'),
        'channel_id': info.get('channel_id'),
        'channel_url': info.get('channel_url'),
        'subscriber_count': info.get('channel_follower_count'),
        'tags': info.get('tags', []),
        'categories': info.get('categories', []),
        'thumbnail': info.get('thumbnail'),
        'comment_count': info.get('comment_count'),
        'age_limit': info.get('age_limit'),
        'is_live': info.get('is_live'),
        'was_live': info.get('was_live'),
    }

# Usage
video = scrape_video_metadata('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
print(json.dumps(video, indent=2))

Scrape YouTube Search Results

import requests
import re
import json

def scrape_youtube_search(query, max_results=20):
    """Scrape YouTube search results without API quotas."""
    url = f'https://www.youtube.com/results?search_query={query}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
    }
    
    response = requests.get(url, headers=headers)
    
    # Extract ytInitialData from page source
    pattern = r'var ytInitialData = ({.*?});'
    match = re.search(pattern, response.text)
    if not match:
        return []
    
    data = json.loads(match.group(1))
    
    # Navigate the nested structure
    results = []
    try:
        contents = data['contents']['twoColumnSearchResultsRenderer']\
            ['primaryContents']['sectionListRenderer']['contents'][0]\
            ['itemSectionRenderer']['contents']
        
        for item in contents[:max_results]:
            if 'videoRenderer' in item:
                video = item['videoRenderer']
                results.append({
                    'video_id': video['videoId'],
                    'title': video['title']['runs'][0]['text'],
                    'url': f"https://www.youtube.com/watch?v={video['videoId']}",
                    'channel': video.get('ownerText', {}).get('runs', [{}])[0].get('text', ''),
                    'views': video.get('viewCountText', {}).get('simpleText', ''),
                    'published': video.get('publishedTimeText', {}).get('simpleText', ''),
                    'duration': video.get('lengthText', {}).get('simpleText', ''),
                    'description': ''.join([s.get('text', '') for s in video.get('detailedMetadataSnippets', [{}])[0].get('snippetText', {}).get('runs', [])]),
                })
    except (KeyError, IndexError):
        pass
    
    return results

# Usage
results = scrape_youtube_search('web scraping python tutorial 2026')
for r in results[:5]:
    print(f"{r['title']} — {r['views']} — {r['channel']}")

Extract Video Transcripts

from youtube_transcript_api import YouTubeTranscriptApi

def get_transcript(video_id, language='en'):
    """Extract video transcript/captions."""
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=[language])
        
        # Full text
        full_text = ' '.join([entry['text'] for entry in transcript])
        
        # Timestamped entries
        return {
            'full_text': full_text,
            'segments': [{
                'text': entry['text'],
                'start': entry['start'],
                'duration': entry['duration'],
            } for entry in transcript],
            'word_count': len(full_text.split()),
        }
    except Exception as e:
        return {'error': str(e)}

# Usage
transcript = get_transcript('dQw4w9WgXcQ')
print(f"Words: {transcript.get('word_count', 0)}")
print(transcript.get('full_text', '')[:500])

Method 2: Playwright Headless Browser (Full JS Rendering)

For data that requires JavaScript rendering — comments, dynamically-loaded content, infinite scroll, and Shorts — Playwright is the best option.

import asyncio
from playwright.async_api import async_playwright
import json
import re

async def scrape_video_with_comments(video_url, max_comments=50):
    """Scrape video metadata + comments with Playwright."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            viewport={'width': 1920, 'height': 1080},
        )
        page = await context.new_page()
        
        # Intercept the initial data
        video_data = {}
        
        await page.goto(video_url, wait_until='networkidle')
        
        # Extract ytInitialData from page
        initial_data = await page.evaluate('''() => {
            return window.ytInitialData;
        }''')
        
        # Extract ytInitialPlayerResponse for video details
        player_data = await page.evaluate('''() => {
            return window.ytInitialPlayerResponse;
        }''')
        
        # Parse video metadata
        if player_data:
            video_details = player_data.get('videoDetails', {})
            video_data = {
                'title': video_details.get('title'),
                'video_id': video_details.get('videoId'),
                'views': video_details.get('viewCount'),
                'author': video_details.get('author'),
                'channel_id': video_details.get('channelId'),
                'duration_seconds': video_details.get('lengthSeconds'),
                'keywords': video_details.get('keywords', []),
                'description': video_details.get('shortDescription'),
                'is_live': video_details.get('isLiveContent'),
            }
        
        # Scroll to load comments
        comments = []
        for scroll in range(5):
            await page.evaluate('window.scrollBy(0, 800)')
            await asyncio.sleep(2)
        
        # Extract comments from the DOM
        comment_elements = await page.query_selector_all('#content-text')
        for elem in comment_elements[:max_comments]:
            text = await elem.inner_text()
            if text.strip():
                comments.append(text.strip())
        
        video_data['comments'] = comments
        video_data['comment_count_scraped'] = len(comments)
        
        await browser.close()
        return video_data

# Usage
data = asyncio.run(scrape_video_with_comments(
    'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
    max_comments=30
))
print(f"Title: {data.get('title')}")
print(f"Views: {data.get('views')}")
print(f"Comments scraped: {data.get('comment_count_scraped')}")

Scrape a YouTube Channel's Video List

async def scrape_channel_videos(channel_url, max_videos=100):
    """Scrape all videos from a YouTube channel."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )
        page = await context.new_page()
        
        # Navigate to channel's videos tab
        videos_url = channel_url.rstrip('/') + '/videos'
        await page.goto(videos_url, wait_until='networkidle')
        
        videos = []
        last_count = 0
        
        # Scroll to load more videos
        while len(videos) < max_videos:
            # Extract video data from the page
            new_videos = await page.evaluate('''() => {
                const items = document.querySelectorAll('ytd-rich-item-renderer');
                return Array.from(items).map(item => {
                    const title = item.querySelector('#video-title');
                    const meta = item.querySelector('#metadata-line');
                    const link = title?.getAttribute('href');
                    const spans = meta?.querySelectorAll('span') || [];
                    return {
                        title: title?.textContent?.trim(),
                        url: link ? 'https://www.youtube.com' + link : null,
                        views: spans[0]?.textContent?.trim() || '',
                        published: spans[1]?.textContent?.trim() || '',
                    };
                }).filter(v => v.title && v.url);
            }''')
            
            videos = new_videos
            if len(videos) == last_count:
                break  # No more videos loading
            last_count = len(videos)
            
            await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
            await asyncio.sleep(2)
        
        await browser.close()
        return videos[:max_videos]

# Usage
videos = asyncio.run(scrape_channel_videos(
    'https://www.youtube.com/@GoogleDevelopers',
    max_videos=50
))
print(f"Found {len(videos)} videos")
for v in videos[:5]:
    print(f"  {v['title']} — {v['views']}")

Method 3: Node.js + Puppeteer (Stealth Scraping)

Node.js with Puppeteer and the stealth plugin is excellent for YouTube because it closely mimics real browser behavior, reducing detection risk.

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

async function scrapeYouTubeVideo(videoUrl) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  
  await page.setViewport({ width: 1920, height: 1080 });
  
  // Intercept YouTube's internal API responses
  const apiResponses = [];
  page.on('response', async (response) => {
    const url = response.url();
    if (url.includes('youtubei/v1/next')) {
      try {
        const json = await response.json();
        apiResponses.push(json);
      } catch (e) {}
    }
  });
  
  await page.goto(videoUrl, { waitUntil: 'networkidle2' });
  
  // Extract data from ytInitialPlayerResponse
  const videoData = await page.evaluate(() => {
    const playerResponse = window.ytInitialPlayerResponse;
    const initialData = window.ytInitialData;
    
    const details = playerResponse?.videoDetails || {};
    
    // Extract engagement metrics from initialData
    const contents = initialData?.contents?.twoColumnWatchNextResults
      ?.results?.results?.contents || [];
    
    let likes = '';
    for (const content of contents) {
      const buttons = content?.videoPrimaryInfoRenderer?.videoActions
        ?.menuRenderer?.topLevelButtons || [];
      for (const btn of buttons) {
        const toggle = btn?.segmentedLikeDislikeButtonViewModel
          ?.likeButtonViewModel?.likeButtonViewModel?.toggleButtonViewModel
          ?.toggleButtonViewModel?.defaultButtonViewModel?.buttonViewModel;
        if (toggle?.title) {
          likes = toggle.title;
          break;
        }
      }
    }
    
    return {
      title: details.title,
      videoId: details.videoId,
      views: parseInt(details.viewCount) || 0,
      likes,
      duration: parseInt(details.lengthSeconds) || 0,
      author: details.author,
      channelId: details.channelId,
      subscribers: details.channelFollowerCount,
      description: details.shortDescription,
      keywords: details.keywords || [],
      isLive: details.isLiveContent,
      publishDate: playerResponse?.microformat?.playerMicroformatRenderer
        ?.publishDate,
      category: playerResponse?.microformat?.playerMicroformatRenderer
        ?.category,
    };
  });
  
  // Scroll to load comments
  for (let i = 0; i < 5; i++) {
    await page.evaluate(() => window.scrollBy(0, 600));
    await new Promise(r => setTimeout(r, 2000));
  }
  
  // Extract comments
  const comments = await page.evaluate(() => {
    const commentElements = document.querySelectorAll('#content-text');
    return Array.from(commentElements).map(el => el.textContent.trim())
      .filter(t => t.length > 0);
  });
  
  videoData.comments = comments;
  
  await browser.close();
  return videoData;
}

// Usage
(async () => {
  const data = await scrapeYouTubeVideo(
    'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
  );
  console.log(`Title: ${data.title}`);
  console.log(`Views: ${data.views.toLocaleString()}`);
  console.log(`Likes: ${data.likes}`);
  console.log(`Comments: ${data.comments.length}`);
})();

Batch Scrape YouTube Search Results

async function scrapeYouTubeSearch(query, maxResults = 20) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  
  const searchUrl = `https://www.youtube.com/results?search_query=${
    encodeURIComponent(query)
  }`;
  
  await page.goto(searchUrl, { waitUntil: 'networkidle2' });
  
  const results = await page.evaluate(() => {
    const data = window.ytInitialData;
    const contents = data?.contents?.twoColumnSearchResultsRenderer
      ?.primaryContents?.sectionListRenderer?.contents?.[0]
      ?.itemSectionRenderer?.contents || [];
    
    return contents
      .filter(item => item.videoRenderer)
      .map(item => {
        const v = item.videoRenderer;
        return {
          videoId: v.videoId,
          title: v.title?.runs?.[0]?.text,
          url: `https://www.youtube.com/watch?v=${v.videoId}`,
          channel: v.ownerText?.runs?.[0]?.text,
          views: v.viewCountText?.simpleText,
          published: v.publishedTimeText?.simpleText,
          duration: v.lengthText?.simpleText,
          thumbnail: v.thumbnail?.thumbnails?.pop()?.url,
        };
      });
  });
  
  await browser.close();
  return results.slice(0, maxResults);
}

// Usage
(async () => {
  const results = await scrapeYouTubeSearch('ai agent tutorial 2026');
  results.forEach(r => {
    console.log(`${r.title} | ${r.views} | ${r.channel}`);
  });
})();

Method 4: Mantis API (Production-Ready, One Call)

For production workloads, the Mantis WebPerception API handles anti-bot measures, proxy rotation, and JavaScript rendering automatically. One API call returns structured, extracted data.

import requests

# Scrape a YouTube video page
response = requests.post('https://api.mantisapi.com/v1/scrape', json={
    'url': 'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
    'render_js': True,
    'wait_for': 'networkidle',
    'extract': {
        'title': 'meta[property="og:title"]@content',
        'description': 'meta[property="og:description"]@content',
        'channel': 'link[itemprop="name"]@content',
        'views': 'meta[itemprop="interactionCount"]@content',
    }
}, headers={
    'Authorization': 'Bearer YOUR_API_KEY',
})

data = response.json()
print(data['extracted'])

// Scrape YouTube search results with Mantis
const axios = require('axios');

const response = await axios.post('https://api.mantisapi.com/v1/scrape', {
  url: 'https://www.youtube.com/results?search_query=web+scraping+api',
  render_js: true,
  wait_for: 'networkidle',
  screenshot: true,  // Optional: get a screenshot
  extract_schema: 'auto',  // AI-powered structured extraction
}, {
  headers: { 'Authorization': 'Bearer YOUR_API_KEY' },
});

console.log(response.data);

Why Mantis for YouTube? YouTube's anti-bot detection is aggressive. Mantis handles rotating residential proxies, browser fingerprint randomization, CAPTCHA solving, and JS rendering — so you focus on the data, not the infrastructure.

YouTube's Anti-Bot Measures (What You'll Face)

YouTube uses sophisticated detection to block scrapers:

Defense	How It Works	How to Bypass
Rate limiting	Blocks IPs making too many requests	Rotating residential proxies, random delays (3-10s)
JavaScript challenges	Page requires JS execution to render data	Headless browser (Playwright/Puppeteer) or yt-dlp
Consent walls	Cookie consent popup (EU) blocks content	Set consent cookies, dismiss via automation
Bot detection	Fingerprinting, WebDriver detection, behavioral analysis	Stealth plugins, realistic mouse movements, human-like timing
Dynamic class names	CSS class names change between deployments	Use data attributes, aria labels, or ytInitialData JSON instead
Age-gated content	Requires login to view	yt-dlp with cookies, authenticated browser sessions
Geo-restrictions	Content blocked by country	Proxies in the target country

Pro tip: Don't scrape YouTube by parsing HTML selectors — they change constantly. Instead, extract the ytInitialData and ytInitialPlayerResponse JavaScript objects from the page source. These contain all the data in a structured JSON format that's much more stable.

YouTube Data API v3 vs Scraping vs Mantis

Feature	YouTube API v3	DIY Scraping	Mantis API
Cost	Free (10K units/day)	Proxy costs ($50-200/mo)	From $29/mo (5K requests)
Video metadata	✅	✅	✅
Transcripts	❌	✅	✅
Comments (full threads)	Partial (quota-heavy)	✅	✅
Search (unlimited)	100 searches/day max	✅	✅
Related videos	❌ (deprecated)	✅	✅
Anti-bot handling	N/A	You build it	Built-in
Rate limits	10K units/day	Depends on proxies	Plan-based
Maintenance	Low	High (selectors break)	Zero
Setup time	30 min	Days-weeks	5 min

The API's biggest limitation: A single search query costs 100 units. With 10,000 free units per day, you can only make ~100 searches. At scale, the YouTube API becomes impractical — and paid quota increases are expensive ($0.00015/unit beyond the free tier).

Real-World Use Cases with Code

1. Content Research Tool for AI Agents

import yt_dlp
from youtube_transcript_api import YouTubeTranscriptApi

class YouTubeResearcher:
    """AI agent tool for YouTube content research."""
    
    def research_topic(self, query, top_n=5):
        """Find and analyze top videos for a topic."""
        # Search for videos
        yt_opts = {
            'quiet': True,
            'no_download': True,
            'extract_flat': True,
            'default_search': f'ytsearch{top_n}',
        }
        
        with yt_dlp.YoutubeDL(yt_opts) as ydl:
            results = ydl.extract_info(query, download=False)
        
        analyses = []
        for entry in results.get('entries', [])[:top_n]:
            video_id = entry.get('id')
            
            # Get transcript
            try:
                transcript = YouTubeTranscriptApi.get_transcript(video_id)
                text = ' '.join([t['text'] for t in transcript])
            except:
                text = 'Transcript not available'
            
            analyses.append({
                'title': entry.get('title'),
                'video_id': video_id,
                'url': entry.get('url'),
                'channel': entry.get('channel'),
                'view_count': entry.get('view_count'),
                'duration': entry.get('duration'),
                'transcript_preview': text[:1000],
                'word_count': len(text.split()),
            })
        
        return {
            'query': query,
            'results': analyses,
            'total_views': sum(a.get('view_count', 0) or 0 for a in analyses),
        }

# Usage — perfect for AI agents analyzing content
researcher = YouTubeResearcher()
report = researcher.research_topic('web scraping best practices 2026')
for r in report['results']:
    print(f"{r['title']} — {r['view_count']:,} views — {r['word_count']} words")

2. YouTube Competitor Tracker

import yt_dlp
import json
from datetime import datetime

def track_competitor_channel(channel_url, days_back=30):
    """Track a competitor's recent YouTube activity."""
    yt_opts = {
        'quiet': True,
        'no_download': True,
        'extract_flat': True,
        'playlistend': 50,  # Last 50 videos
    }
    
    # Fetch channel's uploads
    videos_url = channel_url.rstrip('/') + '/videos'
    with yt_dlp.YoutubeDL(yt_opts) as ydl:
        results = ydl.extract_info(videos_url, download=False)
    
    videos = results.get('entries', [])
    
    # Analyze recent uploads
    recent = []
    for v in videos:
        if not v:
            continue
        recent.append({
            'title': v.get('title'),
            'views': v.get('view_count', 0),
            'duration': v.get('duration'),
            'id': v.get('id'),
        })
    
    # Calculate metrics
    total_views = sum(v.get('views', 0) or 0 for v in recent)
    avg_views = total_views // len(recent) if recent else 0
    
    return {
        'channel': results.get('channel', results.get('title')),
        'recent_videos': len(recent),
        'total_recent_views': total_views,
        'avg_views_per_video': avg_views,
        'top_videos': sorted(recent, key=lambda x: x.get('views', 0) or 0, reverse=True)[:5],
        'analyzed_at': datetime.now().isoformat(),
    }

# Usage
report = track_competitor_channel('https://www.youtube.com/@TechWithTim')
print(f"Channel: {report['channel']}")
print(f"Recent videos: {report['recent_videos']}")
print(f"Avg views: {report['avg_views_per_video']:,}")
print("Top videos:")
for v in report['top_videos']:
    print(f"  {v['title']} — {(v.get('views') or 0):,} views")

3. Sentiment Analysis from Comments

import asyncio
from playwright.async_api import async_playwright

async def analyze_video_sentiment(video_url, max_comments=100):
    """Scrape comments and analyze sentiment."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(video_url, wait_until='networkidle')
        
        # Scroll to load comments
        for _ in range(10):
            await page.evaluate('window.scrollBy(0, 1000)')
            await asyncio.sleep(1.5)
        
        comments = await page.evaluate('''() => {
            return Array.from(document.querySelectorAll('#content-text'))
                .map(el => el.textContent.trim())
                .filter(t => t.length > 0);
        }''')
        
        await browser.close()
    
    # Simple keyword-based sentiment (replace with NLP model in production)
    positive_words = {'great', 'amazing', 'love', 'awesome', 'best', 'excellent', 'helpful', 'thank', 'perfect', 'fantastic'}
    negative_words = {'bad', 'worst', 'hate', 'terrible', 'awful', 'waste', 'boring', 'useless', 'scam', 'disappointed'}
    
    sentiment_scores = []
    for comment in comments[:max_comments]:
        words = set(comment.lower().split())
        pos = len(words & positive_words)
        neg = len(words & negative_words)
        score = 'positive' if pos > neg else ('negative' if neg > pos else 'neutral')
        sentiment_scores.append({'comment': comment[:200], 'sentiment': score})
    
    total = len(sentiment_scores)
    return {
        'total_comments': total,
        'positive': sum(1 for s in sentiment_scores if s['sentiment'] == 'positive'),
        'negative': sum(1 for s in sentiment_scores if s['sentiment'] == 'negative'),
        'neutral': sum(1 for s in sentiment_scores if s['sentiment'] == 'neutral'),
        'positive_pct': f"{sum(1 for s in sentiment_scores if s['sentiment'] == 'positive') / total * 100:.1f}%" if total else '0%',
        'sample_positive': [s['comment'] for s in sentiment_scores if s['sentiment'] == 'positive'][:3],
        'sample_negative': [s['comment'] for s in sentiment_scores if s['sentiment'] == 'negative'][:3],
    }

# Usage
report = asyncio.run(analyze_video_sentiment(
    'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
))
print(f"Sentiment: {report['positive_pct']} positive ({report['total_comments']} comments)")

Legal Considerations

Before scraping YouTube at scale, understand the legal landscape:

YouTube Terms of Service: Section 5 explicitly prohibits "access, reproduce, download, distribute, transmit, broadcast, display, sell, license, alter, modify or otherwise use any part of the Service" through automated means. However, ToS violations are contract disputes, not criminal matters.
hiQ v. LinkedIn (2022): The Ninth Circuit ruled that scraping publicly available data likely does not violate the Computer Fraud and Abuse Act (CFAA). This precedent applies broadly to public web data.
Van Buren v. United States (2021): The Supreme Court narrowed the CFAA's "exceeds authorized access" provision, generally supporting that accessing publicly available data isn't a federal crime.
Google v. Socratic (ongoing): Google has pursued legal action against some YouTube scrapers, particularly those that download or redistribute copyrighted content.
Copyright: Scraping metadata is different from downloading/redistributing videos. Metadata extraction is generally lower risk; video downloading is higher risk.
GDPR/CCPA: Comments and channel data may constitute personal data under privacy regulations. Handle with care, especially for EU users.

Disclaimer: This guide is for educational purposes. Always consult legal counsel for commercial scraping operations. Consider using the official YouTube Data API v3 for basic needs, and APIs like Mantis for production scraping that handles compliance considerations.

Getting Started

Choose the right method based on your needs:

Method	Best For	Setup Time	Maintenance
Python + yt-dlp	Video metadata, search, transcripts	5 min	Low (yt-dlp is well-maintained)
Playwright	Comments, dynamic content, full pages	30 min	Medium
Node.js + Puppeteer	Stealth scraping, API interception	30 min	Medium
Mantis API	Production workloads, zero maintenance	5 min	Zero

For most use cases, start with yt-dlp for metadata (it's incredibly robust) and add Playwright for comments/dynamic data. When you need to scale or eliminate maintenance, switch to the Mantis API.

Stop Fighting YouTube's Anti-Bot Systems

Mantis handles proxy rotation, browser fingerprinting, and JS rendering so you can focus on the data. Start free — 100 requests/month, no credit card required.

View Pricing Get Started Free