YouTube Transcript API - AI Training Data
Pricing
from $0.01 / youtube transcript extraction
YouTube Transcript API - AI Training Data
Extract YouTube video transcripts optimized for AI and machine learning workflows. Features chunking for LLM context limits, SRT/VTT formats, and music symbol removal. Perfect for building training datasets, content analysis, and subtitle generation.
Pricing
from $0.01 / youtube transcript extraction
Rating
0.0
(0)
Developer
Tan Analytics
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
11 hours ago
Last modified
Categories
Share
YouTube Transcript Extractor - AI Training Data
Extract YouTube video transcripts optimized for AI and machine learning workflows.
Why This Actor?
| Feature | Free Tools | This Actor |
|---|---|---|
| AI chunking | ❌ | ✅ Split by token limit |
| Token counting | ❌ | ✅ Estimated tokens |
| Clean transcripts | ❌ | ✅ Remove ♪ [music] |
| SRT/VTT formats | ❌ | ✅ All included |
| Video metadata | ❌ | ✅ Title, author, thumbnail |
| Affordable | Limited | $0.01 per video |
Use Cases
AI Training Data
Build high-quality training datasets from YouTube videos. Chunked transcripts fit any LLM context window.
Content Analysis
Analyze video content. Get word counts, token estimates, and structured metadata.
Subtitle Generation
Export transcripts in SRT or VTT format for video editing, captions, or accessibility.
Academic Research
Extract lectures, interviews, and documentaries. Clean transcripts ready for analysis.
Features
🎯 AI-Optimized Output
- Smart chunking - Split transcripts to fit your LLM's context window
- Token estimation - Know exactly how many tokens you're working with
- Clean mode - Remove music symbols (♪), [applause], [laughter] for cleaner training data
📄 Multiple Formats
- Plain text - Raw transcript
- SRT subtitles - For video editors
- VTT subtitles - Web-compatible
- Timestamps - Optional [MM:SS] markers
📊 Metadata Enrichment
- Video title and author
- Thumbnail URL
- Duration (formatted and raw)
- Word count and character count
- Detected language
🔒 Reliability
- Automatic proxy fallback (Direct → Datacenter → Residential)
- YouTube Shorts support
- Multi-language transcripts
Pricing
$0.01 per transcript extraction
| Videos | Cost |
|---|---|
| 10 | $0.10 |
| 100 | $1.00 |
| 1,000 | $10.00 |
No monthly commitment. Pay only for what you use.
Quick Start
Input
{"videoUrl": "https://www.youtube.com/watch?v=dQw4w9WgXcQ","language": "en","chunkSize": 2000,"cleanTranscript": true,"outputFormat": "text"}
Output
{"videoUrl": "https://www.youtube.com/watch?v=dQw4w9WgXcQ","videoId": "dQw4w9WgXcQ","transcript": "♪ We're no strangers to love ♪","transcriptClean": "We're no strangers to love","chunks": [{"id": 0,"text": "We're no strangers to love...","start": 1.36,"end": 110.0,"wordCount": 230}],"metadata": {"title": "Rick Astley - Never Gonna Give You Up","author": "Rick Astley","thumbnailUrl": "https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg","duration": 211.32,"durationFormatted": "03:31","wordCount": 367,"estimatedTokens": 488,"language": "en"},"transcriptSRT": "1\n00:00:01,360 --> 00:00:03,040\n♪ We're no strangers to love ♪","transcriptVTT": "WEBVTT\n\n00:00:01.360 --> 00:00:03.040\n♪ We're no strangers to love ♪"}
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
videoUrl | string | required | YouTube video URL |
language | string | "en" | Preferred transcript language |
chunkSize | integer | 2000 | Max chars per chunk (0 = off) |
cleanTranscript | boolean | false | Remove music symbols and filler |
includeMetadata | boolean | true | Include video metadata |
outputFormat | string | "text" | Format: text, srt, or vtt |
includeTimestamps | boolean | true | Add [MM:SS] timestamps |
Supported URLs
https://www.youtube.com/watch?v=VIDEO_IDhttps://youtu.be/VIDEO_IDhttps://www.youtube.com/shorts/VIDEO_ID
FAQ
Q: What if a video has no transcript? A: The actor will return an error for that video.
Q: Can I extract transcripts in other languages?
A: Yes. Set language to the ISO code (e.g., "es" for Spanish).
Q: What's the maximum chunk size? A: Default is 2000 characters (~500 tokens). Set to 0 to disable chunking.
Q: How accurate is token estimation? A: We use ~1.33 tokens per word as a rough estimate.
Support
Open an issue on GitHub or contact for enterprise pricing on large volumes.
$0.01 per transcript | Try it now on Apify