Youtube Transcript Scraper avatar

Youtube Transcript Scraper

Pricing

from $40.00 / 1,000 results

Go to Apify Store
Youtube Transcript Scraper

Youtube Transcript Scraper

Extract transcripts and captions from YouTube videos with language selection support. Returns timestamped segments, full concatenated text, and basic video metadata.

Pricing

from $40.00 / 1,000 results

Rating

5.0

(27)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

28

Bookmarked

14

Total users

3

Monthly active users

4 days ago

Last modified

Share

Extract transcripts and captions from any public YouTube video. Get timestamped segments, full plain-text transcripts, language metadata, and core video info — ready for AI pipelines, content analysis, summarization, translation, and research.

When a video has no captions at all (uploader disabled them, or it's music-only with no on-screen text), an optional Whisper AI fallback can transcribe the audio directly so you still get a usable transcript.

What you get

For each video the scraper returns:

FieldDescription
video_idYouTube 11-character video ID
titleVideo title
channel_nameChannel display name
channel_idChannel ID (when available)
duration_secondsVideo duration in seconds (when available)
viewsView count (when available)
published_datePublish date in YYYY-MM-DD (when available)
thumbnailThumbnail URL
transcript_languageLanguage code of the extracted transcript (e.g. en, es, ko)
transcript_language_nameFull language name
is_auto_generatedtrue if the transcript is YouTube's auto-caption, false for manually uploaded captions or Whisper output
transcript_sourcelibrary / innertube / playwright_dom / whisper — tells you which path produced the transcript
language_probabilityWhisper's language-detection confidence (only set when transcript_source=whisper)
available_languagesArray of every transcript language available for the video
segmentsTimestamped segments — start, dur, text
segment_countNumber of segments returned
full_textComplete transcript joined into a single string
successtrue when a transcript was extracted, false otherwise
errorReason text when success=false
inputUrlThe URL you submitted
scrapedAtISO 8601 UTC timestamp

Empty fields are dropped from each record so the dataset stays clean.

Input

ParameterTypeDefaultDescription
videoUrlsArrayrequiredYouTube watch URLs, youtu.be short links, Shorts URLs, embed URLs, or plain 11-char video IDs.
languageString""Preferred language code (en, es, fr, de, ja, ko, …). Empty = best available.
includeAutoGeneratedBooleantrueInclude YouTube auto-captions when manual ones aren't available.
useWhisperBooleanfalseFall back to local Whisper transcription when YouTube has no transcript. Adds ~30-180 s per video.
whisperModelEnumbasetiny (fastest), base (balanced), small (most accurate). Pick small for music or noisy audio.

Supported URL formats

  • https://www.youtube.com/watch?v=dQw4w9WgXcQ
  • https://youtu.be/dQw4w9WgXcQ
  • https://www.youtube.com/shorts/VIDEO_ID
  • https://www.youtube.com/embed/VIDEO_ID
  • Plain 11-char ID: dQw4w9WgXcQ

Example input — multiple videos, English preferred

{
"videoUrls": [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://youtu.be/9bZkp7q19f0",
"https://www.youtube.com/shorts/abc123XYZ45"
],
"language": "en",
"includeAutoGenerated": true
}

Example input — Whisper fallback for transcripts-disabled videos

{
"videoUrls": ["https://www.youtube.com/watch?v=XqZsoesa55w"],
"useWhisper": true,
"whisperModel": "small"
}

Example output

{
"video_id": "dQw4w9WgXcQ",
"title": "Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)",
"channel_name": "Rick Astley",
"channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
"duration_seconds": 213,
"views": 1769190465,
"published_date": "2009-10-24",
"thumbnail": "https://i.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg",
"transcript_language": "en",
"transcript_language_name": "English",
"is_auto_generated": false,
"transcript_source": "library",
"available_languages": [
{ "code": "en", "name": "English", "is_auto_generated": false },
{ "code": "es-419", "name": "Spanish (Latin America)", "is_auto_generated": false }
],
"segments": [
{ "start": "1.360", "dur": "1.680", "text": "[♪♪♪]" },
{ "start": "18.640", "dur": "3.240", "text": "We're no strangers to love" }
],
"segment_count": 61,
"full_text": "[♪♪♪] We're no strangers to love You know the rules and so do I ...",
"success": true,
"inputUrl": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"scrapedAt": "2026-05-05T13:42:18Z"
}

Use cases

  • AI training data — Build text corpora from YouTube content for LLM fine-tuning or RAG pipelines.
  • Content repurposing — Turn long videos into blog posts, summaries, or social copy.
  • Research and analysis — Pull spoken content from lectures, interviews, podcasts, and documentaries.
  • Subtitles and accessibility — Retrieve captions for translation or accessibility workflows.
  • SEO and keyword research — Analyse spoken keywords and topics across YouTube content.
  • Competitive intelligence — Monitor what competitors say in their videos.
  • Education — Extract transcripts from online courses for indexing or study notes.
  • Sentiment analysis — Run sentiment or topic models against transcripts at scale.

FAQ

Which videos can I scrape? Any public YouTube video that has either manually created captions or auto-generated captions. Private videos, deleted videos, and members-only videos cannot be scraped — those return success=false with a clear error.

What if a video has no captions at all? Set useWhisper: true to download the audio and transcribe it locally with Whisper (faster-whisper). For clear speech, whisperModel=base is the sweet spot. For music, noisy audio, or short clips, use whisperModel=small for noticeably better accuracy.

What if my requested language isn't available? The scraper first tries an exact match (en), then variants (en matches en-GB, en-US), then falls back to the best available transcript. The available_languages field always lists everything available.

What's the difference between transcript_source values?

  • library — pulled from YouTube's published caption tracks (fastest, most reliable, real human captions for popular videos).
  • innertube — pulled from YouTube's internal API when the library couldn't reach it.
  • playwright_dom — extracted from the in-page transcript panel as a last resort.
  • whisper — generated locally from the audio with Whisper AI when YouTube has no captions at all.

How accurate are auto-generated captions vs Whisper? YouTube auto-captions are generally accurate for clear English speech. Whisper base is comparable for clean audio, and Whisper small typically beats YouTube on accents, multiple speakers, and noisy audio. Both struggle on music with non-vocal singing.

The Whisper output looks like garbage / repetitive text. Whisper-tiny on music or near-silent clips can hallucinate repetitive phrases. The actor automatically detects this (low language-detection confidence + identical segment text) and returns success=false with an actionable error pointing you at a larger model. Re-run with whisperModel=small for music videos.

Can I scrape multiple videos in one run? Yes — pass an array of URLs. Each video is processed sequentially and pushed as its own dataset row.

How current is the data? Live — every run hits YouTube at request time. Schedule the actor for daily / hourly refreshes.

Limitations

  • Private, members-only, age-restricted, and deleted videos cannot be scraped.
  • Whisper transcription uses CPU, so it adds 30-180 s per video depending on length and model size.
  • Whisper accuracy on heavy music or pure-instrumental audio is fundamentally limited regardless of model size.
  • YouTube can change its caption infrastructure; the scraper has multiple fallback paths but a transient outage may still cause success=false for individual videos.