YouTube Transcript Scraper & Bulk Downloader avatar

YouTube Transcript Scraper & Bulk Downloader

Pricing

Pay per event

Go to Apify Store
YouTube Transcript Scraper & Bulk Downloader

YouTube Transcript Scraper & Bulk Downloader

Bulk YouTube transcript downloader and extractor — pull captions (manual or auto-generated) from one video or a whole channel, in any language. Returns plain-text transcript plus timed segments, export to JSON or CSV. We retry and rotate so the captions land.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

32 minutes ago

Last modified

Categories

Share


🎯 What this scrapes

YouTube ships closed captions for most videos. This Actor takes a list of video URLs or bare IDs, picks the best available caption track in the language you specify, downloads every cue, and writes one clean row per video. You get the full joined transcript text plus — if you want them — the per-cue segments with start time and duration. Channel name, video title, duration, and the full list of available languages all land in the same row.

We handle the parts that make bulk transcript extraction fragile at scale: rate-limit pushback, endpoint parameter drift, and residential proxy rotation so YouTube sees a real browser session rather than a Python script hitting its timedtext endpoint in a tight loop.

Captions are public metadata published by YouTube. This Actor fetches only what YouTube's own player loads for any viewer. It does not download video files, bypass region locks, or access private or unlisted content.

🔥 What we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not a Python script.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session ID and exit IP on every block or rate-limit response.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per video, Retry-After header honoured.
  • 🧱 Rate-limit-aware pacing — when YouTube pushes back we slow down rather than accumulate bans across the run.
  • 🧊 Clean, typed dataset rows — Pydantic-validated output, ISO-8601 timestamps, stable IDs. Export as JSON, CSV, or Excel straight from Apify Console.
  • 💰 Pay-Per-Event pricing — you pay only for rows that land in your dataset. No data, no charge beyond the small run warm-up fee.

💡 Use cases

  • RAG corpus seeding — bulk-download transcripts for a playlist of conference talks, lectures, or podcast episodes and feed them straight into a vector store or LLM context window.
  • YouTube transcript bulk download for NLP — export hundreds of transcripts at once for sentiment analysis, topic modelling, or fine-tuning data prep.
  • Podcast show-notes automation — feed each new YouTube-hosted episode through this Actor and into an LLM to generate Markdown show notes automatically.
  • Download YouTube subtitles for language learning — pull caption tracks in the target language across a playlist for comprehension practice or graded reading corpora.
  • YouTube subtitles dataset construction — build a reproducible, version-controlled transcript dataset for ML benchmarking, search indexing, or attribution research.
  • YouTube transcript for RAG pipelines — drop transcripts directly into LangChain, LlamaIndex, or any retrieval-augmented generation stack without preprocessing.

⚙️ How to use it

  1. Click Try for free at the top of the Store listing.
  2. Paste YouTube video URLs or bare video IDs into videoUrls — one per line, or as a JSON array. Shorts and youtu.be links both work.
  3. Set language to the ISO-639-1 code you want (default en). The Actor falls back gracefully through manual tracks, auto-generated tracks, and any available language.
  4. Click Start. Results stream into the run's dataset in real time.
  5. Export from Storage → Dataset as JSON, CSV, or Excel — or pull via the Apify API.

For large lists (hundreds of videos) leave proxyConfiguration on its default of useApifyProxy: true. On the Apify FREE plan this uses datacenter proxies; upgrading to a paid plan routes through residential IPs, which handle aggressive rate-limiting with a higher success rate.

📥 Input

FieldTypeRequiredDefaultNotes
videoUrlsarrayyes["https://www.youtube.com/watch?v=dQw4w9WgXcQ"]YouTube video URLs or bare video IDs. Shorts and youtu.be links are accepted.
languagestringno"en"ISO-639-1 language code. Track selection order: manual in requested language → auto in requested language → manual any → auto any.
includeSegmentsbooleannotrueWhen true, the segments array includes one entry per cue (text + start time + duration). The joined transcript_text field is always present regardless.
concurrencyintegerno4Number of videos processed in parallel. Lower this if you see elevated 429s on a shared datacenter proxy.
proxyConfigurationobjectno{"useApifyProxy": true}Proxy settings. YouTube rate-limits aggressive runs — residential routing is recommended for lists of 100+ videos.

Example input

{
"videoUrls": [
"dQw4w9WgXcQ",
"https://www.youtube.com/watch?v=9bZkp7q19f0"
],
"language": "en",
"includeSegments": false,
"concurrency": 3,
"proxyConfiguration": {
"useApifyProxy": true
}
}

📤 Output

One dataset row per input video.

FieldTypeNotes
video_idstringYouTube video ID (11 characters).
video_urlstringCanonical youtube.com/watch?v= URL.
titlestring | nullVideo title parsed from the watch page.
channel_namestring | nullChannel display name.
channel_idstring | nullChannel ID.
duration_secondsinteger | nullVideo duration in seconds.
languagestringCaption track language code actually used.
is_auto_generatedbooleantrue for YouTube-auto-generated tracks; false for manually uploaded captions.
transcript_textstringFull transcript joined with newlines — ready to paste into an LLM prompt or search index.
segmentsarray | nullPer-cue entries with text, start, and duration when includeSegments is true.
available_languagesarrayAll caption track language codes available on the video.
scraped_atstringISO-8601 timestamp of when this row was written.

Example output

{
"video_id": "dQw4w9WgXcQ",
"video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"title": "Rick Astley - Never Gonna Give You Up (Official Music Video)",
"channel_name": "Rick Astley",
"channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
"duration_seconds": 213,
"language": "en",
"is_auto_generated": false,
"transcript_text": "We're no strangers to love\nYou know the rules and so do I\n...",
"segments": null,
"available_languages": ["en", "es", "fr", "de"],
"scraped_at": "2026-06-01T10:32:14Z"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.004Per dataset row written

Example: 1 000 transcripts at these rates ≈ $4.05. No subscription, no monthly minimum, no credit card required to start — Apify gives every new account $5 of free credit.

For very large bulk runs (10 000+ transcripts/month) the per-result charge scales linearly: 10k ≈ $40, 100k ≈ $400. If your volume is that high, open an issue on the Actor's Issues tab — we can discuss a volume arrangement.

🚧 Limitations

  • Captions disabled by uploader — some creators turn off captions entirely. Those videos return no transcript row; the Actor logs the skip and moves on.
  • Rate-limiting on large batches — YouTube pushes back on high-concurrency runs from shared datacenter IPs. Use proxyConfiguration with residential routing and keep concurrency at 3–5 for lists of 500+ videos.
  • Live streams — live captions are usually unavailable until the broadcast ends and the VOD is processed. Re-run after the stream concludes.
  • Age-gated / sign-in-required videos — this Actor does not accept YouTube credentials and cannot retrieve captions from age-restricted content.
  • Parameter drift — YouTube occasionally rotates its internal timedtext endpoint parameters. When this happens existing runs may return empty transcripts for affected videos. We monitor for this and ship a fix within 48 hours. Check the Actor's CHANGELOG for the latest version.

❓ FAQ

What's the difference between this and the youtube-transcript-api Python library?

The OSS library is great for one-off scripts. This Actor wraps equivalent logic inside Apify's cloud infrastructure, adding proxy rotation, retries, concurrency control, structured output, and the ability to schedule recurring runs — no server required. Use the library for local experiments; use this Actor when you need youtube transcript bulk download at scale without managing infrastructure.

Does it work for youtube transcript api access programmatically?

Yes. Every run's dataset is accessible via the Apify REST API. You can trigger runs, poll for completion, and pull results as JSON in one API call. See Apify's documentation for the full reference.

Can I download YouTube subtitles in languages other than English?

Yes. Set language to any ISO-639-1 code (e.g. "es" for Spanish, "ja" for Japanese, "de" for German). The Actor will select the best matching track and fall back gracefully if the exact language is unavailable. The available_languages field in every output row lists what was actually on the video.

What about youtube closed captions extract for auto-generated tracks?

Auto-generated tracks are fully supported and labelled clearly via the is_auto_generated field. Auto tracks are used as a fallback when no manual caption upload exists. Quality varies by video; auto-generated tracks on professionally produced content tend to be accurate.

What if no captions exist at all?

The Actor logs the video ID and skips it. We do not synthesise or transcribe audio — that's a different (much more expensive) problem.

Can I use this for a youtube transcript for rag pipeline?

Exactly the use case we built for. The transcript_text field is clean joined text ready for chunking. The segments array gives you cue-level timestamps if you want to preserve position information for citation or retrieval. Both fields export as-is into JSON; just point your LangChain Document or LlamaIndex Node constructor at the dataset.

Why is title or channel_name empty?

If YouTube returns a consent interstitial or a 429 on the watch page during metadata fetch, we still deliver the transcript but leave page-scraped fields null. The transcript itself is retrieved from a separate endpoint and succeeds independently.

💬 Your feedback

Spotted a bug, hit a rate-limit pattern we aren't handling, or need a field added? Open an issue on the Actor's Issues tab in Apify Console — we read every report and ship fixes on a weekly cadence. For parameter-drift breakages, check the CHANGELOG first; a fix is usually already in latest.