Pricing

Pay per event

Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Download captions from any Archive.org film, TV, or audio item: clean transcript text, timestamped cues, normalized SRT & VTT, one row per language. Search 3M+ captioned items, monitor for new ones. No login or API key. $2 per 1,000 transcripts.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Scrapers Delight

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

🎞️ Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Pull the subtitles/captions from any Internet Archive film, TV recording, or audio item — no login, no API key, no AI transcription. Archive.org hosts 3M+ captioned items (classic films, newsreels, lectures, TV news) and exposes them through public APIs; this actor downloads the caption files and parses them into clean transcript text, timestamped cues, and normalized SRT/VTT — one row per language. Point it at item URLs or an archive.org search query.

Because the captions already exist (uploaded subtitles or archive's own ASR), there's no speech-to-text compute — it's fast and cheap.

What does it do?

For each archive.org item you give it (by URL/identifier or discovered via search), it returns:

📝 Full transcript (clean plain text) — always included
⏲️ Timestamped cues — {index, start_ms, end_ms, start, end, text}
🎬 Normalized SRT / VTT — re-emitted with proper 3-digit millisecond stamps (archive's raw ASR files use non-standard 2-digit millis that break many players)
🌍 One row per caption file/language — grab English ASR plus every uploaded translation
🏷️ Item metadata — title, language, mediatype, collections, item URL
🔎 Search discovery — any advanced-search (Lucene) query, auto-scoped to captioned movies/audio, sorted by downloads
🚩 Honest flags — items with no captions, access-restricted items, and private/empty caption files are reported as such, never as silent zero-cue "successes"

No ASR, no API key — it reads the caption files archive.org already publishes.

What data does it extract?

One dataset record per caption file (per language):

🆔 identifier, 🏷️ item_title, 🔗 item_url, 🌍 language, 📦 mediatype, 🗂️ collection[], ⬇️ downloads
📄 caption_file_name, caption_format (SubRip / Web Video Text Tracks), 🌍 caption_lang_code, 🤖 is_autogenerated (.asr = archive's English ASR)
🔗 caption_url, 📏 caption_size_bytes
📝 transcript, ⏲️ segments[], 🎬 srt, vtt, 🔢 cue_count
🚩 restricted, note, ✨ is_new (monitor), 🕒 scraped_at

Example output

{
  "identifier": "Doctorin1946",
  "item_title": "Doctor in Industry (Part I)",
  "item_url": "https://archive.org/details/Doctorin1946",
  "mediatype": "movies",
  "caption_file_name": "Doctorin1946.asr.srt",
  "caption_format": "SubRip",
  "caption_lang_code": "en",
  "is_autogenerated": true,
  "caption_url": "https://archive.org/download/Doctorin1946/Doctorin1946.asr.srt",
  "caption_size_bytes": 14725,
  "cue_count": 217,
  "transcript": "When the thing with the name names …",
  "restricted": false,
  "scraped_at": "2026-06-12T00:00:00.000Z"
}

Who is it for?

🤖 AI / RAG dataset builders — millions of hours of public-domain era film and TV speech, already transcribed.
✍️ Documentary makers & editors — search inside classic films and newsreels, get ready-to-cut SRT/VTT.
🔎 Researchers & historians — full-text search across mid-century educational films, TV news, and lectures.
🌍 Localization & subtitle teams — pull every language track an item carries in one run.

How to use it (step by step)

Click Try for free.
Paste one or more item URLs (https://archive.org/details/{identifier}) or bare identifiers — or set a search query (e.g. collection:prelinger).
(Optional) filter languages, toggle autogenerated (.asr) captions, add extra formats (srt, vtt, segments).
Click Start, then open the Dataset tab to view/export.
(Optional) set monitorMode + a searchQuery + a Schedule to capture newly captioned items automatically.

Quick start

{
  "itemUrls": ["https://archive.org/details/his_girl_friday"],
  "transcriptFormats": ["txt", "srt"]
}

Search a whole collection

{
  "searchQuery": "collection:prelinger",
  "maxItems": 50,
  "transcriptFormats": ["txt", "segments"]
}

Input

Field	What it does
`itemUrls`	archive.org item URLs / identifiers
`searchQuery`	advanced-search (Lucene) query — auto-scoped to captioned movies/audio, restricted items excluded, sorted by downloads
`languages`	keep only these caption language codes (empty = all)
`includeAutoGenerated`	include archive's `.asr` English ASR captions (default on)
`transcriptFormats`	`txt` · `segments` · `srt` · `vtt`
`maxItems`	hard cap on items per run (default 5; 0 = unlimited)
`maxCaptionFilesPerItem`	cap caption files per item (default 5; 0 = all)
`monitorMode`, `alertOnNewItem`	recurring new-item watcher + alerts
`webhookUrl`, `slackWebhookUrl`, `emailRecipients`	alert channels
`proxyConfiguration`, `requestConcurrency`	proxy + parallelism

Output

Each caption file is one dataset record (fields above). Items with no captions, access-restricted items, and private/empty caption files are emitted as flagged rows (restricted, note) so you always know why a transcript is missing. Export to JSON, CSV, Excel, HTML, or RSS, or fetch via the Apify API.

How much does it cost?

Pay-per-event — and with no transcription compute, it's cheap:

Event	What it covers	Price
`lot-scraped`	each record returned	$0.004 / record
`lot-detail-enriched`	each caption file downloaded + parsed	$0.004 / file
`monitor-run-completed`	each scheduled watch run	$0.05 / run
`new-lot-detected`	each new item found by the monitor	$0.02 / item
`alert-delivered`	each Slack/email/webhook push	$0.005 / alert

That's about $8 per 1,000 transcripts (fetch + parse). No charge for actor starts or empty runs.

Monitor & alert setup

Set a searchQuery (e.g. collection:prelinger or subject:"television news").
Turn on monitorMode (and keep alertOnNewItem on).
Add a webhookUrl, slackWebhookUrl, and/or emailRecipients.
Create an Apify Schedule (e.g. daily). The first run baselines the seen items; every later run outputs and alerts only new items. State persists in a named key-value store (archive-transcript-monitor-state), so it survives between runs.

How does it work without AI transcription?

Archive.org items carry caption files: uploader-provided .srt/.vtt subtitles and archive's own autogenerated English ASR (.asr.srt). This actor reads the item's public metadata, picks the caption files, downloads them, and runs a hardened parser that handles every variant found in the wild — BOM + CRLF files, 2-digit millisecond ASR stamps, <i> formatting tags, VTT headers with trailing junk, and cues without indices. It does not run speech-to-text, so there's no GPU cost and results are instant.

Is it legal to scrape archive.org captions?

The Internet Archive is a non-profit library that publishes these items and APIs for public access, and much of the captioned material is public-domain era film. The output is published media content and item stats, not personal data. Scraping public data is generally legal, but you are responsible for your use — review archive.org's Terms of Use and each item's rights/license statement before redistributing content.

FAQ

Which items have captions? 3M+ movies/audio items carry .srt/.vtt files — classic films, Prelinger educational shorts, TV news, lectures. The search mode finds them for you (it filters to format:"SubRip" OR "Web Video Text Tracks" automatically).

Is there a Whisper/ASR step? No — it downloads the caption files archive.org already publishes (including archive's own ASR track), so it's fast and cheap.

Can I get subtitles for video editing? Yes — add srt and/or vtt to transcriptFormats. The actor normalizes archive's non-standard 2-digit-millisecond stamps to proper hh:mm:ss,mmm, so the files work in any editor/player.

What about multiple languages? Each caption file becomes its own row with caption_lang_code parsed from the filename. Use languages to keep only the ones you want.

Why did an item return no transcript? Three honest cases, all flagged in the row: the item has no caption files (note), the item is access-restricted (its files are private and download as empty bodies — restricted: true), or a specific file is private/zero-byte. The actor never reports those as empty "successes".

Can I crawl a whole collection? Yes — searchQuery: "collection:{name}" + maxItems: 0. Archive's search window caps at 10,000 rows per query; slice bigger collections by date (publicdate:[2020-01-01 TO 2021-01-01]).

How fresh is monitor mode? Every scheduled run re-queries your search and diffs against the named state store — you get only items it hasn't seen before, plus optional Slack/webhook/email alerts.

Does it need a proxy or login? No login or API key. Archive.org's endpoints are public; the default datacenter proxy rotation is plenty.

How do I export? JSON, CSV, Excel, HTML, or RSS from the Dataset tab, or via the Apify API.

What does a 1,000-film crawl cost? With one caption file each: 1,000 × ($0.004 + $0.004) = ~$8.

Feedback

Want full-text search inside transcripts, TV-news-specific fields, or bulk export to a single file? Open an issue on the actor.

Dailymotion Transcript Scraper — Subtitles to TXT, SRT, VTT

scrapersdelight/dailymotion-transcript-scraper

Extract any public Dailymotion video's subtitle transcript — no login, no ASR. By video URL/ID or a search query: full text, timestamped segments & SRT/VTT, plus title, owner and duration, from Dailymotion's own subtitle tracks. $2 per 1,000 videos.

Scrapers Delight

YouTube Transcript Scraper — Batch + SRT/VTT Export

vanity_arias/youtube-transcript-scraper-batch

Extract YouTube video transcripts in bulk — paste video URLs, IDs, or Shorts links and get clean text, timestamped segments, and ready-to-use SRT/VTT subtitle files. No API key, failed videos never charged.

Nvikelo Nyathi

TikTok Transcript Scraper - JSON, SRT, VTT

jamhimself/tiktok-transcript-scraper

Extracts TikTok video transcripts from native captions (no AI transcription). Input: video URLs or IDs. Output: timestamped JSON segments, plain text, SRT, VTT, or RAG chunks + metadata. $0.003 per video with a transcript; no-caption videos free.

Jaime Martinez

YouTube Transcript Scraper - JSON, SRT, VTT, RAG

jamhimself/youtube-transcript-extractor

Extract transcripts from YouTube videos. Input: video URLs or IDs + language preferences. Output: plain text, timestamped segments, SRT/VTT subtitles, and RAG chunks with deep links. 100+ languages, no API key. $0.0075 per delivered transcript.

Jaime Martinez

Vimeo Transcript Scraper — Captions to TXT, SRT & VTT

scrapersdelight/vimeo-transcript-scraper

Extract any public Vimeo video's captions and transcript — no login, no ASR. By video URL/ID or a page that links Vimeo videos: transcript text, timestamped segments & SRT/VTT, plus title, owner and duration, from Vimeo's own caption tracks. $2 per 1,000 videos.

Scrapers Delight

Loom Transcript Scraper - Captions to Text

khadinakbar/loom-transcript-scraper

Extract public Loom video transcripts as clean text, timestamped segments, VTT, and SRT. Use public share or embed URLs only; no login, cookies, or video download.

Khadin Akbar

YouTube Subtitle Extractor

entertained_rattlesnake/youtube-subtitle-extractor

Extract subtitles and transcripts from YouTube videos and export them as JSON, TXT, SRT and VTT.

Entertained Rattlesnake

Wistia Transcript Scraper — Captions to TXT, SRT & VTT

scrapersdelight/wistia-transcript-scraper

Extract any public Wistia video's transcript and captions — no login, no ASR. By hashedId or any page that embeds Wistia: full text, timestamped segments & SRT/VTT, plus title and duration, straight from Wistia's CDN. $2 per 1,000 videos.

Scrapers Delight

Archive.org Scraper

lulzasaur/archive-org-scraper

Scrape the Internet Archive (archive.org). Search 50M+ texts, 13M+ audio, 16M+ movies, and 1.3M+ software items. Get metadata, download counts, file lists, and more via public APIs.

lulz bot

Subtitle Translator — SRT & VTT

dami_studio/subtitle-translator

Translate subtitles into many languages at once. Paste an SRT/VTT file (or give a video URL to auto-transcribe), pick target languages, and get clean translated SRT + VTT back — timings preserved. For localization, accessibility, and multi-language publishing.