Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT avatar

Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Pricing

Pay per event

Go to Apify Store
Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Download captions from any Archive.org film, TV, or audio item: clean transcript text, timestamped cues, normalized SRT & VTT, one row per language. Search 3M+ captioned items, monitor for new ones. No login or API key. $2 per 1,000 transcripts.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Scrapers Delight

Scrapers Delight

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

🎞️ Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Pull the subtitles/captions from any Internet Archive film, TV recording, or audio item — no login, no API key, no AI transcription. Archive.org hosts 3M+ captioned items (classic films, newsreels, lectures, TV news) and exposes them through public APIs; this actor downloads the caption files and parses them into clean transcript text, timestamped cues, and normalized SRT/VTT — one row per language. Point it at item URLs or an archive.org search query.

Because the captions already exist (uploaded subtitles or archive's own ASR), there's no speech-to-text compute — it's fast and cheap.


What does it do?

For each archive.org item you give it (by URL/identifier or discovered via search), it returns:

  • 📝 Full transcript (clean plain text) — always included
  • ⏲️ Timestamped cues{index, start_ms, end_ms, start, end, text}
  • 🎬 Normalized SRT / VTT — re-emitted with proper 3-digit millisecond stamps (archive's raw ASR files use non-standard 2-digit millis that break many players)
  • 🌍 One row per caption file/language — grab English ASR plus every uploaded translation
  • 🏷️ Item metadata — title, language, mediatype, collections, item URL
  • 🔎 Search discovery — any advanced-search (Lucene) query, auto-scoped to captioned movies/audio, sorted by downloads
  • 🚩 Honest flags — items with no captions, access-restricted items, and private/empty caption files are reported as such, never as silent zero-cue "successes"

No ASR, no API key — it reads the caption files archive.org already publishes.


What data does it extract?

One dataset record per caption file (per language):

  • 🆔 identifier, 🏷️ item_title, 🔗 item_url, 🌍 language, 📦 mediatype, 🗂️ collection[], ⬇️ downloads
  • 📄 caption_file_name, caption_format (SubRip / Web Video Text Tracks), 🌍 caption_lang_code, 🤖 is_autogenerated (.asr = archive's English ASR)
  • 🔗 caption_url, 📏 caption_size_bytes
  • 📝 transcript, ⏲️ segments[], 🎬 srt, vtt, 🔢 cue_count
  • 🚩 restricted, note, ✨ is_new (monitor), 🕒 scraped_at

Example output

{
"identifier": "Doctorin1946",
"item_title": "Doctor in Industry (Part I)",
"item_url": "https://archive.org/details/Doctorin1946",
"mediatype": "movies",
"caption_file_name": "Doctorin1946.asr.srt",
"caption_format": "SubRip",
"caption_lang_code": "en",
"is_autogenerated": true,
"caption_url": "https://archive.org/download/Doctorin1946/Doctorin1946.asr.srt",
"caption_size_bytes": 14725,
"cue_count": 217,
"transcript": "When the thing with the name names …",
"restricted": false,
"scraped_at": "2026-06-12T00:00:00.000Z"
}

Who is it for?

  • 🤖 AI / RAG dataset builders — millions of hours of public-domain era film and TV speech, already transcribed.
  • ✍️ Documentary makers & editors — search inside classic films and newsreels, get ready-to-cut SRT/VTT.
  • 🔎 Researchers & historians — full-text search across mid-century educational films, TV news, and lectures.
  • 🌍 Localization & subtitle teams — pull every language track an item carries in one run.

How to use it (step by step)

  1. Click Try for free.
  2. Paste one or more item URLs (https://archive.org/details/{identifier}) or bare identifiers — or set a search query (e.g. collection:prelinger).
  3. (Optional) filter languages, toggle autogenerated (.asr) captions, add extra formats (srt, vtt, segments).
  4. Click Start, then open the Dataset tab to view/export.
  5. (Optional) set monitorMode + a searchQuery + a Schedule to capture newly captioned items automatically.

Quick start

{
"itemUrls": ["https://archive.org/details/his_girl_friday"],
"transcriptFormats": ["txt", "srt"]
}

Search a whole collection

{
"searchQuery": "collection:prelinger",
"maxItems": 50,
"transcriptFormats": ["txt", "segments"]
}

Input

FieldWhat it does
itemUrlsarchive.org item URLs / identifiers
searchQueryadvanced-search (Lucene) query — auto-scoped to captioned movies/audio, restricted items excluded, sorted by downloads
languageskeep only these caption language codes (empty = all)
includeAutoGeneratedinclude archive's .asr English ASR captions (default on)
transcriptFormatstxt · segments · srt · vtt
maxItemshard cap on items per run (default 5; 0 = unlimited)
maxCaptionFilesPerItemcap caption files per item (default 5; 0 = all)
monitorMode, alertOnNewItemrecurring new-item watcher + alerts
webhookUrl, slackWebhookUrl, emailRecipientsalert channels
proxyConfiguration, requestConcurrencyproxy + parallelism

Output

Each caption file is one dataset record (fields above). Items with no captions, access-restricted items, and private/empty caption files are emitted as flagged rows (restricted, note) so you always know why a transcript is missing. Export to JSON, CSV, Excel, HTML, or RSS, or fetch via the Apify API.


How much does it cost?

Pay-per-event — and with no transcription compute, it's cheap:

EventWhat it coversPrice
lot-scrapedeach record returned$0.004 / record
lot-detail-enrichedeach caption file downloaded + parsed$0.004 / file
monitor-run-completedeach scheduled watch run$0.05 / run
new-lot-detectedeach new item found by the monitor$0.02 / item
alert-deliveredeach Slack/email/webhook push$0.005 / alert

That's about $8 per 1,000 transcripts (fetch + parse). No charge for actor starts or empty runs.


Monitor & alert setup

  1. Set a searchQuery (e.g. collection:prelinger or subject:"television news").
  2. Turn on monitorMode (and keep alertOnNewItem on).
  3. Add a webhookUrl, slackWebhookUrl, and/or emailRecipients.
  4. Create an Apify Schedule (e.g. daily). The first run baselines the seen items; every later run outputs and alerts only new items. State persists in a named key-value store (archive-transcript-monitor-state), so it survives between runs.

How does it work without AI transcription?

Archive.org items carry caption files: uploader-provided .srt/.vtt subtitles and archive's own autogenerated English ASR (.asr.srt). This actor reads the item's public metadata, picks the caption files, downloads them, and runs a hardened parser that handles every variant found in the wild — BOM + CRLF files, 2-digit millisecond ASR stamps, <i> formatting tags, VTT headers with trailing junk, and cues without indices. It does not run speech-to-text, so there's no GPU cost and results are instant.


The Internet Archive is a non-profit library that publishes these items and APIs for public access, and much of the captioned material is public-domain era film. The output is published media content and item stats, not personal data. Scraping public data is generally legal, but you are responsible for your use — review archive.org's Terms of Use and each item's rights/license statement before redistributing content.


FAQ

Which items have captions? 3M+ movies/audio items carry .srt/.vtt files — classic films, Prelinger educational shorts, TV news, lectures. The search mode finds them for you (it filters to format:"SubRip" OR "Web Video Text Tracks" automatically).

Is there a Whisper/ASR step? No — it downloads the caption files archive.org already publishes (including archive's own ASR track), so it's fast and cheap.

Can I get subtitles for video editing? Yes — add srt and/or vtt to transcriptFormats. The actor normalizes archive's non-standard 2-digit-millisecond stamps to proper hh:mm:ss,mmm, so the files work in any editor/player.

What about multiple languages? Each caption file becomes its own row with caption_lang_code parsed from the filename. Use languages to keep only the ones you want.

Why did an item return no transcript? Three honest cases, all flagged in the row: the item has no caption files (note), the item is access-restricted (its files are private and download as empty bodies — restricted: true), or a specific file is private/zero-byte. The actor never reports those as empty "successes".

Can I crawl a whole collection? Yes — searchQuery: "collection:{name}" + maxItems: 0. Archive's search window caps at 10,000 rows per query; slice bigger collections by date (publicdate:[2020-01-01 TO 2021-01-01]).

How fresh is monitor mode? Every scheduled run re-queries your search and diffs against the named state store — you get only items it hasn't seen before, plus optional Slack/webhook/email alerts.

Does it need a proxy or login? No login or API key. Archive.org's endpoints are public; the default datacenter proxy rotation is plenty.

How do I export? JSON, CSV, Excel, HTML, or RSS from the Dataset tab, or via the Apify API.

What does a 1,000-film crawl cost? With one caption file each: 1,000 × ($0.004 + $0.004) = ~$8.


Feedback

Want full-text search inside transcripts, TV-news-specific fields, or bulk export to a single file? Open an issue on the actor.