TED Talks Transcript Scraper
Pricing
Pay per event
TED Talks Transcript Scraper
Extracts full transcripts from TED.com talks in any available language. Returns timed segments (JSON), plain text, SRT, and WebVTT formats alongside speaker metadata, tags, and multi-language availability.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Scrape full transcripts from TED.com in any available language. Returns timed cue segments, plain text, SRT subtitles, WebVTT captions, and speaker metadata for every talk — packaged in one record per language per talk.
TED Talks Transcript Scraper Features
- Extracts complete transcripts with millisecond-accurate timing (427 cues for an average talk)
- Returns four formats per transcript: JSON segments, plain text, SRT, and WebVTT — most actors pick one format and call it done
- Collects speaker name, role, full bio, event name, recorded date, duration, view count, and topic tags alongside the transcript
- Reports all available language codes so you can plan multi-language runs
- Fetches only the native language by default, or every translation the talk has, or a specific list you provide
- Accepts custom start URLs for targeted scraping of individual talks
- Discovers all talks automatically via TED's year-by-year sitemap index when no URLs are given
- No proxy required — TED serves transcripts publicly, no auth or Cloudflare management involved
What Can You Do With TED Transcript Data?
- NLP researchers — Build or extend corpora for text classification, summarization, or speaker style analysis; TED-LIUM is a standard benchmark, and this actor gives you fresh slices of it
- Language-learning app developers — Pull parallel transcripts (English audio + Japanese subtitles) for aligned bilingual reading and listening exercises
- AI training teams — Collect multi-speaker, multi-language text at scale; TED's volunteer-translated transcripts cover 100+ languages with consistent quality
- Public speaking coaches — Analyze rhetorical structure, pacing cues, and paragraph breaks across thousands of talks
- Translation quality researchers — Compare the same content across 60+ language variants for benchmarking MT and human translation output
- Educators and content curators — Build searchable archives of transcript text with metadata for curriculum alignment or topic discovery
How TED Talks Transcript Scraper Works
- Seed the run. If you provide
startUrls, those talks are processed directly. Otherwise the scraper walks TED's year-by-year sitemap index (2006–2025) and collects every talk URL up to yourmaxItemsbudget. - For each talk, the scraper fetches the transcript page HTML and parses the embedded
__NEXT_DATA__JSON blob. This yields the numeric talk ID, speaker details, event name, dates, view count, tags, and the full list of available language codes. - Using the language list, the scraper calls TED's public subtitles API — one request per language — and retrieves millisecond-timed caption cues.
- The cues are assembled into four transcript formats, merged with the talk metadata, and saved as one dataset record per language.
Input
{"maxItems": 15,"startUrls": [{ "url": "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity" }],"languages": ["en", "ja"],"fetchAllLanguages": false}
| Field | Type | Default | Description |
|---|---|---|---|
maxItems | integer | 15 | Maximum transcript records to save. One record = one talk × one language. |
startUrls | array | — | Specific TED talk URLs to scrape. When empty, the scraper discovers talks from the sitemap. |
languages | array | — | ISO 639-1 codes to fetch (e.g. ["en", "ja", "es"]). Leave empty for the talk's native language only. |
fetchAllLanguages | boolean | false | When true, fetches every available translation for each talk. Overrides languages. |
Fetch all languages for a single talk:
{"maxItems": 100,"startUrls": [{ "url": "https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity" }],"fetchAllLanguages": true}
Ken Robinson's talk has 64 language translations — that input produces 64 records.
TED Talks Transcript Scraper Output Fields
{"talk_id": "66","slug": "sir_ken_robinson_do_schools_kill_creativity","title": "Do schools kill creativity?","speaker_name": "Sir Ken Robinson","speaker_role": "Author, educator","speaker_bio": "Creativity expert Sir Ken Robinson challenged the way we educate our children...","event": "TED2006","recorded_date": "2006-02-25","published_date": "2006-06-27T00:11:00.000Z","duration_seconds": 1148,"language": "en","language_name": "English","tags": "culture, education, creativity, dance, parenting, teaching, kids","description": "Sir Ken Robinson makes an entertaining and profoundly moving case for creating an education system...","view_count": 80149052,"thumbnail_url": "https://pi.tedcdn.com/r/pe.tedcdn.com/...","canonical_url": "https://www.ted.com/talks/sir_ken_robinson_do_schools_kill_creativity","available_languages": "pt-br, el, eo, en, vi, ca, it, sv, cs, ar, ...","transcript_plain": "Good morning. How are you? (Audience) Good. It's been great, hasn't it?...","transcript_srt": "1\n00:00:02,103 --> 00:00:04,678\nGood morning. How are you?\n\n2\n...","transcript_vtt": "WEBVTT\n\n1\n00:00:02.103 --> 00:00:04.678\nGood morning. How are you?\n\n2\n...","transcript_segments": "[{\"start_ms\":2103,\"duration_ms\":2575,\"text\":\"Good morning. How are you?\",\"start_of_paragraph\":true},...]"}
| Field | Type | Description |
|---|---|---|
talk_id | string | Numeric TED talk ID |
slug | string | Canonical URL slug |
title | string | Talk title in English |
speaker_name | string | Speaker display name |
speaker_role | string | One-line speaker description |
speaker_bio | string | Full speaker biography |
event | string | Event where the talk was given (e.g. TED2006, TEDxBoston) |
recorded_date | string | Recording date (YYYY-MM-DD) |
published_date | string | Publication date (ISO 8601) |
duration_seconds | number | Talk duration in seconds |
language | string | ISO 639-1 code for this transcript |
language_name | string | Full language name in English |
tags | string | Comma-separated TED topic tags |
description | string | Talk abstract |
view_count | number | Total view count across platforms |
thumbnail_url | string | Talk thumbnail image URL |
canonical_url | string | Canonical TED.com URL |
available_languages | string | Comma-separated codes of all available translations |
transcript_plain | string | Full transcript as plain text |
transcript_srt | string | Transcript in SRT subtitle format |
transcript_vtt | string | Transcript in WebVTT format |
transcript_segments | string | JSON-serialized timed cue array: [{start_ms, duration_ms, text, start_of_paragraph}] |
🔍 FAQ
How do I scrape TED talk transcripts?
TED Talks Transcript Scraper handles discovery automatically. Provide a startUrls list for specific talks or leave it empty to pull from the sitemap. Set maxItems to cap the output, then run.
How much does TED Talks Transcript Scraper cost to run?
TED Talks Transcript Scraper charges $0.003 per transcript record (one talk × one language) plus a small platform start fee. Fetching the English transcript for 100 talks costs roughly $0.30.
Can I get transcripts in multiple languages?
Yes. Set fetchAllLanguages: true to retrieve every translation for each talk, or pass a languages array with specific ISO 639-1 codes. A popular talk like Ken Robinson's "Do Schools Kill Creativity?" has 64 language variants.
Does TED Talks Transcript Scraper need proxies?
No. TED publishes transcripts publicly — no authentication, no Cloudflare challenge, no residential proxy required. The scraper runs on standard infrastructure at a courteous pace.
What format do the timed segments come in?
Each record includes transcript_segments as a JSON string containing an array of cue objects: {start_ms, duration_ms, text, start_of_paragraph}. Timing is in milliseconds, matching TED's source data. SRT and VTT formats are derived from the same cue data.
Are transcripts available for all TED talks?
Most established talks have English transcripts. Translations depend on TED's volunteer community — popular talks often have 50+ languages, while talks published in the last few months may have none yet. The scraper logs a warning and skips talks with no available transcripts for the requested language.
Need More Features?
Need filtering by event, speaker, or topic? Custom language combinations? File an issue or get in touch.
Why Use TED Talks Transcript Scraper?
- Four formats, one run — plain text, SRT, WebVTT, and timestamped JSON segments in a single record; most alternatives force you to choose one and convert the rest yourself
- Multi-language by design — fetch all 64+ translations of a talk with a single flag, which is the part that makes this corpus useful for NLP alignment work
- No setup required — public access, no API keys, no proxies, sitemap-driven discovery out of the box