PBS Frontline Transcripts Scraper
Pricing
Pay per event
PBS Frontline Transcripts Scraper
Scrape full transcripts from PBS Frontline documentary films. Extracts transcript body text, speaker labels, film metadata (air date, synopsis, credits), and topic tags from all Frontline documentaries on pbs.org.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
Scrapes full transcripts from PBS Frontline documentary films. Returns one record per documentary — title, synopsis, air date, speaker-labeled transcript body, and topic metadata. Covers the active Frontline archive (250+ films back to ~1995) using the site's sitemap index for discovery.
Frontline is one of the few public-broadcasting sources where transcripts are routinely cited in academic and journalism contexts. Each film runs 60-120 minutes and produces 30-80KB of clean, fact-checked, single-narrative text — a different unit of value from fragmented panel-show or wire-copy corpora.
What It Scrapes
Every record comes from pbs.org/wgbh/frontline/documentary/<slug>/. The transcript lives inline on the main documentary page (the /transcript/ subpath was retired; transcripts are now embedded directly). Metadata comes from JSON-LD structured data blocks on the same page.
Output Schema
| Field | Type | Description |
|---|---|---|
film_slug | string | URL slug (e.g. the-deal-trump-bukele-gangs-el-salvador) |
film_title | string | Documentary title |
film_url | string | Canonical PBS URL |
air_date | string | Original broadcast date (YYYY-MM-DD) |
duration_minutes | number | Runtime in minutes |
synopsis | string | Brief description from page metadata |
producers | string | Comma-separated producing and directing credits |
correspondents | string | Comma-separated correspondent credits |
related_topics | string | Comma-separated PBS topic tags |
body_html | string | Full transcript HTML with <strong>SPEAKER:</strong> spans |
body_text | string | Plain-text transcript with inline speaker labels |
speakers | string | Comma-separated unique speaker labels |
has_viewer_discretion_notice | boolean | True if the film flags mature content |
related_film_urls | string | Comma-separated URLs of cross-linked Frontline films |
canonical_url | string | Canonical page URL |
source | string | Fixed: pbs.org/wgbh/frontline |
scraped_at | datetime | ISO 8601 scrape timestamp |
Speaker labels follow the Frontline convention: NARRATOR, PRESIDENT DONALD TRUMP, NAYIB BUKELE, etc. They are extracted directly from <strong>LABEL:</strong> spans — no inference, no cleanup required.
Input Options
startUrls (array, optional) — Specific documentary URLs to scrape. Leave empty to run the full sitemap discovery and scrape all available transcripts.
maxItems (integer, optional) — Cap on total records. Default 0 (no limit). When using sitemap discovery, applies globally across all sitemaps.
Example: Single film
{"startUrls": [{"url": "https://www.pbs.org/wgbh/frontline/documentary/the-deal-trump-bukele-gangs-el-salvador/"}]}
Example: Full archive crawl (all ~250 films)
{"maxItems": 0}
Example: Recent 50 films
{"maxItems": 50}
How It Works
Discovery uses PBS Frontline's sitemap index at pbs.org/wgbh/frontline/sitemap.xml. The nine sitemap-documentary sub-sitemaps each hold up to 100 film URLs, ordered newest-first. Films without a transcript (some pre-rebuild older entries) are silently skipped.
Metadata is parsed from JSON-LD blocks on each documentary page. The transcript and credits live in two Chakra UI accordion panels — panel 0 is the transcript, panel 1 is the credits. Speaker labels are extracted via a single regex pass on the <strong>LABEL:</strong> pattern Frontline uses consistently across its archive.
The site is server-rendered Next.js with aggressive edge caching — no headless browser required, no proxy required.
Pricing
Charged per record scraped. Long-form transcripts (30-80KB each) are priced at a modest premium reflecting per-record research value. Start price applies per actor run regardless of record count.
Notes
- Films without a transcript are skipped gracefully and do not count toward
maxItems. - Some older archive films have had their transcript pages rebuilt and may appear without speaker-label markup — body text is still returned when a transcript exists.
body_htmlpreserves the original<strong>speaker spans for downstream NLP pipelines that want to distinguish speaker turns programmatically.
Need Custom Fields or a Different Source?
File an issue or get in touch. We can add fields, filter by topic, or build adjacent scrapers in the same broadcast-transcript vertical.