PBS Frontline Transcripts Scraper avatar

PBS Frontline Transcripts Scraper

Pricing

Pay per event

Go to Apify Store
PBS Frontline Transcripts Scraper

PBS Frontline Transcripts Scraper

Scrape full transcripts from PBS Frontline documentary films. Extracts transcript body text, speaker labels, film metadata (air date, synopsis, credits), and topic tags from all Frontline documentaries on pbs.org.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Scrapes full transcripts from PBS Frontline documentary films. Returns one record per documentary — title, synopsis, air date, speaker-labeled transcript body, and topic metadata. Covers the active Frontline archive (250+ films back to ~1995) using the site's sitemap index for discovery.

Frontline is one of the few public-broadcasting sources where transcripts are routinely cited in academic and journalism contexts. Each film runs 60-120 minutes and produces 30-80KB of clean, fact-checked, single-narrative text — a different unit of value from fragmented panel-show or wire-copy corpora.

What It Scrapes

Every record comes from pbs.org/wgbh/frontline/documentary/<slug>/. The transcript lives inline on the main documentary page (the /transcript/ subpath was retired; transcripts are now embedded directly). Metadata comes from JSON-LD structured data blocks on the same page.

Output Schema

FieldTypeDescription
film_slugstringURL slug (e.g. the-deal-trump-bukele-gangs-el-salvador)
film_titlestringDocumentary title
film_urlstringCanonical PBS URL
air_datestringOriginal broadcast date (YYYY-MM-DD)
duration_minutesnumberRuntime in minutes
synopsisstringBrief description from page metadata
producersstringComma-separated producing and directing credits
correspondentsstringComma-separated correspondent credits
related_topicsstringComma-separated PBS topic tags
body_htmlstringFull transcript HTML with <strong>SPEAKER:</strong> spans
body_textstringPlain-text transcript with inline speaker labels
speakersstringComma-separated unique speaker labels
has_viewer_discretion_noticebooleanTrue if the film flags mature content
related_film_urlsstringComma-separated URLs of cross-linked Frontline films
canonical_urlstringCanonical page URL
sourcestringFixed: pbs.org/wgbh/frontline
scraped_atdatetimeISO 8601 scrape timestamp

Speaker labels follow the Frontline convention: NARRATOR, PRESIDENT DONALD TRUMP, NAYIB BUKELE, etc. They are extracted directly from <strong>LABEL:</strong> spans — no inference, no cleanup required.

Input Options

startUrls (array, optional) — Specific documentary URLs to scrape. Leave empty to run the full sitemap discovery and scrape all available transcripts.

maxItems (integer, optional) — Cap on total records. Default 0 (no limit). When using sitemap discovery, applies globally across all sitemaps.

Example: Single film

{
"startUrls": [
{"url": "https://www.pbs.org/wgbh/frontline/documentary/the-deal-trump-bukele-gangs-el-salvador/"}
]
}

Example: Full archive crawl (all ~250 films)

{
"maxItems": 0
}

Example: Recent 50 films

{
"maxItems": 50
}

How It Works

Discovery uses PBS Frontline's sitemap index at pbs.org/wgbh/frontline/sitemap.xml. The nine sitemap-documentary sub-sitemaps each hold up to 100 film URLs, ordered newest-first. Films without a transcript (some pre-rebuild older entries) are silently skipped.

Metadata is parsed from JSON-LD blocks on each documentary page. The transcript and credits live in two Chakra UI accordion panels — panel 0 is the transcript, panel 1 is the credits. Speaker labels are extracted via a single regex pass on the <strong>LABEL:</strong> pattern Frontline uses consistently across its archive.

The site is server-rendered Next.js with aggressive edge caching — no headless browser required, no proxy required.

Pricing

Charged per record scraped. Long-form transcripts (30-80KB each) are priced at a modest premium reflecting per-record research value. Start price applies per actor run regardless of record count.

Notes

  • Films without a transcript are skipped gracefully and do not count toward maxItems.
  • Some older archive films have had their transcript pages rebuilt and may appear without speaker-label markup — body text is still returned when a transcript exists.
  • body_html preserves the original <strong> speaker spans for downstream NLP pipelines that want to distinguish speaker turns programmatically.

Need Custom Fields or a Different Source?

File an issue or get in touch. We can add fields, filter by topic, or build adjacent scrapers in the same broadcast-transcript vertical.