CBS 60 Minutes Transcripts Scraper
Pricing
Pay per event
CBS 60 Minutes Transcripts Scraper
Collects full interview transcripts from CBS 60 Minutes. Discovers pages via the CBS News article sitemap, extracts the Q&A body, correspondent name, broadcast date, speaker labels, and topic tags. Video-only segments without a published transcript are skipped.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Share
Scrapes full Q&A interview transcripts from CBS News 60 Minutes — the most-recognised US investigative news magazine. Returns one record per transcript page: title, correspondent, broadcast date, subject list, speaker-labeled body text, and topic metadata. Discovers transcript pages automatically from the CBS News article sitemap. Video-only segments without a published transcript are skipped.
60 Minutes is the most-watched US news magazine, known for long-form sit-down interviews with heads of state, CEOs, whistleblowers, and scientists. Each transcript runs 5,000-30,000 words of clean, on-the-record Q&A — high-signal content for media research, RAG pipelines, and investigative journalism datasets.
What It Scrapes
Targets two URL patterns on cbsnews.com:
/news/<slug>-60-minutes-transcript/— primary transcript pattern/news/read-the-full-transcript-of-<slug>/— extended interview variant
Discovery walks the CBS News monthly article sitemaps, filters by these patterns, and scrapes each matching page. Video-only stories (e.g. /news/<slug>-60-minutes/) are explicitly excluded.
Output Schema
| Field | Type | Description |
|---|---|---|
story_slug | string | URL slug of the transcript page |
story_title | string | Article headline |
story_url | string | Canonical CBS News URL |
aired_date | string | Broadcast date (YYYY-MM-DD) |
published_date | string | CBS News publish timestamp (ISO 8601) |
segment_type | string | Inferred type: interview, investigation, or profile |
correspondent | string | CBS News correspondent (e.g. Major Garrett, Lesley Stahl) |
subjects | string | Interviewed subjects extracted from speaker labels (comma-separated) |
synopsis | string | Article dek / meta description |
body_html | string | Full transcript HTML preserving Q&A paragraph structure |
body_text | string | Plain-text version of the transcript |
speakers | string | All speaker labels found in the transcript (comma-separated) |
is_transcript | boolean | Always true — non-transcripts are skipped |
has_video_only_variant | boolean | True when a paired video-only story exists |
related_story_urls | string | Related CBS News links on the page (comma-separated) |
topics | string | CBS News topic tags (comma-separated) |
canonical_url | string | Canonical URL from page head |
source | string | Fixed: cbsnews.com/60-minutes |
scraped_at | datetime | ISO 8601 scrape timestamp |
Speaker labels follow two CBS conventions: Major Garrett: (Title Case) and MAJOR GARRETT: (ALL-CAPS, used in the extended-interview variant). Both formats are normalized and extracted.
Input Options
maxItems (integer, required) — Maximum number of transcript records to scrape. Set a higher value for bulk runs.
startDate (string, optional) — Limit sitemap discovery to a given month onwards (YYYY-MM format, e.g. "2024-01"). Defaults to all available months when omitted.
startUrls (array, optional) — One or more direct CBS News transcript URLs. When provided, sitemap discovery is skipped and only the supplied URLs are scraped. Useful for targeted re-runs of specific episodes.
Example: Specific episode
{"maxItems": 1,"startUrls": [{"url": "https://www.cbsnews.com/news/netanyahu-us-israel-iran-60-minutes-transcript/"}]}
Example: All 2025 transcripts
{"maxItems": 200,"startDate": "2025-01"}
Example: Full archive (all available transcripts)
{"maxItems": 1000}
How It Works
Discovery uses the CBS News sitemap index at cbsnews.com/xml-sitemap/index.xml. Monthly article sitemaps (article-YYYY-MM.xml) are walked in order, newest first. Each sitemap lists 3,000+ news articles; only URLs matching the transcript patterns are fetched.
Metadata is parsed from JSON-LD NewsArticle blocks present on every CBS article page — giving reliable correspondent name, publish date, and keywords. The transcript body lives in <section class="content__body"> as a sequence of <p> tags. Speaker labels are extracted from paragraph-leading Name: patterns. Ad wrappers are stripped before body extraction.
CBS News is server-rendered (varnish edge cache) with no bot-protection observed. No proxy required, no headless browser required.
Coverage Notes
60 Minutes airs approximately 45 episodes per US broadcast season, with 3-4 segments per episode. Roughly 50-70% of segments receive a published transcript — the remainder are video-only. This scraper covers transcript-bearing segments only and makes that boundary explicit in every record (is_transcript: true, video-only pages are skipped). The active transcript archive covers approximately 5 years back, with sparser coverage for earlier seasons.
Pricing
Charged per transcript record scraped. Long-form interviews (5,000-30,000 words each) are priced at a modest premium reflecting their per-record research value versus wire-copy or short-form corpora.