Spotify Podcast Guest Extractor
Pricing
Pay per event
Spotify Podcast Guest Extractor
Extract a structured guest-history log from Spotify podcast episodes — episode metadata plus NLP-detected guest names, roles, and confidence scores. Uses your own Spotify Developer credentials; metadata-only, no audio.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
3 days ago
Last modified
Categories
Share
🎯 What this scrapes
Spotify publishes every show's episode list at api.spotify.com/v1/shows/{id}/episodes. This Actor authenticates with your Spotify Developer credentials, pages through episodes for one or more shows, runs spaCy + regex over each episode description, and emits one row per episode × guest pair — name, inferred role (host / cohost / guest), and a confidence score.
🔥 What we handle for you
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python. - 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per page,Retry-Afterhonoured. - 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
- 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
- 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.
💡 Use cases
- Podcast guest research — build a database of who has appeared on which podcasts to find cross-show patterns.
- Booking agencies — surface guests that match a target podcast's profile to pitch your client into similar shows.
- Media intel — track when a public figure (CEO, author, researcher) makes podcast appearances.
- Entity-graph AI training — feed (person, show, date) triples into a knowledge graph for downstream ML.
- Journalists — quickly trace a person's recent podcast circuit to map their messaging and audience reach.
⚙️ How to use it
- Click Try for free at the top of the page.
- Fill in the input form — most fields have sensible defaults.
- Click Start. Output streams into the run's dataset.
- Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
show_ids | array | no | ['2MAi0BvDc6GTFvKFPXnkCL'] | List of 22-character Spotify show IDs (the bit after /show/ in any open.spotify.com URL). Provide this OR `showSearchQ |
showSearchQuery | string | no | '—' | Free-text query passed to /v1/search. The top-matching show is used. Ignored when showIds is set. |
maxEpisodesPerShow | integer | no | 20 | Newest episodes first. Hard cap 200. |
clientId | string | yes | '—' | From your Spotify Developer Dashboard application (https://developer.spotify.com/dashboard). Required. |
clientSecret | string | yes | '—' | From the same Spotify Developer application. Required. Stored as an Apify Secret — never logged. |
market | string | no | 'US' | Country code used for availability filtering on Spotify episode endpoints. |
proxyConfiguration | object | no | {'useApifyProxy': False} | Optional — Spotify Web API does not IP-block at normal volumes. |
Example input
{"show_ids": ["2MAi0BvDc6GTFvKFPXnkCL"],"maxEpisodesPerShow": 3,"clientId": "","clientSecret": "","market": "US","proxyConfiguration": {"useApifyProxy": false}}
📤 Output
Every row is one dataset item.
| Field | Type | Notes |
|---|---|---|
show_id | string | Spotify show ID. |
show_name | string | Show name as returned by /v1/shows/{id}. |
episode_id | string | Spotify episode ID. |
episode_name | string | Episode title. |
episode_release_date | ['string', 'null'] | ISO date (YYYY-MM-DD). |
episode_duration_ms | ['integer', 'null'] | Episode length in milliseconds. |
guest_name | ['string', 'null'] | NLP-extracted guest name. Null when no guest could be identified (host-only row preserves episode metadata). |
guest_role | string | One of host, cohost, guest. |
confidence | number | Heuristic 0.0-1.0 — regex matches 0.85-1.00, bare NER 0.55-0.75, host-fallback 0.0. |
episode_url | string | Public episode URL on open.spotify.com. |
scraped_at | string | ISO 8601 UTC timestamp at row creation. |
Example output
{"show_id": "2MAi0BvDc6GTFvKFPXnkCL","show_name": "Lex Fridman Podcast","episode_id": "5kF8w2Q9pNeLBxXxNH1mxJ","episode_name": "#412 \u2014 Demis Hassabis: AGI and the Future of AI","episode_release_date": "2026-04-30","episode_duration_ms": 9384210,"guest_name": "Demis Hassabis","guest_role": "guest","confidence": 0.92,"episode_url": "https://open.spotify.com/episode/5kF8w2Q9pNeLBxXxNH1mxJ","scraped_at": "2026-05-16T12:00:00Z"}
💰 Pricing
Pay-Per-Event — you pay only when these events fire:
| Event | USD | What it is |
|---|---|---|
actor-start | $0.05 | One-off warm-up charge per run |
result | $0.005 | Per dataset item |
Example: 1 000 results at the rates above ≈ $5.05. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.
🚧 Limitations
Metadata-only: no audio, no transcripts. Guest extraction quality depends on episode description content — shows with sparse or non-English descriptions yield lower recall. Spotify catalog endpoints occasionally return 404 for regionally restricted episodes; these are logged and skipped.
❓ FAQ
Why do I need a Spotify client_id and client_secret?
Spotify's Web API requires OAuth 2.0 — even for public catalog data. Creating a free app on the Spotify Developer Dashboard takes 60 seconds: https://developer.spotify.com/dashboard. The Actor uses the client_credentials grant, which gives it read-only access to the public catalog and never touches any user account.
Are episode transcripts included?
No. Spotify does not expose transcripts via the public Web API; downloading episode audio would violate Spotify's ToS. This Actor is metadata-only — episode title, description, release date, plus extracted guest names from the description.
How accurate is the guest extraction?
The Actor first sweeps regex patterns common in podcast show notes (with guest X, interview with X, #412 — X, etc.) at confidence 0.85-1.00. It then falls back to spaCy's en_core_web_sm PERSON entity recognizer at confidence 0.55-0.75. Each row carries its confidence score — filter on confidence >= 0.8 for high-precision use cases.
What if a show is in a language other than English?
Regex patterns are English-only in v1. The spaCy NER fallback still surfaces PERSON entities but with reduced accuracy. Multi-language NLP is planned for v2 — open an issue with your language to upvote.
Why is one row sometimes emitted with guest_name: null?
If no guest can be extracted (e.g. solo episode or sparse description), the Actor still emits a single host-role row so the episode metadata is preserved in your dataset — useful for downstream joins. The confidence is 0.0 in this case.
💬 Your feedback
Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.