Spotify Podcast Guest Extractor avatar

Spotify Podcast Guest Extractor

Pricing

Pay per event

Go to Apify Store
Spotify Podcast Guest Extractor

Spotify Podcast Guest Extractor

Extract a structured guest-history log from Spotify podcast episodes — episode metadata plus NLP-detected guest names, roles, and confidence scores. Uses your own Spotify Developer credentials; metadata-only, no audio.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

3 days ago

Last modified

Share


🎯 What this scrapes

Spotify publishes every show's episode list at api.spotify.com/v1/shows/{id}/episodes. This Actor authenticates with your Spotify Developer credentials, pages through episodes for one or more shows, runs spaCy + regex over each episode description, and emits one row per episode × guest pair — name, inferred role (host / cohost / guest), and a confidence score.

🔥 What we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured.
  • 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.

💡 Use cases

  • Podcast guest research — build a database of who has appeared on which podcasts to find cross-show patterns.
  • Booking agencies — surface guests that match a target podcast's profile to pitch your client into similar shows.
  • Media intel — track when a public figure (CEO, author, researcher) makes podcast appearances.
  • Entity-graph AI training — feed (person, show, date) triples into a knowledge graph for downstream ML.
  • Journalists — quickly trace a person's recent podcast circuit to map their messaging and audience reach.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Fill in the input form — most fields have sensible defaults.
  3. Click Start. Output streams into the run's dataset.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.

📥 Input

FieldTypeRequiredDefaultNotes
show_idsarrayno['2MAi0BvDc6GTFvKFPXnkCL']List of 22-character Spotify show IDs (the bit after /show/ in any open.spotify.com URL). Provide this OR `showSearchQ
showSearchQuerystringno'—'Free-text query passed to /v1/search. The top-matching show is used. Ignored when showIds is set.
maxEpisodesPerShowintegerno20Newest episodes first. Hard cap 200.
clientIdstringyes'—'From your Spotify Developer Dashboard application (https://developer.spotify.com/dashboard). Required.
clientSecretstringyes'—'From the same Spotify Developer application. Required. Stored as an Apify Secret — never logged.
marketstringno'US'Country code used for availability filtering on Spotify episode endpoints.
proxyConfigurationobjectno{'useApifyProxy': False}Optional — Spotify Web API does not IP-block at normal volumes.

Example input

{
"show_ids": [
"2MAi0BvDc6GTFvKFPXnkCL"
],
"maxEpisodesPerShow": 3,
"clientId": "",
"clientSecret": "",
"market": "US",
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item.

FieldTypeNotes
show_idstringSpotify show ID.
show_namestringShow name as returned by /v1/shows/{id}.
episode_idstringSpotify episode ID.
episode_namestringEpisode title.
episode_release_date['string', 'null']ISO date (YYYY-MM-DD).
episode_duration_ms['integer', 'null']Episode length in milliseconds.
guest_name['string', 'null']NLP-extracted guest name. Null when no guest could be identified (host-only row preserves episode metadata).
guest_rolestringOne of host, cohost, guest.
confidencenumberHeuristic 0.0-1.0 — regex matches 0.85-1.00, bare NER 0.55-0.75, host-fallback 0.0.
episode_urlstringPublic episode URL on open.spotify.com.
scraped_atstringISO 8601 UTC timestamp at row creation.

Example output

{
"show_id": "2MAi0BvDc6GTFvKFPXnkCL",
"show_name": "Lex Fridman Podcast",
"episode_id": "5kF8w2Q9pNeLBxXxNH1mxJ",
"episode_name": "#412 \u2014 Demis Hassabis: AGI and the Future of AI",
"episode_release_date": "2026-04-30",
"episode_duration_ms": 9384210,
"guest_name": "Demis Hassabis",
"guest_role": "guest",
"confidence": 0.92,
"episode_url": "https://open.spotify.com/episode/5kF8w2Q9pNeLBxXxNH1mxJ",
"scraped_at": "2026-05-16T12:00:00Z"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.05One-off warm-up charge per run
result$0.005Per dataset item

Example: 1 000 results at the rates above ≈ $5.05. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.

🚧 Limitations

Metadata-only: no audio, no transcripts. Guest extraction quality depends on episode description content — shows with sparse or non-English descriptions yield lower recall. Spotify catalog endpoints occasionally return 404 for regionally restricted episodes; these are logged and skipped.

❓ FAQ

Why do I need a Spotify client_id and client_secret?

Spotify's Web API requires OAuth 2.0 — even for public catalog data. Creating a free app on the Spotify Developer Dashboard takes 60 seconds: https://developer.spotify.com/dashboard. The Actor uses the client_credentials grant, which gives it read-only access to the public catalog and never touches any user account.

Are episode transcripts included?

No. Spotify does not expose transcripts via the public Web API; downloading episode audio would violate Spotify's ToS. This Actor is metadata-only — episode title, description, release date, plus extracted guest names from the description.

How accurate is the guest extraction?

The Actor first sweeps regex patterns common in podcast show notes (with guest X, interview with X, #412 — X, etc.) at confidence 0.85-1.00. It then falls back to spaCy's en_core_web_sm PERSON entity recognizer at confidence 0.55-0.75. Each row carries its confidence score — filter on confidence >= 0.8 for high-precision use cases.

What if a show is in a language other than English?

Regex patterns are English-only in v1. The spaCy NER fallback still surfaces PERSON entities but with reduced accuracy. Multi-language NLP is planned for v2 — open an issue with your language to upvote.

Why is one row sometimes emitted with guest_name: null?

If no guest can be extracted (e.g. solo episode or sparse description), the Actor still emits a single host-role row so the episode metadata is preserved in your dataset — useful for downstream joins. The confidence is 0.0 in this case.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.