Speech-to-Text Transcription avatar

Speech-to-Text Transcription

Pricing

Pay per event

Go to Apify Store
Speech-to-Text Transcription

Speech-to-Text Transcription

Transcribe audio and video from YouTube, TikTok, podcasts, X, and 1,000+ other sites or any direct media URL into accurate, speaker-labeled text. Uses World's best speech to text AI models with automatic language detection, multilingual support, and smart formatting.

Pricing

Pay per event

Rating

5.0

(3)

Developer

Harish Garg

Harish Garg

Maintained by Community

Actor stats

1

Bookmarked

69

Total users

52

Monthly active users

3 hours ago

Last modified

Share

What does Speech-to-Text Transcription do?

Speech-to-Text Transcription converts audio from YouTube, podcasts, social video platforms, or any direct media URL into accurate, formatted text using Deepgram's Nova-3 speech recognition model. The Actor pulls the audio, sends it to Deepgram's API, and returns a full transcript with speaker diarization (identifying who said what) and smart formatting (dates, numbers, punctuation). It auto-detects the spoken language across dozens of languages, so non-English audio works out of the box — and recovers tricky audio automatically by retrying on a multilingual model when needed.

Powered by the Apify platform, you get API access, scheduling, webhook integrations, and seamless data export — all without managing infrastructure.

YouTube and TikTok downloads are handled by dedicated downloaders. Because their anti-bot and login walls make direct extraction unreliable, YouTube links are downloaded through the maintained streamers/youtube-video-downloader Actor and TikTok links through clockworks/tiktok-scraper, for far better reliability. This affects pricing for those sources — see Pricing.

Why use Speech-to-Text Transcription?

  • Content repurposing — Turn video podcasts, lectures, and interviews into blog posts, articles, or documentation
  • Accessibility — Generate transcripts for hearing-impaired audiences or multilingual translation workflows
  • Research & analysis — Search, index, and analyze spoken content at scale across multiple sources
  • SEO — Create text versions of your video content to improve search engine discoverability
  • Compliance — Maintain text records of meetings, webinars, and public broadcasts
  • Pipeline transcription step — Already scraping videos with your own Actor or tool? Chain this Actor downstream and feed it the direct media URL to add transcripts to whatever you collect (see Use as a transcription step in your scraping pipeline)

Supported sources at a glance

Not sure if your source will work? Find it below — it tells you which input field to use and how to pass it for the most reliable result.

SourceUse this fieldHow to pass it for best results
YouTubevideoUrlPaste the watch URL. Downloading is delegated to a maintained downloader for reliability; platform metadata is limited and the per-video fee is waived (see Pricing).
TikTokvideoUrlPaste the video URL. Delegated downloader, but full metadata (title, channel, views) is still returned.
Vimeo, SoundCloud, X/Twitter, podcast RSS episodes, + 1000+ yt-dlp sitesvideoUrlPaste the public page/episode URL. Must be public — login/age-walled posts can't be fetched.
Facebook / InstagramvideoUrl (the page URL)Prefer the facebook.com / instagram.com post URL so the Actor resolves a fresh stream itself. A raw fbcdn.net / cdninstagram.com link works only if it's the progressive .mp4 (not a fragmented _dashinit.mp4 / DASH) variant and hasn't expired.
Your own hosted files (S3, Google Drive, Dropbox, your CDN, Apify Key-Value Store)mediaUrlPass a direct public link straight to the file (mp3, mp4, wav, m4a, flac, ogg, webm, mov, mkv). Video files have their audio extracted automatically.
Any other direct media URLmediaUrlIf the link points straight at an audio/video file and is publicly reachable, it works. If it serves an HTML page, it won't.

Rule of thumb: if there's a public web page for it, use videoUrl. If you already hold a direct link to the media file itself, use mediaUrl. Provide exactly one.

How to transcribe YouTube videos, podcasts, and audio files

  1. Open the Actor — Click "Try for free" on the Actor's page
  2. Choose one input source — Paste a platform URL (YouTube, Vimeo, podcast RSS episode, TikTok, SoundCloud, X, etc.) or a direct media file URL
  3. Configure options (optional) — Choose the transcription model, enable/disable speaker diarization, set a specific language, or adjust the maximum audio length
  4. Run the Actor — Click "Start" and wait for the transcription to complete
  5. Download results — Get your transcript from the Key-Value Store or Dataset tabs

No external API keys required — Deepgram transcription is included in the per-minute price.

Run it via API or on a schedule

Most production usage drives this Actor through the Apify API, not the Console — it's built for that. Start a run with a single POST:

curl -X POST "https://api.apify.com/v2/acts/hgservices~speech-to-text/runs?token=<APIFY_TOKEN>" \
-H "Content-Type: application/json" \
-d '{"videoUrl": "https://www.youtube.com/watch?v=f60dheI4ARg"}'

Or from the Apify client (the Actor slug is hgservices/speech-to-text):

run = client.actor("hgservices/speech-to-text").call(
run_input={"videoUrl": "https://www.youtube.com/watch?v=..."}
)

To run it unattended, point a Schedule at it, or fire it from your own scraper's success event with webhooks and feed it a direct mediaUrl — see Use as a transcription step in your scraping pipeline for that pattern.

Input

Provide exactly one of the two input sources below. All other fields are optional.

FieldTypeDescription
videoUrlstringURL on YouTube, Vimeo, TikTok, SoundCloud, X, an RSS podcast episode, or any of the 1000+ sites supported by yt-dlp
mediaUrlstringDirect public HTTP(S) URL to an audio or video file (mp3, mp4, wav, m4a, flac, ogg, webm, mov, mkv, …)
maxAudioMinutesintegerMaximum source length in minutes. Inputs longer than this fail before any cost is incurred. Default: 240 (4 h). Max: 600 (10 h)
modelenumDeepgram model: nova-3 (default), nova-2, whisper-large, whisper-medium, whisper-base
languagestringLanguage code (e.g., en, es, fr) — leave empty for auto-detection
diarizebooleanEnable speaker diarization (default: true)
smartFormatbooleanApply smart formatting to transcript (default: true)

For mediaUrl, both audio-only files and video files are supported — the Actor extracts the audio track from video automatically before transcription.

Common mediaUrl uses:

  • Transcribe a video from X/Twitter — pass the direct https://video.twimg.com/... file URL
  • Transcribe a Facebook video — pass the direct https://video.*.fbcdn.net/... file URL (use the progressive .mp4 variant, not a fragmented/DASH one)
  • Transcribe an Instagram Reel or post video — pass the underlying https://...cdninstagram.com/... video URL
  • Transcribe your own hosted files — an S3 presigned URL, a public Google Drive / Dropbox direct link, or an Apify Key-Value Store record URL (handy for batch pipelines that store media before transcribing)

Output

The Actor stores two files in the Key-Value Store:

  • transcript.txt — Formatted transcript with speaker labels and paragraphs
  • transcript.json — Raw Deepgram API response with full metadata

And pushes a summary to the Dataset (visible in the Output tab and via the dataset API):

{
"sourceType": "platform",
"videoUrl": "https://www.youtube.com/watch?v=...",
"videoId": "dQw4w9WgXcQ",
"title": "Example Video Title",
"channel": "Example Channel",
"channelUrl": "https://www.youtube.com/@example",
"uploadDate": "2024-09-12",
"viewCount": 1234567,
"thumbnail": "https://i.ytimg.com/vi/.../maxresdefault.jpg",
"model": "nova-3",
"language": "en",
"diarize": true,
"durationSeconds": 342.5,
"transcript": "[Speaker 0] ...",
"transcriptLength": 4521,
"speakerCount": 3
}

sourceType is "platform" for videoUrl and "directUrl" for mediaUrl. Platform-specific fields (channel, viewCount, thumbnail, …) are null for direct media URL sources. For YouTube sources these fields are also null (downloading is delegated), and title is recovered on a best-effort basis. TikTok is delegated too but still returns full metadata (title, channel, viewCount, thumbnail, uploadDate).

Example transcript output:

[Speaker 0] Welcome everyone to today's discussion about artificial intelligence.
[Speaker 1] Thank you for having me. I think the most exciting development is in natural language processing.
[Speaker 0] I completely agree. The advances in transformer models have been remarkable.

You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.

Output data fields

FieldTypeDescription
sourceTypestring"platform" (yt-dlp source) or "directUrl"
videoUrlstringOriginal source URL (platform URL for videoUrl, direct URL for mediaUrl)
videoIdstringPlatform-specific video identifier (null for direct URL sources)
titlestringVideo or episode title (null for non-platform sources)
channelstringChannel or creator name (null for non-platform sources)
channelUrlstringURL to the creator's channel (null for non-platform sources)
uploadDatestringISO date the source was published (null for non-platform sources)
viewCountintegerView count at time of transcription (null for non-platform sources)
thumbnailstringURL to video thumbnail image (null for non-platform sources)
modelstringDeepgram model used for transcription
languagestringLanguage code (auto-detected when not specified)
diarizebooleanWhether speaker diarization was applied
durationSecondsnumberAudio duration in seconds
transcriptstringFull formatted transcript with speaker labels
transcriptLengthintegerCharacter count of the transcript
speakerCountintegerNumber of distinct speakers detected (1 if diarization disabled)

Use as a transcription step in your scraping pipeline

If you already scrape videos — with your own Apify Actor, a third-party scraper, or any custom tool — you don't need this Actor to find the media. Hand it the direct media URL via mediaUrl and it becomes the transcription stage at the end of your pipeline. This works for any platform's direct CDN URL (Facebook fbcdn.net, Instagram cdninstagram.com, X video.twimg.com, your own S3/CDN, …) as long as the link points straight at an audio or video file.

Chaining from another Apify Actor. Run your scraper first, then call this Actor with the URL it produced. From the Apify API or SDK:

// POST https://api.apify.com/v2/acts/hgservices~speech-to-text/runs?token=<APIFY_TOKEN>
{
"mediaUrl": "https://video.fhel-1.fna.fbcdn.net/o1/v/.../video.mp4?...",
"language": "" // leave empty to auto-detect
}

Or with the Apify client (the Actor slug is hgservices/speech-to-text):

run_input = {"mediaUrl": "https://video.fhel-1.fna.fbcdn.net/o1/v/.../video.mp4?..."}
run = client.actor("hgservices/speech-to-text").call(run_input=run_input)

You can wire this together without code using Apify task chaining / webhooks — fire this Actor on your scraper's SUCCEEDED event and map the scraped URL into mediaUrl.

Operational caveats for signed CDN URLs. Links from sites like Facebook and Instagram are signed and expire within hours, and are often bound to the IP/session that generated them. To keep delivery reliable:

  • Transcribe promptly — pass the URL to this Actor right after you scrape it; a stale link returns a 403 and the run fails.
  • Prefer the progressive .mp4 variant over fragmented/DASH (_dashinit.mp4) URLs — audio extraction can fail on fragmented-init segments.
  • Watch IP binding — if your scraper fetched the URL from one IP, fetching it here from a different one (including via proxy) may be rejected; pass a freshly minted, broadly fetchable URL.

The Dataset summary for these runs has sourceType: "directUrl" and platform metadata fields (channel, viewCount, thumbnail, …) set to null, since there's no platform page to scrape that from — you already hold that context upstream.

Languages and choosing a model

The Actor handles non-English audio automatically — leave the language field empty and Deepgram detects the spoken language for you. If you already know the language, set it explicitly (e.g. es, fr, de, it, pt-BR, hi, ja, zh) for a small accuracy boost. Set language to multi for Nova-3 multilingual code-switching when a single recording mixes languages.

Choosing a model:

ModelBest for
nova-3 (default)Fast, high-accuracy English and major-language speech with the strongest punctuation and diarization
whisper-largeMusic-adjacent audio, heavy accents, or less-common languages — the widest language coverage
nova-2, whisper-medium, whisper-baseBenchmarking or lighter-weight alternatives

You usually don't need to think about this: when the default model returns an empty transcript on auto-detected audio, the Actor retries once with whisper-large before giving up, so hard or non-English audio is recovered without you changing any settings. If a transcript still comes back empty after that, the audio almost certainly contains no transcribable speech (it's music, singing, or silence).

Pricing / Cost estimation

This Actor uses pay-per-event pricing — you only pay for what you use, with no Apify compute units, proxy bandwidth, or Deepgram fees to track separately. Everything is bundled in:

EventPrice (USD)When charged
Video processed$0.02Once per run after the Actor processes your input. Also charged when a submitted post contains no media to transcribe (e.g. an Instagram photo/carousel) — the download attempt still runs. Not charged for YouTube or TikTok URLs (see below)
Minute of audio transcribed$0.015Per minute of audio (rounded up), only after transcription succeeds

YouTube and TikTok sources: the $0.02 per-video fee is waived, but downloading runs through a dedicated downloader Actor (streamers/youtube-video-downloader for YouTube, clockworks/tiktok-scraper for TikTok), which bills your Apify account directly at its own small pay-per-event rate (a fraction of a cent to ~$0.0017 for a typical short video) in addition to the per-minute transcription cost above. Net effect: these transcriptions usually cost about the same as the estimates below, just with the small download fee in place of the $0.02 processing fee.

Example costs (non-delegated sources):

Video lengthTotal cost
5 min~$0.10
30 min~$0.47
1 h~$0.92
2 h~$1.82
4 h (default cap)~$3.62

The default maxAudioMinutes cap of 240 minutes (4 hours) protects you from accidentally transcribing a marathon livestream. You can raise it up to 600 minutes (10 hours) if you need to, or lower it for tighter cost control. Videos that exceed the cap fail before any charge.

Trying it for free — New Apify accounts include free monthly platform credits, which are enough to transcribe several hours of audio before any out-of-pocket cost.

Tips

  • Set maxAudioMinutes lower if you're running this on a schedule or via API to cap worst-case cost per run
  • Shorter videos process faster — Videos under 30 minutes typically transcribe in under a minute
  • Stick with Nova-3 for most audio — It's Deepgram's most advanced model with superior punctuation and speaker detection. For music-heavy or less-common-language audio, pick whisper-large (though the Actor also falls back to it automatically on an empty result)
  • Set the language explicitly if you know it — this improves accuracy for non-English content; leave it empty to auto-detect
  • Disable diarization for single-speaker videos to slightly reduce transcription time

Troubleshooting: common errors and what to do

First, where to look. A run that can't produce a transcript still finishes as Succeeded (so it never counts as a platform failure) — the reason is written to the run's status message and to the statusMessage / transcriptStatus fields of the Dataset record, with an empty transcript. So if you got no text, open the run and read the status message: it names the cause and the fix. The most common ones are below.

Source and download problems (mostly mediaUrl)

Message you'll seeWhat it meansWhat to do
"Facebook/Instagram returned HTTP 403 for this CDN link"The raw fbcdn.net / cdninstagram.com link expired or is bound to the IP/session that opened the post.Paste the original facebook.com / instagram.com page URL into videoUrl instead of the raw CDN link — the Actor resolves a fresh stream itself.
"Couldn't extract audio from this Facebook/Instagram CDN file"The link is a fragmented (DASH) stream that carries only part of the media, with no complete audio track.Same fix: use the page URL in videoUrl, or pass the progressive .mp4 variant rather than a _dashinit.mp4 one.
"mediaUrl returned HTTP 403 / 404"The link is wrong, expired, private, or not publicly reachable from the server.Open it in a private browser window; if it doesn't play, get a fresh, public, direct file URL and re-submit promptly.
"mediaUrl did not return a media file…"The URL served an HTML page (a login, expired-link, or block page) instead of a file.Make sure the link points straight at the file, not a webpage, and that it's still live and public.
"Downloaded file is not a valid audio or video file"The download was corrupt or truncated (sometimes a proxy reset mid-stream).Re-fetch a fresh URL and retry; confirm the link points to a real media file.
"Could not download mediaUrl after N attempts"Repeated network/proxy resets, or the host is down.Retry shortly; if it persists the host may be blocking automated fetches.

Platform-page problems (videoUrl)

Message you'll seeWhat it meansWhat to do
"YouTube blocked this request as automated traffic"YouTube's bot wall.Enable Apify proxy (proxyConfiguration.useApifyProxy) or retry later.
"This content requires a logged-in account…"Login/age wall — common on TikTok, Instagram, Facebook, Reddit for non-public posts.Use publicly accessible content; the Actor downloads anonymously and can't authenticate.
"Video is private / unavailable / age-restricted"A restriction on the source itself.Nothing the Actor can change — use content that's public and unrestricted.
"This post contains no video or audio to transcribe…"The post has no media track — Instagram photo and carousel-image posts carry no video or audio.Submit a post that contains a video or audio clip. The $0.02 processing fee still applies, since the Actor attempted the download.
"Video is geo-restricted…"The content isn't available from the Actor's region.Enable Apify proxy with a different country.
"Source platform rate-limited the request (HTTP 429)"Too many requests to the platform.Retry later or enable Apify proxy.
"URL is not supported"Wrong input field, or a site yt-dlp doesn't support.Use videoUrl for platform pages (YouTube, Vimeo, etc.) and mediaUrl for direct file links.

Transcript, input, and limit problems

Message you'll seeWhat it meansWhat to do
"Transcription returned no text…" (empty)No speech was recognized. The Actor already retried auto-detected audio on whisper-large.Usually the audio is music, singing, or silence. If it is speech in another language and you forced language, clear it to auto-detect. See the empty transcript FAQ.
"Deepgram rejected the audio…"The audio was malformed, too large, or in an unsupported format.Re-encode to a standard format (mp3 / m4a / wav) and retry.
"Source duration … exceeds the configured maxAudioMinutes cap"The source is longer than your cap. No charge is incurred.Raise maxAudioMinutes (up to 600) if you intended to transcribe something this long.
"Missing input" / "Conflicting input"You provided neither or both of videoUrl and mediaUrl.Provide exactly one of the two.

Tip for signed CDN links (Facebook, Instagram, X): these expire within hours and are often IP-bound, so transcribe promptly and prefer the page URL via videoUrl. See Operational caveats for signed CDN URLs for the full rundown.

Still stuck? If you're hitting a problem this guide doesn't resolve, you can reach the creator directly at harish@harishgarg.com — include the link you were trying to transcribe and the status message you saw, and I'll help you sort it out.

FAQ, disclaimers, and support

Why Deepgram Nova-3 instead of Whisper? Nova-3 is Deepgram's latest model and generally produces stronger punctuation, smart formatting (numbers, dates, currency), and native speaker diarization compared to open-source Whisper variants — without you having to stitch those features together yourself. Whisper models are still available via the model input if you prefer them for specific languages or want to benchmark the difference. You don't have to choose defensively, either: if Nova-3 returns an empty transcript on auto-detected audio, the Actor automatically retries with whisper-large, which has broader language coverage.

I got an empty transcript — what happened? This means no speech was recognized. The Actor already retries auto-detected audio on whisper-large before reporting empty, so a persistently empty result almost always means the source is music, singing, instrumental, or silence (speech models don't transcribe sung lyrics). If you forced a specific language that doesn't match the audio, clear it to auto-detect instead.

Is this Actor legal to use? You are responsible for complying with the Terms of Service of any source platform you transcribe (YouTube, Vimeo, podcast hosts, etc.) and applicable copyright laws. Only transcribe content you have rights to or that is publicly available for personal/research use.

What sources are supported? Two input modes: (1) videoUrl — any of the 1000+ sites yt-dlp supports, including YouTube, Vimeo, TikTok, SoundCloud, X, and podcast RSS episode URLs; (2) mediaUrl — any direct public URL to an audio or video file. Video files have their audio track extracted automatically.

Can I transcribe private or unlisted content? Host the file behind a direct HTTP(S) URL (S3 presigned URL, a public Dropbox/Google Drive direct link, your own server) and pass it as mediaUrl. The URL only needs to be reachable for the duration of the run.

For issues, feature requests, or feedback, please use the Issues tab on this Actor's page. For custom solutions or enterprise needs, contact Apify support.