Speech-to-Text Transcription
Pricing
Pay per event
Speech-to-Text Transcription
Transcribe audio and video from YouTube, TikTok, podcasts, X, and 1,000+ other sites or any direct media URL into accurate, speaker-labeled text. Uses World's best speech to text AI models with automatic language detection, multilingual support, and smart formatting.
Pricing
Pay per event
Rating
5.0
(1)
Developer
Harish Garg
Maintained by CommunityActor stats
0
Bookmarked
29
Total users
21
Monthly active users
19 minutes ago
Last modified
Categories
Share
What does Speech-to-Text Transcription do?
Speech-to-Text Transcription converts audio from YouTube, podcasts, social video platforms, or any direct media URL into accurate, formatted text using Deepgram's Nova-3 speech recognition model. The Actor pulls the audio, sends it to Deepgram's API, and returns a full transcript with speaker diarization (identifying who said what) and smart formatting (dates, numbers, punctuation). It auto-detects the spoken language across dozens of languages, so non-English audio works out of the box — and recovers tricky audio automatically by retrying on a multilingual model when needed.
Powered by the Apify platform, you get API access, scheduling, webhook integrations, and seamless data export — all without managing infrastructure.
YouTube and TikTok downloads are handled by dedicated downloaders. Because their anti-bot and login walls make direct extraction unreliable, YouTube links are downloaded through the maintained streamers/youtube-video-downloader Actor and TikTok links through clockworks/tiktok-scraper, for far better reliability. This affects pricing for those sources — see Pricing.
Why use Speech-to-Text Transcription?
- Content repurposing — Turn video podcasts, lectures, and interviews into blog posts, articles, or documentation
- Accessibility — Generate transcripts for hearing-impaired audiences or multilingual translation workflows
- Research & analysis — Search, index, and analyze spoken content at scale across multiple sources
- SEO — Create text versions of your video content to improve search engine discoverability
- Compliance — Maintain text records of meetings, webinars, and public broadcasts
How to transcribe YouTube videos, podcasts, and audio files
- Open the Actor — Click "Try for free" on the Actor's page
- Choose one input source — Paste a platform URL (YouTube, Vimeo, podcast RSS episode, TikTok, SoundCloud, X, etc.) or a direct media file URL
- Configure options (optional) — Choose the transcription model, enable/disable speaker diarization, set a specific language, or adjust the maximum audio length
- Run the Actor — Click "Start" and wait for the transcription to complete
- Download results — Get your transcript from the Key-Value Store or Dataset tabs
No external API keys required — Deepgram transcription is included in the per-minute price.
Input
Provide exactly one of the two input sources below. All other fields are optional.
| Field | Type | Description |
|---|---|---|
videoUrl | string | URL on YouTube, Vimeo, TikTok, SoundCloud, X, an RSS podcast episode, or any of the 1000+ sites supported by yt-dlp |
mediaUrl | string | Direct public HTTP(S) URL to an audio or video file (mp3, mp4, wav, m4a, flac, ogg, webm, mov, mkv, …) |
maxAudioMinutes | integer | Maximum source length in minutes. Inputs longer than this fail before any cost is incurred. Default: 240 (4 h). Max: 600 (10 h) |
model | enum | Deepgram model: nova-3 (default), nova-2, whisper-large, whisper-medium, whisper-base |
language | string | Language code (e.g., en, es, fr) — leave empty for auto-detection |
diarize | boolean | Enable speaker diarization (default: true) |
smartFormat | boolean | Apply smart formatting to transcript (default: true) |
For mediaUrl, both audio-only files and video files are supported — the Actor extracts the audio track from video automatically before transcription.
Common mediaUrl uses:
- Transcribe a video from X/Twitter — pass the direct
https://video.twimg.com/...file URL - Transcribe an Instagram Reel or post video — pass the underlying
https://...cdninstagram.com/...video URL - Transcribe your own hosted files — an S3 presigned URL, a public Google Drive / Dropbox direct link, or an Apify Key-Value Store record URL (handy for batch pipelines that store media before transcribing)
Output
The Actor stores two files in the Key-Value Store:
transcript.txt— Formatted transcript with speaker labels and paragraphstranscript.json— Raw Deepgram API response with full metadata
And pushes a summary to the Dataset (visible in the Output tab and via the dataset API):
{"sourceType": "platform","videoUrl": "https://www.youtube.com/watch?v=...","videoId": "dQw4w9WgXcQ","title": "Example Video Title","channel": "Example Channel","channelUrl": "https://www.youtube.com/@example","uploadDate": "2024-09-12","viewCount": 1234567,"thumbnail": "https://i.ytimg.com/vi/.../maxresdefault.jpg","model": "nova-3","language": "en","diarize": true,"durationSeconds": 342.5,"transcript": "[Speaker 0] ...","transcriptLength": 4521,"speakerCount": 3}
sourceType is "platform" for videoUrl and "directUrl" for mediaUrl. Platform-specific fields (channel, viewCount, thumbnail, …) are null for direct media URL sources. For YouTube sources these fields are also null (downloading is delegated), and title is recovered on a best-effort basis. TikTok is delegated too but still returns full metadata (title, channel, viewCount, thumbnail, uploadDate).
Example transcript output:
[Speaker 0] Welcome everyone to today's discussion about artificial intelligence.[Speaker 1] Thank you for having me. I think the most exciting development is in natural language processing.[Speaker 0] I completely agree. The advances in transformer models have been remarkable.
You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.
Output data fields
| Field | Type | Description |
|---|---|---|
sourceType | string | "platform" (yt-dlp source) or "directUrl" |
videoUrl | string | Original source URL (platform URL for videoUrl, direct URL for mediaUrl) |
videoId | string | Platform-specific video identifier (null for direct URL sources) |
title | string | Video or episode title (null for non-platform sources) |
channel | string | Channel or creator name (null for non-platform sources) |
channelUrl | string | URL to the creator's channel (null for non-platform sources) |
uploadDate | string | ISO date the source was published (null for non-platform sources) |
viewCount | integer | View count at time of transcription (null for non-platform sources) |
thumbnail | string | URL to video thumbnail image (null for non-platform sources) |
model | string | Deepgram model used for transcription |
language | string | Language code (auto-detected when not specified) |
diarize | boolean | Whether speaker diarization was applied |
durationSeconds | number | Audio duration in seconds |
transcript | string | Full formatted transcript with speaker labels |
transcriptLength | integer | Character count of the transcript |
speakerCount | integer | Number of distinct speakers detected (1 if diarization disabled) |
Languages and choosing a model
The Actor handles non-English audio automatically — leave the language field empty and Deepgram detects the spoken language for you. If you already know the language, set it explicitly (e.g. es, fr, de, it, pt-BR, hi, ja, zh) for a small accuracy boost. Set language to multi for Nova-3 multilingual code-switching when a single recording mixes languages.
Choosing a model:
| Model | Best for |
|---|---|
nova-3 (default) | Fast, high-accuracy English and major-language speech with the strongest punctuation and diarization |
whisper-large | Music-adjacent audio, heavy accents, or less-common languages — the widest language coverage |
nova-2, whisper-medium, whisper-base | Benchmarking or lighter-weight alternatives |
You usually don't need to think about this: when the default model returns an empty transcript on auto-detected audio, the Actor retries once with whisper-large before giving up, so hard or non-English audio is recovered without you changing any settings. If a transcript still comes back empty after that, the audio almost certainly contains no transcribable speech (it's music, singing, or silence).
Pricing / Cost estimation
This Actor uses pay-per-event pricing — you only pay for what you use, with no Apify compute units, proxy bandwidth, or Deepgram fees to track separately. Everything is bundled in:
| Event | Price (USD) | When charged |
|---|---|---|
| Video processed | $0.02 | Once per run, after audio is successfully obtained. Not charged for YouTube or TikTok URLs (see below) |
| Minute of audio transcribed | $0.015 | Per minute of audio (rounded up), only after transcription succeeds |
YouTube and TikTok sources: the $0.02 per-video fee is waived, but downloading runs through a dedicated downloader Actor (streamers/youtube-video-downloader for YouTube, clockworks/tiktok-scraper for TikTok), which bills your Apify account directly at its own small pay-per-event rate (a fraction of a cent to ~$0.0017 for a typical short video) in addition to the per-minute transcription cost above. Net effect: these transcriptions usually cost about the same as the estimates below, just with the small download fee in place of the $0.02 processing fee.
Example costs (non-delegated sources):
| Video length | Total cost |
|---|---|
| 5 min | ~$0.10 |
| 30 min | ~$0.47 |
| 1 h | ~$0.92 |
| 2 h | ~$1.82 |
| 4 h (default cap) | ~$3.62 |
The default maxAudioMinutes cap of 240 minutes (4 hours) protects you from accidentally transcribing a marathon livestream. You can raise it up to 600 minutes (10 hours) if you need to, or lower it for tighter cost control. Videos that exceed the cap fail before any charge.
Trying it for free — New Apify accounts include free monthly platform credits, which are enough to transcribe several hours of audio before any out-of-pocket cost.
Tips
- Set
maxAudioMinuteslower if you're running this on a schedule or via API to cap worst-case cost per run - Shorter videos process faster — Videos under 30 minutes typically transcribe in under a minute
- Stick with Nova-3 for most audio — It's Deepgram's most advanced model with superior punctuation and speaker detection. For music-heavy or less-common-language audio, pick
whisper-large(though the Actor also falls back to it automatically on an empty result) - Set the language explicitly if you know it — this improves accuracy for non-English content; leave it empty to auto-detect
- Disable diarization for single-speaker videos to slightly reduce transcription time
FAQ, disclaimers, and support
Why Deepgram Nova-3 instead of Whisper? Nova-3 is Deepgram's latest model and generally produces stronger punctuation, smart formatting (numbers, dates, currency), and native speaker diarization compared to open-source Whisper variants — without you having to stitch those features together yourself. Whisper models are still available via the model input if you prefer them for specific languages or want to benchmark the difference. You don't have to choose defensively, either: if Nova-3 returns an empty transcript on auto-detected audio, the Actor automatically retries with whisper-large, which has broader language coverage.
I got an empty transcript — what happened? This means no speech was recognized. The Actor already retries auto-detected audio on whisper-large before reporting empty, so a persistently empty result almost always means the source is music, singing, instrumental, or silence (speech models don't transcribe sung lyrics). If you forced a specific language that doesn't match the audio, clear it to auto-detect instead.
Is this Actor legal to use? You are responsible for complying with the Terms of Service of any source platform you transcribe (YouTube, Vimeo, podcast hosts, etc.) and applicable copyright laws. Only transcribe content you have rights to or that is publicly available for personal/research use.
What sources are supported? Two input modes: (1) videoUrl — any of the 1000+ sites yt-dlp supports, including YouTube, Vimeo, TikTok, SoundCloud, X, and podcast RSS episode URLs; (2) mediaUrl — any direct public URL to an audio or video file. Video files have their audio track extracted automatically.
Can I transcribe private or unlisted content? Host the file behind a direct HTTP(S) URL (S3 presigned URL, a public Dropbox/Google Drive direct link, your own server) and pass it as mediaUrl. The URL only needs to be reachable for the duration of the run.
For issues, feature requests, or feedback, please use the Issues tab on this Actor's page. For custom solutions or enterprise needs, contact Apify support.