Speech-to-Text Transcription
Pricing
Pay per event
Speech-to-Text Transcription
Transcribe audio or video from YouTube and 1000+ other platforms, direct media URLs, or Key-Value Store uploads using Deepgram with speaker diarization.
Pricing
Pay per event
Rating
5.0
(1)
Developer
Harish Garg
Maintained by CommunityActor stats
0
Bookmarked
5
Total users
4
Monthly active users
3 days ago
Last modified
Categories
Share
What does YouTube Speech-to-Text do?
YouTube Speech-to-Text extracts audio from any YouTube video and converts it into accurate, formatted text using Deepgram's Nova-3 speech recognition model. The Actor downloads the video's audio, sends it to Deepgram's API, and returns a full transcript with speaker diarization (identifying who said what), smart formatting (dates, numbers, punctuation), and language auto-detection.
Try it out with any public YouTube URL. Powered by the Apify platform, you get API access, scheduling, webhook integrations, and seamless data export — all without managing infrastructure.
Why use YouTube Speech-to-Text?
- Content repurposing — Turn video podcasts, lectures, and interviews into blog posts, articles, or documentation
- Accessibility — Generate transcripts for hearing-impaired audiences or multilingual translation workflows
- Research & analysis — Search, index, and analyze spoken content at scale across multiple videos
- SEO — Create text versions of your video content to improve search engine discoverability
- Compliance — Maintain text records of meetings, webinars, and public broadcasts
How to use YouTube Speech-to-Text
- Open the Actor — Click "Try for free" on the Actor's page
- Enter the YouTube URL — Paste the full URL of any public YouTube video
- Configure options (optional) — Choose the transcription model, enable/disable speaker diarization, set a specific language, or adjust the maximum video length
- Run the Actor — Click "Start" and wait for the transcription to complete
- Download results — Get your transcript from the Key-Value Store or Dataset tabs
No external API keys required — Deepgram transcription is included in the per-minute price.
Input
The Actor accepts the following input parameters:
| Field | Type | Required | Description |
|---|---|---|---|
videoUrl | string | Yes | URL of the YouTube video to transcribe |
maxAudioMinutes | integer | No | Maximum video length in minutes. Videos longer than this fail before any cost is incurred. Default: 240 (4 h). Max: 600 (10 h) |
model | enum | No | Deepgram model: nova-3 (default), nova-2, whisper-large, whisper-medium, whisper-base |
language | string | No | Language code (e.g., en, es, fr) — leave empty for auto-detection |
diarize | boolean | No | Enable speaker diarization (default: true) |
smartFormat | boolean | No | Apply smart formatting to transcript (default: true) |
Output
The Actor stores two files in the Key-Value Store:
transcript.txt— Formatted transcript with speaker labels and paragraphstranscript.json— Raw Deepgram API response with full metadata
And pushes a summary to the Dataset (visible in the Output tab and via the dataset API):
{"videoUrl": "https://www.youtube.com/watch?v=...","videoId": "dQw4w9WgXcQ","title": "Example Video Title","channel": "Example Channel","channelUrl": "https://www.youtube.com/@example","uploadDate": "2024-09-12","viewCount": 1234567,"thumbnail": "https://i.ytimg.com/vi/.../maxresdefault.jpg","model": "nova-3","language": "en","diarize": true,"durationSeconds": 342.5,"transcript": "[Speaker 0] ...","transcriptLength": 4521,"speakerCount": 3}
Example transcript output:
[Speaker 0] Welcome everyone to today's discussion about artificial intelligence.[Speaker 1] Thank you for having me. I think the most exciting development is in natural language processing.[Speaker 0] I completely agree. The advances in transformer models have been remarkable.
You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.
Pricing / Cost estimation
This Actor uses pay-per-event pricing — you only pay for what you use, with no Apify compute units, proxy bandwidth, or Deepgram fees to track separately. Everything is bundled in:
| Event | Price (USD) | When charged |
|---|---|---|
| Video processed | $0.02 | Once per video, after audio is successfully downloaded |
| Minute of audio transcribed | $0.015 | Per minute of audio (rounded up), only after transcription succeeds |
Example costs:
| Video length | Total cost |
|---|---|
| 5 min | ~$0.10 |
| 30 min | ~$0.47 |
| 1 h | ~$0.92 |
| 2 h | ~$1.82 |
| 4 h (default cap) | ~$3.62 |
The default maxAudioMinutes cap of 240 minutes (4 hours) protects you from accidentally transcribing a marathon livestream. You can raise it up to 600 minutes (10 hours) if you need to, or lower it for tighter cost control. Videos that exceed the cap fail before any charge.
Tips
- Set
maxAudioMinuteslower if you're running this on a schedule or via API to cap worst-case cost per run - Shorter videos process faster — Videos under 30 minutes typically transcribe in under a minute
- Use Nova-3 for best accuracy — It's Deepgram's most advanced model with superior punctuation and speaker detection
- Set the language explicitly if you know it — this improves accuracy for non-English content
- Disable diarization for single-speaker videos to slightly reduce transcription time
FAQ, disclaimers, and support
Is this Actor legal to use? You are responsible for complying with YouTube's Terms of Service and applicable copyright laws. Only transcribe videos you have rights to or that are publicly available for personal/research use.
What video formats are supported? Any public YouTube video. The Actor uses yt-dlp to download audio in the best available quality.
Can I transcribe private or unlisted videos? Not directly through this Actor, as it relies on publicly accessible YouTube URLs.
For issues, feature requests, or feedback, please use the Issues tab on this Actor's page. For custom solutions or enterprise needs, contact Apify support.