Pricing

Pay per event

Try for free

Go to Apify Store

Speech-to-Text Transcription

Try for free

Transcribe audio and video from YouTube, TikTok, podcasts, X, and 1,000+ other sites or any direct media URL into accurate, speaker-labeled text. Uses World's best speech to text AI models with automatic language detection, multilingual support, and smart formatting.

Pricing

Pay per event

Rating

5.0

(1)

Developer

Harish Garg

Actor stats

Bookmarked

Total users

Monthly active users

19 minutes ago

Last modified

What does Speech-to-Text Transcription do?

Speech-to-Text Transcription converts audio from YouTube, podcasts, social video platforms, or any direct media URL into accurate, formatted text using Deepgram's Nova-3 speech recognition model. The Actor pulls the audio, sends it to Deepgram's API, and returns a full transcript with speaker diarization (identifying who said what) and smart formatting (dates, numbers, punctuation). It auto-detects the spoken language across dozens of languages, so non-English audio works out of the box — and recovers tricky audio automatically by retrying on a multilingual model when needed.

Powered by the Apify platform, you get API access, scheduling, webhook integrations, and seamless data export — all without managing infrastructure.

YouTube and TikTok downloads are handled by dedicated downloaders. Because their anti-bot and login walls make direct extraction unreliable, YouTube links are downloaded through the maintained streamers/youtube-video-downloader Actor and TikTok links through clockworks/tiktok-scraper, for far better reliability. This affects pricing for those sources — see Pricing.

Why use Speech-to-Text Transcription?

Content repurposing — Turn video podcasts, lectures, and interviews into blog posts, articles, or documentation
Accessibility — Generate transcripts for hearing-impaired audiences or multilingual translation workflows
Research & analysis — Search, index, and analyze spoken content at scale across multiple sources
SEO — Create text versions of your video content to improve search engine discoverability
Compliance — Maintain text records of meetings, webinars, and public broadcasts

How to transcribe YouTube videos, podcasts, and audio files

Open the Actor — Click "Try for free" on the Actor's page
Choose one input source — Paste a platform URL (YouTube, Vimeo, podcast RSS episode, TikTok, SoundCloud, X, etc.) or a direct media file URL
Configure options (optional) — Choose the transcription model, enable/disable speaker diarization, set a specific language, or adjust the maximum audio length
Run the Actor — Click "Start" and wait for the transcription to complete
Download results — Get your transcript from the Key-Value Store or Dataset tabs

No external API keys required — Deepgram transcription is included in the per-minute price.

Input

Provide exactly one of the two input sources below. All other fields are optional.

Field	Type	Description
`videoUrl`	string	URL on YouTube, Vimeo, TikTok, SoundCloud, X, an RSS podcast episode, or any of the 1000+ sites supported by yt-dlp
`mediaUrl`	string	Direct public HTTP(S) URL to an audio or video file (mp3, mp4, wav, m4a, flac, ogg, webm, mov, mkv, …)
`maxAudioMinutes`	integer	Maximum source length in minutes. Inputs longer than this fail before any cost is incurred. Default: `240` (4 h). Max: `600` (10 h)
`model`	enum	Deepgram model: `nova-3` (default), `nova-2`, `whisper-large`, `whisper-medium`, `whisper-base`
`language`	string	Language code (e.g., `en`, `es`, `fr`) — leave empty for auto-detection
`diarize`	boolean	Enable speaker diarization (default: `true`)
`smartFormat`	boolean	Apply smart formatting to transcript (default: `true`)

For mediaUrl, both audio-only files and video files are supported — the Actor extracts the audio track from video automatically before transcription.

Common mediaUrl uses:

Transcribe a video from X/Twitter — pass the direct https://video.twimg.com/... file URL
Transcribe an Instagram Reel or post video — pass the underlying https://...cdninstagram.com/... video URL
Transcribe your own hosted files — an S3 presigned URL, a public Google Drive / Dropbox direct link, or an Apify Key-Value Store record URL (handy for batch pipelines that store media before transcribing)

Output

The Actor stores two files in the Key-Value Store:

transcript.txt — Formatted transcript with speaker labels and paragraphs
transcript.json — Raw Deepgram API response with full metadata

And pushes a summary to the Dataset (visible in the Output tab and via the dataset API):

{
    "sourceType": "platform",
    "videoUrl": "https://www.youtube.com/watch?v=...",
    "videoId": "dQw4w9WgXcQ",
    "title": "Example Video Title",
    "channel": "Example Channel",
    "channelUrl": "https://www.youtube.com/@example",
    "uploadDate": "2024-09-12",
    "viewCount": 1234567,
    "thumbnail": "https://i.ytimg.com/vi/.../maxresdefault.jpg",
    "model": "nova-3",
    "language": "en",
    "diarize": true,
    "durationSeconds": 342.5,
    "transcript": "[Speaker 0] ...",
    "transcriptLength": 4521,
    "speakerCount": 3
}

sourceType is "platform" for videoUrl and "directUrl" for mediaUrl. Platform-specific fields (channel, viewCount, thumbnail, …) are null for direct media URL sources. For YouTube sources these fields are also null (downloading is delegated), and title is recovered on a best-effort basis. TikTok is delegated too but still returns full metadata (title, channel, viewCount, thumbnail, uploadDate).

Example transcript output:

[Speaker 0] Welcome everyone to today's discussion about artificial intelligence.

[Speaker 1] Thank you for having me. I think the most exciting development is in natural language processing.

[Speaker 0] I completely agree. The advances in transformer models have been remarkable.

You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.

Output data fields

Field	Type	Description
`sourceType`	string	`"platform"` (yt-dlp source) or `"directUrl"`
`videoUrl`	string	Original source URL (platform URL for `videoUrl`, direct URL for `mediaUrl`)
`videoId`	string	Platform-specific video identifier (null for direct URL sources)
`title`	string	Video or episode title (null for non-platform sources)
`channel`	string	Channel or creator name (null for non-platform sources)
`channelUrl`	string	URL to the creator's channel (null for non-platform sources)
`uploadDate`	string	ISO date the source was published (null for non-platform sources)
`viewCount`	integer	View count at time of transcription (null for non-platform sources)
`thumbnail`	string	URL to video thumbnail image (null for non-platform sources)
`model`	string	Deepgram model used for transcription
`language`	string	Language code (auto-detected when not specified)
`diarize`	boolean	Whether speaker diarization was applied
`durationSeconds`	number	Audio duration in seconds
`transcript`	string	Full formatted transcript with speaker labels
`transcriptLength`	integer	Character count of the transcript
`speakerCount`	integer	Number of distinct speakers detected (1 if diarization disabled)

Languages and choosing a model

The Actor handles non-English audio automatically — leave the language field empty and Deepgram detects the spoken language for you. If you already know the language, set it explicitly (e.g. es, fr, de, it, pt-BR, hi, ja, zh) for a small accuracy boost. Set language to multi for Nova-3 multilingual code-switching when a single recording mixes languages.

Choosing a model:

Model	Best for
`nova-3` (default)	Fast, high-accuracy English and major-language speech with the strongest punctuation and diarization
`whisper-large`	Music-adjacent audio, heavy accents, or less-common languages — the widest language coverage
`nova-2`, `whisper-medium`, `whisper-base`	Benchmarking or lighter-weight alternatives

You usually don't need to think about this: when the default model returns an empty transcript on auto-detected audio, the Actor retries once with whisper-large before giving up, so hard or non-English audio is recovered without you changing any settings. If a transcript still comes back empty after that, the audio almost certainly contains no transcribable speech (it's music, singing, or silence).

Pricing / Cost estimation

This Actor uses pay-per-event pricing — you only pay for what you use, with no Apify compute units, proxy bandwidth, or Deepgram fees to track separately. Everything is bundled in:

Event	Price (USD)	When charged
Video processed	$0.02	Once per run, after audio is successfully obtained. Not charged for YouTube or TikTok URLs (see below)
Minute of audio transcribed	$0.015	Per minute of audio (rounded up), only after transcription succeeds

YouTube and TikTok sources: the $0.02 per-video fee is waived, but downloading runs through a dedicated downloader Actor (streamers/youtube-video-downloader for YouTube, clockworks/tiktok-scraper for TikTok), which bills your Apify account directly at its own small pay-per-event rate (a fraction of a cent to ~$0.0017 for a typical short video) in addition to the per-minute transcription cost above. Net effect: these transcriptions usually cost about the same as the estimates below, just with the small download fee in place of the $0.02 processing fee.

Example costs (non-delegated sources):

Video length	Total cost
5 min	~$0.10
30 min	~$0.47
1 h	~$0.92
2 h	~$1.82
4 h (default cap)	~$3.62

The default maxAudioMinutes cap of 240 minutes (4 hours) protects you from accidentally transcribing a marathon livestream. You can raise it up to 600 minutes (10 hours) if you need to, or lower it for tighter cost control. Videos that exceed the cap fail before any charge.

Trying it for free — New Apify accounts include free monthly platform credits, which are enough to transcribe several hours of audio before any out-of-pocket cost.

Tips

Set maxAudioMinutes lower if you're running this on a schedule or via API to cap worst-case cost per run
Shorter videos process faster — Videos under 30 minutes typically transcribe in under a minute
Stick with Nova-3 for most audio — It's Deepgram's most advanced model with superior punctuation and speaker detection. For music-heavy or less-common-language audio, pick whisper-large (though the Actor also falls back to it automatically on an empty result)
Set the language explicitly if you know it — this improves accuracy for non-English content; leave it empty to auto-detect
Disable diarization for single-speaker videos to slightly reduce transcription time

FAQ, disclaimers, and support

Why Deepgram Nova-3 instead of Whisper? Nova-3 is Deepgram's latest model and generally produces stronger punctuation, smart formatting (numbers, dates, currency), and native speaker diarization compared to open-source Whisper variants — without you having to stitch those features together yourself. Whisper models are still available via the model input if you prefer them for specific languages or want to benchmark the difference. You don't have to choose defensively, either: if Nova-3 returns an empty transcript on auto-detected audio, the Actor automatically retries with whisper-large, which has broader language coverage.

I got an empty transcript — what happened? This means no speech was recognized. The Actor already retries auto-detected audio on whisper-large before reporting empty, so a persistently empty result almost always means the source is music, singing, instrumental, or silence (speech models don't transcribe sung lyrics). If you forced a specific language that doesn't match the audio, clear it to auto-detect instead.

Is this Actor legal to use? You are responsible for complying with the Terms of Service of any source platform you transcribe (YouTube, Vimeo, podcast hosts, etc.) and applicable copyright laws. Only transcribe content you have rights to or that is publicly available for personal/research use.

What sources are supported? Two input modes: (1) videoUrl — any of the 1000+ sites yt-dlp supports, including YouTube, Vimeo, TikTok, SoundCloud, X, and podcast RSS episode URLs; (2) mediaUrl — any direct public URL to an audio or video file. Video files have their audio track extracted automatically.

Can I transcribe private or unlisted content? Host the file behind a direct HTTP(S) URL (S3 presigned URL, a public Dropbox/Google Drive direct link, your own server) and pass it as mediaUrl. The URL only needs to be reachable for the duration of the run.

For issues, feature requests, or feedback, please use the Issues tab on this Actor's page. For custom solutions or enterprise needs, contact Apify support.

Transcribe Video to Text & Audio to Text — 99+ Languages

sian.agency/INCREDIBLY-FAST-audio-transcriber

Transcribe video to text and audio to text in bulk on Apify. 99+ languages, word-level timestamps, speaker diarization, SRT/VTT export. Try free.

SIÁN OÜ

5.0

Hugging Face Audio AI

alizarin_refrigerator-owner/hugging-face-audio-ai

Audio w/Hugging Face models speech recognition, text-to-speech & audio analysis Speech-to-Text: Transcribe audio Text-to-Speech: Generate natural speech Audio Classification: Classify sounds Voice Activity Detection: Detect speech Speaker Diarization: Identify speakers Music Generation: Create music

The Howlers

Speech To Text

vivid_astronaut/speech-to-text

Convert speech to text with high accuracy using Azure AI. Supports 100+ languages, speaker detection, and timestamps. Perfect for transcription, subtitles, and voice-to-text applications.

Fabio Suizu

TikTok Transcription AI - 1.5$ for 1000 Videos

lofomachines/tiktok-transcription-ai

Transcribe a list of TikTok video URLs. Extracts TikTok metadata and generates AI transcription with timestamps and plain text.

Lofomachines

Speech AI MCP Server

vivid_astronaut/pronunciation-assessment-mcp

Speech AI MCP server with 9 tools: pronunciation scoring (0-100 at phoneme/word/sentence level), speech-to-text with timestamps, text-to-speech with 12 English voices, and multilingual Whisper transcription (99 languages + speaker diarization). Sub-300ms latency. Pay-per-use: $0.02/call.

Fabio Suizu

Instagram Video Transcript

truefetch/instagram-video-transcript

AI-transcribe any Instagram reel, story, or video — timestamped captions, speaker diarization, and translation into 100+ languages from a single pasted link. $0.30 per video.

TrueFetch

5.0

Transcription Extraction from YouTube Video

flavorful_fancy/transcription-extraction-from-video

Extract transcripts from YouTube videos using `Transcription Extraction from YouTube Video` with structured JSON output, along with Apify-ready automation workflows.

Shwetha K M

Speech-to-Text Converter

moving_beacon-owner1/my-actor-72

Introducing the Speech-to-Text Converter — Apify Actor! Transform your audio into text effortlessly with our powerful, serverless multi-engine transcription solution on Apify. Experience seamless and accurate transcription like never before!

Jamshaid Arif

Video To Text

truefetch/video-to-text

Transcribe videos from 1,000+ platforms to text — auto language detection, timestamps, subtitle file download, and translation to 100+ languages. No file uploads. $0.30 per video.

TrueFetch

221

4.4

Instagram AI Transcript Extractor

sian.agency/instagram-ai-transcript-extractor

Instagram Transcript Generator — 🎬 AI Reel Transcription | 🗣️ Speaker Diarization | 🌍 Language Detection | 📊 30+ Metrics | 💰 Best Price. Extract entire channels with word-perfect transcripts and speaker identification. Try 5 reels free!

SIÁN OÜ

1.9K

4.0