Pricing

Pay per event

Try for free

Go to Apify Store

Speech-to-Text Transcription

Try for free

Transcribe audio and video from YouTube, TikTok, podcasts, X, and 1,000+ other sites or any direct media URL into accurate, speaker-labeled text. Uses World's best speech to text AI models with automatic language detection, multilingual support, and smart formatting.

Pricing

Pay per event

Rating

5.0

(3)

Developer

Harish Garg

Actor stats

Bookmarked

232

Total users

Monthly active users

a day ago

Last modified

What does Speech-to-Text Transcription do?

Speech-to-Text Transcription converts audio from YouTube, podcasts, social video platforms, or any direct media URL into accurate, formatted text using Deepgram's Nova-3 speech recognition model. The Actor pulls the audio, sends it to Deepgram's API, and returns a full transcript with speaker diarization (identifying who said what) and smart formatting (dates, numbers, punctuation). It auto-detects the spoken language across dozens of languages, so non-English audio works out of the box — and recovers tricky audio automatically by retrying on a multilingual model when needed.

Powered by the Apify platform, you get API access, scheduling, webhook integrations, and seamless data export — all without managing infrastructure.

YouTube, TikTok, Instagram, Douyin, and Bilibili downloads are handled by dedicated downloaders. Because their anti-bot, login, and geo walls make direct extraction unreliable, these links are downloaded through maintained downloader Actors — streamers/youtube-video-downloader (YouTube), clockworks/tiktok-scraper (TikTok), apify/instagram-scraper (Instagram), hgservices/douyin-scraper (Douyin), and hgservices/Bilibili-Video-Scraper (Bilibili) — for far better reliability. This affects pricing for those sources — see Pricing.

Why use Speech-to-Text Transcription?

Content repurposing — Turn video podcasts, lectures, and interviews into blog posts, articles, or documentation
Accessibility — Generate transcripts for hearing-impaired audiences or multilingual translation workflows
Research & analysis — Search, index, and analyze spoken content at scale across multiple sources
SEO — Create text versions of your video content to improve search engine discoverability
Compliance — Maintain text records of meetings, webinars, and public broadcasts
Pipeline transcription step — Already scraping videos with your own Actor or tool? Chain this Actor downstream and feed it the direct media URL to add transcripts to whatever you collect (see Use as a transcription step in your scraping pipeline)

Supported sources at a glance

Not sure if your source will work? Find it below — it tells you which input field to use and how to pass it for the most reliable result.

Source	Use this field	How to pass it for best results
YouTube	`videoUrl`	Paste the watch URL. Downloading is delegated to a maintained downloader for reliability; platform metadata is limited and the per-video fee is waived (see Pricing).
TikTok	`videoUrl`	Paste the video URL. Delegated downloader, but full metadata (title, channel, views) is still returned.
Douyin (抖音)	`videoUrl`	Paste the video or `v.douyin.com` share URL. Delegated downloader with full metadata (title, author, duration); the no-watermark audio is transcribed with automatic language detection (audio is usually Mandarin).
Bilibili (哔哩哔哩)	`videoUrl`	Paste the `bilibili.com/video/...` or `b23.tv` share URL. Delegated downloader with full metadata (title, author, views, duration); avoids the geo restriction anonymous extraction hits on region-locked and bangumi content.
Vimeo, SoundCloud, X/Twitter, podcast RSS episodes, + 1000+ yt-dlp sites	`videoUrl`	Paste the public page/episode URL. Must be public — login/age-walled posts can't be fetched.
Facebook / Instagram	`videoUrl` (the page URL)	Prefer the `facebook.com` / `instagram.com` post URL so the Actor resolves a fresh stream itself. A raw `fbcdn.net` / `cdninstagram.com` link works only if it's the progressive `.mp4` (not a fragmented `_dashinit.mp4` / DASH) variant and hasn't expired.
Your own hosted files (S3, Google Drive, Dropbox, your CDN, Apify Key-Value Store)	`mediaUrl`	Pass a direct public link straight to the file (`mp3`, `mp4`, `wav`, `m4a`, `flac`, `ogg`, `webm`, `mov`, `mkv`). Video files have their audio extracted automatically.
Any other direct media URL	`mediaUrl`	If the link points straight at an audio/video file and is publicly reachable, it works. If it serves an HTML page, it won't.

Rule of thumb: if there's a public web page for it, use videoUrl. If you already hold a direct link to the media file itself, use mediaUrl. Provide exactly one.

How to transcribe YouTube videos, podcasts, and audio files

Open the Actor — Click "Try for free" on the Actor's page
Choose one input source — Paste a platform URL (YouTube, Vimeo, podcast RSS episode, TikTok, SoundCloud, X, etc.) or a direct media file URL
Configure options (optional) — Choose the transcription model, enable/disable speaker diarization, set a specific language, or adjust the maximum audio length
Run the Actor — Click "Start" and wait for the transcription to complete
Download results — Get your transcript from the Key-Value Store or Dataset tabs

No external API keys required — Deepgram transcription is included in the per-minute price.

Run it via API or on a schedule

Most production usage drives this Actor through the Apify API, not the Console — it's built for that. Start a run with a single POST:

curl -X POST "https://api.apify.com/v2/acts/hgservices~speech-to-text/runs?token=<APIFY_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"videoUrl": "https://www.youtube.com/watch?v=f60dheI4ARg"}'

Or from the Apify client (the Actor slug is hgservices/speech-to-text):

run = client.actor("hgservices/speech-to-text").call(
    run_input={"videoUrl": "https://www.youtube.com/watch?v=..."}
)

To run it unattended, point a Schedule at it, or fire it from your own scraper's success event with webhooks and feed it a direct mediaUrl — see Use as a transcription step in your scraping pipeline for that pattern.

Input

Provide exactly one of the two input sources below. All other fields are optional.

Field	Type	Description
`videoUrl`	string	URL on YouTube, Vimeo, TikTok, Douyin, Bilibili, SoundCloud, X, an RSS podcast episode, or any of the 1000+ sites supported by yt-dlp
`mediaUrl`	string	Direct public HTTP(S) URL to an audio or video file (mp3, mp4, wav, m4a, flac, ogg, webm, mov, mkv, …)
`maxAudioMinutes`	integer	Maximum source length in minutes. Inputs longer than this fail before any cost is incurred. Default: `240` (4 h). Max: `600` (10 h)
`model`	enum	Deepgram model: `nova-3` (default), `nova-2`, `whisper-large`, `whisper-medium`, `whisper-base`
`language`	string	Language code (e.g., `en`, `es`, `fr`) — leave empty for auto-detection
`diarize`	boolean	Enable speaker diarization (default: `true`)
`smartFormat`	boolean	Apply smart formatting to transcript (default: `true`)

For mediaUrl, both audio-only files and video files are supported — the Actor extracts the audio track from video automatically before transcription.

Common mediaUrl uses:

Transcribe a video from X/Twitter — pass the direct https://video.twimg.com/... file URL
Transcribe a Facebook video — pass the direct https://video.*.fbcdn.net/... file URL (use the progressive .mp4 variant, not a fragmented/DASH one)
Transcribe an Instagram Reel or post video — pass the underlying https://...cdninstagram.com/... video URL
Transcribe your own hosted files — an S3 presigned URL, a public Google Drive / Dropbox direct link, or an Apify Key-Value Store record URL (handy for batch pipelines that store media before transcribing)

Output

The Actor stores two files in the Key-Value Store:

transcript.txt — Formatted transcript with speaker labels and paragraphs
transcript.json — Raw Deepgram API response with full metadata

And pushes a summary to the Dataset (visible in the Output tab and via the dataset API):

{
    "sourceType": "platform",
    "videoUrl": "https://www.youtube.com/watch?v=...",
    "videoId": "dQw4w9WgXcQ",
    "title": "Example Video Title",
    "channel": "Example Channel",
    "channelUrl": "https://www.youtube.com/@example",
    "uploadDate": "2024-09-12",
    "viewCount": 1234567,
    "thumbnail": "https://i.ytimg.com/vi/.../maxresdefault.jpg",
    "model": "nova-3",
    "language": "en",
    "diarize": true,
    "durationSeconds": 342.5,
    "transcript": "[Speaker 0] ...",
    "transcriptLength": 4521,
    "speakerCount": 3
}

sourceType is "platform" for videoUrl and "directUrl" for mediaUrl. Platform-specific fields (channel, viewCount, thumbnail, …) are null for direct media URL sources. For YouTube sources these fields are also null (downloading is delegated), and title is recovered on a best-effort basis. TikTok is delegated too but still returns full metadata (title, channel, viewCount, thumbnail, uploadDate). Douyin is likewise delegated and returns full metadata (title, channel, uploadDate, thumbnail), except viewCount, which Douyin's listing doesn't expose (it's null). Bilibili is delegated as well and returns full metadata (title, channel, viewCount, thumbnail, uploadDate).

Example transcript output:

[Speaker 0] Welcome everyone to today's discussion about artificial intelligence.

[Speaker 1] Thank you for having me. I think the most exciting development is in natural language processing.

[Speaker 0] I completely agree. The advances in transformer models have been remarkable.

You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.

Output data fields

Field	Type	Description
`sourceType`	string	`"platform"` (yt-dlp source) or `"directUrl"`
`videoUrl`	string	Original source URL (platform URL for `videoUrl`, direct URL for `mediaUrl`)
`videoId`	string	Platform-specific video identifier (null for direct URL sources)
`title`	string	Video or episode title (null for non-platform sources)
`channel`	string	Channel or creator name (null for non-platform sources)
`channelUrl`	string	URL to the creator's channel (null for non-platform sources)
`uploadDate`	string	ISO date the source was published (null for non-platform sources)
`viewCount`	integer	View count at time of transcription (null for non-platform sources)
`thumbnail`	string	URL to video thumbnail image (null for non-platform sources)
`model`	string	Deepgram model used for transcription
`language`	string	Language code (auto-detected when not specified)
`diarize`	boolean	Whether speaker diarization was applied
`durationSeconds`	number	Audio duration in seconds
`transcript`	string	Full formatted transcript with speaker labels
`transcriptLength`	integer	Character count of the transcript
`speakerCount`	integer	Number of distinct speakers detected (1 if diarization disabled)

Use as a transcription step in your scraping pipeline

If you already scrape videos — with your own Apify Actor, a third-party scraper, or any custom tool — you don't need this Actor to find the media. Hand it the direct media URL via mediaUrl and it becomes the transcription stage at the end of your pipeline. This works for any platform's direct CDN URL (Facebook fbcdn.net, Instagram cdninstagram.com, X video.twimg.com, your own S3/CDN, …) as long as the link points straight at an audio or video file.

Chaining from another Apify Actor. Run your scraper first, then call this Actor with the URL it produced. From the Apify API or SDK:

// POST https://api.apify.com/v2/acts/hgservices~speech-to-text/runs?token=<APIFY_TOKEN>
{
    "mediaUrl": "https://video.fhel-1.fna.fbcdn.net/o1/v/.../video.mp4?...",
    "language": ""        // leave empty to auto-detect
}

Or with the Apify client (the Actor slug is hgservices/speech-to-text):

run_input = {"mediaUrl": "https://video.fhel-1.fna.fbcdn.net/o1/v/.../video.mp4?..."}
run = client.actor("hgservices/speech-to-text").call(run_input=run_input)

You can wire this together without code using Apify task chaining / webhooks — fire this Actor on your scraper's SUCCEEDED event and map the scraped URL into mediaUrl.

Operational caveats for signed CDN URLs. Links from sites like Facebook and Instagram are signed and expire within hours, and are often bound to the IP/session that generated them. To keep delivery reliable:

Transcribe promptly — pass the URL to this Actor right after you scrape it; a stale link returns a 403 and the run fails.
Prefer the progressive .mp4 variant over fragmented/DASH (_dashinit.mp4) URLs — audio extraction can fail on fragmented-init segments.
Watch IP binding — if your scraper fetched the URL from one IP, fetching it here from a different one (including via proxy) may be rejected; pass a freshly minted, broadly fetchable URL.

The Dataset summary for these runs has sourceType: "directUrl" and platform metadata fields (channel, viewCount, thumbnail, …) set to null, since there's no platform page to scrape that from — you already hold that context upstream.

Languages and choosing a model

The Actor handles non-English audio automatically — leave the language field empty and Deepgram detects the spoken language for you. If you already know the language, set it explicitly (e.g. es, fr, de, it, pt-BR, hi, ja, zh) for a small accuracy boost. Set language to multi for Nova-3 multilingual code-switching when a single recording mixes languages.

Choosing a model:

Model	Best for
`nova-3` (default)	Fast, high-accuracy English and major-language speech with the strongest punctuation and diarization
`whisper-large`	Music-adjacent audio, heavy accents, or less-common languages — the widest language coverage
`nova-2`, `whisper-medium`, `whisper-base`	Benchmarking or lighter-weight alternatives

You usually don't need to think about this: when the default model returns an empty transcript on auto-detected audio, the Actor retries once with whisper-large before giving up, so hard or non-English audio is recovered without you changing any settings. If a transcript still comes back empty after that, the audio almost certainly contains no transcribable speech (it's music, singing, or silence).

Pricing / Cost estimation

This Actor uses pay-per-event pricing — you only pay for what you use, with no Apify compute units, proxy bandwidth, or Deepgram fees to track separately. Everything is bundled in:

Event	Price (USD)	When charged
Video processed	$0.02	Once per run after the Actor processes your input. Also charged when a submitted post contains no media to transcribe (e.g. an Instagram photo/carousel) — the download attempt still runs. Not charged for delegated sources — YouTube, TikTok, Instagram, Douyin, and Bilibili (see below)
Minute of audio transcribed	$0.015	Per minute of audio (rounded up), only after transcription succeeds

YouTube, TikTok, Instagram, Douyin, and Bilibili sources: the $0.02 per-video fee is waived, but downloading runs through a dedicated downloader Actor (streamers/youtube-video-downloader for YouTube, clockworks/tiktok-scraper for TikTok, apify/instagram-scraper for Instagram, hgservices/douyin-scraper for Douyin, hgservices/Bilibili-Video-Scraper for Bilibili), which bills your Apify account directly at its own small pay-per-event rate (a fraction of a cent to ~$0.0017 for a typical short video) in addition to the per-minute transcription cost above. Net effect: these transcriptions usually cost about the same as the estimates below, just with the small download fee in place of the $0.02 processing fee.

Example costs (non-delegated sources):

Video length	Total cost
5 min	~$0.10
30 min	~$0.47
1 h	~$0.92
2 h	~$1.82
4 h (default cap)	~$3.62

The default maxAudioMinutes cap of 240 minutes (4 hours) protects you from accidentally transcribing a marathon livestream. You can raise it up to 600 minutes (10 hours) if you need to, or lower it for tighter cost control. Videos that exceed the cap fail before any charge.

Trying it for free — New Apify accounts include free monthly platform credits, which are enough to transcribe several hours of audio before any out-of-pocket cost.

Tips

Set maxAudioMinutes lower if you're running this on a schedule or via API to cap worst-case cost per run
Shorter videos process faster — Videos under 30 minutes typically transcribe in under a minute
Stick with Nova-3 for most audio — It's Deepgram's most advanced model with superior punctuation and speaker detection. For music-heavy or less-common-language audio, pick whisper-large (though the Actor also falls back to it automatically on an empty result)
Set the language explicitly if you know it — this improves accuracy for non-English content; leave it empty to auto-detect
Disable diarization for single-speaker videos to slightly reduce transcription time

Troubleshooting: common errors and what to do

First, where to look. A run that can't produce a transcript still finishes as Succeeded (so it never counts as a platform failure) — the reason is written to the run's status message and to the statusMessage / transcriptStatus fields of the Dataset record, with an empty transcript. So if you got no text, open the run and read the status message: it names the cause and the fix. The most common ones are below.

Source and download problems (mostly `mediaUrl`)

Message you'll see	What it means	What to do
"Facebook/Instagram returned HTTP 403 for this CDN link"	The raw `fbcdn.net` / `cdninstagram.com` link expired or is bound to the IP/session that opened the post.	Paste the original facebook.com / instagram.com page URL into `videoUrl` instead of the raw CDN link — the Actor resolves a fresh stream itself.
"Couldn't extract audio from this Facebook/Instagram CDN file"	The link is a fragmented (DASH) stream that carries only part of the media, with no complete audio track.	Same fix: use the page URL in `videoUrl`, or pass the progressive `.mp4` variant rather than a `_dashinit.mp4` one.
"mediaUrl returned HTTP 403 / 404"	The link is wrong, expired, private, or not publicly reachable from the server.	Open it in a private browser window; if it doesn't play, get a fresh, public, direct file URL and re-submit promptly.
"mediaUrl did not return a media file…"	The URL served an HTML page (a login, expired-link, or block page) instead of a file.	Make sure the link points straight at the file, not a webpage, and that it's still live and public.
"Downloaded file is not a valid audio or video file"	The download was corrupt or truncated (sometimes a proxy reset mid-stream).	Re-fetch a fresh URL and retry; confirm the link points to a real media file.
"Could not download mediaUrl after N attempts"	Repeated network/proxy resets, or the host is down.	Retry shortly; if it persists the host may be blocking automated fetches.

Platform-page problems (`videoUrl`)

Message you'll see	What it means	What to do
"YouTube blocked this request as automated traffic"	YouTube's bot wall.	Enable Apify proxy (`proxyConfiguration.useApifyProxy`) or retry later.
"This content requires a logged-in account…"	Login/age wall — common on TikTok, Instagram, Facebook, Reddit for non-public posts.	Use publicly accessible content; the Actor downloads anonymously and can't authenticate.
"Video is private / unavailable / age-restricted"	A restriction on the source itself.	Nothing the Actor can change — use content that's public and unrestricted.
"This post contains no video or audio to transcribe…"	The post has no media track — Instagram photo and carousel-image posts carry no video or audio.	Submit a post that contains a video or audio clip. The $0.02 processing fee still applies, since the Actor attempted the download.
"Video is geo-restricted…"	The content isn't available from the Actor's region.	Enable Apify proxy with a different country.
"Source platform rate-limited the request (HTTP 429)"	Too many requests to the platform.	Retry later or enable Apify proxy.
"URL is not supported"	Wrong input field, or a site yt-dlp doesn't support.	Use `videoUrl` for platform pages (YouTube, Vimeo, etc.) and `mediaUrl` for direct file links.

Transcript, input, and limit problems

Message you'll see	What it means	What to do
"Transcription returned no text…" (empty)	No speech was recognized. The Actor already retried auto-detected audio on `whisper-large`.	Usually the audio is music, singing, or silence. If it is speech in another language and you forced `language`, clear it to auto-detect. See the empty transcript FAQ.
"Deepgram rejected the audio…"	The audio was malformed, too large, or in an unsupported format.	Re-encode to a standard format (mp3 / m4a / wav) and retry.
"Source duration … exceeds the configured maxAudioMinutes cap"	The source is longer than your cap. No charge is incurred.	Raise `maxAudioMinutes` (up to 600) if you intended to transcribe something this long.
"Missing input" / "Conflicting input"	You provided neither or both of `videoUrl` and `mediaUrl`.	Provide exactly one of the two.

Tip for signed CDN links (Facebook, Instagram, X): these expire within hours and are often IP-bound, so transcribe promptly and prefer the page URL via videoUrl. See Operational caveats for signed CDN URLs for the full rundown.

Still stuck? If you're hitting a problem this guide doesn't resolve, you can reach the creator directly at harish@harishgarg.com — include the link you were trying to transcribe and the status message you saw, and I'll help you sort it out.

FAQ, disclaimers, and support

Why Deepgram Nova-3 instead of Whisper? Nova-3 is Deepgram's latest model and generally produces stronger punctuation, smart formatting (numbers, dates, currency), and native speaker diarization compared to open-source Whisper variants — without you having to stitch those features together yourself. Whisper models are still available via the model input if you prefer them for specific languages or want to benchmark the difference. You don't have to choose defensively, either: if Nova-3 returns an empty transcript on auto-detected audio, the Actor automatically retries with whisper-large, which has broader language coverage.

I got an empty transcript — what happened? This means no speech was recognized. The Actor already retries auto-detected audio on whisper-large before reporting empty, so a persistently empty result almost always means the source is music, singing, instrumental, or silence (speech models don't transcribe sung lyrics). If you forced a specific language that doesn't match the audio, clear it to auto-detect instead.

Is this Actor legal to use? You are responsible for complying with the Terms of Service of any source platform you transcribe (YouTube, Vimeo, podcast hosts, etc.) and applicable copyright laws. Only transcribe content you have rights to or that is publicly available for personal/research use.

What sources are supported? Two input modes: (1) videoUrl — any of the 1000+ sites yt-dlp supports, including YouTube, Vimeo, TikTok, Douyin, Bilibili, SoundCloud, X, and podcast RSS episode URLs; (2) mediaUrl — any direct public URL to an audio or video file. Video files have their audio track extracted automatically.

Can I transcribe private or unlisted content? Host the file behind a direct HTTP(S) URL (S3 presigned URL, a public Dropbox/Google Drive direct link, your own server) and pass it as mediaUrl. The URL only needs to be reachable for the duration of the run.

For issues, feature requests, or feedback, please use the Issues tab on this Actor's page. For custom solutions or enterprise needs, contact Apify support.

Podcast Transcript Scraper — Any RSS Feed to Text & SRT

scrapersdelight/podcast-transcript-scraper

Extract per-episode transcripts from any podcast RSS feed via the Podcasting 2.0 <podcast:transcript> tag — no login, no ASR. Clean text, timestamped segments & SRT/VTT per episode, plus metadata. Works with Buzzsprout, Captivate, Transistor, RSS.com & more. $2 per 1,000 episodes.

Scrapers Delight

Shopify Scraper

pocesar/shopify-scraper

Automate monitoring prices on the most popular solution for building online stores and selling products online. Crawl arbitrary Shopify-powered online stores and extract a list of all products in a structured form, including product title, price, description, etc.

Paulo Cesar

2.3K

1.0

Video Transcript

agentx/video-transcript

Video Transcript is a universal video-to-text API for automation. Submit one supported public URL or an audio/video upload up to 8 GB; receive detected-language text, ordered timestamped segments, source metadata, and optional translation into 133 languages. Transcript pricing begins at $0.3483.

AgentX

786

4.1

All Video Scraper

agentx/all-video-scraper

Universal video downloader API for YouTube, TikTok, Instagram, X, Vimeo, Twitch and 1000+ other sources. Pass any video URL and choose 4K, 720p, 480p, or metadata-only — the response includes view/like/share/comment counts, author, duration, tags, plus the canonical hosted file URL.

AgentX

213

2.7

Instagram Video Transcript

truefetch/instagram-video-transcript

Instagram Video Transcript extracts the spoken words from one public Reel or video post for hook research, captions, and content repurposing. It returns detected-language text, ordered timestamps, post metadata, and optional translation into 133 languages. Entry transcript price: $0.36.

TrueFetch

259

5.0

Instagram Transcript API – AI Video to Text for Developers

apple_yang/instagram-transcripts-scraper

Instagram Reels Transcript API for converting video audio into accurate text using AI. Extract transcripts, spoken content, and metadata from public Reels and videos. Fast, reliable, and built for developers, AI agents, and automation workflows.

APISmith

1.7K

4.3

Hugging Face Audio AI

alizarin_refrigerator-owner/hugging-face-audio-ai

Audio w/Hugging Face models speech recognition, text-to-speech & audio analysis Speech-to-Text: Transcribe audio Text-to-Speech: Generate natural speech Audio Classification: Classify sounds Voice Activity Detection: Detect speech Speaker Diarization: Identify speakers Music Generation: Create music

The Howlers

Text to speech generator

akash9078/advanced-text-to-speech

Professional-grade Text-to-Speech (TTS) actor powered by advanced AI models. Convert any text into natural, human-like speech with 50+ premium voices across 9 languages. Perfect for content creation, accessibility, voiceovers, audiobooks, podcasts, and multilingual applications.

Akash Kumar Naik

Facebook Posts Scraper

netdesignr/facebook-posts-scraper

Scrape public Facebook posts, reels, videos, and visible engagement data from pages, profiles, and public groups. Unofficial actor, not affiliated with Facebook or Meta.