Transcribe Video to Text & Audio to Text — 99+ Languages
Pricing
Pay per event
Transcribe Video to Text & Audio to Text — 99+ Languages
Transcribe video to text and audio to text in bulk on Apify. 99+ languages, word-level timestamps, speaker diarization, SRT/VTT export. Try free.
Pricing
Pay per event
Rating
5.0
(2)
Developer
SIÁN OÜ
Actor stats
4
Bookmarked
48
Total users
10
Monthly active users
12 days ago
Last modified
Categories
Share
Transcribe Video to Text and Audio to Text — Bulk Transcription
Transcribe video to text and audio to text in bulk. 99+ languages, word-level timestamps, speaker diarization, and SRT/VTT subtitle export. Process 100+ files per hour with 10× parallel processing. Free tier available.
How to transcribe video to text in 4 steps
- Paste your URLs or upload files — drop direct audio or video URLs into the
audioUrlsfield, or upload files directly viaaudioFiles. Both lists are processed together. - Pick output format — JSON for full data (transcript + segments + word timestamps), or SRT/VTT for subtitle files. Toggle speaker diarization, translation to English, and EU-region processing as needed.
- Run the actor — files process 10 at a time in parallel; 100 files take about an hour on the paid tier.
- Download results — every file lands in the dataset with the transcript, detected language, duration in seconds, and per-segment timestamps.
Supported formats: MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, WebM. Max 1 GB per file on the paid tier.
Example output
A typical successful transcription returns:
{"transcript": "the ugliest human emotion that exists envy nobody ever wants to admit that they're envious...","detected_language": "en","duration": 57.0775,"segments": [{"id": 0,"text": "the ugliest human emotion that exists envy nobody ever wants to admit...","start": 0.26,"end": 20.56,"speaker": "SPEAKER_00","language": "en","words": [{ "word": "the", "start": 0.26, "end": 0.26, "speaker": "SPEAKER_00" },{ "word": "ugliest", "start": 0.26, "end": 0.78, "speaker": "SPEAKER_00" },{ "word": "human", "start": 0.78, "end": 1.20, "speaker": "SPEAKER_00" }]}],"speakers": ["SPEAKER_00"],"languages": ["en"],"fileSizeMB": 0.92,"success": true}
Every result includes the full transcript, segment-level timestamps, optional word-level timestamps, language detection, audio duration in seconds, file size, and (when speaker diarization is enabled) speaker labels per segment and per word.
Speaker diarization
Toggle the Speaker Diarization input to identify who's speaking in multi-person audio. Each segment and each word receives a speaker label (SPEAKER_00, SPEAKER_01, …) so you can build clean transcripts of meetings, interviews, panels, and podcast conversations. Powered by the same pyannote-audio model used in production speech-to-text pipelines. Charged per audio second; only billed when enabled.
SRT / VTT subtitle export
Set the Output Format to srt or vtt and the actor returns a ready-to-use subtitle file with cue timing inferred from the transcription. Useful for:
- Adding subtitles to YouTube videos
- Generating closed captions for accessibility
- Translating subtitles via the Translate to English option
Set Timestamp Granularities to word to get cue precision down to individual words.
Why teams choose this actor
- ✅ 99+ languages — automatic detection across English, Spanish, French, Mandarin, Arabic, Portuguese, and 90+ more
- 🎤 Speaker diarization powered by pyannote-audio — segment-level and word-level speaker labels
- ⏱️ Word-level timestamps on every transcription —
{word, start, end, speaker}per word, ready for clip extraction and quote search - 🎬 SRT and VTT subtitles included on every successful run — no extra step, no extra charge
- 🚀 10× parallel on the paid tier — 100 files in ~1 hour vs ~16 hours sequential
- 🌐 Translate to English in the same run for non-English audio
- 🇪🇺 EU-region processing toggle for GDPR-aligned workflows
- 💰 Pay per audio second — no subscriptions, no minimums; only pay for the audio you actually transcribe
Use cases
- 🎙️ Podcasters generating show notes, blog repurposing, and YouTube captions for video podcast episodes
- 💼 Sales and ops teams archiving meeting recordings (Zoom, Teams, Meet) for coaching, QA, and compliance
- 📰 Journalists and qualitative researchers turning interview tape into searchable transcripts with attributed quotes
- 🎓 Students and educators transcribing lectures, seminars, and recorded study sessions
- 🎬 Video editors and content creators producing accurate caption tracks for long-form content
- 📊 Customer support teams transcribing support call recordings for sentiment and CSAT analysis
- 🧪 AI / LLM developers building RAG pipelines or training data from spoken-word audio sources
No matter which use case you fall into, this actor handles it — pay only for the audio seconds you actually transcribe.
Pricing & tiers
Pay only for the audio seconds you actually transcribe. No subscriptions, no minimums.
| FREE tier | PAID tier |
|---|---|
| Perfect for testing and small jobs | Built for production volume |
| Up to 5 URLs per run | Unlimited URLs per run |
| 50 MB max per file | 1 GB max per file |
| 200 MB / 20 minutes monthly | Unlimited monthly volume |
| 3 concurrent files | 10 concurrent files (10× parallel) |
| No credit card required | $0.0005 per audio second |
Optional add-ons (only billed when enabled):
| Feature | Price |
|---|---|
| Speaker diarization | $0.0001 per audio second |
| Translate to English | $0.0003 per audio second |
| EU-region processing | $0.0007 per audio second (replaces base $0.0005) |
A 60-minute meeting with diarization on the paid tier costs approximately $2.16 ($1.80 transcription + $0.36 diarization).
Integration examples
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call({audioUrls: ['https://example.com/recording.mp3'],speakerDiarization: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items[0].transcript);console.log(items[0].srt); // ready-to-use SRT subtitle string
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_APIFY_TOKEN')run = client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call(run_input={'audioUrls': ['https://example.com/recording.mp3'],'speakerDiarization': True,})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items[0]['transcript'])print(items[0]['vtt']) # ready-to-use WebVTT subtitle string
cURL
curl -X POST 'https://api.apify.com/v2/acts/sian.agency~INCREDIBLY-FAST-audio-transcriber/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN' \-H 'Content-Type: application/json' \-d '{"audioUrls": ["https://example.com/recording.mp3"],"speakerDiarization": true}'
n8n / Zapier / Make
Wire this actor as a downstream step on any "new file uploaded" or "webhook received" trigger. The dataset record returned per item includes transcript, segments[].words[], srt, and vtt — drop them into Notion, Slack, Google Sheets, Airtable, or your CRM with no transformation step.
FAQ
How accurate is the transcription? Powered by an industrial speech-to-text pipeline tuned for natural conversation. Accuracy is typically 95–99% on clean audio in supported languages, lower on noisy recordings or strong accents. Word-level timestamps are returned even when accuracy is imperfect, so you can verify and correct faster than transcribing from scratch.
What audio and video formats are supported? MP3, M4A, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, MPEG, WebM. Max 50 MB per file on the free tier, 1 GB per file on the paid tier.
Can I transcribe non-English audio? Yes — auto-detection across 99+ languages including Spanish, French, German, Mandarin, Japanese, Portuguese, Arabic, Hindi, and many more. Toggle Translate to English to receive an English transcript alongside the timestamps.
Is speaker diarization included?
Yes, opt-in via the Speaker Diarization toggle. Each segment and word gets labeled SPEAKER_00, SPEAKER_01, etc. Powered by pyannote-audio. Billed at $0.0001 per audio second only when enabled.
How does pricing work? Pay-per-audio-second. The free tier covers small jobs and testing without a credit card. The paid tier is $0.0005 per second of audio (plus optional add-ons for diarization, translation, and EU processing). No subscriptions, no minimums.
Can I use this in n8n, Zapier, or Make?
Yes. The actor exposes a standard Apify run/dataset API. Use any "trigger → run actor → use dataset items" pattern. The dataset record includes transcript, segments[].words[], srt, and vtt ready to feed into downstream tools.
Where does my audio data go? Audio is sent to a transcription pipeline running on AWS (US region by default). Toggle EU-region processing for GDPR-aligned EU-only routing. Files and transcripts are not retained after the run completes.
How long does a transcription take? A 1-minute audio clip usually finishes in 5–15 seconds. A 60-minute meeting takes 1–3 minutes on the paid tier. Bulk batches of 100 files complete in ~1 hour with 10× parallel processing.
Legal disclaimer
Use this actor only on audio you have rights to transcribe — your own recordings, content with consent, or properly licensed media. The actor does not retain audio or transcripts beyond the run's lifetime. EU-region processing is available via the EU Processing toggle for GDPR-aligned workflows. SIÁN Agency provides this actor as-is; users are responsible for the legal use of transcribed content.
Support
Join the Telegram support group, email support@sian-agency.online, or open an issue on the SIÁN Agency Apify Store page.
More from SIÁN Agency
Platform-specific scrapers + transcribers:
- Instagram AI Transcript Extractor
- Best TikTok AI Transcript Extractor
- YouTube Shorts AI Transcript Extractor
- Facebook AI Transcript Extractor
Browse the full SIÁN Agency Apify Store for all available actors.