Transcribe Video to Text & Audio to Text — 99+ Languages
Pricing
Pay per event
Transcribe Video to Text & Audio to Text — 99+ Languages
Transcribe video to text and audio to text in bulk on Apify. 99+ languages, word-level timestamps, speaker diarization, SRT/VTT export. Try free.
Pricing
Pay per event
Rating
5.0
(2)
Developer
SIÁN OÜ
Maintained by CommunityActor stats
4
Bookmarked
69
Total users
17
Monthly active users
3 days ago
Last modified
Categories
Share
Transcribe Audio & Video to Text — 99+ Languages, SRT/VTT 🚀
🎉 NEW: Paste YouTube · TikTok · Instagram links directly — no manual download
Built for podcasters, journalists, sales/ops teams, video editors, and AI/RAG developers
📋 Overview
Bulk-transcribe any audio or video to text — direct files (MP3, WAV, MP4, MOV, M4A, OPUS…) or YouTube / TikTok / Instagram URLs, in 99+ languages, with speaker diarization, word-level timestamps, and ready-to-publish SRT/VTT subtitles.
Why thousands of professionals choose us:
- ✅ 99+ languages auto-detected — English, Spanish, French, Mandarin, Arabic, Portuguese, Hindi, and 90+ more, no manual selection needed
- ⚡ 10× parallel on the paid tier — 100 files in ~1 hour vs ~16 hours sequential
- 🎯 95–99% accuracy on clean audio with word-level timestamps
{word, start, end, speaker}on every transcript - 💰 Pay per audio second — $0.0005/sec, no subscriptions, no minimums; only billed for audio actually transcribed
- 💎 SRT and VTT subtitles included on every successful run — no extra step, no extra charge
- ✨ NEW (PAID): Paste YouTube, TikTok, or Instagram URLs directly — the actor resolves the media and transcribes in one pass
✨ Features
- 🌍 99+ Languages: Auto-detect or force a specific language from a curated dropdown
- 🎤 Speaker Diarization: Per-segment and per-word
SPEAKER_00,SPEAKER_01, … labels (pyannote-audio) - ⏱️ Word-Level Timestamps: Every word ships with
start,end, and (optionally)speakerfor clip-accurate editing - 🎬 SRT + VTT Output: Ready-to-use subtitle file strings on every successful run
- 🔗 Multi-Source Inputs: Direct file URLs and uploads on every tier; YouTube, TikTok, Instagram URLs on the PAID tier — mixed in one run
- 🌐 Translate to English: Optional one-pass translation for non-English audio
- 🇪🇺 EU-Region Processing: Toggle for GDPR-aligned routing
- 🚀 Parallel Bulk Processing: 10 concurrent files on the paid tier
- 🛡️ Hard FREE-Tier Precheck: Never charged if a file exceeds the free-tier cap
- 📊 45+ Dataset Fields: Transcript, segments, words, speakers, languages, SRT, VTT, duration, file size, source platform, and more
🎬 Quick Start
Paste a direct media URL or a YouTube/TikTok/Instagram link, hit Run, and get back a structured transcript with subtitles. No setup, no SDK install.
curl -X POST 'https://api.apify.com/v2/acts/sian.agency~INCREDIBLY-FAST-audio-transcriber/runs?token=YOUR_TOKEN' \-H 'Content-Type: application/json' \-d '{"audioUrls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ"]}'
🚀 Getting Started (3 Simple Steps)
Step 1: Add your audio or video
Paste one or more URLs into Audio/Video URLs (direct files, YouTube, TikTok, or Instagram links) — or upload files from your computer via Upload Audio File or Video to Text. Both lists are processed together.
Step 2: Pick the extras you need
Toggle Speaker Diarization, Translate to English, or EU-Region Processing (all optional, all billed only when enabled).
Step 3: Run the actor
Files process in parallel on the paid tier (10 concurrent). Every result lands in the dataset with a full transcript, segments, word-level timestamps, and SRT + VTT subtitle strings.
That's it! In under 5 minutes, you'll have:
- A clean text transcript per file
- Ready-to-publish
.srtand.vttsubtitle strings - Word-level timestamps and speaker labels for editing, search, and clip extraction
📥 Input Configuration
| Field | Type | Required | Description |
|---|---|---|---|
audioUrls | array | No | Direct media URLs (any tier) or YouTube / TikTok / Instagram links (PAID tier only), one per line |
audioFiles | array | No | Audio/video files uploaded from your computer; processed alongside audioUrls |
language | string | No | Source language; auto for auto-detect (99+ options) |
translateToEnglish | boolean | No | Translate non-English audio to English (PAID, +$0.0003/sec) |
useEuServers | boolean | No | Process inside the EU for GDPR alignment (PAID, replaces base rate with $0.0007/sec) |
speakerDiarization | boolean | No | Identify and label different speakers (PAID, +$0.0001/sec) |
Example — single direct URL:
{"audioUrls": ["https://example.com/podcast.mp3"]}
Example — YouTube + Instagram + diarization:
{"audioUrls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ","https://www.instagram.com/reel/Cxyz123/"],"speakerDiarization": true}
Example — bulk + language + translation:
{"audioUrls": ["https://example.com/interview-es.m4a","https://example.com/lecture-fr.mp4"],"language": "auto","translateToEnglish": true}
Supported formats: MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, WebM FREE tier: 1 direct file URL (or uploaded file) per run, ≤5 MB or ≤60 seconds per file (hard precheck — never charged if over cap). YouTube / TikTok / Instagram links require PAID. PAID tier: unlimited URLs and files (direct, YouTube, TikTok, Instagram), up to 1 GB per file.
📤 Output
Results are saved to the Apify dataset with 20+ fields per row:
| Field | Type | Description |
|---|---|---|
transcript | string | Complete text transcription |
detected_language | string | ISO language code (e.g. en, es) |
duration | number | Audio length in seconds |
fileSizeMB | number | File size in megabytes |
segments | array | Timestamped segments with text, start, end, speaker, language, and words[] (word-level timing) |
srt | string | Ready-to-use SRT subtitle file content |
vtt | string | Ready-to-use WebVTT subtitle file content |
speakers | array | Unique speaker labels (when diarization is enabled) |
languages | array | Languages detected across segments |
sourcePlatform | string | How input was routed: direct, youtube, tiktok, instagram |
mediaUrl | string | Resolved direct media URL sent to the transcription engine |
mediaTitle | string | Title from the source platform (when available) |
mediaAuthor | string | Author/uploader from the source platform (when available) |
transcriptSource | string | engine (word-level + speakers) or captions (line-level only) |
inputUrl | string | Original URL you submitted |
success | boolean | Whether the transcription completed |
processedAt | string | ISO 8601 timestamp |
Example:
{"transcript": "the ugliest human emotion that exists envy nobody ever wants to admit that they're envious...","detected_language": "en","duration": 57.0775,"fileSizeMB": 0.92,"segments": [{"id": 0,"text": "the ugliest human emotion that exists envy nobody ever wants to admit...","start": 0.26,"end": 20.56,"speaker": "SPEAKER_00","language": "en","words": [{ "word": "the", "start": 0.26, "end": 0.26, "speaker": "SPEAKER_00" },{ "word": "ugliest", "start": 0.26, "end": 0.78, "speaker": "SPEAKER_00" },{ "word": "human", "start": 0.78, "end": 1.20, "speaker": "SPEAKER_00" }]}],"speakers": ["SPEAKER_00"],"languages": ["en"],"srt": "1\n00:00:00,260 --> 00:00:20,560\nthe ugliest human emotion that exists envy...\n\n","vtt": "WEBVTT\n\n1\n00:00:00.260 --> 00:00:20.560\nthe ugliest human emotion that exists envy...\n\n","sourcePlatform": "direct","mediaUrl": "https://example.com/podcast.mp3","success": true,"processedAt": "2026-05-21T12:00:00Z"}
💼 Use Cases & Examples
1. Podcast Show Notes & Repurposing
Podcasters turning long episodes into searchable show notes, blog posts, and social clips.
Input: A direct MP3 URL (or YouTube link if the episode is also on YouTube) Output: Full transcript + per-word timestamps + SRT/VTT for the video version Use: Generate timestamped show notes, pull pull-quote clips for Reels/Shorts, publish a searchable blog version of every episode.
2. Meeting & Sales-Call Archival
Sales and ops teams archiving Zoom, Teams, and Meet recordings for coaching, QA, and compliance.
Input: Recordings uploaded via audioFiles or hosted on cloud storage
Output: Transcript with speaker labels (SPEAKER_00, SPEAKER_01…) and per-word timing
Use: Build a searchable internal knowledge base of every customer conversation; coach reps on specific moments; compliance evidence with attributed quotes.
3. Journalist & Researcher Interviews
Journalists and qualitative researchers turning interview tape into clean attributed transcripts.
Input: A phone-recorded interview (M4A from voice memos, MP3 from a recorder) — uploaded directly or hosted somewhere Output: Speaker-labeled transcript with word-level timing Use: Quote-accurate writing, faster fact-checking, and a searchable archive of every interview.
4. Lecture & Course Transcription
Students, professors, and online educators transcribing recorded lectures, seminars, and workshops.
Input: Lecture MP4/M4A files or YouTube unlisted URLs Output: Full transcript + SRT subtitles for the video Use: Study notes, accessibility for hearing-impaired students, captions on every uploaded lecture.
5. Video Subtitles & Captions
Video editors and content creators producing accurate caption tracks for long-form content.
Input: Direct video URL or YouTube link
Output: SRT and VTT subtitle file strings ready to drop into Premiere, DaVinci, or HTML5 <track> elements
Use: Add captions to every published video in one bulk run; localize via the translation toggle.
6. Customer-Support QA & Sentiment
Support teams transcribing call recordings for CSAT analysis and agent coaching.
Input: Support-call recordings (uploaded or hosted) Output: Speaker-separated transcripts feed-able into sentiment models Use: Identify churn signals, coach agents on real conversations, benchmark CSAT against transcript patterns.
7. RAG & LLM Training Data
AI/LLM developers building retrieval-augmented-generation pipelines or training data from spoken-word sources.
Input: Bulk podcasts, conference talks, YouTube lectures Output: Clean text + word-level timing for chunking and citation Use: Build voice-grounded knowledge bases, citation-aware Q&A systems, or fine-tune on domain-specific spoken content.
🔗 Integration Examples
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call({audioUrls: ['https://example.com/recording.mp3'],speakerDiarization: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items[0].transcript);console.log(items[0].srt); // ready-to-use SRT subtitle string
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_APIFY_TOKEN')run = client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call(run_input={'audioUrls': ['https://example.com/recording.mp3'],'speakerDiarization': True,})for item in client.dataset(run['defaultDatasetId']).iterate_items():print(item['transcript'])print(item['vtt']) # ready-to-use WebVTT subtitle string
cURL
curl -X POST 'https://api.apify.com/v2/acts/sian.agency~INCREDIBLY-FAST-audio-transcriber/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN' \-H 'Content-Type: application/json' \-d '{"audioUrls": ["https://example.com/recording.mp3"],"speakerDiarization": true}'
Automation Workflows (n8n / Zapier / Make)
- Trigger: New file in Drive/S3, new podcast episode in RSS, or a webhook
- HTTP Request: Call this actor's
runsendpoint - Process: Read
transcript,segments[].words[],srt,vttfrom the dataset - Action: Save to Notion / Airtable / Google Sheets / Slack / CRM with no transformation step
📊 Performance & Pricing
FREE Tier (Try It Now)
- 1 direct file URL or uploaded file per run — full feature access, same engine, same quality
- ≤5 MB or ≤60 seconds per file (hard precheck — never charged if over cap)
- No credit card required
- Perfect for testing accuracy on your audio before scaling up
PAID Tier (Production Ready)
- Unlimited URLs and files per run
- Up to 1 GB per file — long-form audio and video without splitting
- YouTube, TikTok, and Instagram URLs supported — paste any link, the actor resolves the media
- 10× parallel processing — 100 files in ~1 hour
- Pay-per-result: only charged for audio you actually transcribe
Base pricing:
| Event | Price |
|---|---|
| Audio second processed | $0.0005 / sec |
| Audio second processed (EU region) | $0.0007 / sec (replaces base) |
| Speaker diarization | +$0.0001 / sec (only when enabled) |
| Translate to English | +$0.0003 / sec (only when enabled) |
💰 Best price on the market — a 60-minute meeting with speaker diarization costs ~$2.16 ($1.80 transcription + $0.36 diarization). No subscriptions, no monthly minimums.
❓ Frequently Asked Questions
Q: How accurate is the transcription? A: Typically 95–99% on clean audio in supported languages. Word-level timestamps ship with every transcript so you can verify and correct faster than transcribing from scratch.
Q: What audio and video formats are supported? A: MP3, M4A, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, MPEG, WebM. FREE tier: 1 URL/file per run at ≤5 MB or ≤60 seconds. PAID tier: unlimited at up to 1 GB per file.
Q: Can I transcribe non-English audio? A: Yes — auto-detection across 99+ languages including Spanish, French, German, Mandarin, Japanese, Portuguese, Arabic, Hindi, and more. Toggle Translate to English to receive an English transcript alongside the original timestamps.
Q: Is speaker diarization included?
A: It's an opt-in toggle. When enabled, every segment and word gets SPEAKER_00, SPEAKER_01, … labels (pyannote-audio). Billed at $0.0001 per audio second only when used.
Q: Can I paste YouTube, TikTok, or Instagram links?
A: Yes — on the PAID tier. Paste them straight into audioUrls and the actor resolves the media and transcribes it. The FREE tier supports direct media file URLs (MP3, MP4, etc.) and uploaded files only — upgrade to PAID to unlock YouTube, TikTok, and Instagram link support. For per-word timestamps and speaker labels on YouTube specifically, use a direct file URL. For Facebook, use the dedicated Facebook AI Transcript Extractor.
Q: What output formats can I export? A: JSON, CSV, Excel — directly from the Apify dataset. Per-row you also get ready-to-use SRT and WebVTT subtitle strings.
Q: How long does processing take? A: A 1-minute clip usually finishes in 5–15 seconds. A 60-minute meeting takes 1–3 minutes on the paid tier. Bulk batches of 100 files complete in ~1 hour with 10× parallel processing.
Q: Is this legal? Where does my data go? A: Audio is sent to a transcription pipeline (US region by default; toggle EU-Region Processing for GDPR-aligned routing). Files and transcripts are not retained beyond the run. See the legal section below.
🐛 Troubleshooting
A file is rejected on the FREE tier
- FREE tier caps each file at 5 MB or 60 seconds (whichever applies first). The precheck happens before any transcription cost — you're never charged for an over-cap file. Upgrade to PAID for files up to 1 GB.
A YouTube / TikTok / Instagram link is rejected on the FREE tier
- Platform URL transcription is a PAID-only feature. The FREE tier accepts direct media file URLs (MP3, MP4, M4A, …) and uploaded files. Upgrade to PAID to paste YouTube, TikTok, and Instagram links directly.
YouTube/TikTok/Instagram link returns no speaker labels or per-word timestamps
- Platform links use the platform's own captions where available, which are line-level only. For word-level timing + diarization, transcribe a direct file URL instead (download once, then paste the file URL).
error: unsupported file type
- Confirm the URL ends in (or serves) one of: MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, WebM. HTML pages, images, and PDFs are blocked before processing.
Low-quality transcript on noisy audio
- Strong accents, background music, or compressed phone recordings reduce accuracy. Try explicitly setting
language(instead ofauto) to give the model a hint.
Rate limits or timeouts
- The actor handles parallelism and retries internally. If a single run consistently times out, split your batch — the pay-per-second model means there's no penalty for using multiple smaller runs.
⚖️ Is it legal to scrape data?
Our actors are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our actors, when used for ethical purposes by Apify users, are safe.
However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers.
Use this actor only on audio you have rights to transcribe — your own recordings, content with consent, or properly licensed media. Audio and transcripts are not retained beyond the run's lifetime. EU-Region Processing is available via the toggle for GDPR-aligned workflows.
You can also read Apify's blog post on the legality of web scraping.
🤝 Support
Join our active support community
- For issues or questions, open an issue in the actor's repository
- Check SIÁN Agency Store for more automation tools
- ✉️ apify@sian-agency.online
Built by SIÁN Agency | More Tools