Transcribe Video to Text & Audio to Text — 99+ Languages avatar

Transcribe Video to Text & Audio to Text — 99+ Languages

Pricing

Pay per event

Go to Apify Store
Transcribe Video to Text & Audio to Text — 99+ Languages

Transcribe Video to Text & Audio to Text — 99+ Languages

Transcribe video to text and audio to text in bulk on Apify. 99+ languages, word-level timestamps, speaker diarization, SRT/VTT export. Try free.

Pricing

Pay per event

Rating

5.0

(2)

Developer

SIÁN OÜ

SIÁN OÜ

Maintained by Community

Actor stats

4

Bookmarked

69

Total users

17

Monthly active users

3 days ago

Last modified

Share

Transcribe Audio & Video to Text — 99+ Languages, SRT/VTT 🚀

SIÁN Agency Store Instagram Transcripts TikTok Transcripts Facebook Transcripts

Built for podcasters, journalists, sales/ops teams, video editors, and AI/RAG developers


📋 Overview

Bulk-transcribe any audio or video to text — direct files (MP3, WAV, MP4, MOV, M4A, OPUS…) or YouTube / TikTok / Instagram URLs, in 99+ languages, with speaker diarization, word-level timestamps, and ready-to-publish SRT/VTT subtitles.

Why thousands of professionals choose us:

  • 99+ languages auto-detected — English, Spanish, French, Mandarin, Arabic, Portuguese, Hindi, and 90+ more, no manual selection needed
  • 10× parallel on the paid tier — 100 files in ~1 hour vs ~16 hours sequential
  • 🎯 95–99% accuracy on clean audio with word-level timestamps {word, start, end, speaker} on every transcript
  • 💰 Pay per audio second — $0.0005/sec, no subscriptions, no minimums; only billed for audio actually transcribed
  • 💎 SRT and VTT subtitles included on every successful run — no extra step, no extra charge
  • NEW (PAID): Paste YouTube, TikTok, or Instagram URLs directly — the actor resolves the media and transcribes in one pass

✨ Features

  • 🌍 99+ Languages: Auto-detect or force a specific language from a curated dropdown
  • 🎤 Speaker Diarization: Per-segment and per-word SPEAKER_00, SPEAKER_01, … labels (pyannote-audio)
  • ⏱️ Word-Level Timestamps: Every word ships with start, end, and (optionally) speaker for clip-accurate editing
  • 🎬 SRT + VTT Output: Ready-to-use subtitle file strings on every successful run
  • 🔗 Multi-Source Inputs: Direct file URLs and uploads on every tier; YouTube, TikTok, Instagram URLs on the PAID tier — mixed in one run
  • 🌐 Translate to English: Optional one-pass translation for non-English audio
  • 🇪🇺 EU-Region Processing: Toggle for GDPR-aligned routing
  • 🚀 Parallel Bulk Processing: 10 concurrent files on the paid tier
  • 🛡️ Hard FREE-Tier Precheck: Never charged if a file exceeds the free-tier cap
  • 📊 45+ Dataset Fields: Transcript, segments, words, speakers, languages, SRT, VTT, duration, file size, source platform, and more

🎬 Quick Start

Paste a direct media URL or a YouTube/TikTok/Instagram link, hit Run, and get back a structured transcript with subtitles. No setup, no SDK install.

curl -X POST 'https://api.apify.com/v2/acts/sian.agency~INCREDIBLY-FAST-audio-transcriber/runs?token=YOUR_TOKEN' \
-H 'Content-Type: application/json' \
-d '{"audioUrls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ"]}'

🚀 Getting Started (3 Simple Steps)

Step 1: Add your audio or video

Paste one or more URLs into Audio/Video URLs (direct files, YouTube, TikTok, or Instagram links) — or upload files from your computer via Upload Audio File or Video to Text. Both lists are processed together.

Step 2: Pick the extras you need

Toggle Speaker Diarization, Translate to English, or EU-Region Processing (all optional, all billed only when enabled).

Step 3: Run the actor

Files process in parallel on the paid tier (10 concurrent). Every result lands in the dataset with a full transcript, segments, word-level timestamps, and SRT + VTT subtitle strings.

That's it! In under 5 minutes, you'll have:

  • A clean text transcript per file
  • Ready-to-publish .srt and .vtt subtitle strings
  • Word-level timestamps and speaker labels for editing, search, and clip extraction

📥 Input Configuration

FieldTypeRequiredDescription
audioUrlsarrayNoDirect media URLs (any tier) or YouTube / TikTok / Instagram links (PAID tier only), one per line
audioFilesarrayNoAudio/video files uploaded from your computer; processed alongside audioUrls
languagestringNoSource language; auto for auto-detect (99+ options)
translateToEnglishbooleanNoTranslate non-English audio to English (PAID, +$0.0003/sec)
useEuServersbooleanNoProcess inside the EU for GDPR alignment (PAID, replaces base rate with $0.0007/sec)
speakerDiarizationbooleanNoIdentify and label different speakers (PAID, +$0.0001/sec)

Example — single direct URL:

{
"audioUrls": ["https://example.com/podcast.mp3"]
}

Example — YouTube + Instagram + diarization:

{
"audioUrls": [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://www.instagram.com/reel/Cxyz123/"
],
"speakerDiarization": true
}

Example — bulk + language + translation:

{
"audioUrls": [
"https://example.com/interview-es.m4a",
"https://example.com/lecture-fr.mp4"
],
"language": "auto",
"translateToEnglish": true
}

Supported formats: MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, WebM FREE tier: 1 direct file URL (or uploaded file) per run, ≤5 MB or ≤60 seconds per file (hard precheck — never charged if over cap). YouTube / TikTok / Instagram links require PAID. PAID tier: unlimited URLs and files (direct, YouTube, TikTok, Instagram), up to 1 GB per file.


📤 Output

Results are saved to the Apify dataset with 20+ fields per row:

FieldTypeDescription
transcriptstringComplete text transcription
detected_languagestringISO language code (e.g. en, es)
durationnumberAudio length in seconds
fileSizeMBnumberFile size in megabytes
segmentsarrayTimestamped segments with text, start, end, speaker, language, and words[] (word-level timing)
srtstringReady-to-use SRT subtitle file content
vttstringReady-to-use WebVTT subtitle file content
speakersarrayUnique speaker labels (when diarization is enabled)
languagesarrayLanguages detected across segments
sourcePlatformstringHow input was routed: direct, youtube, tiktok, instagram
mediaUrlstringResolved direct media URL sent to the transcription engine
mediaTitlestringTitle from the source platform (when available)
mediaAuthorstringAuthor/uploader from the source platform (when available)
transcriptSourcestringengine (word-level + speakers) or captions (line-level only)
inputUrlstringOriginal URL you submitted
successbooleanWhether the transcription completed
processedAtstringISO 8601 timestamp

Example:

{
"transcript": "the ugliest human emotion that exists envy nobody ever wants to admit that they're envious...",
"detected_language": "en",
"duration": 57.0775,
"fileSizeMB": 0.92,
"segments": [
{
"id": 0,
"text": "the ugliest human emotion that exists envy nobody ever wants to admit...",
"start": 0.26,
"end": 20.56,
"speaker": "SPEAKER_00",
"language": "en",
"words": [
{ "word": "the", "start": 0.26, "end": 0.26, "speaker": "SPEAKER_00" },
{ "word": "ugliest", "start": 0.26, "end": 0.78, "speaker": "SPEAKER_00" },
{ "word": "human", "start": 0.78, "end": 1.20, "speaker": "SPEAKER_00" }
]
}
],
"speakers": ["SPEAKER_00"],
"languages": ["en"],
"srt": "1\n00:00:00,260 --> 00:00:20,560\nthe ugliest human emotion that exists envy...\n\n",
"vtt": "WEBVTT\n\n1\n00:00:00.260 --> 00:00:20.560\nthe ugliest human emotion that exists envy...\n\n",
"sourcePlatform": "direct",
"mediaUrl": "https://example.com/podcast.mp3",
"success": true,
"processedAt": "2026-05-21T12:00:00Z"
}

💼 Use Cases & Examples

1. Podcast Show Notes & Repurposing

Podcasters turning long episodes into searchable show notes, blog posts, and social clips.

Input: A direct MP3 URL (or YouTube link if the episode is also on YouTube) Output: Full transcript + per-word timestamps + SRT/VTT for the video version Use: Generate timestamped show notes, pull pull-quote clips for Reels/Shorts, publish a searchable blog version of every episode.

2. Meeting & Sales-Call Archival

Sales and ops teams archiving Zoom, Teams, and Meet recordings for coaching, QA, and compliance.

Input: Recordings uploaded via audioFiles or hosted on cloud storage Output: Transcript with speaker labels (SPEAKER_00, SPEAKER_01…) and per-word timing Use: Build a searchable internal knowledge base of every customer conversation; coach reps on specific moments; compliance evidence with attributed quotes.

3. Journalist & Researcher Interviews

Journalists and qualitative researchers turning interview tape into clean attributed transcripts.

Input: A phone-recorded interview (M4A from voice memos, MP3 from a recorder) — uploaded directly or hosted somewhere Output: Speaker-labeled transcript with word-level timing Use: Quote-accurate writing, faster fact-checking, and a searchable archive of every interview.

4. Lecture & Course Transcription

Students, professors, and online educators transcribing recorded lectures, seminars, and workshops.

Input: Lecture MP4/M4A files or YouTube unlisted URLs Output: Full transcript + SRT subtitles for the video Use: Study notes, accessibility for hearing-impaired students, captions on every uploaded lecture.

5. Video Subtitles & Captions

Video editors and content creators producing accurate caption tracks for long-form content.

Input: Direct video URL or YouTube link Output: SRT and VTT subtitle file strings ready to drop into Premiere, DaVinci, or HTML5 <track> elements Use: Add captions to every published video in one bulk run; localize via the translation toggle.

6. Customer-Support QA & Sentiment

Support teams transcribing call recordings for CSAT analysis and agent coaching.

Input: Support-call recordings (uploaded or hosted) Output: Speaker-separated transcripts feed-able into sentiment models Use: Identify churn signals, coach agents on real conversations, benchmark CSAT against transcript patterns.

7. RAG & LLM Training Data

AI/LLM developers building retrieval-augmented-generation pipelines or training data from spoken-word sources.

Input: Bulk podcasts, conference talks, YouTube lectures Output: Clean text + word-level timing for chunking and citation Use: Build voice-grounded knowledge bases, citation-aware Q&A systems, or fine-tune on domain-specific spoken content.


🔗 Integration Examples

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call({
audioUrls: ['https://example.com/recording.mp3'],
speakerDiarization: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].transcript);
console.log(items[0].srt); // ready-to-use SRT subtitle string

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_TOKEN')
run = client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call(run_input={
'audioUrls': ['https://example.com/recording.mp3'],
'speakerDiarization': True,
})
for item in client.dataset(run['defaultDatasetId']).iterate_items():
print(item['transcript'])
print(item['vtt']) # ready-to-use WebVTT subtitle string

cURL

curl -X POST 'https://api.apify.com/v2/acts/sian.agency~INCREDIBLY-FAST-audio-transcriber/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"audioUrls": ["https://example.com/recording.mp3"],
"speakerDiarization": true
}'

Automation Workflows (n8n / Zapier / Make)

  1. Trigger: New file in Drive/S3, new podcast episode in RSS, or a webhook
  2. HTTP Request: Call this actor's runs endpoint
  3. Process: Read transcript, segments[].words[], srt, vtt from the dataset
  4. Action: Save to Notion / Airtable / Google Sheets / Slack / CRM with no transformation step

📊 Performance & Pricing

FREE Tier (Try It Now)

  • 1 direct file URL or uploaded file per run — full feature access, same engine, same quality
  • ≤5 MB or ≤60 seconds per file (hard precheck — never charged if over cap)
  • No credit card required
  • Perfect for testing accuracy on your audio before scaling up
  • Unlimited URLs and files per run
  • Up to 1 GB per file — long-form audio and video without splitting
  • YouTube, TikTok, and Instagram URLs supported — paste any link, the actor resolves the media
  • 10× parallel processing — 100 files in ~1 hour
  • Pay-per-result: only charged for audio you actually transcribe

Base pricing:

EventPrice
Audio second processed$0.0005 / sec
Audio second processed (EU region)$0.0007 / sec (replaces base)
Speaker diarization+$0.0001 / sec (only when enabled)
Translate to English+$0.0003 / sec (only when enabled)

💰 Best price on the market — a 60-minute meeting with speaker diarization costs ~$2.16 ($1.80 transcription + $0.36 diarization). No subscriptions, no monthly minimums.

🔗 View current pricing


❓ Frequently Asked Questions

Q: How accurate is the transcription? A: Typically 95–99% on clean audio in supported languages. Word-level timestamps ship with every transcript so you can verify and correct faster than transcribing from scratch.

Q: What audio and video formats are supported? A: MP3, M4A, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, MPEG, WebM. FREE tier: 1 URL/file per run at ≤5 MB or ≤60 seconds. PAID tier: unlimited at up to 1 GB per file.

Q: Can I transcribe non-English audio? A: Yes — auto-detection across 99+ languages including Spanish, French, German, Mandarin, Japanese, Portuguese, Arabic, Hindi, and more. Toggle Translate to English to receive an English transcript alongside the original timestamps.

Q: Is speaker diarization included? A: It's an opt-in toggle. When enabled, every segment and word gets SPEAKER_00, SPEAKER_01, … labels (pyannote-audio). Billed at $0.0001 per audio second only when used.

Q: Can I paste YouTube, TikTok, or Instagram links? A: Yes — on the PAID tier. Paste them straight into audioUrls and the actor resolves the media and transcribes it. The FREE tier supports direct media file URLs (MP3, MP4, etc.) and uploaded files only — upgrade to PAID to unlock YouTube, TikTok, and Instagram link support. For per-word timestamps and speaker labels on YouTube specifically, use a direct file URL. For Facebook, use the dedicated Facebook AI Transcript Extractor.

Q: What output formats can I export? A: JSON, CSV, Excel — directly from the Apify dataset. Per-row you also get ready-to-use SRT and WebVTT subtitle strings.

Q: How long does processing take? A: A 1-minute clip usually finishes in 5–15 seconds. A 60-minute meeting takes 1–3 minutes on the paid tier. Bulk batches of 100 files complete in ~1 hour with 10× parallel processing.

Q: Is this legal? Where does my data go? A: Audio is sent to a transcription pipeline (US region by default; toggle EU-Region Processing for GDPR-aligned routing). Files and transcripts are not retained beyond the run. See the legal section below.


🐛 Troubleshooting

A file is rejected on the FREE tier

  • FREE tier caps each file at 5 MB or 60 seconds (whichever applies first). The precheck happens before any transcription cost — you're never charged for an over-cap file. Upgrade to PAID for files up to 1 GB.

A YouTube / TikTok / Instagram link is rejected on the FREE tier

  • Platform URL transcription is a PAID-only feature. The FREE tier accepts direct media file URLs (MP3, MP4, M4A, …) and uploaded files. Upgrade to PAID to paste YouTube, TikTok, and Instagram links directly.

YouTube/TikTok/Instagram link returns no speaker labels or per-word timestamps

  • Platform links use the platform's own captions where available, which are line-level only. For word-level timing + diarization, transcribe a direct file URL instead (download once, then paste the file URL).

error: unsupported file type

  • Confirm the URL ends in (or serves) one of: MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, WebM. HTML pages, images, and PDFs are blocked before processing.

Low-quality transcript on noisy audio

  • Strong accents, background music, or compressed phone recordings reduce accuracy. Try explicitly setting language (instead of auto) to give the model a hint.

Rate limits or timeouts

  • The actor handles parallelism and retries internally. If a single run consistently times out, split your batch — the pay-per-second model means there's no penalty for using multiple smaller runs.

Our actors are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our actors, when used for ethical purposes by Apify users, are safe.

However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers.

Use this actor only on audio you have rights to transcribe — your own recordings, content with consent, or properly licensed media. Audio and transcripts are not retained beyond the run's lifetime. EU-Region Processing is available via the toggle for GDPR-aligned workflows.

You can also read Apify's blog post on the legality of web scraping.


🤝 Support

Telegram Support

Join our active support community


Built by SIÁN Agency | More Tools