Pricing

Pay per event

Try for free

Go to Apify Store

Transcribe Video to Text & Audio to Text — 99+ Languages

Try for free

Transcribe video to text and audio to text in bulk on Apify. 99+ languages, word-level timestamps, speaker diarization, SRT/VTT export. Try free.

Pricing

Pay per event

Rating

5.0

(2)

Developer

SIÁN OÜ

Actor stats

Bookmarked

135

Total users

Monthly active users

6 days ago

Last modified

Transcribe Audio & Video to Text — 99+ Languages, SRT/VTT 🚀

🎉 NEW: Paste YouTube · TikTok · Instagram links directly — no manual download

Built for podcasters, journalists, sales/ops teams, video editors, and AI/RAG developers

📋 Overview

Bulk-transcribe any audio or video to text — direct files (MP3, WAV, MP4, MOV, M4A, OPUS…) or YouTube / TikTok / Instagram URLs, in 99+ languages, with speaker diarization, word-level timestamps, and ready-to-publish SRT/VTT subtitles.

Why thousands of professionals choose us:

✅ 99+ languages auto-detected — English, Spanish, French, Mandarin, Arabic, Portuguese, Hindi, and 90+ more, no manual selection needed
⚡ 10× parallel on the paid tier — 100 files in ~1 hour vs ~16 hours sequential
🎯 95–99% accuracy on clean audio with word-level timestamps {word, start, end, speaker} on every transcript
💰 Pay per audio second — $0.0005/sec, no subscriptions, no minimums; only billed for audio actually transcribed
💎 SRT and VTT subtitles included on every successful run — no extra step, no extra charge
✨ NEW (PAID): Paste YouTube, TikTok, or Instagram URLs directly — the actor resolves the media and transcribes in one pass

✨ Features

🌍 99+ Languages: Auto-detect or force a specific language from a curated dropdown
🎤 Speaker Diarization: Per-segment and per-word SPEAKER_00, SPEAKER_01, … labels (pyannote-audio)
⏱️ Word-Level Timestamps: Every word ships with start, end, and (optionally) speaker for clip-accurate editing
🎬 SRT + VTT Output: Ready-to-use subtitle file strings on every successful run
🔗 Multi-Source Inputs: Direct file URLs and uploads on every tier; YouTube, TikTok, Instagram URLs on the PAID tier — mixed in one run
🌐 Translate to English: Optional one-pass translation for non-English audio
🇪🇺 EU-Region Processing: Toggle for GDPR-aligned routing
🚀 Parallel Bulk Processing: 10 concurrent files on the paid tier
🛡️ Hard FREE-Tier Precheck: Never charged if a file exceeds the free-tier cap
📊 45+ Dataset Fields: Transcript, segments, words, speakers, languages, SRT, VTT, duration, file size, source platform, and more

🎬 Quick Start

Paste a direct media URL or a YouTube/TikTok/Instagram link, hit Run, and get back a structured transcript with subtitles. No setup, no SDK install.

curl -X POST 'https://api.apify.com/v2/acts/sian.agency~INCREDIBLY-FAST-audio-transcriber/runs?token=YOUR_TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{"audioUrls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ"]}'

🚀 Getting Started (3 Simple Steps)

Step 1: Add your audio or video

Paste one or more URLs into Audio/Video URLs (direct files, YouTube, TikTok, or Instagram links) — or upload files from your computer via Upload Audio File or Video to Text. Both lists are processed together.

Step 2: Pick the extras you need

Toggle Speaker Diarization, Translate to English, or EU-Region Processing (all optional, all billed only when enabled).

Step 3: Run the actor

Files process in parallel on the paid tier (10 concurrent). Every result lands in the dataset with a full transcript, segments, word-level timestamps, and SRT + VTT subtitle strings.

That's it! In under 5 minutes, you'll have:

A clean text transcript per file
Ready-to-publish .srt and .vtt subtitle strings
Word-level timestamps and speaker labels for editing, search, and clip extraction

📥 Input Configuration

Field	Type	Required	Description
`audioUrls`	array	No	Direct media URLs (any tier) or YouTube / TikTok / Instagram links (PAID tier only), one per line
`audioFiles`	array	No	Audio/video files uploaded from your computer; processed alongside `audioUrls`
`language`	string	No	Source language; `auto` for auto-detect (99+ options)
`translateToEnglish`	boolean	No	Translate non-English audio to English (PAID, +$0.0003/sec)
`useEuServers`	boolean	No	Process inside the EU for GDPR alignment (PAID, replaces base rate with $0.0007/sec)
`speakerDiarization`	boolean	No	Identify and label different speakers (PAID, +$0.0001/sec)

Example — single direct URL:

{
  "audioUrls": ["https://example.com/podcast.mp3"]
}

Example — YouTube + Instagram + diarization:

{
  "audioUrls": [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://www.instagram.com/reel/Cxyz123/"
  ],
  "speakerDiarization": true
}

Example — bulk + language + translation:

{
  "audioUrls": [
    "https://example.com/interview-es.m4a",
    "https://example.com/lecture-fr.mp4"
  ],
  "language": "auto",
  "translateToEnglish": true
}

Supported formats: MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, WebM FREE tier: 1 direct file URL (or uploaded file) per run, ≤5 MB or ≤60 seconds per file (hard precheck — never charged if over cap). YouTube / TikTok / Instagram links require PAID. PAID tier: unlimited URLs and files (direct, YouTube, TikTok, Instagram), up to 1 GB per file.

📤 Output

Results are saved to the Apify dataset with 20+ fields per row:

Field	Type	Description
`transcript`	string	Complete text transcription
`detected_language`	string	ISO language code (e.g. `en`, `es`)
`duration`	number	Audio length in seconds
`fileSizeMB`	number	File size in megabytes
`segments`	array	Timestamped segments with `text`, `start`, `end`, `speaker`, `language`, and `words[]` (word-level timing)
`srt`	string	Ready-to-use SRT subtitle file content
`vtt`	string	Ready-to-use WebVTT subtitle file content
`speakers`	array	Unique speaker labels (when diarization is enabled)
`languages`	array	Languages detected across segments
`sourcePlatform`	string	How input was routed: `direct`, `youtube`, `tiktok`, `instagram`
`mediaUrl`	string	Resolved direct media URL sent to the transcription engine
`mediaTitle`	string	Title from the source platform (when available)
`mediaAuthor`	string	Author/uploader from the source platform (when available)
`transcriptSource`	string	`engine` (word-level + speakers) or `captions` (line-level only)
`inputUrl`	string	Original URL you submitted
`success`	boolean	Whether the transcription completed
`processedAt`	string	ISO 8601 timestamp

Example:

{
  "transcript": "the ugliest human emotion that exists envy nobody ever wants to admit that they're envious...",
  "detected_language": "en",
  "duration": 57.0775,
  "fileSizeMB": 0.92,
  "segments": [
    {
      "id": 0,
      "text": "the ugliest human emotion that exists envy nobody ever wants to admit...",
      "start": 0.26,
      "end": 20.56,
      "speaker": "SPEAKER_00",
      "language": "en",
      "words": [
        { "word": "the",     "start": 0.26, "end": 0.26, "speaker": "SPEAKER_00" },
        { "word": "ugliest", "start": 0.26, "end": 0.78, "speaker": "SPEAKER_00" },
        { "word": "human",   "start": 0.78, "end": 1.20, "speaker": "SPEAKER_00" }
      ]
    }
  ],
  "speakers": ["SPEAKER_00"],
  "languages": ["en"],
  "srt": "1\n00:00:00,260 --> 00:00:20,560\nthe ugliest human emotion that exists envy...\n\n",
  "vtt": "WEBVTT\n\n1\n00:00:00.260 --> 00:00:20.560\nthe ugliest human emotion that exists envy...\n\n",
  "sourcePlatform": "direct",
  "mediaUrl": "https://example.com/podcast.mp3",
  "success": true,
  "processedAt": "2026-05-21T12:00:00Z"
}

💼 Use Cases & Examples

1. Podcast Show Notes & Repurposing

Podcasters turning long episodes into searchable show notes, blog posts, and social clips.

Input: A direct MP3 URL (or YouTube link if the episode is also on YouTube) Output: Full transcript + per-word timestamps + SRT/VTT for the video version Use: Generate timestamped show notes, pull pull-quote clips for Reels/Shorts, publish a searchable blog version of every episode.

2. Meeting & Sales-Call Archival

Sales and ops teams archiving Zoom, Teams, and Meet recordings for coaching, QA, and compliance.

Input: Recordings uploaded via audioFiles or hosted on cloud storage Output: Transcript with speaker labels (SPEAKER_00, SPEAKER_01…) and per-word timing Use: Build a searchable internal knowledge base of every customer conversation; coach reps on specific moments; compliance evidence with attributed quotes.

3. Journalist & Researcher Interviews

Journalists and qualitative researchers turning interview tape into clean attributed transcripts.

Input: A phone-recorded interview (M4A from voice memos, MP3 from a recorder) — uploaded directly or hosted somewhere Output: Speaker-labeled transcript with word-level timing Use: Quote-accurate writing, faster fact-checking, and a searchable archive of every interview.

4. Lecture & Course Transcription

Students, professors, and online educators transcribing recorded lectures, seminars, and workshops.

Input: Lecture MP4/M4A files or YouTube unlisted URLs Output: Full transcript + SRT subtitles for the video Use: Study notes, accessibility for hearing-impaired students, captions on every uploaded lecture.

5. Video Subtitles & Captions

Video editors and content creators producing accurate caption tracks for long-form content.

Input: Direct video URL or YouTube link Output: SRT and VTT subtitle file strings ready to drop into Premiere, DaVinci, or HTML5 <track> elements Use: Add captions to every published video in one bulk run; localize via the translation toggle.

6. Customer-Support QA & Sentiment

Support teams transcribing call recordings for CSAT analysis and agent coaching.

Input: Support-call recordings (uploaded or hosted) Output: Speaker-separated transcripts feed-able into sentiment models Use: Identify churn signals, coach agents on real conversations, benchmark CSAT against transcript patterns.

7. RAG & LLM Training Data

AI/LLM developers building retrieval-augmented-generation pipelines or training data from spoken-word sources.

Input: Bulk podcasts, conference talks, YouTube lectures Output: Clean text + word-level timing for chunking and citation Use: Build voice-grounded knowledge bases, citation-aware Q&A systems, or fine-tune on domain-specific spoken content.

🔗 Integration Examples

JavaScript / Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call({
  audioUrls: ['https://example.com/recording.mp3'],
  speakerDiarization: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].transcript);
console.log(items[0].srt); // ready-to-use SRT subtitle string

Python

from apify_client import ApifyClient

client = ApifyClient('YOUR_APIFY_TOKEN')

run = client.actor('sian.agency/INCREDIBLY-FAST-audio-transcriber').call(run_input={
    'audioUrls': ['https://example.com/recording.mp3'],
    'speakerDiarization': True,
})

for item in client.dataset(run['defaultDatasetId']).iterate_items():
    print(item['transcript'])
    print(item['vtt'])  # ready-to-use WebVTT subtitle string

cURL

curl -X POST 'https://api.apify.com/v2/acts/sian.agency~INCREDIBLY-FAST-audio-transcriber/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{
    "audioUrls": ["https://example.com/recording.mp3"],
    "speakerDiarization": true
  }'

Automation Workflows (n8n / Zapier / Make)

Trigger: New file in Drive/S3, new podcast episode in RSS, or a webhook
HTTP Request: Call this actor's runs endpoint
Process: Read transcript, segments[].words[], srt, vtt from the dataset
Action: Save to Notion / Airtable / Google Sheets / Slack / CRM with no transformation step

📊 Performance & Pricing

FREE Tier (Try It Now)

1 direct file URL or uploaded file per run — full feature access, same engine, same quality
≤5 MB or ≤60 seconds per file (hard precheck — never charged if over cap)
No credit card required
Perfect for testing accuracy on your audio before scaling up

PAID Tier (Production Ready)

Unlimited URLs and files per run
Up to 1 GB per file — long-form audio and video without splitting
YouTube, TikTok, and Instagram URLs supported — paste any link, the actor resolves the media
10× parallel processing — 100 files in ~1 hour
Pay-per-result: only charged for audio you actually transcribe

Base pricing:

Event	Price
Audio second processed	$0.0005 / sec
Audio second processed (EU region)	$0.0007 / sec (replaces base)
Speaker diarization	+$0.0001 / sec (only when enabled)
Translate to English	+$0.0003 / sec (only when enabled)

💰 Best price on the market — a 60-minute meeting with speaker diarization costs ~$2.16 ($1.80 transcription + $0.36 diarization). No subscriptions, no monthly minimums.

🔗 View current pricing

❓ Frequently Asked Questions

Q: How accurate is the transcription? A: Typically 95–99% on clean audio in supported languages. Word-level timestamps ship with every transcript so you can verify and correct faster than transcribing from scratch.

Q: What audio and video formats are supported? A: MP3, M4A, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, MPEG, WebM. FREE tier: 1 URL/file per run at ≤5 MB or ≤60 seconds. PAID tier: unlimited at up to 1 GB per file.

Q: Can I transcribe non-English audio? A: Yes — auto-detection across 99+ languages including Spanish, French, German, Mandarin, Japanese, Portuguese, Arabic, Hindi, and more. Toggle Translate to English to receive an English transcript alongside the original timestamps.

Q: Is speaker diarization included? A: It's an opt-in toggle. When enabled, every segment and word gets SPEAKER_00, SPEAKER_01, … labels (pyannote-audio). Billed at $0.0001 per audio second only when used.

Q: Can I paste YouTube, TikTok, or Instagram links? A: Yes — on the PAID tier. Paste them straight into audioUrls and the actor resolves the media and transcribes it. The FREE tier supports direct media file URLs (MP3, MP4, etc.) and uploaded files only — upgrade to PAID to unlock YouTube, TikTok, and Instagram link support. For per-word timestamps and speaker labels on YouTube specifically, use a direct file URL. For Facebook, use the dedicated Facebook AI Transcript Extractor.

Q: What output formats can I export? A: JSON, CSV, Excel — directly from the Apify dataset. Per-row you also get ready-to-use SRT and WebVTT subtitle strings.

Q: How long does processing take? A: A 1-minute clip usually finishes in 5–15 seconds. A 60-minute meeting takes 1–3 minutes on the paid tier. Bulk batches of 100 files complete in ~1 hour with 10× parallel processing.

Q: Is this legal? Where does my data go? A: Audio is sent to a transcription pipeline (US region by default; toggle EU-Region Processing for GDPR-aligned routing). Files and transcripts are not retained beyond the run. See the legal section below.

🐛 Troubleshooting

A file is rejected on the FREE tier

FREE tier caps each file at 5 MB or 60 seconds (whichever applies first). The precheck happens before any transcription cost — you're never charged for an over-cap file. Upgrade to PAID for files up to 1 GB.

A YouTube / TikTok / Instagram link is rejected on the FREE tier

Platform URL transcription is a PAID-only feature. The FREE tier accepts direct media file URLs (MP3, MP4, M4A, …) and uploaded files. Upgrade to PAID to paste YouTube, TikTok, and Instagram links directly.

YouTube/TikTok/Instagram link returns no speaker labels or per-word timestamps

Platform links use the platform's own captions where available, which are line-level only. For word-level timing + diarization, transcribe a direct file URL instead (download once, then paste the file URL).

error: unsupported file type

Confirm the URL ends in (or serves) one of: MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, WebM. HTML pages, images, and PDFs are blocked before processing.

Low-quality transcript on noisy audio

Strong accents, background music, or compressed phone recordings reduce accuracy. Try explicitly setting language (instead of auto) to give the model a hint.

Rate limits or timeouts

The actor handles parallelism and retries internally. If a single run consistently times out, split your batch — the pay-per-second model means there's no penalty for using multiple smaller runs.

⚖️ Is it legal to scrape data?

Our actors are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our actors, when used for ethical purposes by Apify users, are safe.

However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers.

Use this actor only on audio you have rights to transcribe — your own recordings, content with consent, or properly licensed media. Audio and transcripts are not retained beyond the run's lifetime. EU-Region Processing is available via the toggle for GDPR-aligned workflows.

You can also read Apify's blog post on the legality of web scraping.

🤝 Support

Join our active support community

For issues or questions, open an issue in the actor's repository
Check SIÁN Agency Store for more automation tools
✉️ apify@sian-agency.online

Built by SIÁN Agency | More Tools

Video To Text

truefetch/video-to-text

Video To Text turns one public video URL or media upload up to 8 GB into readable text for notes, captions, interviews, and content reuse. Results include language detection, ordered timestamps, source details, and optional translation into 133 languages. A completed transcript starts at $0.468.

TrueFetch

301

4.4

Email Address Validator / Verifier (100% success)

overpowered/verify-email

Fast & Accurate Email Address Validator, Checker, Verifier. Only $9 for 1000 validations.

blg

2.5K

4.7

Fast Bulk Email Verifier & Validator | 0.56$/1K

octahedral_nightcrawler/bulk-email-verifier

Verify emails with our fast bulk email verifier and email validator, delivering premium accuracy for just $0.56 per 1,000 emails. Advanced SMTP, MX, catch-all, disposable email detection, and real-time validation help you clean email lists with confidence.

MD Atick

Rumble Transcript Extractor

scriptbase/rumble-transcript-extractor

🟢 Convert any Rumble video to text. Extract transcripts, subtitles, and captions with timestamps. Outputs JSON, SRT, or plain text. Auto-captions + speech-to-text fallback. 14+ languages. No login needed.

Scriptbase

353

Transcribe Voice Memo to Text — Speaker Labels & Timestamps

sian.agency/transcribe-voice-memo-to-text

Transcribe iPhone and Android voice memos to text. Speaker labels, word-level timestamps, SRT/VTT. Bulk upload, 99+ languages. Try free.

SIÁN OÜ

Video Transcriber Ultimate

marielise.dev/video-transcriber-ultimate

Transcribe videos from 1000+ platforms including Vimeo, Dailymotion, Twitch, Rumble, TED, and Bitchute. Powered by Whisper AI with 50+ language support. Get full text with timestamps and segments. No API keys needed. Perfect for content creators, researchers, and accessibility compliance.

Marielise

450

Transcribe Zoom Meeting to Text — Bulk Meeting Transcription

sian.agency/transcribe-zoom-meeting-to-text

Transcribe Zoom recordings to text in bulk. Speaker labels for host and participants, word-level timestamps, SRT/VTT export. 99+ languages. Try free.

SIÁN OÜ

Transcribe Podcast to Text — Show Notes, SRT & Timestamps

sian.agency/transcribe-podcast-to-text

Transcribe podcast episodes to text in bulk. Speaker labels for hosts and guests, word-level timestamps, SRT/VTT for show notes. 99+ languages.

SIÁN OÜ

Email Verifier I Free To Use

fatihtahta/email-verifier-free-to-use

Clean your email lists with this fast, free email verifier and validator. This actor provides fast deliverability checks to slash bounce rates, protect your sender reputation, and improve marketing campaign performance.

Fatih Tahta

407

5.0

Transcribe Interview to Text — for Journalists & Researchers

sian.agency/transcribe-interview-to-text

Transcribe interviews and recorded conversations to text. Speaker labels for interviewer and guest, word-level timestamps, SRT/VTT. Try free.

SIÁN OÜ

Transcribe Video to Text & Audio to Text — 99+ Languages

Transcribe Audio & Video to Text — 99+ Languages, SRT/VTT 🚀

🎉 NEW: Paste YouTube · TikTok · Instagram links directly — no manual download

Built for podcasters, journalists, sales/ops teams, video editors, and AI/RAG developers

📋 Overview

✨ Features

🎬 Quick Start

🚀 Getting Started (3 Simple Steps)

Step 1: Add your audio or video

Step 2: Pick the extras you need

Step 3: Run the actor

📥 Input Configuration

📤 Output

💼 Use Cases & Examples

1. Podcast Show Notes & Repurposing

2. Meeting & Sales-Call Archival

3. Journalist & Researcher Interviews

4. Lecture & Course Transcription

5. Video Subtitles & Captions

6. Customer-Support QA & Sentiment

7. RAG & LLM Training Data

🔗 Integration Examples

JavaScript / Node.js

Python

cURL

Automation Workflows (n8n / Zapier / Make)

📊 Performance & Pricing

FREE Tier (Try It Now)

PAID Tier (Production Ready)

❓ Frequently Asked Questions

🐛 Troubleshooting

⚖️ Is it legal to scrape data?

🤝 Support

You might also like

Video To Text

Email Address Validator / Verifier (100% success)

Fast Bulk Email Verifier & Validator | 0.56$/1K

Rumble Transcript Extractor

Transcribe Voice Memo to Text — Speaker Labels & Timestamps

Video Transcriber Ultimate

Transcribe Zoom Meeting to Text — Bulk Meeting Transcription

Transcribe Podcast to Text — Show Notes, SRT & Timestamps

Email Verifier I Free To Use

Transcribe Interview to Text — for Journalists & Researchers