Pricing

from $0.15 / 1,000 audio second processeds

Transcribe Voice Memo to Text — Speaker Labels & Timestamps

Transcribe iPhone and Android voice memos to text. Speaker labels, word-level timestamps, SRT/VTT. Bulk upload, 99+ languages. Try free.

Pricing

from $0.15 / 1,000 audio second processeds

Rating

0.0

(0)

Developer

SIÁN OÜ

Actor stats

Bookmarked

Total users

Monthly active users

11 days ago

Last modified

How to transcribe a voice memo in 4 steps

Upload your voice memo files — drop .m4a (iPhone Voice Memos), .mp3, .wav, or any common audio format into the Upload Voice Memo Files field. You can drag in many at once for bulk transcription.
Pick your options — auto-detect language or pick from 99+, toggle speaker diarization for multi-speaker recordings, and optionally translate non-English audio to English.
Run the actor — files process 10 at a time in parallel on the paid tier; bulk batches are usually done in minutes.
Download results — every file lands in the dataset with the transcript, segment-level + word-level timestamps, speaker labels, and ready-to-use SRT/VTT subtitle strings.

Supported formats: M4A, MP3, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, WebM. Max 1 GB per file on the paid tier.

Example output — voice memo transcript with speaker labels

A typical successful transcription returns:

{
  "transcript": "Quick note for the team — the client wants the launch pushed to Q3...",
  "detected_language": "en",
  "duration": 42.18,
  "segments": [
    {
      "id": 0,
      "text": "Quick note for the team — the client wants the launch pushed to Q3.",
      "start": 0.20,
      "end": 4.86,
      "speaker": "SPEAKER_00",
      "language": "en",
      "words": [
        { "word": "Quick",  "start": 0.20, "end": 0.42, "speaker": "SPEAKER_00" },
        { "word": "note",   "start": 0.42, "end": 0.68, "speaker": "SPEAKER_00" },
        { "word": "for",    "start": 0.68, "end": 0.84, "speaker": "SPEAKER_00" }
      ]
    }
  ],
  "srt": "1\n00:00:00,200 --> 00:00:04,860\nQuick note for the team — the client wants the launch pushed to Q3.",
  "vtt": "WEBVTT\n\n00:00:00.200 --> 00:00:04.860\nQuick note for the team — the client wants the launch pushed to Q3.",
  "speakers": ["SPEAKER_00"],
  "languages": ["en"],
  "fileSizeMB": 0.31,
  "success": true
}

Every result includes the full transcript, segment-level timestamps, word-level timestamps, language detection, voice memo duration in seconds, file size, ready-to-use srt and vtt subtitle strings, and (when speaker diarization is enabled) speaker labels per segment and per word.

Speaker diarization

Toggle the Speaker Diarization input to identify who's speaking in multi-person voice memos — interview-style recordings, family conversations, group brainstorms. Each segment and each word receives a speaker label (SPEAKER_00, SPEAKER_01, …) so you can keep one person's quotes separate from another's. Powered by pyannote-audio, the same model used in production speech-to-text pipelines. Charged per audio second; only billed when enabled.

SRT / VTT export for video repurposing

Every transcription returns ready-to-use srt and vtt subtitle strings. Save the field value as a .srt or .vtt file and:

Drop it into a video editor to caption a video version of your voice memo (great for short-form social posts)
Use it as a starter caption track for YouTube uploads
Add HTML5 <track> accessibility captions to web embeds

Set Timestamp Granularities to word for cue precision down to individual words.

Why voice-memo users choose this actor

✅ 99+ languages — automatic detection across English, Spanish, French, Mandarin, Arabic, Portuguese, and 90+ more
📤 Direct file upload — drop .m4a, .mp3, .wav straight from your phone or Mac, no need to host them anywhere first
🎤 Speaker diarization powered by pyannote-audio — separate the recorder from interview guests automatically
⏱️ Word-level timestamps on every transcription — {word, start, end, speaker} per word, ready for quote search and clip extraction
🎬 SRT and VTT subtitles included on every successful run — perfect for turning a voice memo into a captioned video
🚀 Bulk processing — drop in 10, 50, or 200 files at once; 10× parallel on the paid tier
💰 Pay per audio second — no subscriptions, no minimums; only pay for the audio you actually transcribe

Use cases

🎓 Students — turn lecture and seminar voice memos into searchable study notes; never lose a key concept buried in an hour of audio
📰 Journalists — transcribe phone-recorded interviews on the go; pull attributed quotes with word-level timestamps
💼 Professionals dictating notes — convert post-meeting voice recaps, brainstorms, and quick ideas into shareable text
🧪 Qualitative researchers — preserve participant voice memos with speaker separation for thematic analysis
✍️ Writers and creators — capture voice notes in the wild, edit them as text drafts later
🎙️ Interview-style podcasters recording on phones — get clean, attributed transcripts ready for show notes
📱 Anyone using iPhone Voice Memos or Android voice recorders — the lowest-friction path from spoken word to text

Pricing & tiers

Pay only for the audio seconds you actually transcribe. No subscriptions, no minimums.

FREE tier	PAID tier
Perfect for testing and small jobs	Built for production volume
Up to 5 files per run	Unlimited files per run
50 MB max per file	1 GB max per file
200 MB / 20 minutes monthly	Unlimited monthly volume
3 concurrent files	10 concurrent files (10× parallel)
No credit card required	$0.0005 per audio second

Optional add-ons (only billed when enabled):

Feature	Price
Speaker diarization	$0.0001 per audio second
Translate to English	$0.0003 per audio second
EU-region processing	$0.0007 per audio second (replaces base $0.0005)

A 5-minute voice memo on the paid tier costs approximately $0.15 (transcription only).

Integration examples

JavaScript / Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('sian.agency/transcribe-voice-memo-to-text').call({
    audioFiles: ['https://example.com/voice-memo.m4a'],
    speakerDiarization: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].transcript);
console.log(items[0].srt);

Python

from apify_client import ApifyClient

client = ApifyClient('YOUR_APIFY_TOKEN')

run = client.actor('sian.agency/transcribe-voice-memo-to-text').call(run_input={
    'audioFiles': ['https://example.com/voice-memo.m4a'],
    'speakerDiarization': True,
})

items = client.dataset(run['defaultDatasetId']).list_items().items
print(items[0]['transcript'])
print(items[0]['vtt'])

cURL

curl -X POST 'https://api.apify.com/v2/acts/sian.agency~transcribe-voice-memo-to-text/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{
    "audioFiles": ["https://example.com/voice-memo.m4a"],
    "speakerDiarization": true
  }'

n8n / Zapier / Make

Wire this actor as a downstream step on any "new voice memo synced" trigger (Dropbox, iCloud, Google Drive). The dataset record returned per item includes transcript, segments[].words[], srt, and vtt — drop them into Notion, Slack, Google Sheets, Obsidian, or your CRM with no transformation step.

FAQ

How accurate is voice memo transcription? Powered by an industrial speech-to-text pipeline tuned for natural conversation. Accuracy is typically 95–99% on clean iPhone or Android voice recordings, lower on noisy environments or strong accents. Word-level timestamps are returned even when accuracy is imperfect, so you can verify and correct faster than transcribing from scratch.

What audio formats are supported? M4A (iPhone Voice Memos default), MP3, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, WebM. Max 50 MB per file on the free tier, 1 GB per file on the paid tier.

Can I transcribe non-English voice memos? Yes — auto-detection across 99+ languages including Spanish, French, German, Mandarin, Japanese, Portuguese, Arabic, Hindi. Toggle Translate to English to receive an English transcript alongside the timestamps.

Is speaker diarization included? Yes, opt-in via the Speaker Diarization toggle. Each segment and word gets labeled SPEAKER_00, SPEAKER_01, etc. Powered by pyannote-audio. Billed at $0.0001 per audio second only when enabled.

How does pricing work? Pay-per-audio-second. The free tier covers small jobs and testing without a credit card. The paid tier is $0.0005 per second of audio, plus optional add-ons for diarization, translation, and EU processing. No subscriptions.

Can I use this in n8n, Zapier, or Make? Yes. The actor exposes a standard Apify run/dataset API. Use any "trigger → run actor → use dataset items" pattern. The dataset record includes transcript, segments[].words[], srt, and vtt ready to feed into downstream tools.

Do I need to host my voice memos somewhere first? No. Use the Upload Voice Memo Files field to upload directly from your computer or phone. Apify stores the upload in your key-value store and the actor processes it from there.

How long does a transcription take? A 5-minute voice memo usually finishes in 10–30 seconds. A 60-minute recording takes 1–3 minutes on the paid tier. Bulk batches process 10 files in parallel.

Legal disclaimer

Use this actor only on voice memos and audio you have rights to transcribe — your own recordings, content with consent, or properly licensed media. The actor does not retain audio or transcripts beyond the run's lifetime. EU-region processing is available via the EU Processing toggle for GDPR-aligned workflows. SIÁN Agency provides this actor as-is; users are responsible for the legal use of transcribed content.

Support

Join the Telegram support group, email apify@sian-agency.online, or open an issue on the SIÁN Agency Apify Store page.

More from SIÁN Agency

Platform-specific scrapers + transcribers:

Browse the full SIÁN Agency Apify Store for all available actors.

Transcribe Video to Text & Audio to Text — 99+ Languages

sian.agency/INCREDIBLY-FAST-audio-transcriber

Transcribe video to text and audio to text in bulk on Apify. 99+ languages, word-level timestamps, speaker diarization, SRT/VTT export. Try free.

SIÁN OÜ

132

5.0

Transcribe Interview to Text — for Journalists & Researchers

sian.agency/transcribe-interview-to-text

Transcribe interviews and recorded conversations to text. Speaker labels for interviewer and guest, word-level timestamps, SRT/VTT. Try free.

SIÁN OÜ

Transcribe Podcast to Text — Show Notes, SRT & Timestamps

sian.agency/transcribe-podcast-to-text

Transcribe podcast episodes to text in bulk. Speaker labels for hosts and guests, word-level timestamps, SRT/VTT for show notes. 99+ languages.

SIÁN OÜ

Transcribe Zoom Meeting to Text — Bulk Meeting Transcription

sian.agency/transcribe-zoom-meeting-to-text

Transcribe Zoom recordings to text in bulk. Speaker labels for host and participants, word-level timestamps, SRT/VTT export. 99+ languages. Try free.

SIÁN OÜ

Speech to Text — Audio Transcription API, 100+ Languages

vivid_astronaut/speech-to-text

Transcribe audio to text with high accuracy in 100+ languages, with speaker detection and word timestamps. Input an audio file, get structured transcript JSON — ready for subtitles, meeting notes, and voice apps.

Fabio Suizu

Instagram Youtube Transcripts With Speaker Labels Full Account

transcriptdl/instagram-youtube-transcripts-with-speaker-labels-full-account

Verified 99.4% Success. BULK generate transcripts with speaker diarization from Instagram Reels & YouTube videos. Automatically identifies speakers, outputs SRT/VTT subtitles, timestamps & full text. Perfect for podcasts, interviews & meetings. Bulk processing supported.

Transcript Downloader

Audio & Video Transcription + Speaker Diarization + SRT

vivid_astronaut/audio-video-transcription-diarization

Transcribe YouTube, TikTok, Instagram and direct audio/video with speaker diarization and SRT/VTT/TXT export. Flat $0.008/min, no OpenAI or other API key required.

Fabio Suizu

Transcribe | Transcribe any video or audio

rexreus/Transcribe

Transcribe any video or audio from YouTube, TikTok, Instagram, Twitter, and 1000+ sites

REXREUS D.O

Speech to Text — YouTube, TikTok, Instagram, 99+ Languages

andronixmd/speech-to-text-transcriber

Multi-engine speech-to-text for YouTube, TikTok, Instagram, podcasts, X, and direct media URLs. Auto-detects 99+ languages, routes across Groq/OpenAI/ElevenLabs/Google with automatic failover, and returns text, SRT/VTT subtitles, and optional speaker diarization. Pay-per-event — no subscription.

Sergey Andronik

Subtitle Translator — SRT & VTT

dami_studio/subtitle-translator

Translate subtitles into many languages at once. Paste an SRT/VTT file (or give a video URL to auto-transcribe), pick target languages, and get clean translated SRT + VTT back — timings preserved. For localization, accessibility, and multi-language publishing.