Transcribe Voice Memo to Text — Speaker Labels & Timestamps avatar

Transcribe Voice Memo to Text — Speaker Labels & Timestamps

Pricing

from $0.15 / 1,000 audio second processeds

Go to Apify Store
Transcribe Voice Memo to Text — Speaker Labels & Timestamps

Transcribe Voice Memo to Text — Speaker Labels & Timestamps

Transcribe iPhone and Android voice memos to text. Speaker labels, word-level timestamps, SRT/VTT. Bulk upload, 99+ languages. Try free.

Pricing

from $0.15 / 1,000 audio second processeds

Rating

0.0

(0)

Developer

SIÁN OÜ

SIÁN OÜ

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

SIÁN Agency Store Telegram Support Instagram AI Transcript Extractor Best TikTok AI Transcript Extractor YouTube Shorts AI Transcript Extractor Facebook AI Transcript Extractor

Transcribe iPhone and Android voice memos to text. Drop your .m4a files into the upload field, get clean transcripts with speaker labels, word-level timestamps, and SRT/VTT subtitles ready for video repurposing. 99+ languages, bulk processing, free tier available.


How to transcribe a voice memo in 4 steps

  1. Upload your voice memo files — drop .m4a (iPhone Voice Memos), .mp3, .wav, or any common audio format into the Upload Voice Memo Files field. You can drag in many at once for bulk transcription.
  2. Pick your options — auto-detect language or pick from 99+, toggle speaker diarization for multi-speaker recordings, and optionally translate non-English audio to English.
  3. Run the actor — files process 10 at a time in parallel on the paid tier; bulk batches are usually done in minutes.
  4. Download results — every file lands in the dataset with the transcript, segment-level + word-level timestamps, speaker labels, and ready-to-use SRT/VTT subtitle strings.

Supported formats: M4A, MP3, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, WebM. Max 1 GB per file on the paid tier.


Example output — voice memo transcript with speaker labels

A typical successful transcription returns:

{
"transcript": "Quick note for the team — the client wants the launch pushed to Q3...",
"detected_language": "en",
"duration": 42.18,
"segments": [
{
"id": 0,
"text": "Quick note for the team — the client wants the launch pushed to Q3.",
"start": 0.20,
"end": 4.86,
"speaker": "SPEAKER_00",
"language": "en",
"words": [
{ "word": "Quick", "start": 0.20, "end": 0.42, "speaker": "SPEAKER_00" },
{ "word": "note", "start": 0.42, "end": 0.68, "speaker": "SPEAKER_00" },
{ "word": "for", "start": 0.68, "end": 0.84, "speaker": "SPEAKER_00" }
]
}
],
"srt": "1\n00:00:00,200 --> 00:00:04,860\nQuick note for the team — the client wants the launch pushed to Q3.",
"vtt": "WEBVTT\n\n00:00:00.200 --> 00:00:04.860\nQuick note for the team — the client wants the launch pushed to Q3.",
"speakers": ["SPEAKER_00"],
"languages": ["en"],
"fileSizeMB": 0.31,
"success": true
}

Every result includes the full transcript, segment-level timestamps, word-level timestamps, language detection, voice memo duration in seconds, file size, ready-to-use srt and vtt subtitle strings, and (when speaker diarization is enabled) speaker labels per segment and per word.


Speaker diarization

Toggle the Speaker Diarization input to identify who's speaking in multi-person voice memos — interview-style recordings, family conversations, group brainstorms. Each segment and each word receives a speaker label (SPEAKER_00, SPEAKER_01, …) so you can keep one person's quotes separate from another's. Powered by pyannote-audio, the same model used in production speech-to-text pipelines. Charged per audio second; only billed when enabled.


SRT / VTT export for video repurposing

Every transcription returns ready-to-use srt and vtt subtitle strings. Save the field value as a .srt or .vtt file and:

  • Drop it into a video editor to caption a video version of your voice memo (great for short-form social posts)
  • Use it as a starter caption track for YouTube uploads
  • Add HTML5 <track> accessibility captions to web embeds

Set Timestamp Granularities to word for cue precision down to individual words.


Why voice-memo users choose this actor

  • 99+ languages — automatic detection across English, Spanish, French, Mandarin, Arabic, Portuguese, and 90+ more
  • 📤 Direct file upload — drop .m4a, .mp3, .wav straight from your phone or Mac, no need to host them anywhere first
  • 🎤 Speaker diarization powered by pyannote-audio — separate the recorder from interview guests automatically
  • ⏱️ Word-level timestamps on every transcription — {word, start, end, speaker} per word, ready for quote search and clip extraction
  • 🎬 SRT and VTT subtitles included on every successful run — perfect for turning a voice memo into a captioned video
  • 🚀 Bulk processing — drop in 10, 50, or 200 files at once; 10× parallel on the paid tier
  • 💰 Pay per audio second — no subscriptions, no minimums; only pay for the audio you actually transcribe

Use cases

  • 🎓 Students — turn lecture and seminar voice memos into searchable study notes; never lose a key concept buried in an hour of audio
  • 📰 Journalists — transcribe phone-recorded interviews on the go; pull attributed quotes with word-level timestamps
  • 💼 Professionals dictating notes — convert post-meeting voice recaps, brainstorms, and quick ideas into shareable text
  • 🧪 Qualitative researchers — preserve participant voice memos with speaker separation for thematic analysis
  • ✍️ Writers and creators — capture voice notes in the wild, edit them as text drafts later
  • 🎙️ Interview-style podcasters recording on phones — get clean, attributed transcripts ready for show notes
  • 📱 Anyone using iPhone Voice Memos or Android voice recorders — the lowest-friction path from spoken word to text

Pricing & tiers

Pay only for the audio seconds you actually transcribe. No subscriptions, no minimums.

FREE tierPAID tier
Perfect for testing and small jobsBuilt for production volume
Up to 5 files per runUnlimited files per run
50 MB max per file1 GB max per file
200 MB / 20 minutes monthlyUnlimited monthly volume
3 concurrent files10 concurrent files (10× parallel)
No credit card required$0.0005 per audio second

Optional add-ons (only billed when enabled):

FeaturePrice
Speaker diarization$0.0001 per audio second
Translate to English$0.0003 per audio second
EU-region processing$0.0007 per audio second (replaces base $0.0005)

A 5-minute voice memo on the paid tier costs approximately $0.15 (transcription only).


Integration examples

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('sian.agency/transcribe-voice-memo-to-text').call({
audioFiles: ['https://example.com/voice-memo.m4a'],
speakerDiarization: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].transcript);
console.log(items[0].srt);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_TOKEN')
run = client.actor('sian.agency/transcribe-voice-memo-to-text').call(run_input={
'audioFiles': ['https://example.com/voice-memo.m4a'],
'speakerDiarization': True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items[0]['transcript'])
print(items[0]['vtt'])

cURL

curl -X POST 'https://api.apify.com/v2/acts/sian.agency~transcribe-voice-memo-to-text/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"audioFiles": ["https://example.com/voice-memo.m4a"],
"speakerDiarization": true
}'

n8n / Zapier / Make

Wire this actor as a downstream step on any "new voice memo synced" trigger (Dropbox, iCloud, Google Drive). The dataset record returned per item includes transcript, segments[].words[], srt, and vtt — drop them into Notion, Slack, Google Sheets, Obsidian, or your CRM with no transformation step.


FAQ

How accurate is voice memo transcription? Powered by an industrial speech-to-text pipeline tuned for natural conversation. Accuracy is typically 95–99% on clean iPhone or Android voice recordings, lower on noisy environments or strong accents. Word-level timestamps are returned even when accuracy is imperfect, so you can verify and correct faster than transcribing from scratch.

What audio formats are supported? M4A (iPhone Voice Memos default), MP3, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, WebM. Max 50 MB per file on the free tier, 1 GB per file on the paid tier.

Can I transcribe non-English voice memos? Yes — auto-detection across 99+ languages including Spanish, French, German, Mandarin, Japanese, Portuguese, Arabic, Hindi. Toggle Translate to English to receive an English transcript alongside the timestamps.

Is speaker diarization included? Yes, opt-in via the Speaker Diarization toggle. Each segment and word gets labeled SPEAKER_00, SPEAKER_01, etc. Powered by pyannote-audio. Billed at $0.0001 per audio second only when enabled.

How does pricing work? Pay-per-audio-second. The free tier covers small jobs and testing without a credit card. The paid tier is $0.0005 per second of audio, plus optional add-ons for diarization, translation, and EU processing. No subscriptions.

Can I use this in n8n, Zapier, or Make? Yes. The actor exposes a standard Apify run/dataset API. Use any "trigger → run actor → use dataset items" pattern. The dataset record includes transcript, segments[].words[], srt, and vtt ready to feed into downstream tools.

Do I need to host my voice memos somewhere first? No. Use the Upload Voice Memo Files field to upload directly from your computer or phone. Apify stores the upload in your key-value store and the actor processes it from there.

How long does a transcription take? A 5-minute voice memo usually finishes in 10–30 seconds. A 60-minute recording takes 1–3 minutes on the paid tier. Bulk batches process 10 files in parallel.


Use this actor only on voice memos and audio you have rights to transcribe — your own recordings, content with consent, or properly licensed media. The actor does not retain audio or transcripts beyond the run's lifetime. EU-region processing is available via the EU Processing toggle for GDPR-aligned workflows. SIÁN Agency provides this actor as-is; users are responsible for the legal use of transcribed content.


Support

Telegram Support Email SIÁN Agency

Join the Telegram support group, email support@sian-agency.online, or open an issue on the SIÁN Agency Apify Store page.


More from SIÁN Agency

Platform-specific scrapers + transcribers:

Browse the full SIÁN Agency Apify Store for all available actors.