Transcribe Interview to Text — for Journalists & Researchers avatar

Transcribe Interview to Text — for Journalists & Researchers

Pricing

from $0.15 / 1,000 second of video processeds

Go to Apify Store
Transcribe Interview to Text — for Journalists & Researchers

Transcribe Interview to Text — for Journalists & Researchers

Transcribe interviews and recorded conversations to text. Speaker labels for interviewer and guest, word-level timestamps, SRT/VTT. Try free.

Pricing

from $0.15 / 1,000 second of video processeds

Rating

0.0

(0)

Developer

SIÁN OÜ

SIÁN OÜ

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

SIÁN Agency Store Telegram Support Instagram AI Transcript Extractor Best TikTok AI Transcript Extractor YouTube Shorts AI Transcript Extractor Facebook AI Transcript Extractor

Transcribe interviews and recorded conversations to text. Built for journalists, qualitative researchers, market researchers, and anyone with hours of interview tape. Speaker labels for interviewer and guest, word-level timestamps for precise quote extraction, SRT/VTT subtitles, 99+ languages.


How to transcribe an interview in 4 steps

  1. Upload your interview recordings — drop .m4a, .mp3, .wav, .mp4, or any common format into the Upload Interview Recordings field. Bulk uploads supported.
  2. Pick your options — auto-detect language or pick from 99+, toggle speaker diarization to separate the interviewer from each guest, optionally translate non-English interviews to English.
  3. Run the actor — recordings process 10 at a time in parallel on the paid tier; an entire project's interviews can be transcribed in one run.
  4. Download results — every recording lands in the dataset with the transcript, segment + word-level timestamps, speaker labels, and ready-to-use SRT/VTT subtitle strings.

Supported formats: M4A, MP3, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, WebM. Max 1 GB per file on the paid tier.


Example output — interview transcript with speaker labels

{
"transcript": "Interviewer: Tell me about the first time you realized... Guest: Honestly, it was when my mentor pulled me aside and said...",
"detected_language": "en",
"duration": 1432.7,
"segments": [
{
"id": 0,
"text": "Tell me about the first time you realized you wanted to do this.",
"start": 0.42,
"end": 4.18,
"speaker": "SPEAKER_00",
"language": "en",
"words": [
{ "word": "Tell", "start": 0.42, "end": 0.61, "speaker": "SPEAKER_00" },
{ "word": "me", "start": 0.61, "end": 0.74, "speaker": "SPEAKER_00" }
]
},
{
"id": 1,
"text": "Honestly, it was when my mentor pulled me aside.",
"start": 4.86,
"end": 8.94,
"speaker": "SPEAKER_01",
"language": "en",
"words": []
}
],
"srt": "1\n00:00:00,420 --> 00:00:04,180\nTell me about the first time you realized you wanted to do this.\n\n2\n00:00:04,860 --> 00:00:08,940\nHonestly, it was when my mentor pulled me aside.",
"vtt": "WEBVTT\n\n00:00:00.420 --> 00:00:04.180\nTell me about the first time you realized you wanted to do this.\n\n00:00:04.860 --> 00:00:08.940\nHonestly, it was when my mentor pulled me aside.",
"speakers": ["SPEAKER_00", "SPEAKER_01"],
"languages": ["en"],
"fileSizeMB": 12.6,
"success": true
}

Every result includes the full transcript, segment-level timestamps, word-level timestamps, language detection, recording duration in seconds, file size, ready-to-use srt and vtt subtitle strings, and (when speaker diarization is enabled) speaker labels per segment and per word.


Built for journalists, qualitative researchers, market researchers

  • 📰 Journalists — turn phone-recorded interviews into clean transcripts ready for quote pulling
  • 🧪 Qualitative researchers — preserve participant voices with speaker separation for thematic analysis
  • 📊 Market researchers — bulk-transcribe focus group and 1:1 interview tapes
  • 📚 Oral historians — searchable, time-stamped archives of long-form interviews
  • 🎙️ Podcasters publishing interview shows — transcripts for show notes, blog repurposing, and SEO

Speaker diarization (interviewer / guest separation)

Toggle the Speaker Diarization input and the actor automatically labels every segment and every word with the speaker it came from (SPEAKER_00 for the interviewer, SPEAKER_01 for the first guest, SPEAKER_02 for the second, etc.). This makes it trivial to extract clean quote attributions for journalism or coding qualitative data for research. Powered by pyannote-audio. Charged per audio second; only billed when enabled.


Translate foreign-language interviews to English

Toggle Translate to English and the actor returns the transcript translated into English while preserving timing — perfect for conducting interviews in your subject's native language and publishing in English. Combine with Speaker Diarization to get clean, attributed quotes in both directions. Charged separately when enabled.


SRT / VTT subtitle export

Every transcription returns ready-to-use srt and vtt subtitle strings. Save the field value as a .srt or .vtt file and:

  • Publish a video version of the interview with subtitles for YouTube, Vimeo, or your CMS
  • Add HTML5 <track> accessibility captions to embedded video
  • Build a searchable interview archive with timestamps

Set Timestamp Granularities to word for cue precision down to individual words.


Why interviewers choose this transcriber

  • Interviewer ↔ guest separation out of the box via pyannote-audio diarization — clean attributed quotes ready for publication
  • ⏱️ Word-level timestamps for every word — find any quote in a 90-minute interview in seconds
  • 🌐 Translate non-English interviews to English in the same run — perfect for international journalism and cross-cultural research
  • 🎬 SRT and VTT subtitles included for video versions of interviews
  • 🌍 99+ languages — automatic detection, no manual selection
  • 🇪🇺 EU-region processing for GDPR-aligned research workflows
  • 💰 Pay per audio second — no per-minute Rev.com markups, no Otter subscription
  • 🚀 10× parallel on the paid tier — an entire research project's worth of interviews done in one run

Use cases

  • 📰 Investigative journalists transcribing source interviews and pulling attributed quotes for stories
  • 🎙️ Long-form podcasters generating publication-ready transcripts of every guest interview
  • 🧪 Qualitative researchers coding participant transcripts in NVivo, Atlas.ti, or MAXQDA
  • 📊 Market research firms transcribing focus groups and customer 1:1 sessions for thematic analysis
  • 📚 Oral history projects preserving long-form recorded interviews with timestamped speaker tracks
  • 🎓 Academic researchers conducting qualitative fieldwork in foreign-language contexts (transcribe + translate in one pass)
  • ✍️ Authors and biographers working from hours of recorded conversations for book material
  • 🎬 Documentary filmmakers preparing rough-cut transcripts of interview tape for editing

Pricing & tiers

Pay only for the audio seconds you actually transcribe. No subscriptions, no minimums.

FREE tierPAID tier
Perfect for testing and small jobsBuilt for production volume
Up to 5 interviews per runUnlimited interviews per run
50 MB max per file1 GB max per file
200 MB / 20 minutes monthlyUnlimited monthly volume
3 concurrent files10 concurrent files (10× parallel)
No credit card required$0.0005 per audio second

Optional add-ons (only billed when enabled):

FeaturePrice
Speaker diarization$0.0001 per audio second
Translate to English$0.0003 per audio second
EU-region processing$0.0007 per audio second (replaces base $0.0005)

A 60-minute interview with diarization on the paid tier costs approximately $2.16 ($1.80 transcription + $0.36 diarization). Compare to Rev.com's $1.50/min ($90 for the same interview).


Integration examples

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('sian.agency/transcribe-interview-to-text').call({
audioFiles: ['https://example.com/interview-with-source.m4a'],
speakerDiarization: true,
translateToEnglish: false,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].transcript);
console.log(items[0].srt);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_TOKEN')
run = client.actor('sian.agency/transcribe-interview-to-text').call(run_input={
'audioFiles': ['https://example.com/interview-with-source.m4a'],
'speakerDiarization': True,
'translateToEnglish': False,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items[0]['transcript'])
print(items[0]['vtt'])

cURL

curl -X POST 'https://api.apify.com/v2/acts/sian.agency~transcribe-interview-to-text/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"audioFiles": ["https://example.com/interview.m4a"],
"speakerDiarization": true
}'

n8n / Zapier / Make

Wire this actor onto a "new file in research-recordings folder" trigger (Dropbox, Google Drive, OneDrive). The dataset record returned per item includes transcript, segments[].words[], srt, and vtt — drop them into Notion (research database), Airtable (interview log), MAXQDA/NVivo (qualitative coding), or Google Docs (story drafts).


FAQ

How accurate is interview transcription? Powered by an industrial speech-to-text pipeline tuned for natural conversation. Accuracy is typically 95–99% on clean studio or quiet-room interviews, lower on phone-recorded or noisy field interviews. Word-level timestamps are returned even when accuracy is imperfect, so you can verify and correct quote attributions quickly.

What audio and video formats are supported? M4A, MP3, WAV, FLAC, AAC, OPUS, OGG, MP4, MOV, WebM. Max 50 MB per file on the free tier, 1 GB on the paid tier.

Can I transcribe foreign-language interviews? Yes — auto-detection across 99+ languages including Spanish, French, German, Mandarin, Japanese, Portuguese, Arabic, Hindi, Russian, and many more. Toggle Translate to English to receive an English transcript alongside the timestamped original.

Is speaker diarization included? Yes, opt-in via the Speaker Diarization toggle. Each segment and word gets labeled SPEAKER_00 (interviewer), SPEAKER_01 (first guest), etc. Powered by pyannote-audio. Billed at $0.0001 per audio second only when enabled.

How does pricing work? Pay-per-audio-second. The free tier covers small jobs and testing without a credit card. The paid tier is $0.0005 per second of audio. A 1-hour interview with diarization is approximately $2.16 — versus Rev.com at ~$90 for the same length.

Can I integrate this into my qualitative research workflow? Yes. The actor exposes a standard Apify run/dataset API. The dataset record includes transcript, segments[].words[], srt, and vtt ready to feed into NVivo, MAXQDA, Atlas.ti, Dovetail, or any qualitative analysis tool that accepts plain text or VTT.

What if my interview is multi-speaker (panel, focus group)? Speaker diarization handles up to ~6 distinct speakers reliably. Each speaker is labeled SPEAKER_00 through SPEAKER_N in temporal order of first speaking turn.

How long does a transcription take? A 60-minute interview takes 1–3 minutes on the paid tier. Bulk batches of 10 interviews complete in 5–10 minutes (parallelized).


Use this actor only on interviews you have rights to transcribe — your own recordings with subject consent, properly licensed media, or material covered by journalistic source agreements. Some jurisdictions require subject consent for recording; you are responsible for compliance with applicable laws and IRB requirements for academic research. The actor does not retain audio or transcripts beyond the run's lifetime. EU-region processing is available via the EU Processing toggle for GDPR-aligned workflows. SIÁN Agency provides this actor as-is.


Support

Telegram Support Email SIÁN Agency

Join the Telegram support group, email support@sian-agency.online, or open an issue on the SIÁN Agency Apify Store page.


More from SIÁN Agency

Platform-specific scrapers + transcribers:

Browse the full SIÁN Agency Apify Store for all available actors.