PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook
Pricing
from $40.00 / 1,000 audio minute generateds
PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook
Convert PDF, EPUB, DOCX, Markdown, HTML, TXT, and RTF to MP3 audiobooks. Free Microsoft Edge TTS (no API key) with OCR for scanned PDFs, 70+ languages, and optional OpenAI or ElevenLabs voices. ~$0.04/min.
Pricing
from $40.00 / 1,000 audio minute generateds
Rating
0.0
(0)
Developer
Marielise
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Text to Audio Narrator
Turn any document — PDF, DOCX, EPUB, Markdown, plain text, HTML, or RTF — into an MP3 audiobook in one run. Paste a URL, upload a file, drop in raw text, or send base64 bytes. Pick a voice. Click run. Get a downloadable MP3. No prompts to chain, no manual chunking, no ffmpeg gymnastics. Just clean audio at the end.
Even scanned / image-only PDFs work — when a page has no text layer, the actor automatically OCRs it (Tesseract) and narrates the recovered text.
Use the free Edge TTS voices by default (400+ neural voices, 70+ languages, no API key) — or bring your own OpenAI / ElevenLabs key for premium voices and steerable narration.
Perfect for: Anyone with a backlog of Project Gutenberg / Calibre EPUBs and zero time to read them, researchers listening to arXiv PDFs on a commute, business users turning Word DOCX reports into audio briefings, devs converting READMEs and blog drafts into audio for proofing, knowledge workers narrating long reports, accessibility-first publishers, podcast producers prototyping audiobook conversions, students reviewing textbook chapters on the go, journalists drafting voiceovers, and Substack writers exporting their newsletter to audio.
Features
| Feature | Description |
|---|---|
| Seven input formats | PDF (text-layer), DOCX (Word, via mammoth), EPUB (ebook, spine-ordered chapters), Markdown (syntax stripped), plain text, HTML (tags stripped), RTF (control codes stripped). Auto-detected from magic bytes / mimetype / extension |
| OCR for scanned PDFs | Pages with no selectable text layer are auto-rendered and OCR'd with Tesseract (7 languages: EN, ES, FR, DE, IT, PT, NL) so scanned books and photographed pages narrate too. Only the pages that actually need OCR are processed and billed |
| Encrypted PDF support | Provide the password via pdfPassword to decrypt and narrate password-protected PDFs |
| SSRF-guarded URL fetch + proxy | Document URLs are validated against private / internal address ranges before fetching. Optional Apify proxy for hosts that block datacenter IPs |
| Four ways to provide it | Public URL, file upload, base64 paste, or raw text paste (great for blog drafts & ChatGPT replies) |
| Free Edge TTS by default | Microsoft Edge neural voices, no API key, no per-character cost — works out of the box |
| One document in, one (or many) MP3s out | Short inputs produce a single MP3. Long ones are auto-split into chapter-sized parts, plus a shareable INDEX.html page with inline players + download links |
| Multiple TTS engines | Edge TTS (FREE, default), OpenAI gpt-4o-mini-tts (steerable), OpenAI tts-1 / tts-1-hd, ElevenLabs Flash v2.5 / Turbo v2.5 |
| BYOK for premium models | OpenAI and ElevenLabs models require your own API key (we never markup the provider price) |
| Steerable narration | With gpt-4o-mini-tts you can prompt the voice ("calm audiobook narrator", "energetic podcast host", "slow and deliberate") without retraining |
| Auto language detection | Edge TTS auto-picks a matching voice based on the text language. Or force a specific Azure voice name (e.g. en-US-AndrewNeural) |
| Markdown-aware | Markdown syntax is stripped (no "asterisk asterisk bold asterisk asterisk"). Headings, lists, links, code fences, tables, and inline formatting all read as natural prose |
| Page / section range support | Narrate the whole document, a single chapter, or a custom slice. Works on PDFs (real pages) and TXT / MD / HTML (~3000-char pseudo-pages). Range syntax: 1-10, 1,3,5, 1-3,7-9 |
| Smart chunking | Text is split on paragraph and sentence boundaries before TTS. Hard cuts respect word boundaries so chunks don't start mid-word |
| Pre-flight cost preview + hard cap | Every run prints an estimated ceiling cost before TTS starts and writes a PREVIEW key. Set maxCostUsd to abort before TTS if the estimate is too high — and it also clamps the actual audio-minute charge so the final bill never exceeds your cap |
| Provider-adaptive concurrency | Auto-caps parallel TTS calls per provider (ElevenLabs 2, Edge 8, OpenAI 10) so free tiers don't hit 429 storms |
| Resume failed runs | Already-synthesized chunks are cached. If a long run times out or fails partway, the next run picks up where it left off — no re-paying for TTS already done |
| Skip-failed-chunks mode | If a single chunk keeps failing after retries, skip it and keep narrating the rest of the document (configurable). Auth / quota errors always abort cleanly |
| ffmpeg concat + ID3 tags | Chunks are stitched into valid MP3 containers with correct duration metadata and ID3 tags (title, album, track, genre=Audiobook) so players show proper info. No "audio glitch at minute 4" bugs |
| Transparent pricing | Pay-per-event: per page narrated + per audio minute. No surcharges, no markups on BYOK providers |
How to Use
Step 1: Provide a document
Any one of:
- Document URL — any publicly reachable URL ending in
.pdf,.docx,.epub,.md,.markdown,.txt,.html,.htm, or.rtf - Upload file — drag and drop a PDF / DOCX / EPUB / MD / TXT / HTML / RTF (uploaded to the run's key-value store)
- Base64 content — paste raw base64 bytes; format auto-detected from magic bytes / mimetype
- Raw text paste — paste prose, Markdown, or HTML directly into the
textfield (perfect for blog drafts, ChatGPT replies, READMEs)
Scanned / image-only PDF pages (no text layer) are automatically OCR'd when
enableOcris on (default). For encrypted / password-protected PDFs, supplypdfPassword.
Step 2: Pick a voice and model
| Model | Best for | API key required |
|---|---|---|
edge-tts (default) | Free, long books, multi-language | No key — free |
openai-gpt-4o-mini-tts | Steerable narration with style instructions | OpenAI key (BYOK) |
openai-tts-1 | Bulk runs, low cost, supports speed | OpenAI key (BYOK) |
openai-tts-1-hd | High-quality OpenAI audio | OpenAI key (BYOK) |
elevenlabs-flash-v2_5 | Fast, real-time-quality voices | ElevenLabs key (BYOK) |
elevenlabs-turbo-v2_5 | Highest-quality ElevenLabs voices | ElevenLabs key (BYOK) |
Step 3: (Optional) Set a range, voice, speed, or instructions
voice— leave blank for auto. For Edge TTS use an Azure ShortName likeen-US-AndrewNeural,es-ES-ElviraNeural. For OpenAI:alloy,echo,fable,onyx,nova,shimmer,coral,sage. For ElevenLabs: a voice ID.language—auto(recommended) or a specific ISO code (Edge TTS only — OpenAI voices are multilingual).pageRange— e.g.1-10or1,3,5or1-3,7-9. Empty = whole document. For non-PDF formats, "pages" are ~3000-char sections.speed— 0.25 to 4.0. Only applies toopenai-tts-1/openai-tts-1-hd.instructions— free-form style hint foropenai-gpt-4o-mini-tts, e.g. "Calm, slow audiobook narrator with a neutral accent."enableOcr— on by default. Auto-OCRs scanned PDF pages that have no text layer. Turn off to fail fast on scans instead.pdfPassword— password for encrypted PDFs.proxyConfiguration— optional Apify proxy, used only for the Document URL fetch.
Step 4: Run and download
The Actor:
- Downloads / decodes / reads the input
- Detects the format (PDF magic bytes + content-type + extension + content sniff)
- Extracts and normalises the text (page-range aware, Markdown / HTML aware)
- Splits into TTS-sized chunks at sentence boundaries (word-boundary safe hard cuts)
- Synthesises each chunk with the chosen provider in parallel
- Folds chunks into chapter-sized parts as they complete (ffmpeg concat)
- Uploads each part to the key-value store + writes a shareable INDEX.html
You'll find the result in:
- The dataset — one row with metadata (
indexUrl,audioUrl,partsCount,parts[],durationSeconds,chars,pagesProcessed,cost,status) - The key-value store — each MP3 part, the
INDEX.htmlpage, thePREVIEWestimate, and theOUTPUTrecord
Input Reference
| Field | Type | Required | Description |
|---|---|---|---|
documentUrl | string | one of | Public URL (PDF / TXT / MD / HTML) |
documentFile | file | one of | Upload a PDF / TXT / MD / HTML from your device |
documentBase64 | string | one of | Base64-encoded document bytes |
text | string | one of | Paste raw prose, Markdown, or HTML directly |
model | enum | no | TTS model (default edge-tts, free) |
voice | string | no | Edge ShortName, OpenAI voice, or ElevenLabs voice ID |
language | enum | no | Auto-detect (default) or specific ISO code |
speed | number | no | 0.25 to 4.0, default 1.0. tts-1 / tts-1-hd only |
instructions | string | no | Free-form style for gpt-4o-mini-tts |
pageRange | string | no | e.g. 1-10 or 1,3,5. Empty = full document |
chunkSize | integer | no | 500 to 4096, default 4000 (auto-clamped to 2500 for ElevenLabs) |
concurrency | integer | no | 1 to 20 parallel TTS requests, default 5 (auto-clamped per provider) |
resume | boolean | no | Skip already-synthesized chunks from previous runs (default true) |
skipFailedChunks | boolean | no | Skip individual chunk failures instead of aborting (default true) |
maxPartMb | integer | no | Max size per MP3 part, default 40MB |
maxCostUsd | number | no | Hard cap (min 0.02). Aborts before TTS if the estimate exceeds it, and clamps the audio-minute charge so the final bill never exceeds the cap |
enableOcr | boolean | no | OCR scanned / image-only PDF pages (default true) |
pdfPassword | secret | no | Password for encrypted PDFs |
proxyConfiguration | object | no | Apify proxy for the Document URL fetch |
openaiApiKey | secret | required for openai-* | Your OpenAI API key (BYOK) |
elevenlabsApiKey | secret | required for elevenlabs-* | Your ElevenLabs API key (BYOK) |
debug | boolean | no | Verbose logs |
Output Example
{"indexUrl": "https://api.apify.com/v2/key-value-stores/.../records/INDEX","audioUrl": "https://api.apify.com/v2/key-value-stores/.../records/narration-abc123-part001.mp3","audioKvKey": "narration-abc123-part001.mp3","durationSeconds": 1843.2,"partsCount": 2,"parts": [{ "part": 1, "key": "narration-abc123-part001.mp3", "url": "https://...", "durationSeconds": 1200, "bytes": 12500000 },{ "part": 2, "key": "narration-abc123-part002.mp3", "url": "https://...", "durationSeconds": 643.2, "bytes": 6700000 }],"chars": 48210,"pagesProcessed": 24,"ocrPagesProcessed": 0,"voice": "en-US-AndrewNeural","model": "edge-tts","cost": 2.13,"status": "success","chunksTotal": 13,"chunksSucceeded": 13,"chunksFailed": 0,"generatedAt": "2026-06-02T10:30:00.000Z"}
Use Cases
- Any ebook → free audiobook — drop a Project Gutenberg EPUB and listen to a full classic novel. Spine-ordered chapters narrate in the right sequence.
- Word doc → audio briefing — drop your DOCX report and listen on a commute instead of skimming on screen.
- Audiobook prototyping — convert your ebook PDF into MP3 to validate narrator tone before commissioning a human voiceover.
- Research papers on the go — listen to arXiv PDFs during a commute or workout.
- README → audio — paste your project README and listen to your own docs to spot rough explanations.
- Blog draft proofing — paste a Markdown blog draft and listen to it before publishing. You'll hear awkward phrasing you'd never catch reading.
- ChatGPT reply → podcast snippet — copy a long ChatGPT response into the
textfield and listen as audio. - Accessibility — generate audio versions of internal documentation for screen-reader-light workflows.
- Onboarding — pipe HR PDFs (handbooks, policies) into audio for distributed teams.
- Newsletter audio versions — automatically narrate weekly reports (PDF, MD, or HTML) for paying subscribers.
- Language learning — narrate text in different voices and speeds to practise listening comprehension.
- Substack → audio export — export a post as HTML and narrate it for a podcast feed.
Pricing
This Actor uses Pay Per Event so you only pay for the work the run actually does. No premium-voice surcharges, no provider markups.
| Event | Price (USD) | When charged |
|---|---|---|
actor-start | $0.02 | Once per run, after the document loads successfully |
pdf-page-narrated | $0.05 | Once per page (PDF) or per ~3000-char section (TXT / MD / HTML) successfully narrated |
audio-minute-generated | $0.03 | Once per minute of MP3 output |
ocr-page-processed | $0.10 | Only for scanned / image-only PDF pages that had no text layer and were recovered via OCR. Text-layer PDFs never pay this |
Typical cost example — a 20-page research paper (~40k chars, ~50 minutes of audio):
actor-start: $0.02- 20 ×
pdf-page-narrated: $1.00 - ~50 ×
audio-minute-generated: $1.50 - Total: ~$2.52 for the full paper
For OpenAI / ElevenLabs models, this is all you pay Apify. You pay the provider directly with your own API key on top — that's the whole point of BYOK: no markup.
Pre-flight cost preview & hard cap
Every run writes a PREVIEW key to the key-value store BEFORE TTS starts, with pages to process, estimated audio minutes, and the estimated ceiling cost. The same numbers are printed to the run log.
Set the optional maxCostUsd input to enforce a hard cap: if the estimate exceeds it, the run aborts cleanly before any TTS — you only pay the actor-start fee (plus any OCR already performed). The actual audio-minute charge is also clamped to the cap, so even if the produced audio runs longer than estimated (slow speech, CJK scripts) the final bill never exceeds your cap. Combine with Apify's run-level Max total charge as a second belt-and-suspenders limit.
BYOK — Bring Your Own Key
| Model family | Key field | Where to get it | Free tier? |
|---|---|---|---|
edge-tts (default) | — | No key needed | Yes — completely free |
openai-* | openaiApiKey | https://platform.openai.com/api-keys | OpenAI charges per character |
elevenlabs-* | elevenlabsApiKey | https://elevenlabs.io/app/settings/api-keys | ElevenLabs free tier available |
The actor never logs your keys (isSecret: true) and never proxies your calls through our servers — your key talks directly to the provider from inside the actor's run.
FAQ
Which document formats are supported?
- PDF (
.pdf) — text-layer PDFs extract natively; scanned / image-only pages are auto-OCR'd (Tesseract). - DOCX (
.docx) — Word documents, parsed withmammoth. Styles, lists, tables, footnotes handled natively. - EPUB (
.epub) — ebooks. Walked in spine order so chapters narrate in the right sequence. HTML stripped per chapter. - Markdown (
.md,.markdown,.mdx) — syntax stripped so the voice reads natural prose. - Plain text (
.txt,.text) — UTF-8, BOM handled. - HTML (
.html,.htm,.xhtml) — tags stripped, entities decoded, scripts and styles removed. - RTF (
.rtf) — control codes stripped, unicode escapes and hex bytes decoded.
Does it work on scanned PDFs?
Yes. When a PDF page has no selectable text layer, the actor renders it (poppler pdftoppm) and runs OCR (Tesseract) to recover the text, then narrates it. OCR runs only on pages that need it, and those pages are billed via the ocr-page-processed event ($0.10/page). Built-in OCR languages: English, Spanish, French, German, Italian, Portuguese, Dutch (others fall back to English). Turn it off with enableOcr: false to fail fast on scans instead.
Does it work on password-protected PDFs?
Yes — pass the password in the pdfPassword input and the actor decrypts the PDF before extraction.
What about ODT, MOBI, AZW3, or Pages?
Not supported in v0.1. Convert ODT to DOCX first; for Kindle formats, convert via Calibre to EPUB.
Why do OpenAI / ElevenLabs models require BYOK?
So we never markup the provider price. Pay Apify for the actor work, pay OpenAI / ElevenLabs directly for the TTS calls. Cleaner, cheaper, more honest. For zero-key zero-friction runs, the default edge-tts is free and gives great quality on 70+ languages.
How is "page" defined for TXT / MD / HTML?
There are no real pages, so the actor splits the cleaned text into ~3000-char pseudo-pages — roughly the length of one PDF page of prose. This keeps pageRange and per-page billing fair across formats.
How long can the document be?
PDFs up to 50 MB. EPUB up to 40 MB. DOCX up to 30 MB. TXT / MD / HTML / RTF up to 20 MB of decoded text. There is no hard page limit. Long inputs are auto-split into chapter-sized MP3 parts (configurable via maxPartMb).
What if my run times out or fails partway?
Re-run with the same input. The resume option (on by default) skips already-synthesized chunks via a shared cache, so you only pay TTS for the missing pieces.
Can I get word-level timestamps?
v0.1 does not emit timestamps. Coming in a future version.
Can I use multiple speakers / podcast mode?
Not in v0.1. Single-voice narration only.
What audio format is produced?
MP3, standard playback on any device. Edge TTS produces 24 kHz mono; OpenAI / ElevenLabs use their default high-quality output.
Can I override the voice with a custom ElevenLabs voice?
Yes — paste any ElevenLabs voice ID into the voice field when using an elevenlabs-* model. You can clone your own voice in your ElevenLabs account and use that ID here.
Is my API key safe?
Yes. API keys are marked isSecret: true in the input schema and are never logged or persisted.
Built and maintained by Equipinico
Need a custom variant (different language model, custom voices, SSML support, podcast multi-speaker, EPUB / DOCX support)? Reach out via the Apify Store contact link.