PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook avatar

PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook

Pricing

from $40.00 / 1,000 audio minute generateds

Go to Apify Store
PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook

PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook

Convert PDF, EPUB, DOCX, Markdown, HTML, TXT, and RTF to MP3 audiobooks. Free Microsoft Edge TTS (no API key) with OCR for scanned PDFs, 70+ languages, and optional OpenAI or ElevenLabs voices. ~$0.04/min.

Pricing

from $40.00 / 1,000 audio minute generateds

Rating

0.0

(0)

Developer

Marielise

Marielise

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Text to Audio Narrator

TTS: Edge free + OpenAI + ElevenLabs Formats: PDF · DOCX · EPUB · MD · TXT · HTML · RTF OCR: scanned PDFs Output: MP3 audiobook

Turn any document — PDF, DOCX, EPUB, Markdown, plain text, HTML, or RTF — into an MP3 audiobook in one run. Paste a URL, upload a file, drop in raw text, or send base64 bytes. Pick a voice. Click run. Get a downloadable MP3. No prompts to chain, no manual chunking, no ffmpeg gymnastics. Just clean audio at the end.

Even scanned / image-only PDFs work — when a page has no text layer, the actor automatically OCRs it (Tesseract) and narrates the recovered text.

Use the free Edge TTS voices by default (400+ neural voices, 70+ languages, no API key) — or bring your own OpenAI / ElevenLabs key for premium voices and steerable narration.

Perfect for: Anyone with a backlog of Project Gutenberg / Calibre EPUBs and zero time to read them, researchers listening to arXiv PDFs on a commute, business users turning Word DOCX reports into audio briefings, devs converting READMEs and blog drafts into audio for proofing, knowledge workers narrating long reports, accessibility-first publishers, podcast producers prototyping audiobook conversions, students reviewing textbook chapters on the go, journalists drafting voiceovers, and Substack writers exporting their newsletter to audio.

Features

FeatureDescription
Seven input formatsPDF (text-layer), DOCX (Word, via mammoth), EPUB (ebook, spine-ordered chapters), Markdown (syntax stripped), plain text, HTML (tags stripped), RTF (control codes stripped). Auto-detected from magic bytes / mimetype / extension
OCR for scanned PDFsPages with no selectable text layer are auto-rendered and OCR'd with Tesseract (7 languages: EN, ES, FR, DE, IT, PT, NL) so scanned books and photographed pages narrate too. Only the pages that actually need OCR are processed and billed
Encrypted PDF supportProvide the password via pdfPassword to decrypt and narrate password-protected PDFs
SSRF-guarded URL fetch + proxyDocument URLs are validated against private / internal address ranges before fetching. Optional Apify proxy for hosts that block datacenter IPs
Four ways to provide itPublic URL, file upload, base64 paste, or raw text paste (great for blog drafts & ChatGPT replies)
Free Edge TTS by defaultMicrosoft Edge neural voices, no API key, no per-character cost — works out of the box
One document in, one (or many) MP3s outShort inputs produce a single MP3. Long ones are auto-split into chapter-sized parts, plus a shareable INDEX.html page with inline players + download links
Multiple TTS enginesEdge TTS (FREE, default), OpenAI gpt-4o-mini-tts (steerable), OpenAI tts-1 / tts-1-hd, ElevenLabs Flash v2.5 / Turbo v2.5
BYOK for premium modelsOpenAI and ElevenLabs models require your own API key (we never markup the provider price)
Steerable narrationWith gpt-4o-mini-tts you can prompt the voice ("calm audiobook narrator", "energetic podcast host", "slow and deliberate") without retraining
Auto language detectionEdge TTS auto-picks a matching voice based on the text language. Or force a specific Azure voice name (e.g. en-US-AndrewNeural)
Markdown-awareMarkdown syntax is stripped (no "asterisk asterisk bold asterisk asterisk"). Headings, lists, links, code fences, tables, and inline formatting all read as natural prose
Page / section range supportNarrate the whole document, a single chapter, or a custom slice. Works on PDFs (real pages) and TXT / MD / HTML (~3000-char pseudo-pages). Range syntax: 1-10, 1,3,5, 1-3,7-9
Smart chunkingText is split on paragraph and sentence boundaries before TTS. Hard cuts respect word boundaries so chunks don't start mid-word
Pre-flight cost preview + hard capEvery run prints an estimated ceiling cost before TTS starts and writes a PREVIEW key. Set maxCostUsd to abort before TTS if the estimate is too high — and it also clamps the actual audio-minute charge so the final bill never exceeds your cap
Provider-adaptive concurrencyAuto-caps parallel TTS calls per provider (ElevenLabs 2, Edge 8, OpenAI 10) so free tiers don't hit 429 storms
Resume failed runsAlready-synthesized chunks are cached. If a long run times out or fails partway, the next run picks up where it left off — no re-paying for TTS already done
Skip-failed-chunks modeIf a single chunk keeps failing after retries, skip it and keep narrating the rest of the document (configurable). Auth / quota errors always abort cleanly
ffmpeg concat + ID3 tagsChunks are stitched into valid MP3 containers with correct duration metadata and ID3 tags (title, album, track, genre=Audiobook) so players show proper info. No "audio glitch at minute 4" bugs
Transparent pricingPay-per-event: per page narrated + per audio minute. No surcharges, no markups on BYOK providers

How to Use

Step 1: Provide a document

Any one of:

  • Document URL — any publicly reachable URL ending in .pdf, .docx, .epub, .md, .markdown, .txt, .html, .htm, or .rtf
  • Upload file — drag and drop a PDF / DOCX / EPUB / MD / TXT / HTML / RTF (uploaded to the run's key-value store)
  • Base64 content — paste raw base64 bytes; format auto-detected from magic bytes / mimetype
  • Raw text paste — paste prose, Markdown, or HTML directly into the text field (perfect for blog drafts, ChatGPT replies, READMEs)

Scanned / image-only PDF pages (no text layer) are automatically OCR'd when enableOcr is on (default). For encrypted / password-protected PDFs, supply pdfPassword.

Step 2: Pick a voice and model

ModelBest forAPI key required
edge-tts (default)Free, long books, multi-languageNo key — free
openai-gpt-4o-mini-ttsSteerable narration with style instructionsOpenAI key (BYOK)
openai-tts-1Bulk runs, low cost, supports speedOpenAI key (BYOK)
openai-tts-1-hdHigh-quality OpenAI audioOpenAI key (BYOK)
elevenlabs-flash-v2_5Fast, real-time-quality voicesElevenLabs key (BYOK)
elevenlabs-turbo-v2_5Highest-quality ElevenLabs voicesElevenLabs key (BYOK)

Step 3: (Optional) Set a range, voice, speed, or instructions

  • voice — leave blank for auto. For Edge TTS use an Azure ShortName like en-US-AndrewNeural, es-ES-ElviraNeural. For OpenAI: alloy, echo, fable, onyx, nova, shimmer, coral, sage. For ElevenLabs: a voice ID.
  • languageauto (recommended) or a specific ISO code (Edge TTS only — OpenAI voices are multilingual).
  • pageRange — e.g. 1-10 or 1,3,5 or 1-3,7-9. Empty = whole document. For non-PDF formats, "pages" are ~3000-char sections.
  • speed — 0.25 to 4.0. Only applies to openai-tts-1 / openai-tts-1-hd.
  • instructions — free-form style hint for openai-gpt-4o-mini-tts, e.g. "Calm, slow audiobook narrator with a neutral accent."
  • enableOcr — on by default. Auto-OCRs scanned PDF pages that have no text layer. Turn off to fail fast on scans instead.
  • pdfPassword — password for encrypted PDFs.
  • proxyConfiguration — optional Apify proxy, used only for the Document URL fetch.

Step 4: Run and download

The Actor:

  1. Downloads / decodes / reads the input
  2. Detects the format (PDF magic bytes + content-type + extension + content sniff)
  3. Extracts and normalises the text (page-range aware, Markdown / HTML aware)
  4. Splits into TTS-sized chunks at sentence boundaries (word-boundary safe hard cuts)
  5. Synthesises each chunk with the chosen provider in parallel
  6. Folds chunks into chapter-sized parts as they complete (ffmpeg concat)
  7. Uploads each part to the key-value store + writes a shareable INDEX.html

You'll find the result in:

  • The dataset — one row with metadata (indexUrl, audioUrl, partsCount, parts[], durationSeconds, chars, pagesProcessed, cost, status)
  • The key-value store — each MP3 part, the INDEX.html page, the PREVIEW estimate, and the OUTPUT record

Input Reference

FieldTypeRequiredDescription
documentUrlstringone ofPublic URL (PDF / TXT / MD / HTML)
documentFilefileone ofUpload a PDF / TXT / MD / HTML from your device
documentBase64stringone ofBase64-encoded document bytes
textstringone ofPaste raw prose, Markdown, or HTML directly
modelenumnoTTS model (default edge-tts, free)
voicestringnoEdge ShortName, OpenAI voice, or ElevenLabs voice ID
languageenumnoAuto-detect (default) or specific ISO code
speednumberno0.25 to 4.0, default 1.0. tts-1 / tts-1-hd only
instructionsstringnoFree-form style for gpt-4o-mini-tts
pageRangestringnoe.g. 1-10 or 1,3,5. Empty = full document
chunkSizeintegerno500 to 4096, default 4000 (auto-clamped to 2500 for ElevenLabs)
concurrencyintegerno1 to 20 parallel TTS requests, default 5 (auto-clamped per provider)
resumebooleannoSkip already-synthesized chunks from previous runs (default true)
skipFailedChunksbooleannoSkip individual chunk failures instead of aborting (default true)
maxPartMbintegernoMax size per MP3 part, default 40MB
maxCostUsdnumbernoHard cap (min 0.02). Aborts before TTS if the estimate exceeds it, and clamps the audio-minute charge so the final bill never exceeds the cap
enableOcrbooleannoOCR scanned / image-only PDF pages (default true)
pdfPasswordsecretnoPassword for encrypted PDFs
proxyConfigurationobjectnoApify proxy for the Document URL fetch
openaiApiKeysecretrequired for openai-*Your OpenAI API key (BYOK)
elevenlabsApiKeysecretrequired for elevenlabs-*Your ElevenLabs API key (BYOK)
debugbooleannoVerbose logs

Output Example

{
"indexUrl": "https://api.apify.com/v2/key-value-stores/.../records/INDEX",
"audioUrl": "https://api.apify.com/v2/key-value-stores/.../records/narration-abc123-part001.mp3",
"audioKvKey": "narration-abc123-part001.mp3",
"durationSeconds": 1843.2,
"partsCount": 2,
"parts": [
{ "part": 1, "key": "narration-abc123-part001.mp3", "url": "https://...", "durationSeconds": 1200, "bytes": 12500000 },
{ "part": 2, "key": "narration-abc123-part002.mp3", "url": "https://...", "durationSeconds": 643.2, "bytes": 6700000 }
],
"chars": 48210,
"pagesProcessed": 24,
"ocrPagesProcessed": 0,
"voice": "en-US-AndrewNeural",
"model": "edge-tts",
"cost": 2.13,
"status": "success",
"chunksTotal": 13,
"chunksSucceeded": 13,
"chunksFailed": 0,
"generatedAt": "2026-06-02T10:30:00.000Z"
}

Use Cases

  • Any ebook → free audiobook — drop a Project Gutenberg EPUB and listen to a full classic novel. Spine-ordered chapters narrate in the right sequence.
  • Word doc → audio briefing — drop your DOCX report and listen on a commute instead of skimming on screen.
  • Audiobook prototyping — convert your ebook PDF into MP3 to validate narrator tone before commissioning a human voiceover.
  • Research papers on the go — listen to arXiv PDFs during a commute or workout.
  • README → audio — paste your project README and listen to your own docs to spot rough explanations.
  • Blog draft proofing — paste a Markdown blog draft and listen to it before publishing. You'll hear awkward phrasing you'd never catch reading.
  • ChatGPT reply → podcast snippet — copy a long ChatGPT response into the text field and listen as audio.
  • Accessibility — generate audio versions of internal documentation for screen-reader-light workflows.
  • Onboarding — pipe HR PDFs (handbooks, policies) into audio for distributed teams.
  • Newsletter audio versions — automatically narrate weekly reports (PDF, MD, or HTML) for paying subscribers.
  • Language learning — narrate text in different voices and speeds to practise listening comprehension.
  • Substack → audio export — export a post as HTML and narrate it for a podcast feed.

Pricing

This Actor uses Pay Per Event so you only pay for the work the run actually does. No premium-voice surcharges, no provider markups.

EventPrice (USD)When charged
actor-start$0.02Once per run, after the document loads successfully
pdf-page-narrated$0.05Once per page (PDF) or per ~3000-char section (TXT / MD / HTML) successfully narrated
audio-minute-generated$0.03Once per minute of MP3 output
ocr-page-processed$0.10Only for scanned / image-only PDF pages that had no text layer and were recovered via OCR. Text-layer PDFs never pay this

Typical cost example — a 20-page research paper (~40k chars, ~50 minutes of audio):

  • actor-start: $0.02
  • 20 × pdf-page-narrated: $1.00
  • ~50 × audio-minute-generated: $1.50
  • Total: ~$2.52 for the full paper

For OpenAI / ElevenLabs models, this is all you pay Apify. You pay the provider directly with your own API key on top — that's the whole point of BYOK: no markup.

Pre-flight cost preview & hard cap

Every run writes a PREVIEW key to the key-value store BEFORE TTS starts, with pages to process, estimated audio minutes, and the estimated ceiling cost. The same numbers are printed to the run log.

Set the optional maxCostUsd input to enforce a hard cap: if the estimate exceeds it, the run aborts cleanly before any TTS — you only pay the actor-start fee (plus any OCR already performed). The actual audio-minute charge is also clamped to the cap, so even if the produced audio runs longer than estimated (slow speech, CJK scripts) the final bill never exceeds your cap. Combine with Apify's run-level Max total charge as a second belt-and-suspenders limit.

BYOK — Bring Your Own Key

Model familyKey fieldWhere to get itFree tier?
edge-tts (default)No key neededYes — completely free
openai-*openaiApiKeyhttps://platform.openai.com/api-keysOpenAI charges per character
elevenlabs-*elevenlabsApiKeyhttps://elevenlabs.io/app/settings/api-keysElevenLabs free tier available

The actor never logs your keys (isSecret: true) and never proxies your calls through our servers — your key talks directly to the provider from inside the actor's run.

FAQ

Which document formats are supported?

  • PDF (.pdf) — text-layer PDFs extract natively; scanned / image-only pages are auto-OCR'd (Tesseract).
  • DOCX (.docx) — Word documents, parsed with mammoth. Styles, lists, tables, footnotes handled natively.
  • EPUB (.epub) — ebooks. Walked in spine order so chapters narrate in the right sequence. HTML stripped per chapter.
  • Markdown (.md, .markdown, .mdx) — syntax stripped so the voice reads natural prose.
  • Plain text (.txt, .text) — UTF-8, BOM handled.
  • HTML (.html, .htm, .xhtml) — tags stripped, entities decoded, scripts and styles removed.
  • RTF (.rtf) — control codes stripped, unicode escapes and hex bytes decoded.

Does it work on scanned PDFs?

Yes. When a PDF page has no selectable text layer, the actor renders it (poppler pdftoppm) and runs OCR (Tesseract) to recover the text, then narrates it. OCR runs only on pages that need it, and those pages are billed via the ocr-page-processed event ($0.10/page). Built-in OCR languages: English, Spanish, French, German, Italian, Portuguese, Dutch (others fall back to English). Turn it off with enableOcr: false to fail fast on scans instead.

Does it work on password-protected PDFs?

Yes — pass the password in the pdfPassword input and the actor decrypts the PDF before extraction.

What about ODT, MOBI, AZW3, or Pages?

Not supported in v0.1. Convert ODT to DOCX first; for Kindle formats, convert via Calibre to EPUB.

Why do OpenAI / ElevenLabs models require BYOK?

So we never markup the provider price. Pay Apify for the actor work, pay OpenAI / ElevenLabs directly for the TTS calls. Cleaner, cheaper, more honest. For zero-key zero-friction runs, the default edge-tts is free and gives great quality on 70+ languages.

How is "page" defined for TXT / MD / HTML?

There are no real pages, so the actor splits the cleaned text into ~3000-char pseudo-pages — roughly the length of one PDF page of prose. This keeps pageRange and per-page billing fair across formats.

How long can the document be?

PDFs up to 50 MB. EPUB up to 40 MB. DOCX up to 30 MB. TXT / MD / HTML / RTF up to 20 MB of decoded text. There is no hard page limit. Long inputs are auto-split into chapter-sized MP3 parts (configurable via maxPartMb).

What if my run times out or fails partway?

Re-run with the same input. The resume option (on by default) skips already-synthesized chunks via a shared cache, so you only pay TTS for the missing pieces.

Can I get word-level timestamps?

v0.1 does not emit timestamps. Coming in a future version.

Can I use multiple speakers / podcast mode?

Not in v0.1. Single-voice narration only.

What audio format is produced?

MP3, standard playback on any device. Edge TTS produces 24 kHz mono; OpenAI / ElevenLabs use their default high-quality output.

Can I override the voice with a custom ElevenLabs voice?

Yes — paste any ElevenLabs voice ID into the voice field when using an elevenlabs-* model. You can clone your own voice in your ElevenLabs account and use that ID here.

Is my API key safe?

Yes. API keys are marked isSecret: true in the input schema and are never logged or persisted.

Built and maintained by Equipinico

Need a custom variant (different language model, custom voices, SSML support, podcast multi-speaker, EPUB / DOCX support)? Reach out via the Apify Store contact link.