Pricing

from $40.00 / 1,000 audio minute generateds

PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook

Convert PDF, EPUB, DOCX, Markdown, HTML, TXT, and RTF to MP3 audiobooks. Free Microsoft Edge TTS (no API key) with OCR for scanned PDFs, 70+ languages, and optional OpenAI or ElevenLabs voices. ~$0.04/min.

Pricing

from $40.00 / 1,000 audio minute generateds

Rating

0.0

(0)

Developer

Marielise

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Text to Audio Narrator

Turn any document — PDF, DOCX, EPUB, Markdown, plain text, HTML, or RTF — into an MP3 audiobook in one run. Paste a URL, upload a file, drop in raw text, or send base64 bytes. Pick a voice. Click run. Get a downloadable MP3. No prompts to chain, no manual chunking, no ffmpeg gymnastics. Just clean audio at the end.

Even scanned / image-only PDFs work — when a page has no text layer, the actor automatically OCRs it (Tesseract) and narrates the recovered text.

Use the free Edge TTS voices by default (400+ neural voices, 70+ languages, no API key) — or bring your own OpenAI / ElevenLabs key for premium voices and steerable narration.

Perfect for: Anyone with a backlog of Project Gutenberg / Calibre EPUBs and zero time to read them, researchers listening to arXiv PDFs on a commute, business users turning Word DOCX reports into audio briefings, devs converting READMEs and blog drafts into audio for proofing, knowledge workers narrating long reports, accessibility-first publishers, podcast producers prototyping audiobook conversions, students reviewing textbook chapters on the go, journalists drafting voiceovers, and Substack writers exporting their newsletter to audio.

Features

Feature	Description
Seven input formats	PDF (text-layer), DOCX (Word, via mammoth), EPUB (ebook, spine-ordered chapters), Markdown (syntax stripped), plain text, HTML (tags stripped), RTF (control codes stripped). Auto-detected from magic bytes / mimetype / extension
OCR for scanned PDFs	Pages with no selectable text layer are auto-rendered and OCR'd with Tesseract (7 languages: EN, ES, FR, DE, IT, PT, NL) so scanned books and photographed pages narrate too. Only the pages that actually need OCR are processed and billed
Encrypted PDF support	Provide the password via `pdfPassword` to decrypt and narrate password-protected PDFs
SSRF-guarded URL fetch + proxy	Document URLs are validated against private / internal address ranges before fetching. Optional Apify proxy for hosts that block datacenter IPs
Four ways to provide it	Public URL, file upload, base64 paste, or raw text paste (great for blog drafts & ChatGPT replies)
Free Edge TTS by default	Microsoft Edge neural voices, no API key, no per-character cost — works out of the box
One document in, one (or many) MP3s out	Short inputs produce a single MP3. Long ones are auto-split into chapter-sized parts, plus a shareable INDEX.html page with inline players + download links
Multiple TTS engines	Edge TTS (FREE, default), OpenAI gpt-4o-mini-tts (steerable), OpenAI tts-1 / tts-1-hd, ElevenLabs Flash v2.5 / Turbo v2.5
BYOK for premium models	OpenAI and ElevenLabs models require your own API key (we never markup the provider price)
Steerable narration	With gpt-4o-mini-tts you can prompt the voice ("calm audiobook narrator", "energetic podcast host", "slow and deliberate") without retraining
Auto language detection	Edge TTS auto-picks a matching voice based on the text language. Or force a specific Azure voice name (e.g. `en-US-AndrewNeural`)
Markdown-aware	Markdown syntax is stripped (no "asterisk asterisk bold asterisk asterisk"). Headings, lists, links, code fences, tables, and inline formatting all read as natural prose
Page / section range support	Narrate the whole document, a single chapter, or a custom slice. Works on PDFs (real pages) and TXT / MD / HTML (~3000-char pseudo-pages). Range syntax: `1-10`, `1,3,5`, `1-3,7-9`
Smart chunking	Text is split on paragraph and sentence boundaries before TTS. Hard cuts respect word boundaries so chunks don't start mid-word
Pre-flight cost preview + hard cap	Every run prints an estimated ceiling cost before TTS starts and writes a `PREVIEW` key. Set `maxCostUsd` to abort before TTS if the estimate is too high — and it also clamps the actual audio-minute charge so the final bill never exceeds your cap
Provider-adaptive concurrency	Auto-caps parallel TTS calls per provider (ElevenLabs 2, Edge 8, OpenAI 10) so free tiers don't hit 429 storms
Resume failed runs	Already-synthesized chunks are cached. If a long run times out or fails partway, the next run picks up where it left off — no re-paying for TTS already done
Skip-failed-chunks mode	If a single chunk keeps failing after retries, skip it and keep narrating the rest of the document (configurable). Auth / quota errors always abort cleanly
ffmpeg concat + ID3 tags	Chunks are stitched into valid MP3 containers with correct duration metadata and ID3 tags (title, album, track, genre=Audiobook) so players show proper info. No "audio glitch at minute 4" bugs
Transparent pricing	Pay-per-event: per page narrated + per audio minute. No surcharges, no markups on BYOK providers

How to Use

Step 1: Provide a document

Any one of:

Document URL — any publicly reachable URL ending in .pdf, .docx, .epub, .md, .markdown, .txt, .html, .htm, or .rtf
Upload file — drag and drop a PDF / DOCX / EPUB / MD / TXT / HTML / RTF (uploaded to the run's key-value store)
Base64 content — paste raw base64 bytes; format auto-detected from magic bytes / mimetype
Raw text paste — paste prose, Markdown, or HTML directly into the text field (perfect for blog drafts, ChatGPT replies, READMEs)

Scanned / image-only PDF pages (no text layer) are automatically OCR'd when enableOcr is on (default). For encrypted / password-protected PDFs, supply pdfPassword.

Step 2: Pick a voice and model

Model	Best for	API key required
`edge-tts` (default)	Free, long books, multi-language	No key — free
`openai-gpt-4o-mini-tts`	Steerable narration with style instructions	OpenAI key (BYOK)
`openai-tts-1`	Bulk runs, low cost, supports `speed`	OpenAI key (BYOK)
`openai-tts-1-hd`	High-quality OpenAI audio	OpenAI key (BYOK)
`elevenlabs-flash-v2_5`	Fast, real-time-quality voices	ElevenLabs key (BYOK)
`elevenlabs-turbo-v2_5`	Highest-quality ElevenLabs voices	ElevenLabs key (BYOK)

Step 3: (Optional) Set a range, voice, speed, or instructions

voice — leave blank for auto. For Edge TTS use an Azure ShortName like en-US-AndrewNeural, es-ES-ElviraNeural. For OpenAI: alloy, echo, fable, onyx, nova, shimmer, coral, sage. For ElevenLabs: a voice ID.
language — auto (recommended) or a specific ISO code (Edge TTS only — OpenAI voices are multilingual).
pageRange — e.g. 1-10 or 1,3,5 or 1-3,7-9. Empty = whole document. For non-PDF formats, "pages" are ~3000-char sections.
speed — 0.25 to 4.0. Only applies to openai-tts-1 / openai-tts-1-hd.
instructions — free-form style hint for openai-gpt-4o-mini-tts, e.g. "Calm, slow audiobook narrator with a neutral accent."
enableOcr — on by default. Auto-OCRs scanned PDF pages that have no text layer. Turn off to fail fast on scans instead.
pdfPassword — password for encrypted PDFs.
proxyConfiguration — optional Apify proxy, used only for the Document URL fetch.

Step 4: Run and download

The Actor:

Downloads / decodes / reads the input
Detects the format (PDF magic bytes + content-type + extension + content sniff)
Extracts and normalises the text (page-range aware, Markdown / HTML aware)
Splits into TTS-sized chunks at sentence boundaries (word-boundary safe hard cuts)
Synthesises each chunk with the chosen provider in parallel
Folds chunks into chapter-sized parts as they complete (ffmpeg concat)
Uploads each part to the key-value store + writes a shareable INDEX.html

You'll find the result in:

The dataset — one row with metadata (indexUrl, audioUrl, partsCount, parts[], durationSeconds, chars, pagesProcessed, cost, status)
The key-value store — each MP3 part, the INDEX.html page, the PREVIEW estimate, and the OUTPUT record

Input Reference

Field	Type	Required	Description
`documentUrl`	string	one of	Public URL (PDF / TXT / MD / HTML)
`documentFile`	file	one of	Upload a PDF / TXT / MD / HTML from your device
`documentBase64`	string	one of	Base64-encoded document bytes
`text`	string	one of	Paste raw prose, Markdown, or HTML directly
`model`	enum	no	TTS model (default `edge-tts`, free)
`voice`	string	no	Edge ShortName, OpenAI voice, or ElevenLabs voice ID
`language`	enum	no	Auto-detect (default) or specific ISO code
`speed`	number	no	0.25 to 4.0, default 1.0. tts-1 / tts-1-hd only
`instructions`	string	no	Free-form style for gpt-4o-mini-tts
`pageRange`	string	no	e.g. `1-10` or `1,3,5`. Empty = full document
`chunkSize`	integer	no	500 to 4096, default 4000 (auto-clamped to 2500 for ElevenLabs)
`concurrency`	integer	no	1 to 20 parallel TTS requests, default 5 (auto-clamped per provider)
`resume`	boolean	no	Skip already-synthesized chunks from previous runs (default true)
`skipFailedChunks`	boolean	no	Skip individual chunk failures instead of aborting (default true)
`maxPartMb`	integer	no	Max size per MP3 part, default 40MB
`maxCostUsd`	number	no	Hard cap (min 0.02). Aborts before TTS if the estimate exceeds it, and clamps the audio-minute charge so the final bill never exceeds the cap
`enableOcr`	boolean	no	OCR scanned / image-only PDF pages (default true)
`pdfPassword`	secret	no	Password for encrypted PDFs
`proxyConfiguration`	object	no	Apify proxy for the Document URL fetch
`openaiApiKey`	secret	required for openai-*	Your OpenAI API key (BYOK)
`elevenlabsApiKey`	secret	required for elevenlabs-*	Your ElevenLabs API key (BYOK)
`debug`	boolean	no	Verbose logs

Output Example

{
    "indexUrl": "https://api.apify.com/v2/key-value-stores/.../records/INDEX",
    "audioUrl": "https://api.apify.com/v2/key-value-stores/.../records/narration-abc123-part001.mp3",
    "audioKvKey": "narration-abc123-part001.mp3",
    "durationSeconds": 1843.2,
    "partsCount": 2,
    "parts": [
        { "part": 1, "key": "narration-abc123-part001.mp3", "url": "https://...", "durationSeconds": 1200, "bytes": 12500000 },
        { "part": 2, "key": "narration-abc123-part002.mp3", "url": "https://...", "durationSeconds": 643.2, "bytes": 6700000 }
    ],
    "chars": 48210,
    "pagesProcessed": 24,
    "ocrPagesProcessed": 0,
    "voice": "en-US-AndrewNeural",
    "model": "edge-tts",
    "cost": 2.13,
    "status": "success",
    "chunksTotal": 13,
    "chunksSucceeded": 13,
    "chunksFailed": 0,
    "generatedAt": "2026-06-02T10:30:00.000Z"
}

Use Cases

Any ebook → free audiobook — drop a Project Gutenberg EPUB and listen to a full classic novel. Spine-ordered chapters narrate in the right sequence.
Word doc → audio briefing — drop your DOCX report and listen on a commute instead of skimming on screen.
Audiobook prototyping — convert your ebook PDF into MP3 to validate narrator tone before commissioning a human voiceover.
Research papers on the go — listen to arXiv PDFs during a commute or workout.
README → audio — paste your project README and listen to your own docs to spot rough explanations.
Blog draft proofing — paste a Markdown blog draft and listen to it before publishing. You'll hear awkward phrasing you'd never catch reading.
ChatGPT reply → podcast snippet — copy a long ChatGPT response into the text field and listen as audio.
Accessibility — generate audio versions of internal documentation for screen-reader-light workflows.
Onboarding — pipe HR PDFs (handbooks, policies) into audio for distributed teams.
Newsletter audio versions — automatically narrate weekly reports (PDF, MD, or HTML) for paying subscribers.
Language learning — narrate text in different voices and speeds to practise listening comprehension.
Substack → audio export — export a post as HTML and narrate it for a podcast feed.

Pricing

This Actor uses Pay Per Event so you only pay for the work the run actually does. No premium-voice surcharges, no provider markups.

Event	Price (USD)	When charged
`actor-start`	$0.02	Once per run, after the document loads successfully
`pdf-page-narrated`	$0.05	Once per page (PDF) or per ~3000-char section (TXT / MD / HTML) successfully narrated
`audio-minute-generated`	$0.03	Once per minute of MP3 output
`ocr-page-processed`	$0.10	Only for scanned / image-only PDF pages that had no text layer and were recovered via OCR. Text-layer PDFs never pay this

Typical cost example — a 20-page research paper (~40k chars, ~50 minutes of audio):

actor-start: $0.02
20 × pdf-page-narrated: $1.00
~50 × audio-minute-generated: $1.50
Total: ~$2.52 for the full paper

For OpenAI / ElevenLabs models, this is all you pay Apify. You pay the provider directly with your own API key on top — that's the whole point of BYOK: no markup.

Pre-flight cost preview & hard cap

Every run writes a PREVIEW key to the key-value store BEFORE TTS starts, with pages to process, estimated audio minutes, and the estimated ceiling cost. The same numbers are printed to the run log.

Set the optional maxCostUsd input to enforce a hard cap: if the estimate exceeds it, the run aborts cleanly before any TTS — you only pay the actor-start fee (plus any OCR already performed). The actual audio-minute charge is also clamped to the cap, so even if the produced audio runs longer than estimated (slow speech, CJK scripts) the final bill never exceeds your cap. Combine with Apify's run-level Max total charge as a second belt-and-suspenders limit.

BYOK — Bring Your Own Key

Model family	Key field	Where to get it	Free tier?
`edge-tts` (default)	—	No key needed	Yes — completely free
`openai-*`	`openaiApiKey`	https://platform.openai.com/api-keys	OpenAI charges per character
`elevenlabs-*`	`elevenlabsApiKey`	https://elevenlabs.io/app/settings/api-keys	ElevenLabs free tier available

The actor never logs your keys (isSecret: true) and never proxies your calls through our servers — your key talks directly to the provider from inside the actor's run.

FAQ

Which document formats are supported?

PDF (.pdf) — text-layer PDFs extract natively; scanned / image-only pages are auto-OCR'd (Tesseract).
DOCX (.docx) — Word documents, parsed with mammoth. Styles, lists, tables, footnotes handled natively.
EPUB (.epub) — ebooks. Walked in spine order so chapters narrate in the right sequence. HTML stripped per chapter.
Markdown (.md, .markdown, .mdx) — syntax stripped so the voice reads natural prose.
Plain text (.txt, .text) — UTF-8, BOM handled.
HTML (.html, .htm, .xhtml) — tags stripped, entities decoded, scripts and styles removed.
RTF (.rtf) — control codes stripped, unicode escapes and hex bytes decoded.

Does it work on scanned PDFs?

Yes. When a PDF page has no selectable text layer, the actor renders it (poppler pdftoppm) and runs OCR (Tesseract) to recover the text, then narrates it. OCR runs only on pages that need it, and those pages are billed via the ocr-page-processed event ($0.10/page). Built-in OCR languages: English, Spanish, French, German, Italian, Portuguese, Dutch (others fall back to English). Turn it off with enableOcr: false to fail fast on scans instead.

Does it work on password-protected PDFs?

Yes — pass the password in the pdfPassword input and the actor decrypts the PDF before extraction.

What about ODT, MOBI, AZW3, or Pages?

Not supported in v0.1. Convert ODT to DOCX first; for Kindle formats, convert via Calibre to EPUB.

Why do OpenAI / ElevenLabs models require BYOK?

So we never markup the provider price. Pay Apify for the actor work, pay OpenAI / ElevenLabs directly for the TTS calls. Cleaner, cheaper, more honest. For zero-key zero-friction runs, the default edge-tts is free and gives great quality on 70+ languages.

How is "page" defined for TXT / MD / HTML?

There are no real pages, so the actor splits the cleaned text into ~3000-char pseudo-pages — roughly the length of one PDF page of prose. This keeps pageRange and per-page billing fair across formats.

How long can the document be?

PDFs up to 50 MB. EPUB up to 40 MB. DOCX up to 30 MB. TXT / MD / HTML / RTF up to 20 MB of decoded text. There is no hard page limit. Long inputs are auto-split into chapter-sized MP3 parts (configurable via maxPartMb).

What if my run times out or fails partway?

Re-run with the same input. The resume option (on by default) skips already-synthesized chunks via a shared cache, so you only pay TTS for the missing pieces.

Can I get word-level timestamps?

v0.1 does not emit timestamps. Coming in a future version.

Can I use multiple speakers / podcast mode?

Not in v0.1. Single-voice narration only.

What audio format is produced?

MP3, standard playback on any device. Edge TTS produces 24 kHz mono; OpenAI / ElevenLabs use their default high-quality output.

Can I override the voice with a custom ElevenLabs voice?

Yes — paste any ElevenLabs voice ID into the voice field when using an elevenlabs-* model. You can clone your own voice in your ElevenLabs account and use that ID here.

Is my API key safe?

Yes. API keys are marked isSecret: true in the input schema and are never logged or persisted.

Built and maintained by Equipinico

Need a custom variant (different language model, custom voices, SSML support, podcast multi-speaker, EPUB / DOCX support)? Reach out via the Apify Store contact link.

OCR & Document Extractor – PDF & Image to Text, JSON, Word

lofomachines/ocr-document-extractor

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

Lofomachines

PDF & DOCX to Markdown — Document Extractor for LLM/RAG

fetchbase/document-to-markdown

Convert PDF and Word (DOCX) documents into clean Markdown, text, or JSON. Smart PDF paragraph reflow, page markers for RAG citations, full DOCX structure (headings, lists, tables), custom auth headers. No browser — parses in seconds. Charged per page processed — no startup fee.

Fetchbase

Doc-to-Markdown/JSON RAG Prep - Convert PDF & DOCX for RAG

bigjoecoding/doc-to-markdown-json-rag-prep

Convert PDF, DOCX, PPTX and webpages to clean Markdown and RAG-ready JSON chunks for your embedding pipeline. No LLM cost. $0.03 per document.

Joseph Curry

Document Parser — PDF/DOCX to Markdown & JSON for RAG

genuine_qa/document-parser

Convert PDF, DOCX, PPTX, XLSX, HTML and images into clean Markdown or JSON for RAG and LLM pipelines. Powered by IBM's open-source Docling.

Rahul Bhiwagade

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

Web Harvester

Epub To Pdf

flamboyant_inn/epub-to-pdf

Simple actor that takes a public url to a epub file, and converts it to pdf format. Once the run finishes, go to the Storage tab, select Key-value store, and you will see the OUTPUT.pdf file ready for download.

Eric

Pdf API

vivid_astronaut/pdf

Fabio Suizu

Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX

scrapeworks/pandoc-document-converter

Convert documents between formats with Pandoc in the cloud: HTML to Markdown for LLMs and RAG, Markdown to Word DOCX, EPUB e-books, PowerPoint PPTX, LaTeX, reStructuredText and more. Feed it URLs or raw text, get one converted document per input.

Nicolas van Arkens

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

518

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

entranced_gelato/ai-document-reader

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.

AIDevs

PDF to MP3 - Convert PDF, EPUB, DOCX & Text to Audiobook

Text to Audio Narrator

Features

How to Use

Step 1: Provide a document

Step 2: Pick a voice and model

Step 3: (Optional) Set a range, voice, speed, or instructions

Step 4: Run and download

Input Reference

Output Example

Use Cases

Pricing

Pre-flight cost preview & hard cap

BYOK — Bring Your Own Key

FAQ

Which document formats are supported?

Does it work on scanned PDFs?

Does it work on password-protected PDFs?

What about ODT, MOBI, AZW3, or Pages?

Why do OpenAI / ElevenLabs models require BYOK?

How is "page" defined for TXT / MD / HTML?

How long can the document be?

What if my run times out or fails partway?

Can I get word-level timestamps?

Can I use multiple speakers / podcast mode?

What audio format is produced?

Can I override the voice with a custom ElevenLabs voice?

Is my API key safe?

Built and maintained by Equipinico

You might also like

OCR & Document Extractor – PDF & Image to Text, JSON, Word

PDF & DOCX to Markdown — Document Extractor for LLM/RAG

Doc-to-Markdown/JSON RAG Prep - Convert PDF & DOCX for RAG

Document Parser — PDF/DOCX to Markdown & JSON for RAG

PDF to Markdown Converter

Epub To Pdf

Pdf API

Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX

PDF Scraper

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs