Pricing

from $2.80 / 1,000 document reads

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.

Pricing

from $2.80 / 1,000 document reads

Rating

0.0

(0)

Developer

AIDevs

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

AI Document Reader

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown — with metadata and an optional AI summary, in a single call.

AI Document Reader is the document counterpart to a web-page reader. Point it at a document URL and it auto-detects the format, extracts the real content (not the binary noise), and returns structured text, Markdown, and metadata that you can feed straight into an LLM, a vector database, or a RAG pipeline.

Why AI Document Reader

Most "read a document" steps in AI pipelines are a mess of format-specific parsers, broken encodings, and inconsistent output. This Actor gives you one endpoint for the most common document formats and a single, predictable output shape regardless of whether the source was a PDF, a Word file, a text file, or an HTML page.

One call, one record. Each run returns exactly one structured document record.
LLM-ready by default. You get both clean plain text and Markdown — no post-processing required.
Bring-your-own-key summaries. Optional AI TL;DR + key points using your own OpenAI key, so model cost stays with you.

When to use it

Ingesting PDFs/DOCX into a RAG pipeline or vector database.
Building a document Q&A bot or research agent that needs clean text from a link.
No-code automations (Make, Zapier, n8n) that receive a document URL and need its contents.
Quickly turning a report, whitepaper, or contract into Markdown for an LLM prompt.

When NOT to use it

Deep-crawling an entire website — use a site crawler instead; this reads a single document/URL.
Scanned/image-only PDFs — there is no OCR step, so image-only PDFs return little or no text.
Password-protected or login-gated files — the Actor fetches the URL as an anonymous client.

Built for

AI engineers, data teams, RAG/LLM developers, and automation builders who need a reliable "document → text" primitive.

How it works

Fetch. The Actor downloads the document at url as raw bytes (with redirects followed).
Detect. It identifies the format from the content-type header, the URL extension, and the file's magic bytes (e.g. %PDF, PK for DOCX zips).
Extract.
- PDF → parsed with pdf-parse (text + page count + embedded title/author).
- DOCX → converted to HTML with mammoth, then to clean Markdown.
- HTML → main content isolated (nav/header/footer/scripts removed) and converted to Markdown.
- TXT / Markdown → returned as-is.
(Optional) Summarize. If summarize is on and an OpenAI key is supplied, it generates a TL;DR + key points.
Output. One record is pushed to the dataset; usage is billed per event.

How to call it

From the Console

Open the Actor, paste a document URL into Document URL, optionally enable Generate AI summary with your OpenAI key, and click Start. Read the result in the Output tab.

From the API

Run it via the Apify API and read the dataset. Conceptually:

POST https://api.apify.com/v2/acts/entranced_gelato~ai-document-reader/runs?token=<APIFY_TOKEN>
{
  "url": "https://example.com/report.pdf",
  "summarize": true,
  "openaiApiKey": "sk-...",
  "model": "gpt-4o-mini"
}

The Actor is also callable over MCP, so AI agents can invoke it as a tool.

Input reference

Field	Type	Required	Default	Description
`url`	string	Yes	—	Direct URL to the document (PDF, DOCX, TXT, or HTML).
`summarize`	boolean	No	`false`	Generate an AI TL;DR + key points (requires `openaiApiKey`).
`openaiApiKey`	string (secret)	No	—	Your OpenAI key; used only for the summary.
`model`	string	No	`gpt-4o-mini`	OpenAI model for the summary.
`maxChars`	integer	No	`0`	Cap the length of returned text/markdown (`0` = no limit).

Output reference

One dataset record per run:

Field	Description
`url`	The document URL that was read.
`fileType`	Detected format: `pdf`, `docx`, `html`, or `text`.
`title`	Document title (from PDF info or first heading), if available.
`author`	Author metadata, if present.
`pageCount`	Number of pages (PDF only).
`wordCount`	Word count of the extracted text.
`content`	Clean plain text.
`markdown`	LLM-ready Markdown version.
`summary`	AI TL;DR (only when summarization is enabled).
`keyPoints`	Array of key points (only when summarization is enabled).
`fetchedAt`	ISO timestamp of the run.

Pricing

Pay per event — you only pay for what you run:

Document read — charged once per successful run (one document).
AI summary — a small premium that applies only when you enable summarization. You supply your own OpenAI key, so the model's cost is billed by OpenAI separately and is never added to the Actor price.

Apify platform/compute usage is included in the per-event price. See the Pricing tab for current rates.

Integrations

LangChain / LlamaIndex — feed content/markdown into document loaders and vector stores.
Make / Zapier / n8n — trigger on a document URL, store the structured output.
MCP — expose the Actor as a tool for autonomous agents.

🔌 Integrations & code examples

Call it from the API

curl "https://api.apify.com/v2/acts/entranced_gelato~ai-document-reader/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com/report.pdf" }'

Python (Apify client)

from apify_client import ApifyClient

client = ApifyClient("<APIFY_TOKEN>")
run = client.actor("entranced_gelato/ai-document-reader").call(
    run_input={"url": "https://example.com/whitepaper.pdf"}
)
doc = next(client.dataset(run["defaultDatasetId"]).iterate_items())
print(doc["fileType"], doc["pageCount"], "pages,", doc["wordCount"], "words")
print(doc["markdown"][:500])

LangChain (ingest a document into a RAG chain)

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper()
loader = apify.call_actor(
    actor_id="entranced_gelato/ai-document-reader",
    run_input={"url": "https://example.com/report.pdf"},
    dataset_mapping_function=lambda i: Document(
        page_content=i["markdown"] or i["content"] or "",
        metadata={"source": i["url"], "fileType": i.get("fileType")},
    ),
)
docs = loader.load()

MCP — add it to Claude, Cursor, or any agent

{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/ai-document-reader"],
      "env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }
    }
  }
}

Also works with LlamaIndex, Make, Zapier, and n8n — trigger on a document URL, store the structured output.

Example output

{
  "url": "https://example.com/report.pdf",
  "fileType": "pdf",
  "title": "Annual Report 2025",
  "author": "Example Corp",
  "pageCount": 42,
  "wordCount": 18734,
  "content": "Annual Report 2025\n\nLetter from the CEO\n\nThis year we...",
  "markdown": "# Annual Report 2025\n\n## Letter from the CEO\n\nThis year we...",
  "fetchedAt": "2026-07-02T07:20:00.000Z"
}

FAQ

Does it OCR scanned PDFs? No. It extracts embedded text; image-only PDFs need an OCR step first.

Which DOCX features are preserved? Headings, paragraphs, lists, bold/italic, and links are converted to Markdown. Complex tables and embedded objects may be simplified.

Can I cap output size? Yes — set maxChars to truncate very long documents.

Limitations

No OCR (image-only PDFs).
No authentication / cookies (public URLs only).
One document per run (use a list-driven task or orchestrator for batches).

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

AI Document Reader

Why AI Document Reader

When to use it

When NOT to use it

Built for

How it works

How to call it

From the Console

From the API

Input reference

Output reference

Pricing

Integrations

🔌 Integrations & code examples

Call it from the API

Python (Apify client)

LangChain (ingest a document into a RAG chain)

MCP — add it to Claude, Cursor, or any agent

Example output

FAQ

Limitations

See also

PDF Extractor: PDF → Clean Markdown + JSON for LLM/RAG

Document Parser — PDF/DOCX to Markdown & JSON for RAG

PDF to Markdown & JSON Extractor for LLMs

PDF & DOCX to Markdown — Document Extractor for LLM/RAG

PDF to Text API | Document Extraction for LLMs & RAG

OCR & Document Extractor – PDF & Image to Text, JSON, Word

Doc-to-Markdown/JSON RAG Prep - Convert PDF & DOCX for RAG

PDF Text Extractor - Extract Text from PDF by URL API

PDF Text Extractor — Text & Metadata from URLs

PDF Parser API

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

AI Document Reader

Why AI Document Reader

When to use it

When NOT to use it

Built for

How it works

How to call it

From the Console

From the API

Input reference

Output reference

Pricing

Integrations

🔌 Integrations & code examples

Call it from the API

Python (Apify client)

LangChain (ingest a document into a RAG chain)

MCP — add it to Claude, Cursor, or any agent

Example output

FAQ

Limitations

See also

You might also like

PDF Extractor: PDF → Clean Markdown + JSON for LLM/RAG

Document Parser — PDF/DOCX to Markdown & JSON for RAG

PDF to Markdown & JSON Extractor for LLMs

PDF & DOCX to Markdown — Document Extractor for LLM/RAG

PDF to Text API | Document Extraction for LLMs & RAG

OCR & Document Extractor – PDF & Image to Text, JSON, Word

Doc-to-Markdown/JSON RAG Prep - Convert PDF & DOCX for RAG

PDF Text Extractor - Extract Text from PDF by URL API

PDF Text Extractor — Text & Metadata from URLs

PDF Parser API