PDF & Document to Markdown - PDF, DOCX & HTML for LLMs avatar

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

Pricing

from $30.00 / 1,000 document reads

Go to Apify Store
PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.

Pricing

from $30.00 / 1,000 document reads

Rating

0.0

(0)

Developer

AIDevs

AIDevs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 hours ago

Last modified

Share

AI Document Reader

PDF & Document to Markdown

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown — with metadata and an optional AI summary, in a single call.

AI Document Reader is the document counterpart to a web-page reader. Point it at a document URL and it auto-detects the format, extracts the real content (not the binary noise), and returns structured text, Markdown, and metadata that you can feed straight into an LLM, a vector database, or a RAG pipeline.


Why AI Document Reader

Most "read a document" steps in AI pipelines are a mess of format-specific parsers, broken encodings, and inconsistent output. This Actor gives you one endpoint for the most common document formats and a single, predictable output shape regardless of whether the source was a PDF, a Word file, a text file, or an HTML page.

  • One call, one record. Each run returns exactly one structured document record.
  • LLM-ready by default. You get both clean plain text and Markdown — no post-processing required.
  • Bring-your-own-key summaries. Optional AI TL;DR + key points using your own OpenAI key, so model cost stays with you.

When to use it

  • Ingesting PDFs/DOCX into a RAG pipeline or vector database.
  • Building a document Q&A bot or research agent that needs clean text from a link.
  • No-code automations (Make, Zapier, n8n) that receive a document URL and need its contents.
  • Quickly turning a report, whitepaper, or contract into Markdown for an LLM prompt.

When NOT to use it

  • Deep-crawling an entire website — use a site crawler instead; this reads a single document/URL.
  • Scanned/image-only PDFs — there is no OCR step, so image-only PDFs return little or no text.
  • Password-protected or login-gated files — the Actor fetches the URL as an anonymous client.

Built for

AI engineers, data teams, RAG/LLM developers, and automation builders who need a reliable "document → text" primitive.


How it works

  1. Fetch. The Actor downloads the document at url as raw bytes (with redirects followed).
  2. Detect. It identifies the format from the content-type header, the URL extension, and the file's magic bytes (e.g. %PDF, PK for DOCX zips).
  3. Extract.
    • PDF → parsed with pdf-parse (text + page count + embedded title/author).
    • DOCX → converted to HTML with mammoth, then to clean Markdown.
    • HTML → main content isolated (nav/header/footer/scripts removed) and converted to Markdown.
    • TXT / Markdown → returned as-is.
  4. (Optional) Summarize. If summarize is on and an OpenAI key is supplied, it generates a TL;DR + key points.
  5. Output. One record is pushed to the dataset; usage is billed per event.

How to call it

From the Console

Open the Actor, paste a document URL into Document URL, optionally enable Generate AI summary with your OpenAI key, and click Start. Read the result in the Output tab.

From the API

Run it via the Apify API and read the dataset. Conceptually:

POST https://api.apify.com/v2/acts/entranced_gelato~ai-document-reader/runs?token=<APIFY_TOKEN>
{
"url": "https://example.com/report.pdf",
"summarize": true,
"openaiApiKey": "sk-...",
"model": "gpt-4o-mini"
}

The Actor is also callable over MCP, so AI agents can invoke it as a tool.


Input reference

FieldTypeRequiredDefaultDescription
urlstringYesDirect URL to the document (PDF, DOCX, TXT, or HTML).
summarizebooleanNofalseGenerate an AI TL;DR + key points (requires openaiApiKey).
openaiApiKeystring (secret)NoYour OpenAI key; used only for the summary.
modelstringNogpt-4o-miniOpenAI model for the summary.
maxCharsintegerNo0Cap the length of returned text/markdown (0 = no limit).

Output reference

One dataset record per run:

FieldDescription
urlThe document URL that was read.
fileTypeDetected format: pdf, docx, html, or text.
titleDocument title (from PDF info or first heading), if available.
authorAuthor metadata, if present.
pageCountNumber of pages (PDF only).
wordCountWord count of the extracted text.
contentClean plain text.
markdownLLM-ready Markdown version.
summaryAI TL;DR (only when summarization is enabled).
keyPointsArray of key points (only when summarization is enabled).
fetchedAtISO timestamp of the run.

Pricing

Pay per event — you only pay for what you run:

  • Document read — charged once per successful run (one document).
  • AI summary — a small premium that applies only when you enable summarization. You supply your own OpenAI key, so the model's cost is billed by OpenAI separately and is never added to the Actor price.

Apify platform/compute usage is included in the per-event price. See the Pricing tab for current rates.

Integrations

  • LangChain / LlamaIndex — feed content/markdown into document loaders and vector stores.
  • Make / Zapier / n8n — trigger on a document URL, store the structured output.
  • MCP — expose the Actor as a tool for autonomous agents.

🔌 Integrations & code examples

Call it from the API

curl "https://api.apify.com/v2/acts/entranced_gelato~ai-document-reader/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \
-H "Content-Type: application/json" \
-d '{ "url": "https://example.com/report.pdf" }'

Python (Apify client)

from apify_client import ApifyClient
client = ApifyClient("<APIFY_TOKEN>")
run = client.actor("entranced_gelato/ai-document-reader").call(
run_input={"url": "https://example.com/whitepaper.pdf"}
)
doc = next(client.dataset(run["defaultDatasetId"]).iterate_items())
print(doc["fileType"], doc["pageCount"], "pages,", doc["wordCount"], "words")
print(doc["markdown"][:500])

LangChain (ingest a document into a RAG chain)

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document
apify = ApifyWrapper()
loader = apify.call_actor(
actor_id="entranced_gelato/ai-document-reader",
run_input={"url": "https://example.com/report.pdf"},
dataset_mapping_function=lambda i: Document(
page_content=i["markdown"] or i["content"] or "",
metadata={"source": i["url"], "fileType": i.get("fileType")},
),
)
docs = loader.load()

MCP — add it to Claude, Cursor, or any agent

{
"mcpServers": {
"apify": {
"command": "npx",
"args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/ai-document-reader"],
"env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }
}
}
}

Also works with LlamaIndex, Make, Zapier, and n8n — trigger on a document URL, store the structured output.

Example output

{
"url": "https://example.com/report.pdf",
"fileType": "pdf",
"title": "Annual Report 2025",
"author": "Example Corp",
"pageCount": 42,
"wordCount": 18734,
"content": "Annual Report 2025\n\nLetter from the CEO\n\nThis year we...",
"markdown": "# Annual Report 2025\n\n## Letter from the CEO\n\nThis year we...",
"fetchedAt": "2026-07-02T07:20:00.000Z"
}

FAQ

Does it OCR scanned PDFs? No. It extracts embedded text; image-only PDFs need an OCR step first.

Which DOCX features are preserved? Headings, paragraphs, lists, bold/italic, and links are converted to Markdown. Complex tables and embedded objects may be simplified.

Can I cap output size? Yes — set maxChars to truncate very long documents.

Limitations

  • No OCR (image-only PDFs).
  • No authentication / cookies (public URLs only).
  • One document per run (use a list-driven task or orchestrator for batches).

See also