PDF & Document to Markdown - PDF, DOCX & HTML for LLMs
Pricing
from $30.00 / 1,000 document reads
PDF & Document to Markdown - PDF, DOCX & HTML for LLMs
Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.
Pricing
from $30.00 / 1,000 document reads
Rating
0.0
(0)
Developer
AIDevs
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 hours ago
Last modified
Categories
Share
AI Document Reader
Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown — with metadata and an optional AI summary, in a single call.
AI Document Reader is the document counterpart to a web-page reader. Point it at a document URL and it auto-detects the format, extracts the real content (not the binary noise), and returns structured text, Markdown, and metadata that you can feed straight into an LLM, a vector database, or a RAG pipeline.
Why AI Document Reader
Most "read a document" steps in AI pipelines are a mess of format-specific parsers, broken encodings, and inconsistent output. This Actor gives you one endpoint for the most common document formats and a single, predictable output shape regardless of whether the source was a PDF, a Word file, a text file, or an HTML page.
- One call, one record. Each run returns exactly one structured
documentrecord. - LLM-ready by default. You get both clean plain text and Markdown — no post-processing required.
- Bring-your-own-key summaries. Optional AI TL;DR + key points using your own OpenAI key, so model cost stays with you.
When to use it
- Ingesting PDFs/DOCX into a RAG pipeline or vector database.
- Building a document Q&A bot or research agent that needs clean text from a link.
- No-code automations (Make, Zapier, n8n) that receive a document URL and need its contents.
- Quickly turning a report, whitepaper, or contract into Markdown for an LLM prompt.
When NOT to use it
- Deep-crawling an entire website — use a site crawler instead; this reads a single document/URL.
- Scanned/image-only PDFs — there is no OCR step, so image-only PDFs return little or no text.
- Password-protected or login-gated files — the Actor fetches the URL as an anonymous client.
Built for
AI engineers, data teams, RAG/LLM developers, and automation builders who need a reliable "document → text" primitive.
How it works
- Fetch. The Actor downloads the document at
urlas raw bytes (with redirects followed). - Detect. It identifies the format from the content-type header, the URL extension, and the file's magic bytes (e.g.
%PDF,PKfor DOCX zips). - Extract.
- PDF → parsed with
pdf-parse(text + page count + embedded title/author). - DOCX → converted to HTML with
mammoth, then to clean Markdown. - HTML → main content isolated (nav/header/footer/scripts removed) and converted to Markdown.
- TXT / Markdown → returned as-is.
- PDF → parsed with
- (Optional) Summarize. If
summarizeis on and an OpenAI key is supplied, it generates a TL;DR + key points. - Output. One record is pushed to the dataset; usage is billed per event.
How to call it
From the Console
Open the Actor, paste a document URL into Document URL, optionally enable Generate AI summary with your OpenAI key, and click Start. Read the result in the Output tab.
From the API
Run it via the Apify API and read the dataset. Conceptually:
POST https://api.apify.com/v2/acts/entranced_gelato~ai-document-reader/runs?token=<APIFY_TOKEN>{"url": "https://example.com/report.pdf","summarize": true,"openaiApiKey": "sk-...","model": "gpt-4o-mini"}
The Actor is also callable over MCP, so AI agents can invoke it as a tool.
Input reference
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | — | Direct URL to the document (PDF, DOCX, TXT, or HTML). |
summarize | boolean | No | false | Generate an AI TL;DR + key points (requires openaiApiKey). |
openaiApiKey | string (secret) | No | — | Your OpenAI key; used only for the summary. |
model | string | No | gpt-4o-mini | OpenAI model for the summary. |
maxChars | integer | No | 0 | Cap the length of returned text/markdown (0 = no limit). |
Output reference
One dataset record per run:
| Field | Description |
|---|---|
url | The document URL that was read. |
fileType | Detected format: pdf, docx, html, or text. |
title | Document title (from PDF info or first heading), if available. |
author | Author metadata, if present. |
pageCount | Number of pages (PDF only). |
wordCount | Word count of the extracted text. |
content | Clean plain text. |
markdown | LLM-ready Markdown version. |
summary | AI TL;DR (only when summarization is enabled). |
keyPoints | Array of key points (only when summarization is enabled). |
fetchedAt | ISO timestamp of the run. |
Pricing
Pay per event — you only pay for what you run:
- Document read — charged once per successful run (one document).
- AI summary — a small premium that applies only when you enable summarization. You supply your own OpenAI key, so the model's cost is billed by OpenAI separately and is never added to the Actor price.
Apify platform/compute usage is included in the per-event price. See the Pricing tab for current rates.
Integrations
- LangChain / LlamaIndex — feed
content/markdowninto document loaders and vector stores. - Make / Zapier / n8n — trigger on a document URL, store the structured output.
- MCP — expose the Actor as a tool for autonomous agents.
🔌 Integrations & code examples
Call it from the API
curl "https://api.apify.com/v2/acts/entranced_gelato~ai-document-reader/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \-H "Content-Type: application/json" \-d '{ "url": "https://example.com/report.pdf" }'
Python (Apify client)
from apify_client import ApifyClientclient = ApifyClient("<APIFY_TOKEN>")run = client.actor("entranced_gelato/ai-document-reader").call(run_input={"url": "https://example.com/whitepaper.pdf"})doc = next(client.dataset(run["defaultDatasetId"]).iterate_items())print(doc["fileType"], doc["pageCount"], "pages,", doc["wordCount"], "words")print(doc["markdown"][:500])
LangChain (ingest a document into a RAG chain)
from langchain_community.utilities import ApifyWrapperfrom langchain_core.documents import Documentapify = ApifyWrapper()loader = apify.call_actor(actor_id="entranced_gelato/ai-document-reader",run_input={"url": "https://example.com/report.pdf"},dataset_mapping_function=lambda i: Document(page_content=i["markdown"] or i["content"] or "",metadata={"source": i["url"], "fileType": i.get("fileType")},),)docs = loader.load()
MCP — add it to Claude, Cursor, or any agent
{"mcpServers": {"apify": {"command": "npx","args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/ai-document-reader"],"env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }}}}
Also works with LlamaIndex, Make, Zapier, and n8n — trigger on a document URL, store the structured output.
Example output
{"url": "https://example.com/report.pdf","fileType": "pdf","title": "Annual Report 2025","author": "Example Corp","pageCount": 42,"wordCount": 18734,"content": "Annual Report 2025\n\nLetter from the CEO\n\nThis year we...","markdown": "# Annual Report 2025\n\n## Letter from the CEO\n\nThis year we...","fetchedAt": "2026-07-02T07:20:00.000Z"}
FAQ
Does it OCR scanned PDFs? No. It extracts embedded text; image-only PDFs need an OCR step first.
Which DOCX features are preserved? Headings, paragraphs, lists, bold/italic, and links are converted to Markdown. Complex tables and embedded objects may be simplified.
Can I cap output size? Yes — set maxChars to truncate very long documents.
Limitations
- No OCR (image-only PDFs).
- No authentication / cookies (public URLs only).
- One document per run (use a list-driven task or orchestrator for batches).
See also
- AI Web Page Reader - any URL to clean text + Markdown.
- AI Competitive Brief Generator - any company URL to a competitive, SEO, or sales brief.