Pricing

from $0.00001 / actor start

Document Parser — PDF/DOCX to Markdown & JSON for RAG

Convert PDF, DOCX, PPTX, XLSX, HTML and images into clean Markdown or JSON for RAG and LLM pipelines. Powered by IBM's open-source Docling.

Pricing

from $0.00001 / actor start

Rating

0.0

(0)

Developer

Rahul Bhiwagade

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Document Parser — PDF, DOCX & more → Markdown / JSON for RAG & LLMs

Turn messy documents into clean, structured Markdown or JSON that's ready to drop straight into RAG pipelines, vector databases, and LLM prompts.

Send one or more document URLs and get back well-structured content with headings, lists, reading order, and real tables preserved — powered by state-of-the-art open-source document-AI models for layout analysis and table structure recognition.

No local setup, no GPU, no model wrangling. Just URLs in, clean text out.

✨ Why use this

Built for RAG / LLMs — Markdown output drops cleanly into prompts and chunkers; JSON output gives you structured elements for custom pipelines.
Real table extraction — tables come back as proper Markdown tables (rows/columns intact), not jumbled text.
Layout-aware — detects headings, lists, captions, and correct reading order across multi-column pages.
Many formats, one Actor — PDF, Word, PowerPoint, Excel, HTML, and images.
Robust — each document is processed independently; one bad URL never fails the whole run, and errors come back with a clear, human-readable reason.
Optional OCR — extract text from scanned or image-only PDFs.

📄 Supported formats

Type	Extensions
PDF	`.pdf`
Word	`.docx`
PowerPoint	`.pptx`
Excel	`.xlsx`
Web / markup	`.html`, `.md`
Images	`.png`, `.jpg`, `.tiff` (with OCR)

💡 Common use cases

RAG ingestion — convert a library of PDFs/Docs into Markdown for chunking and embedding.
Knowledge bases & search — extract clean, structured text from reports, manuals, and contracts.
LLM context — feed papers, datasheets, or filings to a model without copy-paste noise.
Dataset building — turn document collections into structured JSON for training or analysis.
Table harvesting — pull tables out of financial reports or research papers as usable Markdown.

🚀 How to use

From the Apify Console

Click Try for free / Start.
Paste one or more Document URLs (direct links to the files).
Pick an Output format — markdown, json, or both.
(Optional) Turn on OCR for scanned/image PDFs.
Click Start and grab the results from the Dataset tab (export as JSON, CSV, Excel, or via API).

Input

Field	Type	Required	Description
`documentUrls`	array of strings	✅	Direct URLs to the documents to convert.
`outputFormat`	`markdown` \| `json` \| `both`		Output format. Default: `markdown`.
`doOcr`	boolean		Run OCR on scanned/image PDFs (slower). Default: `false`.

Example input

{
  "documentUrls": [
    "https://arxiv.org/pdf/2408.09869",
    "https://www.example.com/report.pdf"
  ],
  "outputFormat": "both",
  "doOcr": false
}

Output

One dataset item per document:

{
  "url": "https://arxiv.org/pdf/2408.09869",
  "status": "success",
  "markdown": "## Abstract\n\nThis technical report introduces ...",
  "json": { "schema_name": "DoclingDocument", "texts": [ ... ], "tables": [ ... ] }
}

If a document can't be processed, you get a clear error instead of a crash:

{
  "url": "https://example.com/locked.pdf",
  "status": "error",
  "error": "Download failed with HTTP 403. The URL may be private, expired, or protected (e.g. Cloudflare/login). Provide a direct, publicly accessible document link."
}

🔌 Use the results via API

Run the Actor and read its output from your own code with the Apify API client:

from apify_client import ApifyClient

client = ApifyClient("<YOUR_APIFY_TOKEN>")

run = client.actor("genuine_qa/document-parser").call(run_input={
    "documentUrls": ["https://arxiv.org/pdf/2408.09869"],
    "outputFormat": "markdown",
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item["status"] == "success":
        print(item["markdown"])
    else:
        print("Failed:", item["url"], "->", item["error"])

You can also export results directly as JSON/CSV/Excel from the dataset's Export button, or pull them from the Dataset API.

⚙️ Tips & performance

Memory: the conversion models need room — run with 4 GB+ memory for reliable results, more for large or OCR-heavy documents.
First page is slowest: models load once per run, so converting many documents in a single run is more efficient than one run per document.
OCR is heavier: only enable doOcr when documents are scanned or image-based — it's significantly slower than parsing digital text.
Use direct links: point to the actual file URL. Pages behind logins, paywalls, or anti-bot challenges (e.g. Cloudflare) can't be downloaded and will return a clear error.

❓ FAQ

Does it handle scanned PDFs? Yes — enable doOcr. For digital (text-based) PDFs, leave it off for much faster, higher-fidelity results.

Are tables preserved? Yes. Tables are reconstructed and emitted as Markdown tables, and as structured cells in the JSON output.

Can I process many documents at once? Yes — pass multiple URLs in documentUrls. Each becomes its own dataset item.

What happens if one URL is bad? That single document is marked "status": "error" with a readable message; the rest of the run continues normally.

Do my documents leave the run? The Actor downloads each URL you provide, converts it inside the run, and writes the result to your dataset. It doesn't send your documents anywhere else.

PDF to Markdown Converter: Docling Parser for AI & RAG

raional/pdf-to-markdown-converter

Convert PDF, DOCX, PPTX, XLSX, HTML and images to clean Markdown and structured JSON using IBM's open-source Docling library. Preserves headings, tables, and page structure. RAG-ready chunked output mode for LLM pipelines.

Raion Al

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

Web Harvester

Doc-to-Markdown/JSON RAG Prep - Convert PDF & DOCX for RAG

bigjoecoding/doc-to-markdown-json-rag-prep

Convert PDF, DOCX, PPTX and webpages to clean Markdown and RAG-ready JSON chunks for your embedding pipeline. No LLM cost. $0.03 per document.

Joseph Curry

PDF & DOCX to Markdown — Document Extractor for LLM/RAG

fetchbase/document-to-markdown

Convert PDF and Word (DOCX) documents into clean Markdown, text, or JSON. Smart PDF paragraph reflow, page markers for RAG citations, full DOCX structure (headings, lists, tables), custom auth headers. No browser — parses in seconds. Charged per page processed — no startup fee.

Fetchbase

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

entranced_gelato/ai-document-reader

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.

AIDevs

Word, PowerPoint & Excel to Markdown — for RAG & AI Agents

lizaraco/office-docs-to-markdown

Convert DOCX, PPTX, and XLSX files to clean, LLM-ready markdown at scale. Headings, tables, slides, and sheets preserved. Never-fail runs, per-document output. The Office twin of PDF-to-Markdown.

Shawn Downs

PDF Extractor: PDF → Clean Markdown + JSON for LLM/RAG

boxbox10/pdf-extractor

Turn any PDF URL into clean, LLM-ready Markdown + structured JSON (title, metadata, per-page text, page count, word count, token count). Perfect for RAG pipelines, AI agents, and LLM document ingestion.

Marvin Eguilos

PDF to Markdown & JSON Converter (Docling)

actorzlab/docling-pdf-converter

Convert PDF documents to clean Markdown, structured JSON, and plain text using IBM's open-source Docling AI. Handles text PDFs and scanned documents (OCR), extracts tables and images. No external API key required — runs fully on-device.

Khalil Drissi

Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX

scrapeworks/pandoc-document-converter

Convert documents between formats with Pandoc in the cloud: HTML to Markdown for LLMs and RAG, Markdown to Word DOCX, EPUB e-books, PowerPoint PPTX, LaTeX, reStructuredText and more. Feed it URLs or raw text, get one converted document per input.