Document Parser — PDF/DOCX to Markdown & JSON for RAG avatar

Document Parser — PDF/DOCX to Markdown & JSON for RAG

Pricing

from $0.00001 / actor start

Go to Apify Store
Document Parser — PDF/DOCX to Markdown & JSON for RAG

Document Parser — PDF/DOCX to Markdown & JSON for RAG

Convert PDF, DOCX, PPTX, XLSX, HTML and images into clean Markdown or JSON for RAG and LLM pipelines. Powered by IBM's open-source Docling.

Pricing

from $0.00001 / actor start

Rating

0.0

(0)

Developer

Rahul Bhiwagade

Rahul Bhiwagade

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 hours ago

Last modified

Share

Document Parser — PDF, DOCX & more → Markdown / JSON for RAG & LLMs

Turn messy documents into clean, structured Markdown or JSON that's ready to drop straight into RAG pipelines, vector databases, and LLM prompts.

Send one or more document URLs and get back well-structured content with headings, lists, reading order, and real tables preserved — powered by state-of-the-art open-source document-AI models for layout analysis and table structure recognition.

No local setup, no GPU, no model wrangling. Just URLs in, clean text out.


✨ Why use this

  • Built for RAG / LLMs — Markdown output drops cleanly into prompts and chunkers; JSON output gives you structured elements for custom pipelines.
  • Real table extraction — tables come back as proper Markdown tables (rows/columns intact), not jumbled text.
  • Layout-aware — detects headings, lists, captions, and correct reading order across multi-column pages.
  • Many formats, one Actor — PDF, Word, PowerPoint, Excel, HTML, and images.
  • Robust — each document is processed independently; one bad URL never fails the whole run, and errors come back with a clear, human-readable reason.
  • Optional OCR — extract text from scanned or image-only PDFs.

📄 Supported formats

TypeExtensions
PDF.pdf
Word.docx
PowerPoint.pptx
Excel.xlsx
Web / markup.html, .md
Images.png, .jpg, .tiff (with OCR)

💡 Common use cases

  • RAG ingestion — convert a library of PDFs/Docs into Markdown for chunking and embedding.
  • Knowledge bases & search — extract clean, structured text from reports, manuals, and contracts.
  • LLM context — feed papers, datasheets, or filings to a model without copy-paste noise.
  • Dataset building — turn document collections into structured JSON for training or analysis.
  • Table harvesting — pull tables out of financial reports or research papers as usable Markdown.

🚀 How to use

From the Apify Console

  1. Click Try for free / Start.
  2. Paste one or more Document URLs (direct links to the files).
  3. Pick an Output formatmarkdown, json, or both.
  4. (Optional) Turn on OCR for scanned/image PDFs.
  5. Click Start and grab the results from the Dataset tab (export as JSON, CSV, Excel, or via API).

Input

FieldTypeRequiredDescription
documentUrlsarray of stringsDirect URLs to the documents to convert.
outputFormatmarkdown | json | bothOutput format. Default: markdown.
doOcrbooleanRun OCR on scanned/image PDFs (slower). Default: false.

Example input

{
"documentUrls": [
"https://arxiv.org/pdf/2408.09869",
"https://www.example.com/report.pdf"
],
"outputFormat": "both",
"doOcr": false
}

Output

One dataset item per document:

{
"url": "https://arxiv.org/pdf/2408.09869",
"status": "success",
"markdown": "## Abstract\n\nThis technical report introduces ...",
"json": { "schema_name": "DoclingDocument", "texts": [ ... ], "tables": [ ... ] }
}

If a document can't be processed, you get a clear error instead of a crash:

{
"url": "https://example.com/locked.pdf",
"status": "error",
"error": "Download failed with HTTP 403. The URL may be private, expired, or protected (e.g. Cloudflare/login). Provide a direct, publicly accessible document link."
}

🔌 Use the results via API

Run the Actor and read its output from your own code with the Apify API client:

from apify_client import ApifyClient
client = ApifyClient("<YOUR_APIFY_TOKEN>")
run = client.actor("genuine_qa/document-parser").call(run_input={
"documentUrls": ["https://arxiv.org/pdf/2408.09869"],
"outputFormat": "markdown",
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
if item["status"] == "success":
print(item["markdown"])
else:
print("Failed:", item["url"], "->", item["error"])

You can also export results directly as JSON/CSV/Excel from the dataset's Export button, or pull them from the Dataset API.


⚙️ Tips & performance

  • Memory: the conversion models need room — run with 4 GB+ memory for reliable results, more for large or OCR-heavy documents.
  • First page is slowest: models load once per run, so converting many documents in a single run is more efficient than one run per document.
  • OCR is heavier: only enable doOcr when documents are scanned or image-based — it's significantly slower than parsing digital text.
  • Use direct links: point to the actual file URL. Pages behind logins, paywalls, or anti-bot challenges (e.g. Cloudflare) can't be downloaded and will return a clear error.

❓ FAQ

Does it handle scanned PDFs? Yes — enable doOcr. For digital (text-based) PDFs, leave it off for much faster, higher-fidelity results.

Are tables preserved? Yes. Tables are reconstructed and emitted as Markdown tables, and as structured cells in the JSON output.

Can I process many documents at once? Yes — pass multiple URLs in documentUrls. Each becomes its own dataset item.

What happens if one URL is bad? That single document is marked "status": "error" with a readable message; the rest of the run continues normally.

Do my documents leave the run? The Actor downloads each URL you provide, converts it inside the run, and writes the result to your dataset. It doesn't send your documents anywhere else.