Pricing

$50.00 / 1,000 pdf converteds

Go to Apify Store

PDF to Markdown Converter - Extract & Format Text

Try for free

Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.

Pricing

$50.00 / 1,000 pdf converteds

Rating

0.0

(0)

Developer

daehwan kim

Actor stats

Bookmarked

Total users

Monthly active users

5 days ago

Last modified

Office to Markdown — RAG-Ready Document Extractor

Convert PDF, DOCX, PPTX, XLSX, HTML, images, audio, and 20+ other formats into clean LLM-ready Markdown in one API call. Powered by Microsoft MarkItDown — the highest-fidelity open-source document-to-Markdown engine available.

Optimized for RAG pipelines, embedding ingestion, AI knowledge bases, and document understanding workflows where the quality of upstream chunking determines the quality of downstream retrieval.

v2.0 (2026-05-25) — Upgraded from PDF-only to multi-format. pdfUrl is kept as a legacy alias of fileUrl for backwards compatibility.

Why This Actor

pdf-parse and pdfplumber extract raw text but lose structure: headings collapse, tables stringify, lists flatten. For RAG, that means smaller retrieval precision and more hallucination downstream.

Microsoft MarkItDown preserves:

Heading hierarchy → #, ##, ### mapped from document outline
Tables → real Markdown tables, not pipe-broken text
Lists → bullet and numbered list integrity
Code blocks → fenced code fences preserved
Image alt-text → embedded into the flow for context

For DOCX and PPTX, semantic structure (slide titles, footnotes, comments) is preserved. For images and audio, OCR / transcription fallback runs automatically.

Supported Formats

Category	Formats
Documents	PDF, DOCX, PPTX, XLSX, ODT, RTF
Web / Markup	HTML, HTM, XML, MHTML
Data	CSV, JSON, TSV
Plain	TXT, MD
Images (with OCR)	PNG, JPG, JPEG, GIF, BMP, WEBP
Audio (with transcription)	MP3, WAV, M4A
Archives	ZIP (recursive), EPUB
Others	YouTube URLs (transcript), Outlook MSG

Max file size: 100 MB per request.

Use Cases

RAG ingestion — Convert document libraries into Markdown chunks before embedding with OpenAI / Voyage / Cohere
AI knowledge bases — Bulk import company wikis, training material, manuals into vector DBs
Document Q&A — Pre-process source documents for Claude / GPT structured extraction
Compliance archival — Normalize multi-format historical records to searchable Markdown
Migration projects — Move from SharePoint / Confluence to modern docs-as-code platforms
LLM fine-tuning data prep — Clean Markdown corpus from heterogeneous source files

Input

Field	Type	Required	Description
`fileUrl`	string	✅	Direct HTTPS URL to a supported document (max 100 MB)
`pdfUrl`	string	—	Legacy alias for `fileUrl` (v1 compatibility)
`includePageBreaks`	boolean	—	Insert horizontal-rule between pages (PDF/PPTX). Default `false`
`truncateChars`	integer	—	Cap Markdown output at N characters. `0` = no cap (default)

{
  "fileUrl": "https://arxiv.org/pdf/2305.10601",
  "includePageBreaks": true,
  "truncateChars": 0
}

Output

One dataset item per run:

Field	Type	Description
`fileUrl`	string	Source URL
`fileFormat`	string	Detected file extension (pdf, docx, ...)
`byteSize`	integer	Bytes downloaded
`charCount`	integer	Character length of resulting Markdown
`wordCount`	integer	Whitespace-tokenized word count
`markdown`	string	Final cleaned Markdown
`disclaimer`	string	Conversion accuracy notice
`error`	string	Populated only on failure

{
  "fileUrl": "https://arxiv.org/pdf/2305.10601",
  "fileFormat": "pdf",
  "byteSize": 1043820,
  "charCount": 48230,
  "wordCount": 7821,
  "markdown": "# Tree of Thoughts: Deliberate Problem Solving with Large Language Models\n\n## Abstract\n\nLanguage models are increasingly being deployed for general problem solving..."
}

Pricing

$0.05 per document converted (event: pdf-converted)
Charged only after successful conversion + dataset push
No charge on download / conversion failures
Apify platform compute usage is billed separately to users (passOnCosts enabled)

Quick Start

curl

curl -X POST "https://api.apify.com/v2/acts/ntriqpro~pdf-to-markdown/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "fileUrl": "https://arxiv.org/pdf/2305.10601",
    "includePageBreaks": true
  }'

Python (Apify Client)

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("ntriqpro/pdf-to-markdown").call(run_input={
    "fileUrl": "https://example.com/report.docx"
})
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(items[0]["markdown"])

JavaScript (Apify Client)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('ntriqpro/pdf-to-markdown').call({
  fileUrl: 'https://example.com/slides.pptx'
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].markdown);

Limitations

Limitation	Detail
Scanned image-only PDFs	OCR is applied but accuracy depends on scan quality
Encrypted / password-protected files	Not supported
Files > 100 MB	Hard-rejected to protect compute cost
Non-Latin scripts (CJK, Arabic, etc.)	Supported but proofreading recommended for production
Streaming sources (S3 signed URLs)	Supported as long as URL is HTTPS-reachable

Always validate critical extractions against the source.

Technology Stack

Microsoft MarkItDown (MIT) — multi-format → Markdown converter
httpx (BSD) — Async HTTP client with streaming + size cap
Apify SDK for Python (Apache 2.0) — Actor runtime

Disclaimer

This Actor is an unofficial open-source wrapper around Microsoft MarkItDown. It is not affiliated with, sponsored by, or endorsed by Microsoft Corporation. Conversion fidelity depends on source-document structure; results are provided for informational and AI ingestion purposes only and are not a substitute for human review of critical or regulated documents.

Changelog

2.0 (2026-05-25) — Migrated to Python + Microsoft MarkItDown. Multi-format support (DOCX, PPTX, XLSX, HTML, images, audio, etc.). Output schema enriched with fileFormat / byteSize / charCount. pdfUrl retained as alias of fileUrl.
1.0 (2026-04-14) — Initial release with pdf-parse JavaScript backend (PDF only).

invoice-extraction-mcp — Structured line-item extraction from invoice PDFs
blueprint-intelligence — AI floor-plan and architectural-drawing analyzer
content-factory — Convert documents into quizzes, flashcards, slide decks, podcast scripts

⭐ Rate this Actor

If this saves you time, please leave a review — it helps other teams discover it.

File to Markdown

shahidirfan/file-to-markdown

Transform files into clean, readable Markdown instantly. Convert PDFs, documents, images, and more to structured Markdown format. Perfect for automating documentation workflows, content migration, and building knowledge bases. Ideal for developers, writers, and content teams.

Shahid Irfan

5.0

Website To Markdown

swarmgarden/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

Swarm Garden

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

ParseForge

5.0

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

Web Harvester

Document Format Converter — Markdown, HTML & Text

junipr/document-format-converter

Convert Markdown, HTML, plain text, JSON, and CSV-style documents into clean automation-ready formats with downloadable output files.

junipr

Markdown to HTML Converter

anaselgamed/markdown-to-html-converter

Convert Markdown text to clean, semantic HTML instantly. Supports tables, code blocks, images, links, and GitHub Flavored Markdown. Perfect for content publishing, email templates, and documentation.

Anas Hossam

PDF to Markdown & JSON Converter (Docling)

actorzlab/docling-pdf-converter

Convert PDF documents to clean Markdown, structured JSON, and plain text using IBM's open-source Docling AI. Handles text PDFs and scanned documents (OCR), extracts tables and images. No external API key required — runs fully on-device.

Khalil Drissi

Markdown Converter API

vivid_astronaut/markdown-converter

Fabio Suizu

Html To Markdown Converter 📄

powerful_bachelor/html-to-markdown-converter

📄✨ HTML to Markdown Converter transforms web pages into clean, portable Markdown. Simply input a URL to extract content while preserving structure, formatting, and media elements.🔄 Perfect for content repurposing, documentation, and creating readable, platform-independent text from any webpage! 🚀