PDF to Markdown Converter - Extract & Format Text avatar

PDF to Markdown Converter - Extract & Format Text

Pricing

$50.00 / 1,000 pdf converteds

Go to Apify Store
PDF to Markdown Converter - Extract & Format Text

PDF to Markdown Converter - Extract & Format Text

Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.

Pricing

$50.00 / 1,000 pdf converteds

Rating

0.0

(0)

Developer

daehwan kim

daehwan kim

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Categories

Share

Office to Markdown — RAG-Ready Document Extractor

Convert PDF, DOCX, PPTX, XLSX, HTML, images, audio, and 20+ other formats into clean LLM-ready Markdown in one API call. Powered by Microsoft MarkItDown — the highest-fidelity open-source document-to-Markdown engine available.

Optimized for RAG pipelines, embedding ingestion, AI knowledge bases, and document understanding workflows where the quality of upstream chunking determines the quality of downstream retrieval.

v2.0 (2026-05-25) — Upgraded from PDF-only to multi-format. pdfUrl is kept as a legacy alias of fileUrl for backwards compatibility.

Why This Actor

pdf-parse and pdfplumber extract raw text but lose structure: headings collapse, tables stringify, lists flatten. For RAG, that means smaller retrieval precision and more hallucination downstream.

Microsoft MarkItDown preserves:

  • Heading hierarchy#, ##, ### mapped from document outline
  • Tables → real Markdown tables, not pipe-broken text
  • Lists → bullet and numbered list integrity
  • Code blocks → fenced code fences preserved
  • Image alt-text → embedded into the flow for context

For DOCX and PPTX, semantic structure (slide titles, footnotes, comments) is preserved. For images and audio, OCR / transcription fallback runs automatically.

Supported Formats

CategoryFormats
DocumentsPDF, DOCX, PPTX, XLSX, ODT, RTF
Web / MarkupHTML, HTM, XML, MHTML
DataCSV, JSON, TSV
PlainTXT, MD
Images (with OCR)PNG, JPG, JPEG, GIF, BMP, WEBP
Audio (with transcription)MP3, WAV, M4A
ArchivesZIP (recursive), EPUB
OthersYouTube URLs (transcript), Outlook MSG

Max file size: 100 MB per request.

Use Cases

  • RAG ingestion — Convert document libraries into Markdown chunks before embedding with OpenAI / Voyage / Cohere
  • AI knowledge bases — Bulk import company wikis, training material, manuals into vector DBs
  • Document Q&A — Pre-process source documents for Claude / GPT structured extraction
  • Compliance archival — Normalize multi-format historical records to searchable Markdown
  • Migration projects — Move from SharePoint / Confluence to modern docs-as-code platforms
  • LLM fine-tuning data prep — Clean Markdown corpus from heterogeneous source files

Input

FieldTypeRequiredDescription
fileUrlstringDirect HTTPS URL to a supported document (max 100 MB)
pdfUrlstringLegacy alias for fileUrl (v1 compatibility)
includePageBreaksbooleanInsert horizontal-rule between pages (PDF/PPTX). Default false
truncateCharsintegerCap Markdown output at N characters. 0 = no cap (default)
{
"fileUrl": "https://arxiv.org/pdf/2305.10601",
"includePageBreaks": true,
"truncateChars": 0
}

Output

One dataset item per run:

FieldTypeDescription
fileUrlstringSource URL
fileFormatstringDetected file extension (pdf, docx, ...)
byteSizeintegerBytes downloaded
charCountintegerCharacter length of resulting Markdown
wordCountintegerWhitespace-tokenized word count
markdownstringFinal cleaned Markdown
disclaimerstringConversion accuracy notice
errorstringPopulated only on failure
{
"fileUrl": "https://arxiv.org/pdf/2305.10601",
"fileFormat": "pdf",
"byteSize": 1043820,
"charCount": 48230,
"wordCount": 7821,
"markdown": "# Tree of Thoughts: Deliberate Problem Solving with Large Language Models\n\n## Abstract\n\nLanguage models are increasingly being deployed for general problem solving..."
}

Pricing

  • $0.05 per document converted (event: pdf-converted)
  • Charged only after successful conversion + dataset push
  • No charge on download / conversion failures
  • Apify platform compute usage is billed separately to users (passOnCosts enabled)

Quick Start

curl

curl -X POST "https://api.apify.com/v2/acts/ntriqpro~pdf-to-markdown/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"fileUrl": "https://arxiv.org/pdf/2305.10601",
"includePageBreaks": true
}'

Python (Apify Client)

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("ntriqpro/pdf-to-markdown").call(run_input={
"fileUrl": "https://example.com/report.docx"
})
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(items[0]["markdown"])

JavaScript (Apify Client)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('ntriqpro/pdf-to-markdown').call({
fileUrl: 'https://example.com/slides.pptx'
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].markdown);

Limitations

LimitationDetail
Scanned image-only PDFsOCR is applied but accuracy depends on scan quality
Encrypted / password-protected filesNot supported
Files > 100 MBHard-rejected to protect compute cost
Non-Latin scripts (CJK, Arabic, etc.)Supported but proofreading recommended for production
Streaming sources (S3 signed URLs)Supported as long as URL is HTTPS-reachable

Always validate critical extractions against the source.

Technology Stack

Disclaimer

This Actor is an unofficial open-source wrapper around Microsoft MarkItDown. It is not affiliated with, sponsored by, or endorsed by Microsoft Corporation. Conversion fidelity depends on source-document structure; results are provided for informational and AI ingestion purposes only and are not a substitute for human review of critical or regulated documents.

Changelog

  • 2.0 (2026-05-25) — Migrated to Python + Microsoft MarkItDown. Multi-format support (DOCX, PPTX, XLSX, HTML, images, audio, etc.). Output schema enriched with fileFormat / byteSize / charCount. pdfUrl retained as alias of fileUrl.
  • 1.0 (2026-04-14) — Initial release with pdf-parse JavaScript backend (PDF only).

⭐ Rate this Actor

If this saves you time, please leave a review — it helps other teams discover it.