PDF to Markdown Converter - Extract & Format Text
Pricing
$50.00 / 1,000 pdf converteds
PDF to Markdown Converter - Extract & Format Text
Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.
Pricing
$50.00 / 1,000 pdf converteds
Rating
0.0
(0)
Developer
daehwan kim
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Share
Office to Markdown — RAG-Ready Document Extractor
Convert PDF, DOCX, PPTX, XLSX, HTML, images, audio, and 20+ other formats into clean LLM-ready Markdown in one API call. Powered by Microsoft MarkItDown — the highest-fidelity open-source document-to-Markdown engine available.
Optimized for RAG pipelines, embedding ingestion, AI knowledge bases, and document understanding workflows where the quality of upstream chunking determines the quality of downstream retrieval.
v2.0 (2026-05-25) — Upgraded from PDF-only to multi-format.
pdfUrlis kept as a legacy alias offileUrlfor backwards compatibility.
Why This Actor
pdf-parse and pdfplumber extract raw text but lose structure: headings collapse, tables stringify, lists flatten. For RAG, that means smaller retrieval precision and more hallucination downstream.
Microsoft MarkItDown preserves:
- Heading hierarchy →
#,##,###mapped from document outline - Tables → real Markdown tables, not pipe-broken text
- Lists → bullet and numbered list integrity
- Code blocks → fenced code fences preserved
- Image alt-text → embedded into the flow for context
For DOCX and PPTX, semantic structure (slide titles, footnotes, comments) is preserved. For images and audio, OCR / transcription fallback runs automatically.
Supported Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, PPTX, XLSX, ODT, RTF |
| Web / Markup | HTML, HTM, XML, MHTML |
| Data | CSV, JSON, TSV |
| Plain | TXT, MD |
| Images (with OCR) | PNG, JPG, JPEG, GIF, BMP, WEBP |
| Audio (with transcription) | MP3, WAV, M4A |
| Archives | ZIP (recursive), EPUB |
| Others | YouTube URLs (transcript), Outlook MSG |
Max file size: 100 MB per request.
Use Cases
- RAG ingestion — Convert document libraries into Markdown chunks before embedding with OpenAI / Voyage / Cohere
- AI knowledge bases — Bulk import company wikis, training material, manuals into vector DBs
- Document Q&A — Pre-process source documents for Claude / GPT structured extraction
- Compliance archival — Normalize multi-format historical records to searchable Markdown
- Migration projects — Move from SharePoint / Confluence to modern docs-as-code platforms
- LLM fine-tuning data prep — Clean Markdown corpus from heterogeneous source files
Input
| Field | Type | Required | Description |
|---|---|---|---|
fileUrl | string | ✅ | Direct HTTPS URL to a supported document (max 100 MB) |
pdfUrl | string | — | Legacy alias for fileUrl (v1 compatibility) |
includePageBreaks | boolean | — | Insert horizontal-rule between pages (PDF/PPTX). Default false |
truncateChars | integer | — | Cap Markdown output at N characters. 0 = no cap (default) |
{"fileUrl": "https://arxiv.org/pdf/2305.10601","includePageBreaks": true,"truncateChars": 0}
Output
One dataset item per run:
| Field | Type | Description |
|---|---|---|
fileUrl | string | Source URL |
fileFormat | string | Detected file extension (pdf, docx, ...) |
byteSize | integer | Bytes downloaded |
charCount | integer | Character length of resulting Markdown |
wordCount | integer | Whitespace-tokenized word count |
markdown | string | Final cleaned Markdown |
disclaimer | string | Conversion accuracy notice |
error | string | Populated only on failure |
{"fileUrl": "https://arxiv.org/pdf/2305.10601","fileFormat": "pdf","byteSize": 1043820,"charCount": 48230,"wordCount": 7821,"markdown": "# Tree of Thoughts: Deliberate Problem Solving with Large Language Models\n\n## Abstract\n\nLanguage models are increasingly being deployed for general problem solving..."}
Pricing
- $0.05 per document converted (event:
pdf-converted) - Charged only after successful conversion + dataset push
- No charge on download / conversion failures
- Apify platform compute usage is billed separately to users (passOnCosts enabled)
Quick Start
curl
curl -X POST "https://api.apify.com/v2/acts/ntriqpro~pdf-to-markdown/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"fileUrl": "https://arxiv.org/pdf/2305.10601","includePageBreaks": true}'
Python (Apify Client)
from apify_client import ApifyClientclient = ApifyClient("YOUR_TOKEN")run = client.actor("ntriqpro/pdf-to-markdown").call(run_input={"fileUrl": "https://example.com/report.docx"})items = list(client.dataset(run["defaultDatasetId"]).iterate_items())print(items[0]["markdown"])
JavaScript (Apify Client)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('ntriqpro/pdf-to-markdown').call({fileUrl: 'https://example.com/slides.pptx'});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items[0].markdown);
Limitations
| Limitation | Detail |
|---|---|
| Scanned image-only PDFs | OCR is applied but accuracy depends on scan quality |
| Encrypted / password-protected files | Not supported |
| Files > 100 MB | Hard-rejected to protect compute cost |
| Non-Latin scripts (CJK, Arabic, etc.) | Supported but proofreading recommended for production |
| Streaming sources (S3 signed URLs) | Supported as long as URL is HTTPS-reachable |
Always validate critical extractions against the source.
Technology Stack
- Microsoft MarkItDown (MIT) — multi-format → Markdown converter
- httpx (BSD) — Async HTTP client with streaming + size cap
- Apify SDK for Python (Apache 2.0) — Actor runtime
Disclaimer
This Actor is an unofficial open-source wrapper around Microsoft MarkItDown. It is not affiliated with, sponsored by, or endorsed by Microsoft Corporation. Conversion fidelity depends on source-document structure; results are provided for informational and AI ingestion purposes only and are not a substitute for human review of critical or regulated documents.
Changelog
- 2.0 (2026-05-25) — Migrated to Python + Microsoft MarkItDown. Multi-format support (DOCX, PPTX, XLSX, HTML, images, audio, etc.). Output schema enriched with
fileFormat/byteSize/charCount.pdfUrlretained as alias offileUrl. - 1.0 (2026-04-14) — Initial release with
pdf-parseJavaScript backend (PDF only).
🔗 Related Actors by ntriqpro
- invoice-extraction-mcp — Structured line-item extraction from invoice PDFs
- blueprint-intelligence — AI floor-plan and architectural-drawing analyzer
- content-factory — Convert documents into quizzes, flashcards, slide decks, podcast scripts
⭐ Rate this Actor
If this saves you time, please leave a review — it helps other teams discover it.