PDF to Markdown Converter - AI-Powered with OCR & Tables avatar
PDF to Markdown Converter - AI-Powered with OCR & Tables

Pricing

Pay per event

Go to Apify Store
PDF to Markdown Converter - AI-Powered with OCR & Tables

PDF to Markdown Converter - AI-Powered with OCR & Tables

Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ClearPath

ClearPath

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

3

Monthly active users

4 days ago

Last modified

Share

PDF to Markdown

The most accurate PDF to Markdown converter on Apify — AI-powered with GPU acceleration for complex layouts, tables, formulas, and images.

Convert any PDF document into clean, structured Markdown with intelligent layout detection, OCR for scanned documents, and optional image extraction. Built for developers who need reliable PDF processing at scale.

  • Complex layouts — Multi-column documents, academic papers, financial reports
  • Table extraction — Preserves table structure in Markdown format
  • Formula support — Mathematical equations converted to LaTeX
  • Batch processing — Process hundreds of PDFs in parallel
  • Multiple input methods — File upload, URLs, or base64 API calls

Demo

Demo

Key Features

Document Processing

  • Intelligent layout detection — Handles single and multi-column layouts automatically
  • Table recognition — Extracts tables with proper Markdown formatting
  • Formula extraction — Converts mathematical formulas to LaTeX notation
  • Image extraction — Optionally extract embedded images with public URLs
  • OCR support — Process scanned PDFs with 8 language options

Developer-Friendly

  • Batch processing — Submit multiple PDFs in a single run
  • Parallel execution — Configurable concurrency (1-10 simultaneous PDFs)
  • Three output modes — Choose between text-only, with images, or full extraction
  • Structured output — Consistent JSON schema for easy integration
  • Public image URLs — Images stored in Apify Key-Value Store, not base64 blobs

Reliability

  • Automatic retries — Exponential backoff for transient failures
  • Partial success — Continues processing if individual PDFs fail
  • Detailed status — Per-document success/error reporting
  • Processing metrics — Page count, markdown length, processing time

Use Cases

For RAG Pipeline Developers

  • Prepare documents for LLM retrieval — Convert PDFs to clean text for embedding
  • Build knowledge bases — Extract structured content from document libraries
  • Enhance chatbot context — Feed processed documents into AI assistants
  • Create searchable archives — Transform PDF collections into queryable text

For Data Engineers

  • Document migration — Convert legacy PDF archives to Markdown
  • Content pipelines — Automate PDF processing in data workflows
  • ETL integration — Extract text data for downstream processing
  • Compliance archival — Create text-based backups of PDF documents

For Researchers

  • Extract tables from papers — Pull data tables from academic PDFs
  • Process formulas — Convert mathematical notation to LaTeX
  • Batch analysis — Process entire paper collections
  • Figure extraction — Capture charts and diagrams with metadata

Quick Start

Basic — Single PDF URL

{
"pdfUrls": ["https://example.com/document.pdf"]
}

Advanced — Batch with Images

{
"pdfUrls": [
"https://example.com/report-q1.pdf",
"https://example.com/report-q2.pdf",
"https://example.com/report-q3.pdf"
],
"outputMode": "markdown_images",
"concurrency": 5
}

Complete — All Parameters

{
"pdfFile": null,
"pdfUrls": ["https://example.com/document.pdf"],
"pdfBase64Items": [
{
"filename": "uploaded-doc.pdf",
"data": "JVBERi0xLjQKJeLjz9..."
}
],
"outputMode": "full",
"language": "en",
"concurrency": 3
}

Pricing — Pay Per Event (PPE)

Transparent pay-per-PDF pricing based on output mode:

Output ModePrice per PDFDescription
markdown$0.02Text-only extraction
markdown_images$0.03Text + extracted images stored in KV
full$0.04Text + images + raw JSON metadata

Cost Examples

PDFsOutput ModeTotal Cost
10markdown$0.20
50markdown$1.00
100markdown$2.00
100markdown_images$3.00
100full$4.00
500markdown$10.00
1,000markdown$20.00
1,000markdown_images$30.00
1,000full$40.00

Cost Optimization Tips

  • Use markdown mode if you don't need images
  • Filter PDFs before submission to avoid processing irrelevant documents
  • Start with lower concurrency and scale up as needed

Input Parameters

ParameterTypeDefaultRequiredDescription
pdfFilefile upload-No*Upload a single PDF file via the Apify UI
pdfUrlsstring[]-No*Array of URLs pointing to PDF files
pdfBase64Itemsobject[]-No*Array of base64-encoded PDFs with filenames
outputModeenummarkdownNoOutput format: markdown, markdown_images, or full
languageenumenNoLanguage hint for OCR accuracy
concurrencyinteger3NoParallel processing (1-10)

*At least one PDF source is required (pdfFile, pdfUrls, or pdfBase64Items).

Output Modes

ModeMarkdownImagesJSON ContentBest For
markdownYesNoNoText extraction, RAG pipelines
markdown_imagesYesYes (URLs)NoFull document conversion
fullYesYes (URLs)YesAnalysis, debugging

Supported Languages

CodeLanguage
enEnglish
chChinese (Simplified)
chinese_chtChinese (Traditional)
japanJapanese
koreanKorean
taTamil
teTelugu
kaKannada

Base64 Input Format

For API integration, use pdfBase64Items:

{
"pdfBase64Items": [
{
"filename": "invoice-001.pdf",
"data": "JVBERi0xLjQKJeLjz9MKNSAwIG9iago8PC..."
},
{
"filename": "contract-2024.pdf",
"data": "JVBERi0xLjUKJeLjz9MKMSAwIG9iago8PC..."
}
]
}

Output

Each PDF produces one dataset item with the following structure:

{
"filename": "annual-report-2024.pdf",
"sourceType": "url",
"sourceUrl": "https://example.com/annual-report-2024.pdf",
"status": "success",
"markdown": "# Annual Report 2024\n\n## Executive Summary\n\nThis year marked significant growth across all business units...\n\n## Financial Highlights\n\n| Metric | Q1 | Q2 | Q3 | Q4 |\n|--------|-----|-----|-----|-----|\n| Revenue | $12M | $14M | $15M | $18M |\n| Profit | $2M | $3M | $3.5M | $4M |\n\n## Strategic Initiatives\n\n### Digital Transformation\n\nOur investment in AI-powered solutions delivered...",
"pageCount": 24,
"markdownLength": 45230,
"images": [
"https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/figure-1.png",
"https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/chart-revenue.png",
"https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/logo.jpg"
],
"imageCount": 3,
"jsonContent": null,
"error": null,
"errorDetails": null,
"processingTimeMs": 32450,
"timestamp": "2025-01-15T10:30:00.000Z"
}

Output Fields

FieldTypeDescription
filenamestringOriginal filename
sourceTypestringInput source: url, upload, or base64
sourceUrlstringSource URL (if applicable)
statusstringsuccess or error
markdownstringExtracted Markdown content
pageCountnumberNumber of pages in PDF
markdownLengthnumberCharacter count of markdown
imagesarrayList of public URLs for extracted images
imageCountnumberNumber of extracted images
jsonContentobjectRaw extraction metadata (full mode only)
errorstringUser-friendly error message (if failed)
errorDetailsstringAdditional error context
processingTimeMsnumberProcessing duration in milliseconds
timestampstringISO 8601 timestamp

Error Output Example

{
"filename": "encrypted-doc.pdf",
"sourceType": "url",
"sourceUrl": "https://example.com/encrypted-doc.pdf",
"status": "error",
"markdown": null,
"pageCount": null,
"markdownLength": 0,
"images": [],
"imageCount": 0,
"jsonContent": null,
"error": "PDF is password-protected",
"errorDetails": null,
"processingTimeMs": 1250,
"timestamp": "2025-01-15T10:31:00.000Z"
}

API Integration

Python

from apify_client import ApifyClient
client = ApifyClient("your_api_token")
run_input = {
"pdfUrls": [
"https://example.com/report-q1.pdf",
"https://example.com/report-q2.pdf",
],
"outputMode": "markdown_images",
"concurrency": 3,
}
run = client.actor("your-username/pdf-to-markdown").call(run_input=run_input)
# Fetch results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
if item["status"] == "success":
print(f"Processed: {item['filename']}")
print(f"Pages: {item['pageCount']}")
print(f"Markdown length: {item['markdownLength']} chars")
# Save markdown to file
with open(f"{item['filename']}.md", "w") as f:
f.write(item["markdown"])
else:
print(f"Failed: {item['filename']} - {item['error']}")

JavaScript / TypeScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'your_api_token' });
const input = {
pdfUrls: [
'https://example.com/report-q1.pdf',
'https://example.com/report-q2.pdf',
],
outputMode: 'markdown_images',
concurrency: 3,
};
const run = await client.actor('your-username/pdf-to-markdown').call(input);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
if (item.status === 'success') {
console.log(`Processed: ${item.filename}`);
console.log(`Pages: ${item.pageCount}`);
console.log(`Images: ${item.imageCount}`);
} else {
console.log(`Failed: ${item.filename} - ${item.error}`);
}
}

cURL

curl -X POST "https://api.apify.com/v2/acts/your-username~pdf-to-markdown/runs?token=your_api_token" \
-H "Content-Type: application/json" \
-d '{
"pdfUrls": ["https://example.com/document.pdf"],
"outputMode": "markdown"
}'

Technical Requirements

RequirementValue
Memory512 MB
Processing Time20-45 seconds per PDF
Max Queue Wait2 minutes
Max Processing Time5 minutes per PDF
Concurrency1-10 parallel PDFs

Supported PDF Types

  • Standard text PDFs
  • Scanned documents (via OCR)
  • Multi-column layouts
  • Tables and forms
  • Academic papers with formulas
  • Reports with charts and figures

Limitations

  • Password-protected PDFs are not supported
  • Maximum recommended file size: 50 MB per PDF
  • Very complex layouts may have reduced accuracy

FAQ

What types of PDFs can this Actor process?

This Actor handles most PDF types including standard text documents, scanned images (via OCR), multi-column layouts, academic papers, financial reports, and documents with tables and formulas.

How long does processing take?

Most PDFs complete in 20-45 seconds. Complex documents with many pages or images may take longer. The Actor has a 5-minute timeout per PDF.

Can I process scanned documents?

Yes! The Actor includes OCR (Optical Character Recognition) that works with scanned PDFs. Use the language parameter to improve accuracy for non-English documents.

What languages are supported for OCR?

Eight languages: English, Chinese (Simplified and Traditional), Japanese, Korean, Tamil, Telugu, and Kannada.

How are images stored?

When using markdown_images or full mode, extracted images are stored in Apify's Key-Value Store. The output contains public URLs that remain accessible as long as your storage retention allows.

What happens if a PDF fails to process?

The Actor continues processing other PDFs and reports failures in the output. Each item has a status field (success or error) and an error field with a user-friendly message.

Can I process PDFs via API without uploading files?

Yes! Use the pdfBase64Items parameter to submit base64-encoded PDF content directly, or use pdfUrls to provide URLs that the Actor will fetch.

Is there a free trial?

Yes, Apify offers free platform credits for new users. You can test the Actor with sample PDFs before committing to paid usage.

How do I handle large batches efficiently?

Increase the concurrency parameter (up to 10) to process more PDFs in parallel. For very large batches, consider splitting into multiple runs.

What's the difference between output modes?

  • markdown: Text only, smallest output, fastest
  • markdown_images: Text + image URLs, good for full document conversion
  • full: Everything including raw JSON metadata, best for analysis/debugging

Data Export

Export your results in multiple formats:

  • JSON — Full structured data for programmatic access
  • CSV — Spreadsheet-compatible format
  • Excel — Direct import to Microsoft Excel
  • XML — Legacy system integration

Automation

  • Scheduled runs — Process PDFs on a recurring schedule
  • Webhooks — Get notified when processing completes
  • API integration — Trigger runs from your application
  • Apify integrations — Connect with Zapier, Make, and more

Support

  • Issues & Bugs: Use the ../../issues on this Actor's page
  • Feature Requests: Open an issue or contact via email
  • Email: max@mapa.slmail.me
  • Response Time: Usually within 24 hours

This Actor processes documents that you provide. You are responsible for:

  • Having the right to process the documents you submit
  • Complying with applicable data protection regulations (GDPR, CCPA, etc.)
  • Ensuring processed content doesn't violate any terms of service

The Actor does not store your PDFs beyond the processing duration.


Start Converting PDFs to Markdown Now


Transform your document workflows with accurate, AI-powered PDF extraction.