PDF to Markdown Converter - AI-Powered with OCR & Tables
Pricing
Pay per event
PDF to Markdown Converter - AI-Powered with OCR & Tables
Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.
Pricing
Pay per event
Rating
0.0
(0)
Developer

ClearPath
Actor stats
0
Bookmarked
4
Total users
3
Monthly active users
4 days ago
Last modified
Categories
Share
PDF to Markdown
The most accurate PDF to Markdown converter on Apify — AI-powered with GPU acceleration for complex layouts, tables, formulas, and images.
Convert any PDF document into clean, structured Markdown with intelligent layout detection, OCR for scanned documents, and optional image extraction. Built for developers who need reliable PDF processing at scale.
- Complex layouts — Multi-column documents, academic papers, financial reports
- Table extraction — Preserves table structure in Markdown format
- Formula support — Mathematical equations converted to LaTeX
- Batch processing — Process hundreds of PDFs in parallel
- Multiple input methods — File upload, URLs, or base64 API calls
Demo
Key Features
Document Processing
- Intelligent layout detection — Handles single and multi-column layouts automatically
- Table recognition — Extracts tables with proper Markdown formatting
- Formula extraction — Converts mathematical formulas to LaTeX notation
- Image extraction — Optionally extract embedded images with public URLs
- OCR support — Process scanned PDFs with 8 language options
Developer-Friendly
- Batch processing — Submit multiple PDFs in a single run
- Parallel execution — Configurable concurrency (1-10 simultaneous PDFs)
- Three output modes — Choose between text-only, with images, or full extraction
- Structured output — Consistent JSON schema for easy integration
- Public image URLs — Images stored in Apify Key-Value Store, not base64 blobs
Reliability
- Automatic retries — Exponential backoff for transient failures
- Partial success — Continues processing if individual PDFs fail
- Detailed status — Per-document success/error reporting
- Processing metrics — Page count, markdown length, processing time
Use Cases
For RAG Pipeline Developers
- Prepare documents for LLM retrieval — Convert PDFs to clean text for embedding
- Build knowledge bases — Extract structured content from document libraries
- Enhance chatbot context — Feed processed documents into AI assistants
- Create searchable archives — Transform PDF collections into queryable text
For Data Engineers
- Document migration — Convert legacy PDF archives to Markdown
- Content pipelines — Automate PDF processing in data workflows
- ETL integration — Extract text data for downstream processing
- Compliance archival — Create text-based backups of PDF documents
For Researchers
- Extract tables from papers — Pull data tables from academic PDFs
- Process formulas — Convert mathematical notation to LaTeX
- Batch analysis — Process entire paper collections
- Figure extraction — Capture charts and diagrams with metadata
Quick Start
Basic — Single PDF URL
{"pdfUrls": ["https://example.com/document.pdf"]}
Advanced — Batch with Images
{"pdfUrls": ["https://example.com/report-q1.pdf","https://example.com/report-q2.pdf","https://example.com/report-q3.pdf"],"outputMode": "markdown_images","concurrency": 5}
Complete — All Parameters
{"pdfFile": null,"pdfUrls": ["https://example.com/document.pdf"],"pdfBase64Items": [{"filename": "uploaded-doc.pdf","data": "JVBERi0xLjQKJeLjz9..."}],"outputMode": "full","language": "en","concurrency": 3}
Pricing — Pay Per Event (PPE)
Transparent pay-per-PDF pricing based on output mode:
| Output Mode | Price per PDF | Description |
|---|---|---|
markdown | $0.02 | Text-only extraction |
markdown_images | $0.03 | Text + extracted images stored in KV |
full | $0.04 | Text + images + raw JSON metadata |
Cost Examples
| PDFs | Output Mode | Total Cost |
|---|---|---|
| 10 | markdown | $0.20 |
| 50 | markdown | $1.00 |
| 100 | markdown | $2.00 |
| 100 | markdown_images | $3.00 |
| 100 | full | $4.00 |
| 500 | markdown | $10.00 |
| 1,000 | markdown | $20.00 |
| 1,000 | markdown_images | $30.00 |
| 1,000 | full | $40.00 |
Cost Optimization Tips
- Use
markdownmode if you don't need images - Filter PDFs before submission to avoid processing irrelevant documents
- Start with lower concurrency and scale up as needed
Input Parameters
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
pdfFile | file upload | - | No* | Upload a single PDF file via the Apify UI |
pdfUrls | string[] | - | No* | Array of URLs pointing to PDF files |
pdfBase64Items | object[] | - | No* | Array of base64-encoded PDFs with filenames |
outputMode | enum | markdown | No | Output format: markdown, markdown_images, or full |
language | enum | en | No | Language hint for OCR accuracy |
concurrency | integer | 3 | No | Parallel processing (1-10) |
*At least one PDF source is required (pdfFile, pdfUrls, or pdfBase64Items).
Output Modes
| Mode | Markdown | Images | JSON Content | Best For |
|---|---|---|---|---|
markdown | Yes | No | No | Text extraction, RAG pipelines |
markdown_images | Yes | Yes (URLs) | No | Full document conversion |
full | Yes | Yes (URLs) | Yes | Analysis, debugging |
Supported Languages
| Code | Language |
|---|---|
en | English |
ch | Chinese (Simplified) |
chinese_cht | Chinese (Traditional) |
japan | Japanese |
korean | Korean |
ta | Tamil |
te | Telugu |
ka | Kannada |
Base64 Input Format
For API integration, use pdfBase64Items:
{"pdfBase64Items": [{"filename": "invoice-001.pdf","data": "JVBERi0xLjQKJeLjz9MKNSAwIG9iago8PC..."},{"filename": "contract-2024.pdf","data": "JVBERi0xLjUKJeLjz9MKMSAwIG9iago8PC..."}]}
Output
Each PDF produces one dataset item with the following structure:
{"filename": "annual-report-2024.pdf","sourceType": "url","sourceUrl": "https://example.com/annual-report-2024.pdf","status": "success","markdown": "# Annual Report 2024\n\n## Executive Summary\n\nThis year marked significant growth across all business units...\n\n## Financial Highlights\n\n| Metric | Q1 | Q2 | Q3 | Q4 |\n|--------|-----|-----|-----|-----|\n| Revenue | $12M | $14M | $15M | $18M |\n| Profit | $2M | $3M | $3.5M | $4M |\n\n## Strategic Initiatives\n\n### Digital Transformation\n\nOur investment in AI-powered solutions delivered...","pageCount": 24,"markdownLength": 45230,"images": ["https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/figure-1.png","https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/chart-revenue.png","https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/logo.jpg"],"imageCount": 3,"jsonContent": null,"error": null,"errorDetails": null,"processingTimeMs": 32450,"timestamp": "2025-01-15T10:30:00.000Z"}
Output Fields
| Field | Type | Description |
|---|---|---|
filename | string | Original filename |
sourceType | string | Input source: url, upload, or base64 |
sourceUrl | string | Source URL (if applicable) |
status | string | success or error |
markdown | string | Extracted Markdown content |
pageCount | number | Number of pages in PDF |
markdownLength | number | Character count of markdown |
images | array | List of public URLs for extracted images |
imageCount | number | Number of extracted images |
jsonContent | object | Raw extraction metadata (full mode only) |
error | string | User-friendly error message (if failed) |
errorDetails | string | Additional error context |
processingTimeMs | number | Processing duration in milliseconds |
timestamp | string | ISO 8601 timestamp |
Error Output Example
{"filename": "encrypted-doc.pdf","sourceType": "url","sourceUrl": "https://example.com/encrypted-doc.pdf","status": "error","markdown": null,"pageCount": null,"markdownLength": 0,"images": [],"imageCount": 0,"jsonContent": null,"error": "PDF is password-protected","errorDetails": null,"processingTimeMs": 1250,"timestamp": "2025-01-15T10:31:00.000Z"}
API Integration
Python
from apify_client import ApifyClientclient = ApifyClient("your_api_token")run_input = {"pdfUrls": ["https://example.com/report-q1.pdf","https://example.com/report-q2.pdf",],"outputMode": "markdown_images","concurrency": 3,}run = client.actor("your-username/pdf-to-markdown").call(run_input=run_input)# Fetch resultsfor item in client.dataset(run["defaultDatasetId"]).iterate_items():if item["status"] == "success":print(f"Processed: {item['filename']}")print(f"Pages: {item['pageCount']}")print(f"Markdown length: {item['markdownLength']} chars")# Save markdown to filewith open(f"{item['filename']}.md", "w") as f:f.write(item["markdown"])else:print(f"Failed: {item['filename']} - {item['error']}")
JavaScript / TypeScript
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'your_api_token' });const input = {pdfUrls: ['https://example.com/report-q1.pdf','https://example.com/report-q2.pdf',],outputMode: 'markdown_images',concurrency: 3,};const run = await client.actor('your-username/pdf-to-markdown').call(input);const { items } = await client.dataset(run.defaultDatasetId).listItems();for (const item of items) {if (item.status === 'success') {console.log(`Processed: ${item.filename}`);console.log(`Pages: ${item.pageCount}`);console.log(`Images: ${item.imageCount}`);} else {console.log(`Failed: ${item.filename} - ${item.error}`);}}
cURL
curl -X POST "https://api.apify.com/v2/acts/your-username~pdf-to-markdown/runs?token=your_api_token" \-H "Content-Type: application/json" \-d '{"pdfUrls": ["https://example.com/document.pdf"],"outputMode": "markdown"}'
Technical Requirements
| Requirement | Value |
|---|---|
| Memory | 512 MB |
| Processing Time | 20-45 seconds per PDF |
| Max Queue Wait | 2 minutes |
| Max Processing Time | 5 minutes per PDF |
| Concurrency | 1-10 parallel PDFs |
Supported PDF Types
- Standard text PDFs
- Scanned documents (via OCR)
- Multi-column layouts
- Tables and forms
- Academic papers with formulas
- Reports with charts and figures
Limitations
- Password-protected PDFs are not supported
- Maximum recommended file size: 50 MB per PDF
- Very complex layouts may have reduced accuracy
FAQ
What types of PDFs can this Actor process?
This Actor handles most PDF types including standard text documents, scanned images (via OCR), multi-column layouts, academic papers, financial reports, and documents with tables and formulas.
How long does processing take?
Most PDFs complete in 20-45 seconds. Complex documents with many pages or images may take longer. The Actor has a 5-minute timeout per PDF.
Can I process scanned documents?
Yes! The Actor includes OCR (Optical Character Recognition) that works with scanned PDFs. Use the language parameter to improve accuracy for non-English documents.
What languages are supported for OCR?
Eight languages: English, Chinese (Simplified and Traditional), Japanese, Korean, Tamil, Telugu, and Kannada.
How are images stored?
When using markdown_images or full mode, extracted images are stored in Apify's Key-Value Store. The output contains public URLs that remain accessible as long as your storage retention allows.
What happens if a PDF fails to process?
The Actor continues processing other PDFs and reports failures in the output. Each item has a status field (success or error) and an error field with a user-friendly message.
Can I process PDFs via API without uploading files?
Yes! Use the pdfBase64Items parameter to submit base64-encoded PDF content directly, or use pdfUrls to provide URLs that the Actor will fetch.
Is there a free trial?
Yes, Apify offers free platform credits for new users. You can test the Actor with sample PDFs before committing to paid usage.
How do I handle large batches efficiently?
Increase the concurrency parameter (up to 10) to process more PDFs in parallel. For very large batches, consider splitting into multiple runs.
What's the difference between output modes?
markdown: Text only, smallest output, fastestmarkdown_images: Text + image URLs, good for full document conversionfull: Everything including raw JSON metadata, best for analysis/debugging
Data Export
Export your results in multiple formats:
- JSON — Full structured data for programmatic access
- CSV — Spreadsheet-compatible format
- Excel — Direct import to Microsoft Excel
- XML — Legacy system integration
Automation
- Scheduled runs — Process PDFs on a recurring schedule
- Webhooks — Get notified when processing completes
- API integration — Trigger runs from your application
- Apify integrations — Connect with Zapier, Make, and more
Related Actors
Support
- Issues & Bugs: Use the ../../issues on this Actor's page
- Feature Requests: Open an issue or contact via email
- Email: max@mapa.slmail.me
- Response Time: Usually within 24 hours
Legal Compliance
This Actor processes documents that you provide. You are responsible for:
- Having the right to process the documents you submit
- Complying with applicable data protection regulations (GDPR, CCPA, etc.)
- Ensuring processed content doesn't violate any terms of service
The Actor does not store your PDFs beyond the processing duration.
Start Converting PDFs to Markdown Now
Transform your document workflows with accurate, AI-powered PDF extraction.
