Pricing

Pay per event

PDF to Markdown Converter - AI-Powered with OCR & Tables

Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ClearPath

Actor stats

Bookmarked

Total users

Monthly active users

9 days ago

Last modified

PDF to Markdown

The most accurate PDF to Markdown converter on Apify — AI-powered with GPU acceleration for complex layouts, tables, formulas, and images.

Convert any PDF document into clean, structured Markdown with intelligent layout detection, OCR for scanned documents, and optional image extraction. Built for developers who need reliable PDF processing at scale.

Complex layouts — Multi-column documents, academic papers, financial reports
Table extraction — Preserves table structure in Markdown format
Formula support — Mathematical equations converted to LaTeX
Batch processing — Process hundreds of PDFs in parallel
Multiple input methods — File upload, URLs, or base64 API calls

Demo

Key Features

Document Processing

Intelligent layout detection — Handles single and multi-column layouts automatically
Table recognition — Extracts tables with proper Markdown formatting
Formula extraction — Converts mathematical formulas to LaTeX notation
Image extraction — Optionally extract embedded images with public URLs
OCR support — Process scanned PDFs with 8 language options

Developer-Friendly

Batch processing — Submit multiple PDFs in a single run
Parallel execution — Configurable concurrency (1-10 simultaneous PDFs)
Three output modes — Choose between text-only, with images, or full extraction
Structured output — Consistent JSON schema for easy integration
Public image URLs — Images stored in Apify Key-Value Store, not base64 blobs

Reliability

Automatic retries — Exponential backoff for transient failures
Partial success — Continues processing if individual PDFs fail
Detailed status — Per-document success/error reporting
Processing metrics — Page count, markdown length, processing time

Use Cases

For RAG Pipeline Developers

Prepare documents for LLM retrieval — Convert PDFs to clean text for embedding
Build knowledge bases — Extract structured content from document libraries
Enhance chatbot context — Feed processed documents into AI assistants
Create searchable archives — Transform PDF collections into queryable text

For Data Engineers

Document migration — Convert legacy PDF archives to Markdown
Content pipelines — Automate PDF processing in data workflows
ETL integration — Extract text data for downstream processing
Compliance archival — Create text-based backups of PDF documents

For Researchers

Extract tables from papers — Pull data tables from academic PDFs
Process formulas — Convert mathematical notation to LaTeX
Batch analysis — Process entire paper collections
Figure extraction — Capture charts and diagrams with metadata

Quick Start

Basic — Single PDF URL

{
  "pdfUrls": ["https://example.com/document.pdf"]
}

Advanced — Batch with Images

{
  "pdfUrls": [
    "https://example.com/report-q1.pdf",
    "https://example.com/report-q2.pdf",
    "https://example.com/report-q3.pdf"
  ],
  "outputMode": "markdown_images",
  "concurrency": 5
}

Complete — All Parameters

{
  "pdfFile": null,
  "pdfUrls": ["https://example.com/document.pdf"],
  "pdfBase64Items": [
    {
      "filename": "uploaded-doc.pdf",
      "data": "JVBERi0xLjQKJeLjz9..."
    }
  ],
  "outputMode": "full",
  "language": "en",
  "concurrency": 3
}

Pricing — Pay Per Event (PPE)

Transparent pay-per-PDF pricing based on output mode:

Output Mode	Price per PDF	Description
`markdown`	$0.02	Text-only extraction
`markdown_images`	$0.03	Text + extracted images stored in KV
`full`	$0.04	Text + images + raw JSON metadata

Cost Examples

PDFs	Output Mode	Total Cost
10	`markdown`	$0.20
50	`markdown`	$1.00
100	`markdown`	$2.00
100	`markdown_images`	$3.00
100	`full`	$4.00
500	`markdown`	$10.00
1,000	`markdown`	$20.00
1,000	`markdown_images`	$30.00
1,000	`full`	$40.00

Cost Optimization Tips

Use markdown mode if you don't need images
Filter PDFs before submission to avoid processing irrelevant documents
Start with lower concurrency and scale up as needed

Input Parameters

Parameter	Type	Default	Required	Description
`pdfFile`	file upload	-	No*	Upload a single PDF file via the Apify UI
`pdfUrls`	string[]	-	No*	Array of URLs pointing to PDF files
`pdfBase64Items`	object[]	-	No*	Array of base64-encoded PDFs with filenames
`outputMode`	enum	`markdown`	No	Output format: `markdown`, `markdown_images`, or `full`
`language`	enum	`en`	No	Language hint for OCR accuracy
`concurrency`	integer	`3`	No	Parallel processing (1-10)

*At least one PDF source is required (pdfFile, pdfUrls, or pdfBase64Items).

Output Modes

Mode	Markdown	Images	JSON Content	Best For
`markdown`	Yes	No	No	Text extraction, RAG pipelines
`markdown_images`	Yes	Yes (URLs)	No	Full document conversion
`full`	Yes	Yes (URLs)	Yes	Analysis, debugging

Supported Languages

Code	Language
`en`	English
`ch`	Chinese (Simplified)
`chinese_cht`	Chinese (Traditional)
`japan`	Japanese
`korean`	Korean
`ta`	Tamil
`te`	Telugu
`ka`	Kannada

Base64 Input Format

For API integration, use pdfBase64Items:

{
  "pdfBase64Items": [
    {
      "filename": "invoice-001.pdf",
      "data": "JVBERi0xLjQKJeLjz9MKNSAwIG9iago8PC..."
    },
    {
      "filename": "contract-2024.pdf",
      "data": "JVBERi0xLjUKJeLjz9MKMSAwIG9iago8PC..."
    }
  ]
}

Output

Each PDF produces one dataset item with the following structure:

{
  "filename": "annual-report-2024.pdf",
  "sourceType": "url",
  "sourceUrl": "https://example.com/annual-report-2024.pdf",
  "status": "success",
  "markdown": "# Annual Report 2024\n\n## Executive Summary\n\nThis year marked significant growth across all business units...\n\n## Financial Highlights\n\n| Metric | Q1 | Q2 | Q3 | Q4 |\n|--------|-----|-----|-----|-----|\n| Revenue | $12M | $14M | $15M | $18M |\n| Profit | $2M | $3M | $3.5M | $4M |\n\n## Strategic Initiatives\n\n### Digital Transformation\n\nOur investment in AI-powered solutions delivered...",
  "pageCount": 24,
  "markdownLength": 45230,
  "images": [
    "https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/figure-1.png",
    "https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/chart-revenue.png",
    "https://api.apify.com/v2/key-value-stores/abc123/records/annual-report-2024/logo.jpg"
  ],
  "imageCount": 3,
  "jsonContent": null,
  "error": null,
  "errorDetails": null,
  "processingTimeMs": 32450,
  "timestamp": "2025-01-15T10:30:00.000Z"
}

Output Fields

Field	Type	Description
`filename`	string	Original filename
`sourceType`	string	Input source: `url`, `upload`, or `base64`
`sourceUrl`	string	Source URL (if applicable)
`status`	string	`success` or `error`
`markdown`	string	Extracted Markdown content
`pageCount`	number	Number of pages in PDF
`markdownLength`	number	Character count of markdown
`images`	array	List of public URLs for extracted images
`imageCount`	number	Number of extracted images
`jsonContent`	object	Raw extraction metadata (full mode only)
`error`	string	User-friendly error message (if failed)
`errorDetails`	string	Additional error context
`processingTimeMs`	number	Processing duration in milliseconds
`timestamp`	string	ISO 8601 timestamp

Error Output Example

{
  "filename": "encrypted-doc.pdf",
  "sourceType": "url",
  "sourceUrl": "https://example.com/encrypted-doc.pdf",
  "status": "error",
  "markdown": null,
  "pageCount": null,
  "markdownLength": 0,
  "images": [],
  "imageCount": 0,
  "jsonContent": null,
  "error": "PDF is password-protected",
  "errorDetails": null,
  "processingTimeMs": 1250,
  "timestamp": "2025-01-15T10:31:00.000Z"
}

API Integration

Python

from apify_client import ApifyClient

client = ApifyClient("your_api_token")

run_input = {
    "pdfUrls": [
        "https://example.com/report-q1.pdf",
        "https://example.com/report-q2.pdf",
    ],
    "outputMode": "markdown_images",
    "concurrency": 3,
}

run = client.actor("your-username/pdf-to-markdown").call(run_input=run_input)

# Fetch results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item["status"] == "success":
        print(f"Processed: {item['filename']}")
        print(f"Pages: {item['pageCount']}")
        print(f"Markdown length: {item['markdownLength']} chars")
        # Save markdown to file
        with open(f"{item['filename']}.md", "w") as f:
            f.write(item["markdown"])
    else:
        print(f"Failed: {item['filename']} - {item['error']}")

JavaScript / TypeScript

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'your_api_token' });

const input = {
    pdfUrls: [
        'https://example.com/report-q1.pdf',
        'https://example.com/report-q2.pdf',
    ],
    outputMode: 'markdown_images',
    concurrency: 3,
};

const run = await client.actor('your-username/pdf-to-markdown').call(input);

const { items } = await client.dataset(run.defaultDatasetId).listItems();

for (const item of items) {
    if (item.status === 'success') {
        console.log(`Processed: ${item.filename}`);
        console.log(`Pages: ${item.pageCount}`);
        console.log(`Images: ${item.imageCount}`);
    } else {
        console.log(`Failed: ${item.filename} - ${item.error}`);
    }
}

cURL

curl -X POST "https://api.apify.com/v2/acts/your-username~pdf-to-markdown/runs?token=your_api_token" \
  -H "Content-Type: application/json" \
  -d '{
    "pdfUrls": ["https://example.com/document.pdf"],
    "outputMode": "markdown"
  }'

Technical Requirements

Requirement	Value
Memory	512 MB
Processing Time	20-45 seconds per PDF
Max Queue Wait	2 minutes
Max Processing Time	5 minutes per PDF
Concurrency	1-10 parallel PDFs

Supported PDF Types

Standard text PDFs
Scanned documents (via OCR)
Multi-column layouts
Tables and forms
Academic papers with formulas
Reports with charts and figures

Limitations

Password-protected PDFs are not supported
Maximum recommended file size: 50 MB per PDF
Very complex layouts may have reduced accuracy

FAQ

What types of PDFs can this Actor process?

This Actor handles most PDF types including standard text documents, scanned images (via OCR), multi-column layouts, academic papers, financial reports, and documents with tables and formulas.

How long does processing take?

Most PDFs complete in 20-45 seconds. Complex documents with many pages or images may take longer. The Actor has a 5-minute timeout per PDF.

Can I process scanned documents?

Yes! The Actor includes OCR (Optical Character Recognition) that works with scanned PDFs. Use the language parameter to improve accuracy for non-English documents.

What languages are supported for OCR?

Eight languages: English, Chinese (Simplified and Traditional), Japanese, Korean, Tamil, Telugu, and Kannada.

How are images stored?

When using markdown_images or full mode, extracted images are stored in Apify's Key-Value Store. The output contains public URLs that remain accessible as long as your storage retention allows.

What happens if a PDF fails to process?

The Actor continues processing other PDFs and reports failures in the output. Each item has a status field (success or error) and an error field with a user-friendly message.

Can I process PDFs via API without uploading files?

Yes! Use the pdfBase64Items parameter to submit base64-encoded PDF content directly, or use pdfUrls to provide URLs that the Actor will fetch.

Is there a free trial?

Yes, Apify offers free platform credits for new users. You can test the Actor with sample PDFs before committing to paid usage.

How do I handle large batches efficiently?

Increase the concurrency parameter (up to 10) to process more PDFs in parallel. For very large batches, consider splitting into multiple runs.

What's the difference between output modes?

markdown: Text only, smallest output, fastest
markdown_images: Text + image URLs, good for full document conversion
full: Everything including raw JSON metadata, best for analysis/debugging

Data Export

Export your results in multiple formats:

JSON — Full structured data for programmatic access
CSV — Spreadsheet-compatible format
Excel — Direct import to Microsoft Excel
XML — Legacy system integration

Automation

Scheduled runs — Process PDFs on a recurring schedule
Webhooks — Get notified when processing completes
API integration — Trigger runs from your application
Apify integrations — Connect with Zapier, Make, and more

Support

Issues & Bugs: Use the ../../issues on this Actor's page
Feature Requests: Open an issue or contact via email
Email: max@mapa.slmail.me
Response Time: Usually within 24 hours

Legal Compliance

This Actor processes documents that you provide. You are responsible for:

Having the right to process the documents you submit
Complying with applicable data protection regulations (GDPR, CCPA, etc.)
Ensuring processed content doesn't violate any terms of service

The Actor does not store your PDFs beyond the processing duration.

Start Converting PDFs to Markdown Now

Transform your document workflows with accurate, AI-powered PDF extraction.

Google Lens | AI Mode | Reverse image search | Translation+OCR

borderline/google-lens

Google Lens | Reverse image search | AI Mode🌟 Seamlessly identify text, translate in real time 🌐, recognize and classify objects 🎁, reverse search images 🔍, and extract detailed structured data 📚. It’s fast, reliable, and affordable—your essential tool for all visual intelligence needs! 🚀

borderline

349

5.0

Manga AI OCR Translator

parseforge/manga-ocr-translator

Extract and translate text from manga images using Nano Banana AI. Processes manga panel images to extract text in multiple languages and translates to multiple target languages simultaneously. Each extracted text includes the original text and translations to all selected languages.

ParseForge

5.0

Website Image Scraper

gomorrhadev/website-image-scraper

Website Image Scraper is a fast, lightweight tool that crawls websites to extract image URLs (jpg, png, svg) without downloading files or using browsers. It supports recursive crawling, respects robots.txt, and efficiently collects image links for analysis or monitoring or a later download.

F. Gutz

230

5.0

Pdf OCR API

cspnair/pdf-ocr-api

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

csp

5.0

OCR for Google Maps pins

danielmilevski9/google-maps-pins-map-ocr

Actor will try to find pins specified exactly by sprite https://github.com/apify-alexey/gmaps-ocrpin/blob/main/pin.png and store coordinates of the pins found in dataset and OUTPUT

Daniel Milevski

405

5.0

Image To Text Ai

welcoming_fireplace/image-to-text-ai

A powerful OCR tool that goes beyond standard text extraction. Powered by a Premium Vision AI model, it accurately reads handwriting, preserves table structures, and converts messy receipts or documents into structured JSON or Markdown. Supports batch processing for high-volume workflows.

Richmond Nkrumah

Bulk Image Downloader

onescales/bulk-image-downloader

The Bulk Image Downloader is a powerful Apify actor that extracts and downloads images from web pages or processes direct image URLs in bulk. Whether you need to download a single image or thousands of images from multiple websites, this tool handles it all efficiently.

One Scales

550

5.0

Website Image Downloader Pro

powerful_bachelor/website-image-downloader-pro

📸 Website Image Downloader Pro: Extract and download images from any URL! 🚀 Features include image URL extraction, SVG to PNG conversion, downloading, and zipping images. Perfect for market research, AI training, and creating visual archives. 🌐✨ Try it now on Apify! 💾

Powerful Bachelor

453

2.5

Fast Google Maps Search Scraper API | Business Listings & Leads

agents/google-maps-search

Find high-value leads fast with our low-cost Google Maps scraper. Instantly extract business names, contact details, emails, phone numbers, and reviews—perfect for sales prospecting and outreach. Affordable, reliable, and built for scale.

Agents

341

5.0

Google Images Scraper

easyapi/google-images-scraper

Powerful Google Images scraper collect up to 5000 image results per runn with flexible search options, language support. Perfect for visual content research, competitor analysis, and image SEO optimization. 🖼️🔍

EasyApi

362

4.3