PDF to Markdown Converter avatar

PDF to Markdown Converter

Pricing

$4.00/month + usage

Go to Apify Store
PDF to Markdown Converter

PDF to Markdown Converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

Pricing

$4.00/month + usage

Rating

0.0

(0)

Developer

Web Harvester

Web Harvester

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Convert PDFs to clean Markdown with optional OCR for scanned documents. Lightweight alternative to heavy document processing tools.

Features

  • Fast Text Extraction: Uses PDF.js for native text PDFs
  • OCR Support: Tesseract.js for scanned/image documents
  • Smart Mode: Auto-detects best extraction method per page
  • Layout Preservation: Maintains document structure
  • Multi-language OCR: 14+ languages supported
  • Batch Processing: Convert multiple PDFs at once

Input

ParameterTypeDefaultDescription
filestring-Upload a PDF file
pdfUrlsarray-URLs of PDFs to convert
modestring"quick"Extraction mode
languagestring"eng"OCR language
preserveLayoutbooleantruePreserve document structure

Extraction Modes

  • quick: Fast extraction using PDF.js - best for native text PDFs
  • ocr: Tesseract OCR - use for scanned documents or images
  • combined: Auto-detects per page - uses OCR when text extraction fails

Output

Results are saved to the dataset:

{
"status": "success",
"fileName": "document.pdf",
"pdfUrl": "https://...",
"markdown": "# Document Title\n\nContent here...",
"pageCount": 5,
"extractionMethod": "pdf.js",
"characterCount": 12345
}

Use Cases

  1. LLM Preprocessing: Convert PDFs for AI/RAG pipelines
  2. Documentation Migration: Convert PDF docs to Markdown
  3. Content Extraction: Pull text from reports and papers
  4. Accessibility: Make PDF content more accessible
  5. Archive Conversion: Convert legacy PDFs to modern format

Supported Languages (OCR)

  • English, French, German, Spanish, Italian
  • Portuguese, Dutch, Polish, Russian
  • Chinese (Simplified/Traditional)
  • Japanese, Korean, Arabic

Example

# Using Apify CLI
apify run -i '{
"pdfUrls": ["https://example.com/document.pdf"],
"mode": "combined",
"language": "eng"
}'

Technical Notes

  • Quick mode is 10-50x faster than OCR
  • OCR quality depends on scan quality and resolution
  • Combined mode adds overhead for analysis
  • Large PDFs may require more memory
  • Some complex layouts may not convert perfectly