Pricing

$4.00/month + usage

PDF to Markdown Converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

Pricing

$4.00/month + usage

Rating

0.0

(0)

Developer

Web Harvester

Actor stats

Bookmarked

Total users

Monthly active users

5 months ago

Last modified

Features

Fast Text Extraction: Uses PDF.js for native text PDFs
OCR Support: Tesseract.js for scanned/image documents
Smart Mode: Auto-detects best extraction method per page
Layout Preservation: Maintains document structure
Multi-language OCR: 14+ languages supported
Batch Processing: Convert multiple PDFs at once

Input

Parameter	Type	Default	Description
`file`	string	-	Upload a PDF file
`pdfUrls`	array	-	URLs of PDFs to convert
`mode`	string	"quick"	Extraction mode
`language`	string	"eng"	OCR language
`preserveLayout`	boolean	true	Preserve document structure

Extraction Modes

quick: Fast extraction using PDF.js - best for native text PDFs
ocr: Tesseract OCR - use for scanned documents or images
combined: Auto-detects per page - uses OCR when text extraction fails

Output

Results are saved to the dataset:

{
    "status": "success",
    "fileName": "document.pdf",
    "pdfUrl": "https://...",
    "markdown": "# Document Title\n\nContent here...",
    "pageCount": 5,
    "extractionMethod": "pdf.js",
    "characterCount": 12345
}

Use Cases

LLM Preprocessing: Convert PDFs for AI/RAG pipelines
Documentation Migration: Convert PDF docs to Markdown
Content Extraction: Pull text from reports and papers
Accessibility: Make PDF content more accessible
Archive Conversion: Convert legacy PDFs to modern format

Supported Languages (OCR)

English, French, German, Spanish, Italian
Portuguese, Dutch, Polish, Russian
Chinese (Simplified/Traditional)
Japanese, Korean, Arabic

Example

# Using Apify CLI
apify run -i '{
    "pdfUrls": ["https://example.com/document.pdf"],
    "mode": "combined",
    "language": "eng"
}'

Technical Notes

Quick mode is 10-50x faster than OCR
OCR quality depends on scan quality and resolution
Combined mode adds overhead for analysis
Large PDFs may require more memory
Some complex layouts may not convert perfectly

Pdf OCR API

cspnair/pdf-ocr-api

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

csp

5.0

PDF to Markdown Converter - Extract & Format Text

ntriqpro/pdf-to-markdown

Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.

daehwan kim

PDF to Markdown & JSON Converter (Docling)

actorzlab/docling-pdf-converter

Convert PDF documents to clean Markdown, structured JSON, and plain text using IBM's open-source Docling AI. Handles text PDFs and scanned documents (OCR), extracts tables and images. No external API key required — runs fully on-device.

Khalil Drissi

PDF to Markdown — Tables + OCR, for RAG & AI Agents

lizaraco/pdf-to-markdown

Convert PDFs to clean markdown at scale: layout-aware text extraction, table handling, and a vision-model OCR tier for scanned or broken pages. Per-page transparency, never-fail runs.

Shawn Downs

PDF to Markdown Converter - AI-Powered with OCR & Tables

clearpath/pdf-to-markdown-api

Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.

ClearPath

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

ParseForge

5.0

OCR & Document Extractor – PDF & Image to Text, JSON, Word

lofomachines/ocr-document-extractor

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

Lofomachines

Image to Text (OCR) — Extract Text from Screenshots & Photos

junipr/image-to-text

Extract text from images using Tesseract.js OCR engine. Supports 100+ languages, PDFs, and bulk image processing.

junipr

PDF OCR Tool — Extract Text from Scanned Documents

junipr/pdf-ocr-tool

Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.

junipr

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.