PDF OCR API - Document Extraction avatar

PDF OCR API - Document Extraction

Pricing

from $0.01 / 1,000 results

Go to Apify Store
PDF OCR API - Document Extraction

PDF OCR API - Document Extraction

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

The Howlers

The Howlers

Maintained by Community

Actor stats

0

Bookmarked

16

Total users

2

Monthly active users

7 days ago

Last modified

Share

PDF OCR API

Extract text from PDF files using OCR (Tesseract). Supports scanned documents, images, and multi-page PDFs. Returns structured text with page numbers and confidence scores.

Features

  • OCR extraction via Tesseract (eng, spa, fra, deu supported)
  • Multi-page PDFs with parallel processing (4 pages at a time)
  • Page range selection (e.g., "1-5" or "1,3,5")
  • Output formats: JSON, plain text, or Markdown
  • Webhook support for async workflows
  • 100MB file size limit, 500 page limit
  • Pay-per-page billing — demo mode is free

Quick Start

{
"pdfUrl": "https://example.com/document.pdf",
"language": "eng",
"outputFormat": "json"
}

Demo Mode

Set demoMode: true to test with sample data — no charges. When ready for real results, set demoMode: false and provide a PDF URL or Base64.

Input Parameters

ParameterTypeRequiredDescription
pdfUrlstringYes*URL of the PDF file to process
pdfBase64stringYes*Base64-encoded PDF (alternative to URL)
languagestringNoOCR language: eng, spa, fra, deu (default: eng)
pageRangestringNoPages to process (e.g., "1-5" or "1,3,5")
outputFormatstringNoOutput format: text, json, markdown
detectTablesbooleanNoAttempt to preserve table structure
webhookUrlstringNoWebhook URL for async results
demoModebooleanNoReturn sample output without processing (free)

*Either pdfUrl or pdfBase64 is required

Output Format

{
"success": true,
"fileName": "document.pdf",
"totalPages": 5,
"processedPages": 5,
"language": "eng",
"processingTime": 2.3,
"pages": [
{
"pageNumber": 1,
"text": "This is the extracted text from page 1...",
"confidence": 95.2,
"wordCount": 342,
"hasImages": true
}
],
"fullText": "Complete document text concatenated...",
"wordCount": 1250,
"averageConfidence": 94.5
}

Pricing

Pay-per-page billing:

  • $0.10 per page processed with OCR
  • Demo mode is free (no charge)
  • Only charged for pages that produce output

Cost Examples

DocumentPagesCost
Single invoice1$0.10
10-page contract10$1.00
50-page report50$5.00

Use Cases

  • Invoice processing: Extract line items, totals, dates
  • Document digitization: Make scanned documents searchable
  • Contract analysis: Extract key terms and clauses
  • Academic papers: Extract text from PDF research papers
  • Legal discovery: Process document sets at scale
  • Form processing: Pull data from scanned forms

Common Problems

Empty or low-confidence results

Cause: The PDF may be image-heavy with poor scan quality. Fix: Try increasing DPI by using a higher-quality source scan. Ensure the correct language is selected.

Timeout errors

Cause: Very large or complex pages can exceed processing limits. Fix: Use pageRange to process specific pages instead of the entire document.

Demo data showing instead of real results

Cause: demoMode is still set to true. Fix: Set demoMode: false and provide a pdfUrl or pdfBase64.


Built by John Rippy | LinkedIn | Website