PDF OCR Tool avatar

PDF OCR Tool

Pricing

from $4.00 / 1,000 page ocrs

Go to Apify Store
PDF OCR Tool

PDF OCR Tool

Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.

Pricing

from $4.00 / 1,000 page ocrs

Rating

0.0

(0)

Developer

junipr

junipr

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Extract text from scanned PDFs and image-based documents using built-in Tesseract.js OCR — no API keys required, no external services, runs entirely on Apify.

Overview

Most PDF text extractors fail silently on scanned documents: the PDF looks normal but contains images instead of selectable text. This actor solves that with a two-stage pipeline:

  1. Smart detection — first attempts direct text extraction (fast, free). If the PDF is text-based, it returns results immediately.
  2. OCR fallback — if the PDF is scanned or image-based (fewer than 20 chars/page), it renders each page using a headless Chrome browser and runs Tesseract.js OCR on the resulting images.

Every result includes per-page confidence scores so you know exactly how reliable each extraction was.

Features

  • No API keys — Tesseract.js runs entirely inside the actor. Zero external dependencies, zero cost per call to a third-party API.
  • 11 languages — English, French, German, Spanish, Italian, Portuguese, Simplified Chinese, Japanese, Korean, Arabic, Russian.
  • Smart detection — text-based PDFs take the fast path (direct extraction). Only scanned PDFs incur the OCR overhead.
  • Confidence scores — every page reports an OCR confidence value (0–100). Text-extracted pages always score 100.
  • Extraction method — each result reports text-extraction, ocr, or hybrid so you know what happened.
  • Batch processing — supply as many PDF URLs as needed. Configurable concurrency keeps memory usage in check.
  • Metadata extraction — title, author, subject, creator, producer, creation date, modification date.
  • Multiple output formats — plain text, markdown, or structured JSON.
  • Page-by-page output — optional pages array with per-page text, char count, confidence, and method.

Input

All fields have defaults — run with zero configuration using the built-in sample PDF.

FieldTypeDefaultDescription
pdfUrlsarrayW3C sample PDFList of {url, label} objects to process
languagestringengTesseract OCR language code
outputFormatstringtexttext, markdown, or json
extractMetadatabooleantrueInclude PDF metadata in output
pageByPagebooleantrueInclude per-page breakdown with confidence scores
maxPagesinteger0 (all)Limit pages per PDF (0 = no limit)
dpiinteger300Rendering resolution for OCR (higher = better accuracy, slower)
maxConcurrencyinteger2Parallel PDFs (keep low — OCR is CPU-heavy)
requestTimeoutinteger120000Download timeout in milliseconds

Output

Each dataset item corresponds to one PDF:

{
"url": "https://example.com/report.pdf",
"label": "Q4 Report",
"fileName": "report.pdf",
"method": "ocr",
"metadata": {
"title": "Annual Report 2023",
"author": "Acme Corp",
"creationDate": "2024-01-15"
},
"text": "Full extracted text here...",
"pageCount": 12,
"averageConfidence": 94.3,
"pages": [
{
"pageNumber": 1,
"text": "Page text here...",
"charCount": 847,
"confidence": 96.1,
"method": "ocr"
}
],
"extractedAt": "2024-03-11T12:00:00.000Z",
"errors": []
}

Extraction Methods

MethodWhen used
text-extractionPDF contains embedded text (≥20 chars/page average)
ocrPDF is scanned or image-based — all pages processed with Tesseract
hybridMixed document: some pages had text, others needed OCR

Supported Languages

eng (English), fra (French), deu (German), spa (Spanish), ita (Italian), por (Portuguese), chi_sim (Simplified Chinese), jpn (Japanese), kor (Korean), ara (Arabic), rus (Russian).

Language data is downloaded from Tesseract's CDN on first use. Subsequent runs on the same build cache the data automatically.

Performance & Cost

  • Text-based PDFs process very fast (seconds per document).
  • Scanned PDFs require rendering + OCR — expect 5–30 seconds per page depending on resolution and document complexity.
  • Set dpi: 150 for faster processing when accuracy is less critical. Use dpi: 300–600 for small or dense text.
  • Set maxConcurrency: 1 for large batches if you hit memory limits.

FAQ

Does this work on password-protected PDFs?

No. Password-protected PDFs cannot be downloaded or parsed without the password. The actor will report a parse error and return an empty result for those files.

What DPI should I use?

  • 72–150 DPI: Fast, lower accuracy. Fine for large clear text.
  • 300 DPI (default): Good balance of speed and accuracy for most scanned documents.
  • 400–600 DPI: Best accuracy for small fonts, handwriting, or dense tables. Significantly slower.

Why does my text PDF still show method: text-extraction even though it looks scanned?

Some PDFs embed invisible text layers over scanned images (common in documents processed by Adobe Acrobat or similar tools). The actor detects this embedded text and uses it directly — it's more accurate than re-running OCR on those documents.

Can I process multiple languages in one PDF?

Tesseract supports one language per run. For multilingual documents, run the actor twice with different language settings and compare results, or use eng which handles many Latin-script languages adequately.

What happens if OCR confidence is low?

Low confidence (below ~60%) usually means the scan quality is poor, the wrong language is selected, or the document contains complex layouts. Try increasing dpi, selecting the correct language, or pre-processing the PDF to improve image quality.

Is there a page limit?

Default is 0 (no limit). Set maxPages to limit pages per PDF. Actor timeout is 60 minutes — for very large batches, increase the actor's timeout in run options.

Competitive Advantage

Unlike alternatives that require Google Vision API, OpenAI, or AWS Textract (all paid, all requiring API keys to be configured), this actor uses Tesseract.js which is open-source, runs locally inside the actor, and has zero per-call API cost. You only pay for Apify compute time.