PDF OCR Tool
Pricing
from $4.00 / 1,000 page ocrs
PDF OCR Tool
Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.
Pricing
from $4.00 / 1,000 page ocrs
Rating
0.0
(0)
Developer

junipr
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Extract text from scanned PDFs and image-based documents using built-in Tesseract.js OCR — no API keys required, no external services, runs entirely on Apify.
Overview
Most PDF text extractors fail silently on scanned documents: the PDF looks normal but contains images instead of selectable text. This actor solves that with a two-stage pipeline:
- Smart detection — first attempts direct text extraction (fast, free). If the PDF is text-based, it returns results immediately.
- OCR fallback — if the PDF is scanned or image-based (fewer than 20 chars/page), it renders each page using a headless Chrome browser and runs Tesseract.js OCR on the resulting images.
Every result includes per-page confidence scores so you know exactly how reliable each extraction was.
Features
- No API keys — Tesseract.js runs entirely inside the actor. Zero external dependencies, zero cost per call to a third-party API.
- 11 languages — English, French, German, Spanish, Italian, Portuguese, Simplified Chinese, Japanese, Korean, Arabic, Russian.
- Smart detection — text-based PDFs take the fast path (direct extraction). Only scanned PDFs incur the OCR overhead.
- Confidence scores — every page reports an OCR confidence value (0–100). Text-extracted pages always score 100.
- Extraction method — each result reports
text-extraction,ocr, orhybridso you know what happened. - Batch processing — supply as many PDF URLs as needed. Configurable concurrency keeps memory usage in check.
- Metadata extraction — title, author, subject, creator, producer, creation date, modification date.
- Multiple output formats — plain text, markdown, or structured JSON.
- Page-by-page output — optional
pagesarray with per-page text, char count, confidence, and method.
Input
All fields have defaults — run with zero configuration using the built-in sample PDF.
| Field | Type | Default | Description |
|---|---|---|---|
pdfUrls | array | W3C sample PDF | List of {url, label} objects to process |
language | string | eng | Tesseract OCR language code |
outputFormat | string | text | text, markdown, or json |
extractMetadata | boolean | true | Include PDF metadata in output |
pageByPage | boolean | true | Include per-page breakdown with confidence scores |
maxPages | integer | 0 (all) | Limit pages per PDF (0 = no limit) |
dpi | integer | 300 | Rendering resolution for OCR (higher = better accuracy, slower) |
maxConcurrency | integer | 2 | Parallel PDFs (keep low — OCR is CPU-heavy) |
requestTimeout | integer | 120000 | Download timeout in milliseconds |
Output
Each dataset item corresponds to one PDF:
{"url": "https://example.com/report.pdf","label": "Q4 Report","fileName": "report.pdf","method": "ocr","metadata": {"title": "Annual Report 2023","author": "Acme Corp","creationDate": "2024-01-15"},"text": "Full extracted text here...","pageCount": 12,"averageConfidence": 94.3,"pages": [{"pageNumber": 1,"text": "Page text here...","charCount": 847,"confidence": 96.1,"method": "ocr"}],"extractedAt": "2024-03-11T12:00:00.000Z","errors": []}
Extraction Methods
| Method | When used |
|---|---|
text-extraction | PDF contains embedded text (≥20 chars/page average) |
ocr | PDF is scanned or image-based — all pages processed with Tesseract |
hybrid | Mixed document: some pages had text, others needed OCR |
Supported Languages
eng (English), fra (French), deu (German), spa (Spanish), ita (Italian), por (Portuguese), chi_sim (Simplified Chinese), jpn (Japanese), kor (Korean), ara (Arabic), rus (Russian).
Language data is downloaded from Tesseract's CDN on first use. Subsequent runs on the same build cache the data automatically.
Performance & Cost
- Text-based PDFs process very fast (seconds per document).
- Scanned PDFs require rendering + OCR — expect 5–30 seconds per page depending on resolution and document complexity.
- Set
dpi: 150for faster processing when accuracy is less critical. Usedpi: 300–600for small or dense text. - Set
maxConcurrency: 1for large batches if you hit memory limits.
FAQ
Does this work on password-protected PDFs?
No. Password-protected PDFs cannot be downloaded or parsed without the password. The actor will report a parse error and return an empty result for those files.
What DPI should I use?
- 72–150 DPI: Fast, lower accuracy. Fine for large clear text.
- 300 DPI (default): Good balance of speed and accuracy for most scanned documents.
- 400–600 DPI: Best accuracy for small fonts, handwriting, or dense tables. Significantly slower.
Why does my text PDF still show method: text-extraction even though it looks scanned?
Some PDFs embed invisible text layers over scanned images (common in documents processed by Adobe Acrobat or similar tools). The actor detects this embedded text and uses it directly — it's more accurate than re-running OCR on those documents.
Can I process multiple languages in one PDF?
Tesseract supports one language per run. For multilingual documents, run the actor twice with different language settings and compare results, or use eng which handles many Latin-script languages adequately.
What happens if OCR confidence is low?
Low confidence (below ~60%) usually means the scan quality is poor, the wrong language is selected, or the document contains complex layouts. Try increasing dpi, selecting the correct language, or pre-processing the PDF to improve image quality.
Is there a page limit?
Default is 0 (no limit). Set maxPages to limit pages per PDF. Actor timeout is 60 minutes — for very large batches, increase the actor's timeout in run options.
Competitive Advantage
Unlike alternatives that require Google Vision API, OpenAI, or AWS Textract (all paid, all requiring API keys to be configured), this actor uses Tesseract.js which is open-source, runs locally inside the actor, and has zero per-call API cost. You only pay for Apify compute time.