Pricing

from $9.80 / 1,000 page ocrs

PDF OCR Tool — Scanned PDF Text Extraction

Run OCR on scanned PDFs and image-based documents. Extract text by page with language options, confidence scores, and searchable text exports.

Pricing

from $9.80 / 1,000 page ocrs

Rating

0.0

(0)

Developer

junipr

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

Overview

Most PDF text extractors fail silently on scanned documents: the PDF looks normal but contains images instead of selectable text. This actor solves that with a two-stage pipeline:

Smart detection — first attempts direct text extraction (fast, with no OCR overhead). If the PDF is text-based, it returns results immediately.
OCR fallback — if the PDF is scanned or image-based (fewer than 20 chars/page), it renders each page using a headless Chrome browser and runs Tesseract.js OCR on the resulting images.

Every result includes per-page confidence scores so you know exactly how reliable each extraction was.

Features

No API keys — Tesseract.js runs entirely inside the actor. Zero external dependencies, zero cost per call to a third-party API.
11 languages — English, French, German, Spanish, Italian, Portuguese, Simplified Chinese, Japanese, Korean, Arabic, Russian.
Smart detection — text-based PDFs take the fast path (direct extraction). Only scanned PDFs incur the OCR overhead.
Confidence scores — every page reports an OCR confidence value (0–100). Text-extracted pages always score 100.
Extraction method — each result reports text-extraction, ocr, or hybrid so you know what happened.
Batch processing — supply as many PDF URLs as needed. Configurable concurrency keeps memory usage in check.
Metadata extraction — title, author, subject, creator, producer, creation date, modification date.
Multiple output formats — plain text, markdown, or structured JSON.
Page-by-page output — optional pages array with per-page text, char count, confidence, and method.

Input

All fields have defaults — run with zero configuration using the bounded W3C default PDF.

Field	Type	Default	Description
`pdfUrls`	array	W3C default PDF	List of `{url, label}` objects to process
`language`	string	`eng`	Tesseract OCR language code
`outputFormat`	string	`text`	`text`, `markdown`, or `json`
`extractMetadata`	boolean	`true`	Include PDF metadata in output
`pageByPage`	boolean	`true`	Include per-page breakdown with confidence scores
`maxPages`	integer	`10`	Limit pages per PDF (0 = no limit)
`dpi`	integer	`300`	Rendering resolution for OCR (higher = better accuracy, slower)
`maxConcurrency`	integer	`2`	Parallel PDFs (keep low — OCR is CPU-heavy)
`requestTimeout`	integer	`120000`	Download timeout in milliseconds

Output

Each dataset item corresponds to one PDF:

{
  "url": "https://example.com/report.pdf",
  "label": "Q4 Report",
  "fileName": "report.pdf",
  "method": "ocr",
  "metadata": {
    "title": "Annual Report 2023",
    "author": "Acme Corp",
    "creationDate": "2024-01-15"
  },
  "text": "Full extracted text here...",
  "pageCount": 12,
  "averageConfidence": 94.3,
  "pages": [
    {
      "pageNumber": 1,
      "text": "Page text here...",
      "charCount": 847,
      "confidence": 96.1,
      "method": "ocr"
    }
  ],
  "extractedAt": "2024-03-11T12:00:00.000Z",
  "errors": []
}

Extraction Methods

Method	When used
`text-extraction`	PDF contains embedded text (≥20 chars/page average)
`ocr`	PDF is scanned or image-based — all pages processed with Tesseract
`hybrid`	Mixed document: some pages had text, others needed OCR

Supported Languages

eng (English), fra (French), deu (German), spa (Spanish), ita (Italian), por (Portuguese), chi_sim (Simplified Chinese), jpn (Japanese), kor (Korean), ara (Arabic), rus (Russian).

Language data is downloaded from Tesseract's CDN on first use. Subsequent runs on the same build cache the data automatically.

Performance & Cost

Text-based PDFs process very fast (seconds per document).
Scanned PDFs require rendering + OCR — expect 5–30 seconds per page depending on resolution and document complexity.
Set dpi: 150 for faster processing when accuracy is less critical. Use dpi: 300–600 for small or dense text.
Set maxConcurrency: 1 for large batches if you hit memory limits.
Pay-per-event pricing uses the page-ocr event at $9.80 per 1,000 successfully processed pages in the queued monetization update. Failed downloads, parse/render failures, and zero-page results are not charged.

FAQ

Does this work on password-protected PDFs?

No. Password-protected PDFs cannot be downloaded or parsed without the password. The actor will report a parse error and return an empty result for those files.

What DPI should I use?

72–150 DPI: Fast, lower accuracy. Fine for large clear text.
300 DPI (default): Good balance of speed and accuracy for most scanned documents.
400–600 DPI: Best accuracy for small fonts, handwriting, or dense tables. Significantly slower.

Why does my text PDF still show method: text-extraction even though it looks scanned?

Some PDFs embed invisible text layers over scanned images (common in documents processed by Adobe Acrobat or similar tools). The actor detects this embedded text and uses it directly — it's more accurate than re-running OCR on those documents.

Can I process multiple languages in one PDF?

Tesseract supports one language per run. For multilingual documents, run the actor twice with different language settings and compare results, or use eng which handles many Latin-script languages adequately.

What happens if OCR confidence is low?

Low confidence (below ~60%) usually means the scan quality is poor, the wrong language is selected, or the document contains complex layouts. Try increasing dpi, selecting the correct language, or pre-processing the PDF to improve image quality.

Is there a page limit?

Default is 10 pages per PDF. Set maxPages: 0 only when you intentionally want no page limit. Actor timeout is 60 minutes — for very large batches, increase the actor's timeout in run options.

Competitive Advantage

Unlike alternatives that require Google Vision API, OpenAI, or AWS Textract (all paid, all requiring API keys to be configured), this actor uses Tesseract.js which is open-source, runs locally inside the actor, and has zero third-party API cost. Runs are billed through the actor's PPE page event plus the platform-usage policy shown by Apify.

Image to Text OCR — Extract Text from Images

junipr/image-to-text

Extract text from images with OCR, confidence scores, language options, page/image metadata, and automation-ready text exports.

junipr

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

The Howlers

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

NanoScrape

PDF Extract — Text, Tables & Metadata (OCR-ready)

sathvic_kollu/techtenstein-pdf-extract

Extract clean text, structured tables, and metadata from any PDF URL. Supports OCR for scanned documents. Ideal for building document pipelines, financial data extraction, invoice processing, and research automation.

Techtenstein Services Private Limited

PDF to Structured Data (Excel/JSON) with OCR

nibble/pdf-to-structured-data

Convert PDF invoices, forms, statements and reports into clean structured JSON (text, tables, key/values) with an OCR fallback for scanned pages.

Simon Fletcher

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

Kumar Gagandeo

OCR & Document Extractor – PDF & Image to Text, JSON, Word

lofomachines/ocr-document-extractor

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

Lofomachines

PDF Tools (Merge / Split / Compress / OCR / Watermark)

mrkrokko/pdf-tools

All-in-one PDF processor: merge multiple PDFs, split by page ranges, compress file size, extract text, OCR scanned documents (Tesseract), add text watermarks, rotate pages, and read metadata. Accepts PDF URLs or Key-Value Store keys.

Alex O

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

Web Harvester

Image OCR Scraper

seemuapps/image-ocr-scraper

Extract text from any image. Bulk OCR for screenshots, scanned documents, receipts, signs, and photos. Supports 109 languages and outputs clean Markdown or structured JSON with bounding boxes.