PDF OCR API - Document Extraction avatar
PDF OCR API - Document Extraction

Pricing

from $0.01 / 1,000 results

Go to Apify Store
PDF OCR API - Document Extraction

PDF OCR API - Document Extraction

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

John Rippy

John Rippy

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

6 days ago

Last modified

Share

PDF OCR API

"Extract Text from Any PDF for Pennies" by John Rippy | johnrippy.link


Stop Paying for Expensive OCR Services

You're currently paying: Adobe Acrobat Pro ($22.99/mo), ABBYY FineReader ($199/year), Google Document AI ($1.50/1000 pages), Amazon Textract ($1.50/1000 pages).

What if you could extract text for a fraction of the cost?

The PDF OCR API extracts text from any PDF - scanned documents, image-based PDFs, and multi-page files:

  • Scanned document support (OCR)
  • Multi-page processing (any length)
  • 14 language support (English, Spanish, French, German, Chinese, Japanese, and more)
  • Table structure preservation
  • Multiple output formats (text, JSON, Markdown)
  • Confidence scores per page
  • Page-by-page results

Pay only for what you use. No monthly subscriptions. No minimum commitments.


Why Choose This Over Traditional OCR Services

1. Pay-Per-Page, Not Per-Month

Traditional tools: $20-$200/month for your business.

This actor: Pay per page processed. Process 100 pages for ~$5. Process 1,000 for ~$40.

Process 500 pages/month and still pay less than an Adobe subscription.

2. Support for Any PDF

  • Scanned PDFs: Image-based documents from scanners
  • Digital PDFs: Native text extraction (faster, more accurate)
  • Mixed PDFs: Pages with both text and images
  • Multi-page: No limit on document length

3. 14 Languages Supported

LanguageCodeLanguageCode
EnglishengRussianrus
SpanishspaJapanesejpn
FrenchfraChinese (Simplified)chi_sim
GermandeuChinese (Traditional)chi_tra
ItalianitaKoreankor
PortugueseporArabicara
DutchnldPolishpol

4. Table Detection

Preserve table structure from scanned documents. Get rows and columns as structured data.


Quick Start Examples

Example 1: Extract Text from URL

{
"pdfUrl": "https://example.com/document.pdf",
"language": "eng",
"outputFormat": "text"
}

Example 2: Process Specific Pages

{
"pdfUrl": "https://example.com/document.pdf",
"pageRange": "1-5",
"language": "eng",
"detectTables": true
}

Example 3: Multi-Language Document

{
"pdfUrl": "https://example.com/document.pdf",
"language": "spa",
"outputFormat": "json"
}

Example 4: With Webhook

{
"pdfUrl": "https://example.com/document.pdf",
"webhookUrl": "https://hooks.zapier.com/hooks/catch/12345/abcdef/"
}

Input Parameters

ParameterTypeRequiredDescription
pdfUrlstringYes*URL of the PDF file to process
pdfBase64stringYes*Base64-encoded PDF (alternative to URL)
languagestringNoOCR language hint (default: eng)
pageRangestringNoPages to process (e.g., "1-5" or "1,3,5")
outputFormatstringNoOutput format: text, json, markdown
detectTablesbooleanNoAttempt to preserve table structure
webhookUrlstringNoWebhook URL for async results
demoModebooleanNoReturn sample output without processing

*Either pdfUrl or pdfBase64 is required


Output Format

{
"success": true,
"fileName": "document.pdf",
"totalPages": 5,
"processedPages": 5,
"language": "eng",
"processingTime": 2.3,
"pages": [
{
"pageNumber": 1,
"text": "This is the extracted text from page 1...",
"confidence": 95.2,
"wordCount": 342,
"hasImages": true,
"tables": [
{
"rows": 5,
"columns": 3,
"data": [["Header1", "Header2", "Header3"], ...]
}
]
}
],
"fullText": "Complete document text concatenated...",
"wordCount": 1250,
"averageConfidence": 94.5
}

Pay-Per-Event Pricing

You only pay for what you use. No monthly fees. No minimums.

EventDescriptionPrice
page_processedEach page extracted$0.05
table_detectedEach table found and parsed$0.02

Cost Examples

TaskThis ActorAdobeGoogle Doc AI
100 pages~$5$22.99/mo$0.15
500 pages~$25$22.99/mo$0.75
1,000 pages~$50$22.99/mo$1.50
10,000 pages~$500$22.99/mo$15.00

For low-to-medium volume, save compared to subscriptions. For high volume, competitive with cloud APIs.


Use Cases

Document Digitization

  • Archive processing: Make historical documents searchable
  • Paper to digital: Convert scanned documents to text
  • Record keeping: Digitize contracts, invoices, receipts

Data Extraction

  • Invoice processing: Extract line items, totals, dates
  • Form processing: Pull data from scanned forms
  • Contract analysis: Extract key terms and clauses

Research & Academia

  • Academic papers: Extract text from PDF research papers
  • Book scanning: Digitize book chapters and pages
  • Citation extraction: Pull references from documents
  • Legal discovery: Process large document sets
  • Contract review: Extract text for analysis
  • Compliance audits: Digitize paper records

Developers

  • API integration: RESTful JSON responses
  • Webhook support: Async processing for large documents
  • Multiple formats: Text, JSON, or Markdown output

Confidence Scores

Each page includes a confidence score (0-100%):

ScoreQualityRecommended Action
95-100%ExcellentHigh confidence, use as-is
85-94%GoodReliable for most purposes
70-84%FairReview for critical data
Below 70%PoorManual verification recommended

Low confidence usually indicates:

  • Poor scan quality
  • Unusual fonts
  • Handwritten text
  • Low resolution images

API Integration

Using the Apify API (JavaScript)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('localhowl/pdf-ocr-api').call({
pdfUrl: 'https://example.com/document.pdf',
language: 'eng',
outputFormat: 'json'
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].fullText);

Using cURL

curl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"pdfUrl": "https://example.com/document.pdf",
"language": "eng"
}'

Base64 Upload (for local files)

# Convert PDF to base64
base64 document.pdf > document_b64.txt
# Send to API
curl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"pdfBase64": "'$(cat document_b64.txt)'",
"language": "eng"
}'

Webhook Integration (Zapier, Make, n8n)

Webhook Payload Format

{
"event": "ocr_completed",
"timestamp": "2025-12-23T12:00:00.000Z",
"actor": "pdf-ocr-api",
"runId": "abc123",
"totalPages": 10,
"processedPages": 10,
"averageConfidence": 92.5,
"fullText": "...",
"pages": [...]
}

Common Automations

  • Google Drive: Save extracted text alongside PDFs
  • Notion/Coda: Create searchable document database
  • Slack: Notify when processing completes
  • CRM: Attach extracted text to records

Limitations

  • File Size: Maximum 50MB per PDF
  • Handwriting: Limited support for handwritten text
  • Complex Layouts: Multi-column layouts may merge incorrectly
  • Image Quality: Low-resolution scans reduce accuracy
  • Encrypted PDFs: Password-protected PDFs not supported

Support


Built by John Rippy | johnrippy.link


Keywords

pdf ocr, pdf text extraction, ocr api, scanned pdf to text, document digitization, pdf scraper, image to text, optical character recognition, pdf parser, document processing, invoice ocr, form extraction, adobe alternative, abbyy alternative, tesseract ocr, multi-language ocr