PDF OCR API - Document Extraction avatar
PDF OCR API - Document Extraction

Pricing

from $0.01 / 1,000 results

Go to Apify Store
PDF OCR API - Document Extraction

PDF OCR API - Document Extraction

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

John Rippy

John Rippy

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

2

Monthly active users

13 days ago

Last modified

Share

PDF OCR API

API Integration

This actor connects to an external API service. You'll need valid API credentials from the service provider.


API Integration

This actor connects to an external API service. You'll need valid API credentials from the service provider.


Extract text from PDF files using OCR. Supports scanned documents, images, and multi-page PDFs. Returns structured text with page numbers and confidence scores. Built by John Rippy (https://www.linkedin.com/in/johnrippy/ | https://johnrippy.link/).

Features

  • Direct API integration
  • Structured JSON output
  • Error handling and retries
  • Pay-per-event billing

Quick Start

{
"input": "your input here"
}

Demo Mode

Set demoMode: true to test with sample data (no charges). When you're ready for real results, set demoMode: false or omit it.

{
"demoMode": true,
...
}

Input Parameters

ParameterTypeRequiredDescription
pdfUrlstringYes*URL of the PDF file to process
pdfBase64stringYes*Base64-encoded PDF (alternative to URL)
languagestringNoOCR language hint (default: eng)
pageRangestringNoPages to process (e.g., "1-5" or "1,3,5")
outputFormatstringNoOutput format: text, json, markdown
detectTablesbooleanNoAttempt to preserve table structure
webhookUrlstringNoWebhook URL for async results
demoModebooleanNoReturn sample output without processing

*Either pdfUrl or pdfBase64 is required


Output Format

{
"success": true,
"fileName": "document.pdf",
"totalPages": 5,
"processedPages": 5,
"language": "eng",
"processingTime": 2.3,
"pages": [
{
"pageNumber": 1,
"text": "This is the extracted text from page 1...",
"confidence": 95.2,
"wordCount": 342,
"hasImages": true,
"tables": [
{
"rows": 5,
"columns": 3,
"data": [["Header1", "Header2", "Header3"], ...]
}
]
}
],
"fullText": "Complete document text concatenated...",
"wordCount": 1250,
"averageConfidence": 94.5
}

Pricing

This actor uses pay-per-event billing:

  • data_point: $0.01 per result

Use Cases

Document Digitization

  • Archive processing: Make historical documents searchable
  • Paper to digital: Convert scanned documents to text
  • Record keeping: Digitize contracts, invoices, receipts

Data Extraction

  • Invoice processing: Extract line items, totals, dates
  • Form processing: Pull data from scanned forms
  • Contract analysis: Extract key terms and clauses

Research & Academia

  • Academic papers: Extract text from PDF research papers
  • Book scanning: Digitize book chapters and pages
  • Citation extraction: Pull references from documents
  • Legal discovery: Process large document sets
  • Contract review: Extract text for analysis
  • Compliance audits: Digitize paper records

Developers

  • API integration: RESTful JSON responses
  • Webhook support: Async processing for large documents
  • Multiple formats: Text, JSON, or Markdown output


Common Problems & Solutions

"Invalid API key" error

Cause: Your API key is wrong, expired, or doesn't have the right permissions. Fix: Double-check your API key. Make sure you copied it exactly without extra spaces.

"Rate limit exceeded" error

Cause: You've hit the API's rate limits. Fix: Wait a few minutes, then try again. Consider reducing the number of concurrent requests.

Empty or incomplete results

Cause: The target may have anti-scraping protection or the data doesn't exist. Fix:

  • Check if the URL/search query is correct
  • Try with different parameters
  • Some sites may block automated access

Demo data showing instead of real results

Cause: demoMode is still set to true. Fix: Set demoMode: false and provide your API key(s).


Built by John Rippy | Actor Arsenal