PDF OCR API - Document Extraction
Pricing
from $0.01 / 1,000 results
PDF OCR API - Document Extraction
Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

John Rippy
Actor stats
0
Bookmarked
3
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
PDF OCR API
"Extract Text from Any PDF for Pennies" by John Rippy | johnrippy.link
Stop Paying for Expensive OCR Services
You're currently paying: Adobe Acrobat Pro ($22.99/mo), ABBYY FineReader ($199/year), Google Document AI ($1.50/1000 pages), Amazon Textract ($1.50/1000 pages).
What if you could extract text for a fraction of the cost?
The PDF OCR API extracts text from any PDF - scanned documents, image-based PDFs, and multi-page files:
- Scanned document support (OCR)
- Multi-page processing (any length)
- 14 language support (English, Spanish, French, German, Chinese, Japanese, and more)
- Table structure preservation
- Multiple output formats (text, JSON, Markdown)
- Confidence scores per page
- Page-by-page results
Pay only for what you use. No monthly subscriptions. No minimum commitments.
Why Choose This Over Traditional OCR Services
1. Pay-Per-Page, Not Per-Month
Traditional tools: $20-$200/month for your business.
This actor: Pay per page processed. Process 100 pages for ~$5. Process 1,000 for ~$40.
Process 500 pages/month and still pay less than an Adobe subscription.
2. Support for Any PDF
- Scanned PDFs: Image-based documents from scanners
- Digital PDFs: Native text extraction (faster, more accurate)
- Mixed PDFs: Pages with both text and images
- Multi-page: No limit on document length
3. 14 Languages Supported
| Language | Code | Language | Code |
|---|---|---|---|
| English | eng | Russian | rus |
| Spanish | spa | Japanese | jpn |
| French | fra | Chinese (Simplified) | chi_sim |
| German | deu | Chinese (Traditional) | chi_tra |
| Italian | ita | Korean | kor |
| Portuguese | por | Arabic | ara |
| Dutch | nld | Polish | pol |
4. Table Detection
Preserve table structure from scanned documents. Get rows and columns as structured data.
Quick Start Examples
Example 1: Extract Text from URL
{"pdfUrl": "https://example.com/document.pdf","language": "eng","outputFormat": "text"}
Example 2: Process Specific Pages
{"pdfUrl": "https://example.com/document.pdf","pageRange": "1-5","language": "eng","detectTables": true}
Example 3: Multi-Language Document
{"pdfUrl": "https://example.com/document.pdf","language": "spa","outputFormat": "json"}
Example 4: With Webhook
{"pdfUrl": "https://example.com/document.pdf","webhookUrl": "https://hooks.zapier.com/hooks/catch/12345/abcdef/"}
Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
pdfUrl | string | Yes* | URL of the PDF file to process |
pdfBase64 | string | Yes* | Base64-encoded PDF (alternative to URL) |
language | string | No | OCR language hint (default: eng) |
pageRange | string | No | Pages to process (e.g., "1-5" or "1,3,5") |
outputFormat | string | No | Output format: text, json, markdown |
detectTables | boolean | No | Attempt to preserve table structure |
webhookUrl | string | No | Webhook URL for async results |
demoMode | boolean | No | Return sample output without processing |
*Either pdfUrl or pdfBase64 is required
Output Format
{"success": true,"fileName": "document.pdf","totalPages": 5,"processedPages": 5,"language": "eng","processingTime": 2.3,"pages": [{"pageNumber": 1,"text": "This is the extracted text from page 1...","confidence": 95.2,"wordCount": 342,"hasImages": true,"tables": [{"rows": 5,"columns": 3,"data": [["Header1", "Header2", "Header3"], ...]}]}],"fullText": "Complete document text concatenated...","wordCount": 1250,"averageConfidence": 94.5}
Pay-Per-Event Pricing
You only pay for what you use. No monthly fees. No minimums.
| Event | Description | Price |
|---|---|---|
page_processed | Each page extracted | $0.05 |
table_detected | Each table found and parsed | $0.02 |
Cost Examples
| Task | This Actor | Adobe | Google Doc AI |
|---|---|---|---|
| 100 pages | ~$5 | $22.99/mo | $0.15 |
| 500 pages | ~$25 | $22.99/mo | $0.75 |
| 1,000 pages | ~$50 | $22.99/mo | $1.50 |
| 10,000 pages | ~$500 | $22.99/mo | $15.00 |
For low-to-medium volume, save compared to subscriptions. For high volume, competitive with cloud APIs.
Use Cases
Document Digitization
- Archive processing: Make historical documents searchable
- Paper to digital: Convert scanned documents to text
- Record keeping: Digitize contracts, invoices, receipts
Data Extraction
- Invoice processing: Extract line items, totals, dates
- Form processing: Pull data from scanned forms
- Contract analysis: Extract key terms and clauses
Research & Academia
- Academic papers: Extract text from PDF research papers
- Book scanning: Digitize book chapters and pages
- Citation extraction: Pull references from documents
Legal & Compliance
- Legal discovery: Process large document sets
- Contract review: Extract text for analysis
- Compliance audits: Digitize paper records
Developers
- API integration: RESTful JSON responses
- Webhook support: Async processing for large documents
- Multiple formats: Text, JSON, or Markdown output
Confidence Scores
Each page includes a confidence score (0-100%):
| Score | Quality | Recommended Action |
|---|---|---|
| 95-100% | Excellent | High confidence, use as-is |
| 85-94% | Good | Reliable for most purposes |
| 70-84% | Fair | Review for critical data |
| Below 70% | Poor | Manual verification recommended |
Low confidence usually indicates:
- Poor scan quality
- Unusual fonts
- Handwritten text
- Low resolution images
API Integration
Using the Apify API (JavaScript)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('localhowl/pdf-ocr-api').call({pdfUrl: 'https://example.com/document.pdf',language: 'eng',outputFormat: 'json'});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items[0].fullText);
Using cURL
curl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"pdfUrl": "https://example.com/document.pdf","language": "eng"}'
Base64 Upload (for local files)
# Convert PDF to base64base64 document.pdf > document_b64.txt# Send to APIcurl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"pdfBase64": "'$(cat document_b64.txt)'","language": "eng"}'
Webhook Integration (Zapier, Make, n8n)
Webhook Payload Format
{"event": "ocr_completed","timestamp": "2025-12-23T12:00:00.000Z","actor": "pdf-ocr-api","runId": "abc123","totalPages": 10,"processedPages": 10,"averageConfidence": 92.5,"fullText": "...","pages": [...]}
Common Automations
- Google Drive: Save extracted text alongside PDFs
- Notion/Coda: Create searchable document database
- Slack: Notify when processing completes
- CRM: Attach extracted text to records
Limitations
- File Size: Maximum 50MB per PDF
- Handwriting: Limited support for handwritten text
- Complex Layouts: Multi-column layouts may merge incorrectly
- Image Quality: Low-resolution scans reduce accuracy
- Encrypted PDFs: Password-protected PDFs not supported
Support
- Email: john@johnrippy.link
- GitHub: Report issues on the repository
Built by John Rippy | johnrippy.link
Keywords
pdf ocr, pdf text extraction, ocr api, scanned pdf to text, document digitization, pdf scraper, image to text, optical character recognition, pdf parser, document processing, invoice ocr, form extraction, adobe alternative, abbyy alternative, tesseract ocr, multi-language ocr