PDF OCR API - Document Extraction
Pricing
from $0.01 / 1,000 results
PDF OCR API - Document Extraction
Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer
The Howlers
Actor stats
0
Bookmarked
16
Total users
2
Monthly active users
7 days ago
Last modified
Categories
Share
PDF OCR API
Extract text from PDF files using OCR (Tesseract). Supports scanned documents, images, and multi-page PDFs. Returns structured text with page numbers and confidence scores.
Features
- OCR extraction via Tesseract (eng, spa, fra, deu supported)
- Multi-page PDFs with parallel processing (4 pages at a time)
- Page range selection (e.g., "1-5" or "1,3,5")
- Output formats: JSON, plain text, or Markdown
- Webhook support for async workflows
- 100MB file size limit, 500 page limit
- Pay-per-page billing — demo mode is free
Quick Start
{"pdfUrl": "https://example.com/document.pdf","language": "eng","outputFormat": "json"}
Demo Mode
Set demoMode: true to test with sample data — no charges. When ready for real results, set demoMode: false and provide a PDF URL or Base64.
Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
pdfUrl | string | Yes* | URL of the PDF file to process |
pdfBase64 | string | Yes* | Base64-encoded PDF (alternative to URL) |
language | string | No | OCR language: eng, spa, fra, deu (default: eng) |
pageRange | string | No | Pages to process (e.g., "1-5" or "1,3,5") |
outputFormat | string | No | Output format: text, json, markdown |
detectTables | boolean | No | Attempt to preserve table structure |
webhookUrl | string | No | Webhook URL for async results |
demoMode | boolean | No | Return sample output without processing (free) |
*Either pdfUrl or pdfBase64 is required
Output Format
{"success": true,"fileName": "document.pdf","totalPages": 5,"processedPages": 5,"language": "eng","processingTime": 2.3,"pages": [{"pageNumber": 1,"text": "This is the extracted text from page 1...","confidence": 95.2,"wordCount": 342,"hasImages": true}],"fullText": "Complete document text concatenated...","wordCount": 1250,"averageConfidence": 94.5}
Pricing
Pay-per-page billing:
- $0.10 per page processed with OCR
- Demo mode is free (no charge)
- Only charged for pages that produce output
Cost Examples
| Document | Pages | Cost |
|---|---|---|
| Single invoice | 1 | $0.10 |
| 10-page contract | 10 | $1.00 |
| 50-page report | 50 | $5.00 |
Use Cases
- Invoice processing: Extract line items, totals, dates
- Document digitization: Make scanned documents searchable
- Contract analysis: Extract key terms and clauses
- Academic papers: Extract text from PDF research papers
- Legal discovery: Process document sets at scale
- Form processing: Pull data from scanned forms
Common Problems
Empty or low-confidence results
Cause: The PDF may be image-heavy with poor scan quality. Fix: Try increasing DPI by using a higher-quality source scan. Ensure the correct language is selected.
Timeout errors
Cause: Very large or complex pages can exceed processing limits.
Fix: Use pageRange to process specific pages instead of the entire document.
Demo data showing instead of real results
Cause: demoMode is still set to true.
Fix: Set demoMode: false and provide a pdfUrl or pdfBase64.