Pricing

$24.99/month + usage

Document Extractor API - AI-Powered PDF & Text Analysis

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

Pricing

$24.99/month + usage

Rating

0.0

(0)

Developer

Brennan Crawford

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🚀 Revolutionary Features

🧠 AI-Powered OCR: Advanced text extraction from PDFs, images, and documents
🔄 No-API Protocol: Zero authentication required with mirror fallbacks
📄 Multi-Format Support: PDF, Word, images, HTML, and text files
🌐 Mirror Fallbacks: Automatic fallback to alternative OCR services
🔍 Smart Filtering: Extract only documents containing specific keywords
📊 Multiple Outputs: JSON, Markdown, plain text, or structured formats
🌍 Language Detection: Automatic language identification
⚡ High Performance: Process multiple documents in parallel

🎯 Use Cases

Document Processing

Extract text from scanned PDFs and images
Convert documents to searchable text
Process invoices, contracts, and reports
Analyze research papers and articles

Content Analysis

Extract key information from documents
Filter documents by keywords and topics
Analyze document structure and metadata
Prepare documents for AI processing

Business Intelligence

Process financial reports and statements
Extract data from legal documents
Analyze customer communications
Monitor document trends and patterns

📋 Input Parameters

Parameter	Type	Default	Description
`documentUrls`	string	""	URLs of documents to process (one per line)
`extractionStrategy`	string	"hybrid"	OCR, text, hybrid, or advanced extraction
`outputFormat`	string	"json"	JSON, Markdown, plain text, or structured
`languageDetection`	boolean	true	Detect document language automatically
`includeMetadata`	boolean	true	Extract document metadata
`maxTextLength`	integer	10000	Maximum characters per document
`searchKeywords`	string	""	Filter by keywords (comma-separated)
`useMirrorFallbacks`	boolean	true	Enable mirror site fallbacks

📄 Supported Document Types

PDF Documents

Text-based PDFs with direct extraction
Scanned PDFs with OCR processing
Multi-page document support
Table and figure extraction

Image Files

JPEG, PNG, GIF, BMP, TIFF support
Advanced OCR technology
Multi-language text recognition
High accuracy processing

Text Documents

HTML and web pages
Plain text files
Markdown documents
Structured content extraction

📊 Output Format Examples

JSON Output

{
  "document_id": "doc_12345",
  "file_name": "report.pdf",
  "file_type": "pdf",
  "extracted_text": "Complete document text content...",
  "text_length": 5420,
  "extraction_method": "pdf_direct",
  "language": "eng",
  "confidence_score": 0.95,
  "processing_time": 2.3,
  "extracted_at": "2024-01-15T10:30:00Z"
}

Markdown Output

# report.pdf

Complete document text content with proper formatting...

Structured Output

{
  "extracted_text": "...",
  "structured_data": {
    "word_count": 850,
    "line_count": 120,
    "char_count": 5420,
    "has_tables": true,
    "has_images": false
  }
}

🔧 Technical Architecture

No-API Protocol Implementation

Primary OCR Services: OCR.space, PDF24, Optiic
Mirror Fallbacks: Jina AI proxies for reliability
Zero Authentication: Public demo endpoints
Error Handling: Graceful degradation with sample data

Processing Pipeline

Document Detection: Automatic file type identification
Extraction Method: Direct text or OCR based on content
Language Detection: Automatic language identification
Format Conversion: Output in requested format
Quality Assurance: Confidence scoring and validation

🚀 Getting Started

# Clone the actor
apify pull document-extractor-api

# Install dependencies
pip install -r requirements.txt

# Test locally
python test_extractor.py

# Deploy to Apify
apify push

📈 Performance Metrics

Processing Speed: 2-5 seconds per document
Accuracy: 95%+ for clear documents
Language Support: 100+ languages
File Size: Up to 50MB per document
Concurrent Processing: Multiple documents

🌐 Integration Examples

Basic Document Extraction

# Extract text from PDF documents
results = await Actor.run({
    "documentUrls": "https://example.com/document.pdf",
    "extractionStrategy": "hybrid",
    "outputFormat": "json"
})

Keyword-Based Filtering

# Extract only documents containing specific terms
results = await Actor.run({
    "documentUrls": "https://example.com/financial-report.pdf",
    "searchKeywords": "revenue,profit,financial",
    "maxTextLength": 5000
})

Batch Processing

# Process multiple documents
results = await Actor.run({
    "documentUrls": """
        https://example.com/doc1.pdf
        https://example.com/doc2.jpg
        https://example.com/doc3.html
    """,
    "extractionStrategy": "advanced",
    "languageDetection": true
})

🛡️ Privacy & Security

No Data Storage: Documents processed in memory only
Secure Processing: HTTPS connections for all requests
Privacy Compliant: No personal data retention
Mirror Reliability: Multiple service endpoints

🌐 Actor URL

https://console.apify.com/actors/document-extractor-api

Built with No-API Protocol for maximum reliability and zero authentication requirements. The first agentic document extractor designed for AI workflows and automated processing.

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

Kumar Gagandeo

Pdf OCR API

cspnair/pdf-ocr-api

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

csp

5.0

OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

Anass

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

ParseForge

5.0

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

494

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Andok

Pdf API

vivid_astronaut/pdf

Fabio Suizu

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

The Howlers

Elite Document Ocr Lite

thepattyroller/elite-document-ocr-lite

Basic document text extraction and processing. Extract text from documents, analyze document structure, and extract structured data from invoices and receipts. Perfect for document automation workflows.

Logan Kiser

Html To Pdf Api

simplifysme/html-to-pdf-api

📄 Convert any HTML page or URL to high-quality PDF documents via API. Perfect for reports, invoices, documentation, web page archiving, and automated document generation.

SimplifySME Toolbox