landingai-ade-extractor
Pricing
from $20,000.00 / 1,000 results
landingai-ade-extractor
Official LandingAI Agentic Document Extraction (ADE) wrapper for Apify. Turn any PDF or image (invoices, receipts, IDs, forms, contracts, passports) into perfect structured JSON in seconds – no prompt engineering needed.
Pricing
from $20,000.00 / 1,000 results
Rating
0.0
(0)
Developer

Data Farming Team
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
11 days ago
Last modified
Categories
Share
LandingAI ADE Document Extractor Actor
An Apify Actor that wraps LandingAI's Agentic Document Extraction (ADE) library to extract structured data from visual documents (PDFs and images) via API.
Features
- 🔍 Intelligent Document Extraction: Extracts structured data from PDFs and images using AI
- 📄 Multiple Formats: Supports both PDF documents and image files
- 🎯 Custom Instructions: Provide specific instructions for what data to extract
- 🖼️ Visual Groundings: Optional saving of grounding images showing where data was extracted
- ⚡ Async Processing: Built with async/await for optimal performance
- ✅ Comprehensive Tests: Full test coverage with pytest following TDD practices
Input
The Actor accepts the following input parameters:
{"apiKey": "land_sk_YOUR_API_KEY_HERE","documentUrl": "https://example.com/document.pdf","documentPath": "/path/to/local/document.pdf","instructions": "Extract invoice number, date, and total amount","saveGroundings": true}
Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
apiKey | String | Yes | Your LandingAI API key for authentication |
documentUrl | String | Conditional* | URL of the document to process |
documentPath | String | Conditional* | Local path to the document file |
instructions | String | No | Custom instructions for extraction (default: "Extract all key information from this document.") |
saveGroundings | Boolean | No | Whether to save grounding images to key-value store (default: false) |
useProxies | Boolean | No | Enable routing API calls through Apify proxies (default: false) |
proxyConfiguration | Object | No | Apify proxy configuration (required if useProxies is true) |
*Either documentUrl or documentPath must be provided. If both are provided, documentUrl takes priority.
Output
The Actor pushes results to the dataset with the following structure:
{"structured_data": {"invoice_number": "INV-2024-001","date": "2024-12-05","total": 1500.00},"markdown": "# Invoice INV-2024-001\n\nDate: 2024-12-05\nTotal: $1,500.00","document_source": "https://example.com/invoice.pdf","extraction_time_seconds": 2.34,"total_time_seconds": 2.45,"instructions": "Extract invoice number, date, and total amount","groundings_saved": true}
Output Fields
| Field | Type | Description |
|---|---|---|
structured_data | Object | Extracted structured information as key-value pairs |
markdown | String | Document content formatted as markdown |
document_source | String | Source URL or path of the processed document |
extraction_time_seconds | Number | Time taken for the extraction process |
total_time_seconds | Number | Total execution time including I/O |
instructions | String | The instructions used for extraction |
groundings_saved | Boolean | Whether grounding images were saved |
Grounding Images
When saveGroundings is set to true, the Actor saves visual annotations to the key-value store showing where data was extracted from the document. These images are saved with keys like:
grounding-0grounding-1grounding-2
You can access these images from the Actor's key-value store after the run completes.
Proxy Support
The Actor supports routing LandingAI API calls through Apify's proxy servers. This is useful for:
- Rate Limit Management: Distribute requests across multiple IPs
- Geographic Restrictions: Access region-specific content
- IP Rotation: Avoid blocks from excessive requests
Using Proxies
Enable proxy support by setting useProxies to true:
{"apiKey": "your-api-key","documentUrl": "https://example.com/document.pdf","useProxies": true,"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"],"apifyProxyCountry": "US"}}
Proxy Configuration Options
| Option | Description |
|---|---|
useApifyProxy | Enable Apify's proxy service (default: true) |
apifyProxyGroups | Proxy groups: ["RESIDENTIAL"], ["DATACENTER"], ["GOOGLE_SERP"] |
apifyProxyCountry | Two-letter country code (e.g., "US", "GB", "DE") |
Note: Proxy usage requires an Apify subscription with proxy access enabled.
Local Development
Prerequisites
- Python 3.9+
- Apify CLI:
npm install -g apify-cli - LandingAI API key
Installation
- Clone this repository or navigate to the Actor directory:
$cd ade-extractor
- Install dependencies:
$pip install -r requirements.txt
- Set up your API key in
storage/key_value_stores/default/INPUT.json:
{"apiKey": "land_sk_YOUR_API_KEY_HERE","documentUrl": "https://example.com/sample.pdf","instructions": "Extract key information","saveGroundings": false}
Running Locally
$apify run
Running Tests
The Actor includes comprehensive unit tests following TDD (Test-Driven Development) practices:
# Run all testspython -m pytest tests/ -v# Run with coveragepython -m pytest tests/ --cov=src --cov-report=html# Run specific test filepython -m pytest tests/test_extraction.py -v
Test coverage includes:
- ✅ Successful document extraction from URLs
- ✅ Successful document extraction from local files
- ✅ Invalid API key handling
- ✅ Missing document source validation
- ✅ Saving grounding images
- ✅ Empty groundings handling
- ✅ Input validation scenarios
- ✅ Complete extraction workflows
- ✅ Grounding image workflows
Deployment
Deploy to Apify Platform
- Authenticate with Apify:
$apify login
- Deploy the Actor:
$apify push
Error Handling
The Actor includes comprehensive error handling for:
- Invalid API Key: Returns error with message about checking API key
- Missing Required Parameters: Validates all required inputs before processing
- API Connection Errors: Catches and reports connection issues with LandingAI API
- Document Access Errors: Handles cases where document URL/path is inaccessible
- Unexpected Errors: Catches and logs all unexpected errors with full details
API Rate Limits
Please be aware of LandingAI API rate limits and quotas. The Actor processes documents asynchronously but respects API limitations.
Best Practices
- Use Specific Instructions: The more specific your extraction instructions, the better the results
- Enable Groundings for Debugging: Turn on
saveGroundingswhen testing to verify extraction accuracy - Handle Large Documents: For large PDFs, be aware of processing time and API timeouts
- Secure API Keys: Always store API keys as secrets, never hardcode them
Architecture
The Actor follows Apify best practices:
- Async/Await: All I/O operations use async for optimal performance
- Input Validation: Early validation of all input parameters
- Error Handling: Comprehensive error handling with descriptive messages
- Logging: Detailed logging using Apify's logging system
- Storage: Proper use of Dataset and Key-Value Store
- Docker Compatible: No hardcoded local paths, fully containerized
Technology Stack
- Python 3.9+: Modern async Python
- Apify SDK: Official Apify Python SDK
- LandingAI ADE: Agentic Document Extraction library
- Pytest: Test framework with async support
- asyncio: Asynchronous I/O support
Development Approach
This Actor was developed following Test-Driven Development (TDD) principles:
- ✅ Written comprehensive documentation and comments
- ✅ Created failing unit tests (red phase)
- ✅ Implemented minimal code to pass tests (green phase)
- ✅ Refactored while maintaining test coverage
All 11 tests pass with comprehensive coverage of core functionality.
License
This Actor is provided as-is for use with the Apify platform and LandingAI services.
Support
For issues related to:
- Actor functionality: Create an issue in this repository
- LandingAI API: Contact LandingAI support
- Apify platform: Visit Apify documentation
Related Resources
Version History
1.0.0 (2024-12-05)
- Initial release
- Support for PDF and image document extraction
- Grounding image storage
- Comprehensive test suite
- Full async/await support
- Docker-compatible implementation