landingai-ade-extractor avatar
landingai-ade-extractor

Pricing

from $20,000.00 / 1,000 results

Go to Apify Store
landingai-ade-extractor

landingai-ade-extractor

Official LandingAI Agentic Document Extraction (ADE) wrapper for Apify. Turn any PDF or image (invoices, receipts, IDs, forms, contracts, passports) into perfect structured JSON in seconds – no prompt engineering needed.

Pricing

from $20,000.00 / 1,000 results

Rating

0.0

(0)

Developer

Data Farming Team

Data Farming Team

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

11 days ago

Last modified

Share

LandingAI ADE Document Extractor Actor

An Apify Actor that wraps LandingAI's Agentic Document Extraction (ADE) library to extract structured data from visual documents (PDFs and images) via API.

Features

  • 🔍 Intelligent Document Extraction: Extracts structured data from PDFs and images using AI
  • 📄 Multiple Formats: Supports both PDF documents and image files
  • 🎯 Custom Instructions: Provide specific instructions for what data to extract
  • 🖼️ Visual Groundings: Optional saving of grounding images showing where data was extracted
  • Async Processing: Built with async/await for optimal performance
  • Comprehensive Tests: Full test coverage with pytest following TDD practices

Input

The Actor accepts the following input parameters:

{
"apiKey": "land_sk_YOUR_API_KEY_HERE",
"documentUrl": "https://example.com/document.pdf",
"documentPath": "/path/to/local/document.pdf",
"instructions": "Extract invoice number, date, and total amount",
"saveGroundings": true
}

Input Parameters

ParameterTypeRequiredDescription
apiKeyStringYesYour LandingAI API key for authentication
documentUrlStringConditional*URL of the document to process
documentPathStringConditional*Local path to the document file
instructionsStringNoCustom instructions for extraction (default: "Extract all key information from this document.")
saveGroundingsBooleanNoWhether to save grounding images to key-value store (default: false)
useProxiesBooleanNoEnable routing API calls through Apify proxies (default: false)
proxyConfigurationObjectNoApify proxy configuration (required if useProxies is true)

*Either documentUrl or documentPath must be provided. If both are provided, documentUrl takes priority.

Output

The Actor pushes results to the dataset with the following structure:

{
"structured_data": {
"invoice_number": "INV-2024-001",
"date": "2024-12-05",
"total": 1500.00
},
"markdown": "# Invoice INV-2024-001\n\nDate: 2024-12-05\nTotal: $1,500.00",
"document_source": "https://example.com/invoice.pdf",
"extraction_time_seconds": 2.34,
"total_time_seconds": 2.45,
"instructions": "Extract invoice number, date, and total amount",
"groundings_saved": true
}

Output Fields

FieldTypeDescription
structured_dataObjectExtracted structured information as key-value pairs
markdownStringDocument content formatted as markdown
document_sourceStringSource URL or path of the processed document
extraction_time_secondsNumberTime taken for the extraction process
total_time_secondsNumberTotal execution time including I/O
instructionsStringThe instructions used for extraction
groundings_savedBooleanWhether grounding images were saved

Grounding Images

When saveGroundings is set to true, the Actor saves visual annotations to the key-value store showing where data was extracted from the document. These images are saved with keys like:

  • grounding-0
  • grounding-1
  • grounding-2

You can access these images from the Actor's key-value store after the run completes.

Proxy Support

The Actor supports routing LandingAI API calls through Apify's proxy servers. This is useful for:

  • Rate Limit Management: Distribute requests across multiple IPs
  • Geographic Restrictions: Access region-specific content
  • IP Rotation: Avoid blocks from excessive requests

Using Proxies

Enable proxy support by setting useProxies to true:

{
"apiKey": "your-api-key",
"documentUrl": "https://example.com/document.pdf",
"useProxies": true,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"],
"apifyProxyCountry": "US"
}
}

Proxy Configuration Options

OptionDescription
useApifyProxyEnable Apify's proxy service (default: true)
apifyProxyGroupsProxy groups: ["RESIDENTIAL"], ["DATACENTER"], ["GOOGLE_SERP"]
apifyProxyCountryTwo-letter country code (e.g., "US", "GB", "DE")

Note: Proxy usage requires an Apify subscription with proxy access enabled.

Local Development

Prerequisites

  • Python 3.9+
  • Apify CLI: npm install -g apify-cli
  • LandingAI API key

Installation

  1. Clone this repository or navigate to the Actor directory:
$cd ade-extractor
  1. Install dependencies:
$pip install -r requirements.txt
  1. Set up your API key in storage/key_value_stores/default/INPUT.json:
{
"apiKey": "land_sk_YOUR_API_KEY_HERE",
"documentUrl": "https://example.com/sample.pdf",
"instructions": "Extract key information",
"saveGroundings": false
}

Running Locally

$apify run

Running Tests

The Actor includes comprehensive unit tests following TDD (Test-Driven Development) practices:

# Run all tests
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html
# Run specific test file
python -m pytest tests/test_extraction.py -v

Test coverage includes:

  • ✅ Successful document extraction from URLs
  • ✅ Successful document extraction from local files
  • ✅ Invalid API key handling
  • ✅ Missing document source validation
  • ✅ Saving grounding images
  • ✅ Empty groundings handling
  • ✅ Input validation scenarios
  • ✅ Complete extraction workflows
  • ✅ Grounding image workflows

Deployment

Deploy to Apify Platform

  1. Authenticate with Apify:
$apify login
  1. Deploy the Actor:
$apify push

Error Handling

The Actor includes comprehensive error handling for:

  • Invalid API Key: Returns error with message about checking API key
  • Missing Required Parameters: Validates all required inputs before processing
  • API Connection Errors: Catches and reports connection issues with LandingAI API
  • Document Access Errors: Handles cases where document URL/path is inaccessible
  • Unexpected Errors: Catches and logs all unexpected errors with full details

API Rate Limits

Please be aware of LandingAI API rate limits and quotas. The Actor processes documents asynchronously but respects API limitations.

Best Practices

  1. Use Specific Instructions: The more specific your extraction instructions, the better the results
  2. Enable Groundings for Debugging: Turn on saveGroundings when testing to verify extraction accuracy
  3. Handle Large Documents: For large PDFs, be aware of processing time and API timeouts
  4. Secure API Keys: Always store API keys as secrets, never hardcode them

Architecture

The Actor follows Apify best practices:

  • Async/Await: All I/O operations use async for optimal performance
  • Input Validation: Early validation of all input parameters
  • Error Handling: Comprehensive error handling with descriptive messages
  • Logging: Detailed logging using Apify's logging system
  • Storage: Proper use of Dataset and Key-Value Store
  • Docker Compatible: No hardcoded local paths, fully containerized

Technology Stack

  • Python 3.9+: Modern async Python
  • Apify SDK: Official Apify Python SDK
  • LandingAI ADE: Agentic Document Extraction library
  • Pytest: Test framework with async support
  • asyncio: Asynchronous I/O support

Development Approach

This Actor was developed following Test-Driven Development (TDD) principles:

  1. ✅ Written comprehensive documentation and comments
  2. ✅ Created failing unit tests (red phase)
  3. ✅ Implemented minimal code to pass tests (green phase)
  4. ✅ Refactored while maintaining test coverage

All 11 tests pass with comprehensive coverage of core functionality.

License

This Actor is provided as-is for use with the Apify platform and LandingAI services.

Support

For issues related to:

  • Actor functionality: Create an issue in this repository
  • LandingAI API: Contact LandingAI support
  • Apify platform: Visit Apify documentation

Version History

1.0.0 (2024-12-05)

  • Initial release
  • Support for PDF and image document extraction
  • Grounding image storage
  • Comprehensive test suite
  • Full async/await support
  • Docker-compatible implementation