Pricing

from $20,000.00 / 1,000 results

landingai-ade-extractor

Official LandingAI Agentic Document Extraction (ADE) wrapper for Apify. Turn any PDF or image (invoices, receipts, IDs, forms, contracts, passports) into perfect structured JSON in seconds – no prompt engineering needed.

Pricing

from $20,000.00 / 1,000 results

Rating

0.0

(0)

Developer

Data Farming Team

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

LandingAI ADE Document Extractor Actor

An Apify Actor that wraps LandingAI's Agentic Document Extraction (ADE) library to extract structured data from visual documents (PDFs and images) via API.

Features

🔍 Intelligent Document Extraction: Extracts structured data from PDFs and images using AI
📄 Multiple Formats: Supports both PDF documents and image files
🎯 Custom Instructions: Provide specific instructions for what data to extract
🖼️ Visual Groundings: Optional saving of grounding images showing where data was extracted
⚡ Async Processing: Built with async/await for optimal performance
✅ Comprehensive Tests: Full test coverage with pytest following TDD practices

Input

The Actor accepts the following input parameters:

{
  "apiKey": "land_sk_YOUR_API_KEY_HERE",
  "documentUrl": "https://example.com/document.pdf",
  "documentPath": "/path/to/local/document.pdf",
  "instructions": "Extract invoice number, date, and total amount",
  "saveGroundings": true
}

Input Parameters

Parameter	Type	Required	Description
`apiKey`	String	Yes	Your LandingAI API key for authentication
`documentUrl`	String	Conditional*	URL of the document to process
`documentPath`	String	Conditional*	Local path to the document file
`instructions`	String	No	Custom instructions for extraction (default: "Extract all key information from this document.")
`saveGroundings`	Boolean	No	Whether to save grounding images to key-value store (default: false)
`useProxies`	Boolean	No	Enable routing API calls through Apify proxies (default: false)
`proxyConfiguration`	Object	No	Apify proxy configuration (required if `useProxies` is true)

*Either documentUrl or documentPath must be provided. If both are provided, documentUrl takes priority.

Output

The Actor pushes results to the dataset with the following structure:

{
  "structured_data": {
    "invoice_number": "INV-2024-001",
    "date": "2024-12-05",
    "total": 1500.00
  },
  "markdown": "# Invoice INV-2024-001\n\nDate: 2024-12-05\nTotal: $1,500.00",
  "document_source": "https://example.com/invoice.pdf",
  "extraction_time_seconds": 2.34,
  "total_time_seconds": 2.45,
  "instructions": "Extract invoice number, date, and total amount",
  "groundings_saved": true
}

Output Fields

Field	Type	Description
`structured_data`	Object	Extracted structured information as key-value pairs
`markdown`	String	Document content formatted as markdown
`document_source`	String	Source URL or path of the processed document
`extraction_time_seconds`	Number	Time taken for the extraction process
`total_time_seconds`	Number	Total execution time including I/O
`instructions`	String	The instructions used for extraction
`groundings_saved`	Boolean	Whether grounding images were saved

Grounding Images

When saveGroundings is set to true, the Actor saves visual annotations to the key-value store showing where data was extracted from the document. These images are saved with keys like:

grounding-0
grounding-1
grounding-2

You can access these images from the Actor's key-value store after the run completes.

Proxy Support

The Actor supports routing LandingAI API calls through Apify's proxy servers. This is useful for:

Rate Limit Management: Distribute requests across multiple IPs
Geographic Restrictions: Access region-specific content
IP Rotation: Avoid blocks from excessive requests

Using Proxies

Enable proxy support by setting useProxies to true:

{
  "apiKey": "your-api-key",
  "documentUrl": "https://example.com/document.pdf",
  "useProxies": true,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"],
    "apifyProxyCountry": "US"
  }
}

Proxy Configuration Options

Option	Description
`useApifyProxy`	Enable Apify's proxy service (default: true)
`apifyProxyGroups`	Proxy groups: `["RESIDENTIAL"]`, `["DATACENTER"]`, `["GOOGLE_SERP"]`
`apifyProxyCountry`	Two-letter country code (e.g., "US", "GB", "DE")

Note: Proxy usage requires an Apify subscription with proxy access enabled.

Local Development

Prerequisites

Python 3.9+
Apify CLI: npm install -g apify-cli
LandingAI API key

Installation

Clone this repository or navigate to the Actor directory:

$cd ade-extractor

Install dependencies:

$pip install -r requirements.txt

Set up your API key in storage/key_value_stores/default/INPUT.json:

{
  "apiKey": "land_sk_YOUR_API_KEY_HERE",
  "documentUrl": "https://example.com/sample.pdf",
  "instructions": "Extract key information",
  "saveGroundings": false
}

Running Locally

$apify run

Running Tests

The Actor includes comprehensive unit tests following TDD (Test-Driven Development) practices:

# Run all tests
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html

# Run specific test file
python -m pytest tests/test_extraction.py -v

Test coverage includes:

✅ Successful document extraction from URLs
✅ Successful document extraction from local files
✅ Invalid API key handling
✅ Missing document source validation
✅ Saving grounding images
✅ Empty groundings handling
✅ Input validation scenarios
✅ Complete extraction workflows
✅ Grounding image workflows

Deployment

Deploy to Apify Platform

Authenticate with Apify:

$apify login

Deploy the Actor:

$apify push

Error Handling

The Actor includes comprehensive error handling for:

Invalid API Key: Returns error with message about checking API key
Missing Required Parameters: Validates all required inputs before processing
API Connection Errors: Catches and reports connection issues with LandingAI API
Document Access Errors: Handles cases where document URL/path is inaccessible
Unexpected Errors: Catches and logs all unexpected errors with full details

API Rate Limits

Please be aware of LandingAI API rate limits and quotas. The Actor processes documents asynchronously but respects API limitations.

Best Practices

Use Specific Instructions: The more specific your extraction instructions, the better the results
Enable Groundings for Debugging: Turn on saveGroundings when testing to verify extraction accuracy
Handle Large Documents: For large PDFs, be aware of processing time and API timeouts
Secure API Keys: Always store API keys as secrets, never hardcode them

Architecture

The Actor follows Apify best practices:

Async/Await: All I/O operations use async for optimal performance
Input Validation: Early validation of all input parameters
Error Handling: Comprehensive error handling with descriptive messages
Logging: Detailed logging using Apify's logging system
Storage: Proper use of Dataset and Key-Value Store
Docker Compatible: No hardcoded local paths, fully containerized

Technology Stack

Python 3.9+: Modern async Python
Apify SDK: Official Apify Python SDK
LandingAI ADE: Agentic Document Extraction library
Pytest: Test framework with async support
asyncio: Asynchronous I/O support

Development Approach

This Actor was developed following Test-Driven Development (TDD) principles:

✅ Written comprehensive documentation and comments
✅ Created failing unit tests (red phase)
✅ Implemented minimal code to pass tests (green phase)
✅ Refactored while maintaining test coverage

All 11 tests pass with comprehensive coverage of core functionality.

License

This Actor is provided as-is for use with the Apify platform and LandingAI services.

Support

For issues related to:

Actor functionality: Create an issue in this repository
LandingAI API: Contact LandingAI support
Apify platform: Visit Apify documentation

Version History

1.0.0 (2024-12-05)

Initial release
Support for PDF and image document extraction
Grounding image storage
Comprehensive test suite
Full async/await support
Docker-compatible implementation

OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

Anass

Elite Document Ocr Lite

thepattyroller/elite-document-ocr-lite

Basic document text extraction and processing. Extract text from documents, analyze document structure, and extract structured data from invoices and receipts. Perfect for document automation workflows.

Logan Kiser

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

The Howlers

Pdf Json Extractor

p6t_p10n/pdf-json-extractor

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.

Peerapat Pongnipakorn

File Data Extractor

yasaslive/gemini-file-actor

Turn any document, image, or text file into structured data or concise summaries instantly.

Yasas Alwis

Vision OCR MCP

accelerationengg/vision-ocr-mcp

Extract text from images instantly. Turn receipts, invoices, documents, and handwritten notes into structured data.

Acceleration

5.0

Invoice & Receipt Extractor — Automated Document Data Extrac...

apricot_blackberry/invoice-receipt-extractor

Invoices and receipts → structured data. Amounts, dates, vendors, line items, tax details. Clean JSON, zero manual entry.

Creator Fusion

Convert Image to PDF and PDF to Image

akash9078/image-pdf-converter

Convert images (JPG, PNG, BMP, and more) into high-quality PDFs, or extract images from PDF files in seconds. Image–PDF Converter Pro delivers fast, reliable, and professional results for all your document and image conversion needs.

Akash Kumar Naik

Image To Json Extractor

apitale/image-to-json-extractor

AI-Powered Image to JSON Data Extractor. Utilize cutting-edge AI to transform image content into structured JSON data effortlessly. Perfect for automating data extraction from visual content and streamlining workflows.

Apitale

Html To Pdf Api

simplifysme/html-to-pdf-api

📄 Convert any HTML page or URL to high-quality PDF documents via API. Perfect for reports, invoices, documentation, web page archiving, and automated document generation.

SimplifySME Toolbox