Pricing

from $450.00 / 1,000 tax document extracteds

AI OCR for Tax Documents: Invoices, Balance Sheets & Tables

Extract structured data from invoices, receipts, balance sheets and tabular PDFs with AI. Returns issuer, dates, totals, taxes and tables as JSON. Upload a file or pass URLs; batch or real-time API.

Pricing

from $450.00 / 1,000 tax document extracteds

Rating

0.0

(0)

Developer

Acme AI

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🧾 AI OCR for Tax Documents (Invoices, Balance Sheets & Tables)

Turn invoices, receipts, balance sheets, bank statements and tabular PDFs into clean, structured JSON with AI. Upload a file or pass document URLs, and get back the document type, issuer/recipient, dates, totals, taxes, line-item tables and a summary - ready for your accounting system or spreadsheet.

🎯 Built for Tax & accounting teams. Not a generic text dump: the AI detects the document type and extracts the fields that matter, plus the tables, preserving layout meaning.

What you get (per document)

Field	Description
`documentType`	invoice, receipt, balance_sheet, income_statement, bank_statement, purchase_order, table, other
`issuerName` / `issuerTaxId`	Vendor/company and tax ID (VAT, CNPJ, EIN...)
`recipientName` / `recipientTaxId`	Buyer/customer and tax ID
`documentNumber`, `issueDate`, `dueDate`	Document identification
`currency`, `subtotal`, `taxAmount`, `totalAmount`	Monetary fields (plain numbers)
`tables[]`	Extracted tables (line items, balances...) with columns + rows
`keyValues`	Any other labelled fields (payment terms, account no., period...)
`summary`	One-line description
`fileMetadata`	`type`, `sizeBytes`, `pageCount` (PDF)

How to use

Upload a file in the input, or pass URLs for batch:

{
  "documentUrls": [
    "https://example.com/invoice.pdf",
    "https://example.com/receipt.jpg"
  ]
}

Supports PDF, PNG, JPG and WebP. Up to 50 documents per run (send larger volumes via sequential calls). PDFs are read natively (multi-page); images are auto-optimized before analysis.

Pricing

Charged per document successfully extracted (event tax-document-extracted). Documents that fail to download or can't be read are not charged.

Example output

[
  {
    "documentUrl": "https://example.com/invoice.pdf",
    "success": true,
    "documentType": "invoice",
    "issuerName": "ACME Ltda",
    "issuerTaxId": "12345678000190",
    "recipientName": "Globex Inc",
    "documentNumber": "INV-2024-001",
    "issueDate": "2024-03-15",
    "dueDate": "2024-04-15",
    "currency": "USD",
    "subtotal": 1100.0,
    "taxAmount": 150.0,
    "totalAmount": 1250.0,
    "tables": [
      { "title": "Line items", "columns": ["description", "qty", "unitPrice", "total"],
        "rows": [ { "description": "Consulting", "qty": 10, "unitPrice": 110, "total": 1100 } ] }
    ],
    "keyValues": { "paymentTerms": "Net 30" },
    "summary": "Invoice from ACME Ltda to Globex Inc, total USD 1250.",
    "fileMetadata": { "type": "pdf", "sizeBytes": 84210, "pageCount": 1 },
    "failureReason": null,
    "processedAt": "2026-01-01T12:00:00.000Z",
    "error": null
  }
]

FAQ

Which documents work best? Clear digital PDFs and sharp scans/photos. Very low-resolution or handwritten documents may not be readable - the reason is reported in failureReason.

Does it handle multi-page PDFs? Yes. PDFs are read natively, including tables and layout, across pages.

Can I upload a file directly? Yes - use the upload field in the input, or call the API with a document URL.

Can I call it in real time? Yes. The Standby endpoint POST /extract responds synchronously. See below.

🔌 API integration

Batch run:

curl -X POST "https://api.apify.com/v2/acts/acme-ai~ocr-tax-document-ai/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"documentUrls":["https://example.com/invoice.pdf","https://example.com/receipt.jpg"]}'

Standby (POST /extract):

curl -X POST "https://acme-ai--ocr-tax-document-ai.apify.actor/extract" \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  --compressed \
  -d '{"documentUrls":["https://example.com/invoice.pdf","https://example.com/receipt.jpg"]}'

The token goes in the Authorization: Bearer header, never in the URL.

Notes

This Actor analyzes documents you provide. You are responsible for having the right to process them and any personal or financial data they may contain.

Vision OCR MCP

accelerationengg/vision-ocr-mcp

Extract text from images instantly. Turn receipts, invoices, documents, and handwritten notes into structured data.

Acceleration

5.0

PDF to Structured Data (Excel/JSON) with OCR

nibble/pdf-to-structured-data

Convert PDF invoices, forms, statements and reports into clean structured JSON (text, tables, key/values) with an OCR fallback for scanned pages.

Simon Fletcher

PDF to Markdown Converter - AI-Powered with OCR & Tables

clearpath/pdf-to-markdown-api

Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.

ClearPath

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

The Howlers

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

Kumar Gagandeo

Invoice & Receipt Data Extractor

formnexa/invoice-receipt-data-extractor

Extract structured data from invoices and receipts using privacy-first OCR. Get vendor, date, invoice number, currency, subtotal, tax, total, line items, and raw text from PDFs or images—without external AI APIs.

formnexa Tools

PDF to Markdown (RAG-ready): Scans + Tables

copy2paste/pdf-to-markdown-rag

Scanned PDFs and messy tables actually convert here. Bundled OCR reads image-only pages; tables come out as real Markdown tables. Benchmarked against the leading alternatives — results in the README.

Dermot O'Brien

PDF Extract — Text, Tables & Metadata (OCR-ready)

sathvic_kollu/techtenstein-pdf-extract

Extract clean text, structured tables, and metadata from any PDF URL. Supports OCR for scanned documents. Ideal for building document pipelines, financial data extraction, invoice processing, and research automation.

Techtenstein Services Private Limited

AI Invoice Parser - Extract Receipt & Bill Data

ntriqpro/invoice-extraction-mcp

Automatically read invoices and receipts. Extract amounts, dates, and line items into structured data.

daehwan kim

Pdf to json

shahabuddin38/pdf-to-json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.