# Agentic Document Extractor (`solutionssmart/agentic-document-extractor-local`) Actor

Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.

- **URL**: https://apify.com/solutionssmart/agentic-document-extractor-local.md
- **Developed by:** [Solutions Smart](https://apify.com/solutionssmart) (community)
- **Categories:** AI, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Agentic Document Extractor

Extract public documents into clean, RAG-ready chunks with provenance.

This Actor downloads documents from public URLs, converts them into normalized semantic blocks, and outputs structured chunks that are ready for vector databases, search pipelines, LLM retrieval, and downstream automation. It is designed for practical ingestion workflows where you want deterministic extraction, traceable source context, and clean machine-readable output instead of raw OCR dumps.

### Why use it

- Converts common business documents into structured chunks, not just plain text blobs
- Preserves provenance with page ranges and bounding boxes when available
- Handles mixed document sets in one run
- Exposes stable `SUMMARY` and `MANIFEST` records for orchestration and monitoring
- Works well as a preprocessing step for RAG, indexing, classification, and enrichment pipelines

### 🧾 Supported formats

- PDF
- Images: PNG, JPG, JPEG, TIFF, WEBP, GIF
- DOCX
- XLSX
- CSV
- PPTX
- TXT
- Markdown

### How extraction works

- PDFs use the embedded text layer first for speed and accuracy
- Sparse or scanned PDFs can fall back to OCR depending on `ocrFallbackMode`
- Images are processed with OCR
- DOCX files are converted into headings, paragraphs, lists, and tables
- XLSX and CSV files are converted into sheet-aware table blocks
- PPTX files prefer LibreOffice-to-PDF conversion and fall back to XML text extraction when needed
- Chunking is deterministic and based on structure, page boundaries, tables, size limits, and overlap

### 🎯 Typical use cases

- Preparing document corpora for RAG or vector search
- Normalizing invoices, reports, slide decks, and spreadsheets before AI processing
- Building ingestion pipelines that need both chunk text and source provenance
- Converting legacy documents into structured JSON for automation workflows

### 📥 Input example

Use `documents` to provide public file URLs and tune chunking or OCR behavior as needed.

```json
{
  "documents": [
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
    },
    {
      "url": "https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/skew.pdf"
    },
    {
      "url": "https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/typewriter.png"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.xlsx"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pptx"
    }
  ],
  "maxConcurrency": 3,
  "ocrLanguages": ["eng"],
  "ocrFallbackMode": "auto",
  "chunkMaxChars": 1800,
  "chunkOverlapChars": 200,
  "maxPagesPerDocument": 200,
  "emitMarkdown": true,
  "emitRawText": true,
  "emitBoundingBoxes": true
}
````

### 📤 Output

The Actor writes one dataset item per chunk and also stores two stable records in the default key-value store:

- `SUMMARY` for run-level metrics 📊
- `MANIFEST` for per-document status, warnings, and failure reporting 🗂️

Each chunk item includes:

- `documentId`, `sourceUrl`, `fileType`
- `chunkId`, `chunkIndex`, `chunkType`
- `text`, `markdown`
- `pageStart`, `pageEnd`
- `sectionPath`
- `bbox`
- `charCount`, `tokenEstimate`
- `language`
- `extractionMode`

### 🧩 Example dataset item

```json
{
  "documentId": "caa40e3b17148c75",
  "sourceUrl": "https://example.com/report.pdf",
  "fileType": "pdf",
  "chunkId": "caa40e3b17148c75-1",
  "chunkIndex": 0,
  "chunkType": "page",
  "text": "Quarterly revenue report...",
  "markdown": "Quarterly revenue report...",
  "pageStart": 1,
  "pageEnd": 2,
  "sectionPath": ["Executive Summary"],
  "bbox": {
    "pageNumber": 1,
    "x": 90,
    "y": 71.28,
    "width": 431.88,
    "height": 68.16
  },
  "charCount": 324,
  "tokenEstimate": 81,
  "language": "eng",
  "extractionMode": "text_layer"
}
```

### 🛠️ Operational notes

- Public URLs only in v1
- Runs are deterministic and do not require an LLM provider
- OCR quality depends on the source file and available OCR tooling
- PPTX conversion uses LibreOffice when available and falls back gracefully when it is not

### 🚧 Current limitations

- Public URLs only in v1. No cookies, auth headers, or private file fetch support.
- Advanced form semantics, checkbox state extraction, and layout-aware table reconstruction are intentionally limited.
- Scanned PDF OCR depends on rasterization tooling being available.

### Price

The Actor charges only after successful extraction and stops starting new documents once the charge limit is reached for a configured event.

# Actor input Schema

## `documents` (type: `array`):

Public document URLs to process.

## `maxConcurrency` (type: `integer`):

Maximum number of documents to download and process in parallel.

## `ocrLanguages` (type: `array`):

Tesseract language codes used when OCR is required.

## `ocrFallbackMode` (type: `string`):

Choose whether OCR runs automatically on sparse PDFs, always, or never.

## `chunkMaxChars` (type: `integer`):

Maximum approximate character length for each chunk.

## `chunkOverlapChars` (type: `integer`):

Approximate overlap between adjacent chunks.

## `maxPagesPerDocument` (type: `integer`):

Hard limit on pages or slides processed from a single document.

## `emitMarkdown` (type: `boolean`):

Include markdown-formatted chunk content when available.

## `emitRawText` (type: `boolean`):

Include plain text chunk content.

## `emitBoundingBoxes` (type: `boolean`):

Include bounding box provenance when the extractor can provide it.

## Actor input object example

```json
{
  "documents": [
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
    }
  ],
  "maxConcurrency": 3,
  "ocrLanguages": [
    "eng"
  ],
  "ocrFallbackMode": "auto",
  "chunkMaxChars": 1800,
  "chunkOverlapChars": 200,
  "maxPagesPerDocument": 200,
  "emitMarkdown": true,
  "emitRawText": true,
  "emitBoundingBoxes": true
}
```

# Actor output Schema

## `results` (type: `string`):

No description

## `summary` (type: `string`):

No description

## `manifest` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "documents": [
        {
            "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
        },
        {
            "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("solutionssmart/agentic-document-extractor-local").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "documents": [
        { "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf" },
        { "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx" },
    ] }

# Run the Actor and wait for it to finish
run = client.actor("solutionssmart/agentic-document-extractor-local").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "documents": [
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
    }
  ]
}' |
apify call solutionssmart/agentic-document-extractor-local --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=solutionssmart/agentic-document-extractor-local",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Agentic Document Extractor",
        "description": "Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.",
        "version": "0.1",
        "x-build-id": "Ce9CQWE4FIyjpwD9b"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/solutionssmart~agentic-document-extractor-local/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-solutionssmart-agentic-document-extractor-local",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/solutionssmart~agentic-document-extractor-local/runs": {
            "post": {
                "operationId": "runs-sync-solutionssmart-agentic-document-extractor-local",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/solutionssmart~agentic-document-extractor-local/run-sync": {
            "post": {
                "operationId": "run-sync-solutionssmart-agentic-document-extractor-local",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "documents"
                ],
                "properties": {
                    "documents": {
                        "title": "📚 Documents",
                        "minItems": 1,
                        "type": "array",
                        "description": "Public document URLs to process.",
                        "items": {
                            "type": "object",
                            "additionalProperties": false,
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "title": "🔗 Document URL",
                                    "type": "string",
                                    "description": "Public http or https URL for a document to extract.",
                                    "pattern": "^https?://"
                                },
                                "id": {
                                    "title": "🆔 Document ID",
                                    "type": "string",
                                    "description": "Optional stable identifier to use in outputs."
                                },
                                "label": {
                                    "title": "🏷️ Label",
                                    "type": "string",
                                    "description": "Optional human-friendly document label."
                                },
                                "fileName": {
                                    "title": "📝 Preferred file name",
                                    "type": "string",
                                    "description": "Optional file name override used when the URL does not contain a useful name."
                                },
                                "mimeTypeHint": {
                                    "title": "🧠 MIME type hint",
                                    "type": "string",
                                    "description": "Optional MIME type hint for ambiguous URLs."
                                }
                            }
                        }
                    },
                    "maxConcurrency": {
                        "title": "⚡ Max concurrency",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Maximum number of documents to download and process in parallel.",
                        "default": 3
                    },
                    "ocrLanguages": {
                        "title": "🔤 OCR languages",
                        "type": "array",
                        "description": "Tesseract language codes used when OCR is required.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "eng"
                        ]
                    },
                    "ocrFallbackMode": {
                        "title": "👀 OCR fallback mode",
                        "enum": [
                            "auto",
                            "always",
                            "never"
                        ],
                        "type": "string",
                        "description": "Choose whether OCR runs automatically on sparse PDFs, always, or never.",
                        "default": "auto"
                    },
                    "chunkMaxChars": {
                        "title": "✂️ Chunk max chars",
                        "minimum": 250,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum approximate character length for each chunk.",
                        "default": 1800
                    },
                    "chunkOverlapChars": {
                        "title": "🔁 Chunk overlap chars",
                        "minimum": 0,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Approximate overlap between adjacent chunks.",
                        "default": 200
                    },
                    "maxPagesPerDocument": {
                        "title": "📄 Max pages per document",
                        "minimum": 1,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Hard limit on pages or slides processed from a single document.",
                        "default": 200
                    },
                    "emitMarkdown": {
                        "title": "🧾 Emit markdown",
                        "type": "boolean",
                        "description": "Include markdown-formatted chunk content when available.",
                        "default": true
                    },
                    "emitRawText": {
                        "title": "🔡 Emit raw text",
                        "type": "boolean",
                        "description": "Include plain text chunk content.",
                        "default": true
                    },
                    "emitBoundingBoxes": {
                        "title": "📐 Emit bounding boxes",
                        "type": "boolean",
                        "description": "Include bounding box provenance when the extractor can provide it.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
