Agentic Document Extractor avatar

Agentic Document Extractor

Pricing

Pay per event

Go to Apify Store
Agentic Document Extractor

Agentic Document Extractor

Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Agentic Document Extractor

Extract public documents into clean, RAG-ready chunks with provenance.

This Actor downloads documents from public URLs, converts them into normalized semantic blocks, and outputs structured chunks that are ready for vector databases, search pipelines, LLM retrieval, and downstream automation. It is designed for practical ingestion workflows where you want deterministic extraction, traceable source context, and clean machine-readable output instead of raw OCR dumps.

Why use it

  • Converts common business documents into structured chunks, not just plain text blobs
  • Preserves provenance with page ranges and bounding boxes when available
  • Handles mixed document sets in one run
  • Exposes stable SUMMARY and MANIFEST records for orchestration and monitoring
  • Works well as a preprocessing step for RAG, indexing, classification, and enrichment pipelines

🧾 Supported formats

  • PDF
  • Images: PNG, JPG, JPEG, TIFF, WEBP, GIF
  • DOCX
  • XLSX
  • CSV
  • PPTX
  • TXT
  • Markdown

How extraction works

  • PDFs use the embedded text layer first for speed and accuracy
  • Sparse or scanned PDFs can fall back to OCR depending on ocrFallbackMode
  • Images are processed with OCR
  • DOCX files are converted into headings, paragraphs, lists, and tables
  • XLSX and CSV files are converted into sheet-aware table blocks
  • PPTX files prefer LibreOffice-to-PDF conversion and fall back to XML text extraction when needed
  • Chunking is deterministic and based on structure, page boundaries, tables, size limits, and overlap

🎯 Typical use cases

  • Preparing document corpora for RAG or vector search
  • Normalizing invoices, reports, slide decks, and spreadsheets before AI processing
  • Building ingestion pipelines that need both chunk text and source provenance
  • Converting legacy documents into structured JSON for automation workflows

📥 Input example

Use documents to provide public file URLs and tune chunking or OCR behavior as needed.

{
"documents": [
{
"url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
},
{
"url": "https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/skew.pdf"
},
{
"url": "https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/typewriter.png"
},
{
"url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
},
{
"url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.xlsx"
},
{
"url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pptx"
}
],
"maxConcurrency": 3,
"ocrLanguages": ["eng"],
"ocrFallbackMode": "auto",
"chunkMaxChars": 1800,
"chunkOverlapChars": 200,
"maxPagesPerDocument": 200,
"emitMarkdown": true,
"emitRawText": true,
"emitBoundingBoxes": true
}

📤 Output

The Actor writes one dataset item per chunk and also stores two stable records in the default key-value store:

  • SUMMARY for run-level metrics 📊
  • MANIFEST for per-document status, warnings, and failure reporting 🗂️

Each chunk item includes:

  • documentId, sourceUrl, fileType
  • chunkId, chunkIndex, chunkType
  • text, markdown
  • pageStart, pageEnd
  • sectionPath
  • bbox
  • charCount, tokenEstimate
  • language
  • extractionMode

🧩 Example dataset item

{
"documentId": "caa40e3b17148c75",
"sourceUrl": "https://example.com/report.pdf",
"fileType": "pdf",
"chunkId": "caa40e3b17148c75-1",
"chunkIndex": 0,
"chunkType": "page",
"text": "Quarterly revenue report...",
"markdown": "Quarterly revenue report...",
"pageStart": 1,
"pageEnd": 2,
"sectionPath": ["Executive Summary"],
"bbox": {
"pageNumber": 1,
"x": 90,
"y": 71.28,
"width": 431.88,
"height": 68.16
},
"charCount": 324,
"tokenEstimate": 81,
"language": "eng",
"extractionMode": "text_layer"
}

🛠️ Operational notes

  • Public URLs only in v1
  • Runs are deterministic and do not require an LLM provider
  • OCR quality depends on the source file and available OCR tooling
  • PPTX conversion uses LibreOffice when available and falls back gracefully when it is not

🚧 Current limitations

  • Public URLs only in v1. No cookies, auth headers, or private file fetch support.
  • Advanced form semantics, checkbox state extraction, and layout-aware table reconstruction are intentionally limited.
  • Scanned PDF OCR depends on rasterization tooling being available.

Price

The Actor charges only after successful extraction and stops starting new documents once the charge limit is reached for a configured event.