Pricing

Pay per event

Agentic Document Extractor

Extract RAG-ready chunks with provenance from PDFs, scans, images, DOCX, XLSX, PPTX, CSV, TXT, and Markdown using a local-first Apify Actor.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Solutions Smart

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Agentic Document Extractor

Extract public documents into clean, RAG-ready chunks with provenance.

This Actor downloads documents from public URLs, converts them into normalized semantic blocks, and outputs structured chunks that are ready for vector databases, search pipelines, LLM retrieval, and downstream automation. It is designed for practical ingestion workflows where you want deterministic extraction, traceable source context, and clean machine-readable output instead of raw OCR dumps.

Why use it

Converts common business documents into structured chunks, not just plain text blobs
Preserves provenance with page ranges and bounding boxes when available
Handles mixed document sets in one run
Exposes stable SUMMARY and MANIFEST records for orchestration and monitoring
Works well as a preprocessing step for RAG, indexing, classification, and enrichment pipelines

🧾 Supported formats

PDF
Images: PNG, JPG, JPEG, TIFF, WEBP, GIF
DOCX
XLSX
CSV
PPTX
TXT
Markdown

How extraction works

PDFs use the embedded text layer first for speed and accuracy
Sparse or scanned PDFs can fall back to OCR depending on ocrFallbackMode
Images are processed with OCR
DOCX files are converted into headings, paragraphs, lists, and tables
XLSX and CSV files are converted into sheet-aware table blocks
PPTX files prefer LibreOffice-to-PDF conversion and fall back to XML text extraction when needed
Chunking is deterministic and based on structure, page boundaries, tables, size limits, and overlap

🎯 Typical use cases

Preparing document corpora for RAG or vector search
Normalizing invoices, reports, slide decks, and spreadsheets before AI processing
Building ingestion pipelines that need both chunk text and source provenance
Converting legacy documents into structured JSON for automation workflows

📥 Input example

Use documents to provide public file URLs and tune chunking or OCR behavior as needed.

{
  "documents": [
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pdf"
    },
    {
      "url": "https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/skew.pdf"
    },
    {
      "url": "https://raw.githubusercontent.com/ocrmypdf/OCRmyPDF/main/tests/resources/typewriter.png"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.docx"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.xlsx"
    },
    {
      "url": "https://raw.githubusercontent.com/alexschiller/file-format-commons/master/files/ffc.pptx"
    }
  ],
  "maxConcurrency": 3,
  "ocrLanguages": ["eng"],
  "ocrFallbackMode": "auto",
  "chunkMaxChars": 1800,
  "chunkOverlapChars": 200,
  "maxPagesPerDocument": 200,
  "emitMarkdown": true,
  "emitRawText": true,
  "emitBoundingBoxes": true
}

📤 Output

The Actor writes one dataset item per chunk and also stores two stable records in the default key-value store:

SUMMARY for run-level metrics 📊
MANIFEST for per-document status, warnings, and failure reporting 🗂️

Each chunk item includes:

documentId, sourceUrl, fileType
chunkId, chunkIndex, chunkType
text, markdown
pageStart, pageEnd
sectionPath
bbox
charCount, tokenEstimate
language
extractionMode

🧩 Example dataset item

{
  "documentId": "caa40e3b17148c75",
  "sourceUrl": "https://example.com/report.pdf",
  "fileType": "pdf",
  "chunkId": "caa40e3b17148c75-1",
  "chunkIndex": 0,
  "chunkType": "page",
  "text": "Quarterly revenue report...",
  "markdown": "Quarterly revenue report...",
  "pageStart": 1,
  "pageEnd": 2,
  "sectionPath": ["Executive Summary"],
  "bbox": {
    "pageNumber": 1,
    "x": 90,
    "y": 71.28,
    "width": 431.88,
    "height": 68.16
  },
  "charCount": 324,
  "tokenEstimate": 81,
  "language": "eng",
  "extractionMode": "text_layer"
}

🛠️ Operational notes

Public URLs only in v1
Runs are deterministic and do not require an LLM provider
OCR quality depends on the source file and available OCR tooling
PPTX conversion uses LibreOffice when available and falls back gracefully when it is not

🚧 Current limitations

Public URLs only in v1. No cookies, auth headers, or private file fetch support.
Advanced form semantics, checkbox state extraction, and layout-aware table reconstruction are intentionally limited.
Scanned PDF OCR depends on rasterization tooling being available.

Price

The Actor charges only after successful extraction and stops starting new documents once the charge limit is reached for a configured event.

Document Parser — PDF/DOCX to Markdown & JSON for RAG

genuine_qa/document-parser

Convert PDF, DOCX, PPTX, XLSX, HTML and images into clean Markdown or JSON for RAG and LLM pipelines. Powered by IBM's open-source Docling.

Rahul Bhiwagade

Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX

scrapeworks/pandoc-document-converter

Convert documents between formats with Pandoc in the cloud: HTML to Markdown for LLMs and RAG, Markdown to Word DOCX, EPUB e-books, PowerPoint PPTX, LaTeX, reStructuredText and more. Feed it URLs or raw text, get one converted document per input.

Nicolas van Arkens

Universal Document Format Transformer

actorify/universal-document-format-transformer

Universal Document Format Transformer: a cloud-based Apify Actor that converts documents (PDF, DOCX, PPTX, HTML, TXT) into Markdown, JSON, CSV, HTML or TXT using Pandoc. Easy REST API for automations (n8n, Zapier, Make), production-ready error handling, and security controls.

fanio zilla

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

Web Harvester

OCR & Document Extractor – PDF & Image to Text, JSON, Word

lofomachines/ocr-document-extractor

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

Lofomachines

Markitdown MCP Actor

amaranth_nylon/Markitdown-MCP-actor

Markitdown MCP Actor is an Apify Actor designed to convert various file formats (like PDFs, DOCX, PPTX, HTML, or images) into clean Markdown (.md) text.

Yash Kavaiya

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

CodePoetry

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

entranced_gelato/ai-document-reader

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.

AIDevs

PDF to RAG Markdown Chunks for Embeddings

awesome_highboy/docforge

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.