RAG Document Converter avatar

RAG Document Converter

Pricing

$4.00/month + usage

Go to Apify Store
RAG Document Converter

RAG Document Converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

Pricing

$4.00/month + usage

Rating

0.0

(0)

Developer

Web Harvester

Web Harvester

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 days ago

Last modified

Share

📄 Convert documents to clean Markdown optimized for RAG pipelines

Apify Actor

What This Actor Does

  • Multi-format support - PDF, DOCX, PPTX, XLSX, HTML, images
  • Structure preservation - Keeps headers, tables, lists intact
  • RAG-optimized - Clean Markdown for LLM ingestion
  • Section chunking - Split by headers for vector stores
  • Metadata extraction - Title, author, page count

Use Cases

Use CaseDescription
RAG PipelinesConvert docs for retrieval-augmented generation
Knowledge BasesBuild searchable documentation
Content MigrationConvert legacy documents
LLM ContextPrepare documents for LLM analysis
Document SearchIndex documents for semantic search

Input Examples

Basic PDF to Markdown

{
"fileUrls": ["https://example.com/document.pdf"],
"outputFormat": "markdown"
}

With Section Chunking

{
"fileUrls": ["https://example.com/report.pdf"],
"outputFormat": "markdown",
"chunkBySection": true
}

Multiple Formats

{
"fileUrls": [
"https://example.com/doc.pdf",
"https://example.com/slides.pptx",
"https://example.com/data.xlsx"
],
"outputFormat": "markdown"
}

With OCR

{
"fileUrls": ["https://example.com/scanned.pdf"],
"outputFormat": "markdown",
"enableOcr": true
}

Configuration

ParameterTypeDefaultDescription
fileUrlsarray-Document URLs (required)
outputFormatstring"markdown"Output format
enableOcrbooleanfalseUse OCR for scanned docs
preserveTablesbooleantrueConvert tables
extractImagesbooleanfalseExtract embedded images
chunkBySectionbooleanfalseSplit by headers
includeMetadatabooleantrueInclude doc metadata

Supported Formats

FormatExtensions
PDF.pdf
Word.docx
PowerPoint.pptx
Excel.xlsx
HTML.html, .htm
Images.png, .jpg, .jpeg, .tiff, .bmp

Output Formats

FormatDescription
markdownClean Markdown (default, RAG-optimized)
htmlHTML with structure
jsonLossless structured JSON
textPlain text

Output

{
"source": "https://example.com/document.pdf",
"outputFormat": "markdown",
"outputUrl": "https://api.apify.com/v2/key-value-stores/.../records/converted-12345.md",
"contentPreview": "# Document Title\n\n## Introduction\n\nThis document covers...",
"metadata": {
"title": "Annual Report 2024",
"pageCount": 42
},
"pageCount": 42,
"success": true
}

With Section Chunking

{
"source": "https://example.com/document.pdf",
"sections": [
{ "title": "Introduction", "content": "..." },
{ "title": "Methodology", "content": "..." },
{ "title": "Results", "content": "..." }
],
"sectionCount": 3,
"success": true
}

RAG Integration

LangChain

from langchain.text_splitter import MarkdownTextSplitter
# Get markdown from actor output
markdown = result["contentPreview"] # or fetch from outputUrl
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)

LlamaIndex

from llama_index import Document
doc = Document(text=markdown, metadata=result["metadata"])

Cost Estimation

ScaleDocumentsCompute Units
Small10~0.05
Medium50~0.2
Large200~0.8

Technical Details

  • Language: Python 3.12
  • Library: IBM Docling
  • Memory: 1GB-4GB (depends on document size)
  • Features: 10x faster with DoclingParseV2

Limitations

  • OCR requires additional processing time
  • Very large documents may need more memory
  • Some complex layouts may lose formatting

Keywords: docling, rag, pdf, markdown, convert, document, llm, retrieval, langchain, llamaindex