RAG Document Converter
Pricing
$4.00/month + usage
Go to Apify Store
RAG Document Converter
Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.
Pricing
$4.00/month + usage
Rating
0.0
(0)
Developer

Web Harvester
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
10 days ago
Last modified
Categories
Share
📄 Convert documents to clean Markdown optimized for RAG pipelines
What This Actor Does
- Multi-format support - PDF, DOCX, PPTX, XLSX, HTML, images
- Structure preservation - Keeps headers, tables, lists intact
- RAG-optimized - Clean Markdown for LLM ingestion
- Section chunking - Split by headers for vector stores
- Metadata extraction - Title, author, page count
Use Cases
| Use Case | Description |
|---|---|
| RAG Pipelines | Convert docs for retrieval-augmented generation |
| Knowledge Bases | Build searchable documentation |
| Content Migration | Convert legacy documents |
| LLM Context | Prepare documents for LLM analysis |
| Document Search | Index documents for semantic search |
Input Examples
Basic PDF to Markdown
{"fileUrls": ["https://example.com/document.pdf"],"outputFormat": "markdown"}
With Section Chunking
{"fileUrls": ["https://example.com/report.pdf"],"outputFormat": "markdown","chunkBySection": true}
Multiple Formats
{"fileUrls": ["https://example.com/doc.pdf","https://example.com/slides.pptx","https://example.com/data.xlsx"],"outputFormat": "markdown"}
With OCR
{"fileUrls": ["https://example.com/scanned.pdf"],"outputFormat": "markdown","enableOcr": true}
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
fileUrls | array | - | Document URLs (required) |
outputFormat | string | "markdown" | Output format |
enableOcr | boolean | false | Use OCR for scanned docs |
preserveTables | boolean | true | Convert tables |
extractImages | boolean | false | Extract embedded images |
chunkBySection | boolean | false | Split by headers |
includeMetadata | boolean | true | Include doc metadata |
Supported Formats
| Format | Extensions |
|---|---|
| Word | .docx |
| PowerPoint | .pptx |
| Excel | .xlsx |
| HTML | .html, .htm |
| Images | .png, .jpg, .jpeg, .tiff, .bmp |
Output Formats
| Format | Description |
|---|---|
| markdown | Clean Markdown (default, RAG-optimized) |
| html | HTML with structure |
| json | Lossless structured JSON |
| text | Plain text |
Output
{"source": "https://example.com/document.pdf","outputFormat": "markdown","outputUrl": "https://api.apify.com/v2/key-value-stores/.../records/converted-12345.md","contentPreview": "# Document Title\n\n## Introduction\n\nThis document covers...","metadata": {"title": "Annual Report 2024","pageCount": 42},"pageCount": 42,"success": true}
With Section Chunking
{"source": "https://example.com/document.pdf","sections": [{ "title": "Introduction", "content": "..." },{ "title": "Methodology", "content": "..." },{ "title": "Results", "content": "..." }],"sectionCount": 3,"success": true}
RAG Integration
LangChain
from langchain.text_splitter import MarkdownTextSplitter# Get markdown from actor outputmarkdown = result["contentPreview"] # or fetch from outputUrlsplitter = MarkdownTextSplitter(chunk_size=1000)chunks = splitter.split_text(markdown)
LlamaIndex
from llama_index import Documentdoc = Document(text=markdown, metadata=result["metadata"])
Cost Estimation
| Scale | Documents | Compute Units |
|---|---|---|
| Small | 10 | ~0.05 |
| Medium | 50 | ~0.2 |
| Large | 200 | ~0.8 |
Technical Details
- Language: Python 3.12
- Library: IBM Docling
- Memory: 1GB-4GB (depends on document size)
- Features: 10x faster with DoclingParseV2
Limitations
- OCR requires additional processing time
- Very large documents may need more memory
- Some complex layouts may lose formatting
Keywords: docling, rag, pdf, markdown, convert, document, llm, retrieval, langchain, llamaindex