PDF to Markdown RAG-Ready
Pricing
from $1.00 / 1,000 rag-ready chunks
PDF to Markdown RAG-Ready
Premium PDF scraper that preserves tables and structure. Optimized for RAG.
Pricing
from $1.00 / 1,000 rag-ready chunks
Rating
0.0
(0)
Developer

Dmitry Goncharov
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
12 days ago
Last modified
Categories
Share
PDF to Markdown RAG-Ready Scraper
🚀 Convert complex PDF documents into clean, structured Markdown — perfectly optimized for RAG pipelines, LLM fine-tuning, and AI agents.
Why This Actor?
Extracting text from PDFs is easy, but extracting meaning is hard. This Actor is specifically tuned for the needs of modern AI:
| Feature | Standard PDF Parsers | This Actor |
|---|---|---|
| Table Preservation | ❌ Scrambled text | ✅ Structured Markdown tables |
| Hierarchical Headings | ❌ Flat text | ✅ Nested sections (H1-H6) |
| Semantic Chunking | ❌ Arbitrary splits | ✅ Context-aware RAG chunks |
| Metadata Extraction | ❌ Minimal | ✅ Author, Title, Creator, Dates |
| RAG-Ready Output | ❌ Full file only | ✅ Chunked JSON for Vector DBs |
🎯 RAG-Ready Output
Every PDF is broken down into semantically coherent chunks, ready to be indexed into Chroma, Pinecone, or Weaviate:
{"url": "https://example.com/report.pdf","chunk": "### 3.1 Quarterly Results\nOur revenue grew by 15%...","headings": ["3. Financial Growth", "3.1 Quarterly Results"],"docMetadata": {"title": "Annual Report 2024","author": "Corporate Strategy Team","pageCount": 42}}
Key Features
- Structural Integrity: Preserves bold text, lists, and hierarchical structure.
- Premium OCR: Handles scanned PDFs and image-heavy documents (optional).
- Embedded Tables: Converts complex PDF tables into clean Markdown format.
- Smart Metadata: Automatically extracts document info for better context in RAG.
- Pay-Per-Event: No fixed monthly costs. You pay only for what you process.
🔗 LangChain Integration (Python)
from langchain.document_loaders import ApifyDatasetLoaderfrom langchain.docstore.document import Documentloader = ApifyDatasetLoader(dataset_id="YOUR_DATASET_ID",dataset_mapping_function=lambda item: Document(page_content=item["chunk"],metadata={"source": item["url"],"headings": " > ".join(item["headings"]),**item["docMetadata"]}),)docs = loader.load()
Input Parameters
| Field | Type | Description |
|---|---|---|
urls | Array | List of PDF URLs to process |
chunkSize | Number | Maximum characters per semantic chunk (default: 1000) |
enableChunking | Boolean | Whether to split document into RAG chunks |
includeMetadata | Boolean | Include original PDF metadata in output |
Pricing
Pay per Event:
- Actor Start: $0.01 per GB of memory
- RAG-Ready Chunk: $0.001 per extracted chunk
Author
Built with ❤️ by HEDELKA for the AI Engineering community.
Questions? Open a GitHub issue or contact us on the Apify platform.