PDF to Markdown RAG-Ready avatar
PDF to Markdown RAG-Ready

Pricing

from $1.00 / 1,000 rag-ready chunks

Go to Apify Store
PDF to Markdown RAG-Ready

PDF to Markdown RAG-Ready

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Pricing

from $1.00 / 1,000 rag-ready chunks

Rating

0.0

(0)

Developer

Dmitry Goncharov

Dmitry Goncharov

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 days ago

Last modified

Share

PDF to Markdown RAG-Ready Scraper

🚀 Convert complex PDF documents into clean, structured Markdown — perfectly optimized for RAG pipelines, LLM fine-tuning, and AI agents.

Why This Actor?

Extracting text from PDFs is easy, but extracting meaning is hard. This Actor is specifically tuned for the needs of modern AI:

FeatureStandard PDF ParsersThis Actor
Table Preservation❌ Scrambled text✅ Structured Markdown tables
Hierarchical Headings❌ Flat text✅ Nested sections (H1-H6)
Semantic Chunking❌ Arbitrary splits✅ Context-aware RAG chunks
Metadata Extraction❌ Minimal✅ Author, Title, Creator, Dates
RAG-Ready Output❌ Full file only✅ Chunked JSON for Vector DBs

🎯 RAG-Ready Output

Every PDF is broken down into semantically coherent chunks, ready to be indexed into Chroma, Pinecone, or Weaviate:

{
"url": "https://example.com/report.pdf",
"chunk": "### 3.1 Quarterly Results\nOur revenue grew by 15%...",
"headings": ["3. Financial Growth", "3.1 Quarterly Results"],
"docMetadata": {
"title": "Annual Report 2024",
"author": "Corporate Strategy Team",
"pageCount": 42
}
}

Key Features

  • Structural Integrity: Preserves bold text, lists, and hierarchical structure.
  • Premium OCR: Handles scanned PDFs and image-heavy documents (optional).
  • Embedded Tables: Converts complex PDF tables into clean Markdown format.
  • Smart Metadata: Automatically extracts document info for better context in RAG.
  • Pay-Per-Event: No fixed monthly costs. You pay only for what you process.

🔗 LangChain Integration (Python)

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document
loader = ApifyDatasetLoader(
dataset_id="YOUR_DATASET_ID",
dataset_mapping_function=lambda item: Document(
page_content=item["chunk"],
metadata={
"source": item["url"],
"headings": " > ".join(item["headings"]),
**item["docMetadata"]
}
),
)
docs = loader.load()

Input Parameters

FieldTypeDescription
urlsArrayList of PDF URLs to process
chunkSizeNumberMaximum characters per semantic chunk (default: 1000)
enableChunkingBooleanWhether to split document into RAG chunks
includeMetadataBooleanInclude original PDF metadata in output

Pricing

Pay per Event:

  • Actor Start: $0.01 per GB of memory
  • RAG-Ready Chunk: $0.001 per extracted chunk

Author

Built with ❤️ by HEDELKA for the AI Engineering community.

Questions? Open a GitHub issue or contact us on the Apify platform.