Pricing

from $1.00 / 1,000 rag-ready chunks

Go to Apify Store

PDF to Markdown RAG-Ready

Try for free

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Pricing

from $1.00 / 1,000 rag-ready chunks

Rating

0.0

(0)

Developer

Dmitry Goncharov

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

PDF to Markdown RAG-Ready Scraper

🚀 Convert complex PDF documents into clean, structured Markdown — perfectly optimized for RAG pipelines, LLM fine-tuning, and AI agents.

Why This Actor?

Extracting text from PDFs is easy, but extracting meaning is hard. This Actor is specifically tuned for the needs of modern AI:

Feature	Standard PDF Parsers	This Actor
Table Preservation	❌ Scrambled text	✅ Structured Markdown tables
Hierarchical Headings	❌ Flat text	✅ Nested sections (H1-H6)
Semantic Chunking	❌ Arbitrary splits	✅ Context-aware RAG chunks
Metadata Extraction	❌ Minimal	✅ Author, Title, Creator, Dates
RAG-Ready Output	❌ Full file only	✅ Chunked JSON for Vector DBs

🎯 RAG-Ready Output

Every PDF is broken down into semantically coherent chunks, ready to be indexed into Chroma, Pinecone, or Weaviate:

{
    "url": "https://example.com/report.pdf",
    "chunk": "### 3.1 Quarterly Results\nOur revenue grew by 15%...",
    "headings": ["3. Financial Growth", "3.1 Quarterly Results"],
    "docMetadata": {
        "title": "Annual Report 2024",
        "author": "Corporate Strategy Team",
        "pageCount": 42
    }
}

Key Features

Structural Integrity: Preserves bold text, lists, and hierarchical structure.
Premium OCR: Handles scanned PDFs and image-heavy documents (optional).
Embedded Tables: Converts complex PDF tables into clean Markdown format.
Smart Metadata: Automatically extracts document info for better context in RAG.
Pay-Per-Event: No fixed monthly costs. You pay only for what you process.

🔗 LangChain Integration (Python)

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document

loader = ApifyDatasetLoader(
    dataset_id="YOUR_DATASET_ID",
    dataset_mapping_function=lambda item: Document(
        page_content=item["chunk"],
        metadata={
            "source": item["url"],
            "headings": " > ".join(item["headings"]),
            **item["docMetadata"]
        }
    ),
)
docs = loader.load()

Input Parameters

Field	Type	Description
`urls`	Array	List of PDF URLs to process
`chunkSize`	Number	Maximum characters per semantic chunk (default: 1000)
`enableChunking`	Boolean	Whether to split document into RAG chunks
`includeMetadata`	Boolean	Include original PDF metadata in output

Pricing

Pay per Event:

Actor Start: $0.01 per GB of memory
RAG-Ready Chunk: $0.001 per extracted chunk

Author

Built with ❤️ by HEDELKA for the AI Engineering community.

Questions? Open a GitHub issue or contact us on the Apify platform.

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

Web Harvester

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website Content to Markdown

ryanclinton/website-content-to-markdown

Convert any website to clean Markdown for RAG pipelines, LLM training, and AI apps. Crawls pages, strips boilerplate, preserves headings, tables, and code blocks. GFM support.

ryan clinton

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

JI JUN

Doc Rag

blue_hero/site-rag

Lucas

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Actums

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

Universal Web to Markdown (Bulk & AI-Ready)

lentic_october/web-to-markdown-converter

Bulk convert any website URLs to clean Markdown for AI & LLMs. Universal scraper that removes ads, scripts, and clutter. Optimized for RAG, ChatGPT, Claude, and LangChain. Fast, async, and API-ready.