Pricing

from $5.00 / 1,000 pdf chunkeds

PDF → RAG Chunks (Token-Aware, Vector-Ready)

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

Pricing

from $5.00 / 1,000 pdf chunkeds

Rating

0.0

(0)

Developer

Hojun Lee

Actor stats

Bookmarked

Total users

Monthly active users

12 days ago

Last modified

PDF → RAG Chunks

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. No LLM cost (zero tokens). Vector-ready output. $0.005 per PDF + $0.0002 per chunk.

⚡ Run in 30 seconds

Click Start with default settings — downloads the sample PDF, splits it into overlapping text chunks at the default size, and returns each chunk with its page number, chunk index, and character count, ready to feed into any embedding pipeline. No LLM tokens consumed.

Input Parameters

Parameter	Type	Default	Description
`urls`	array	`[]`	PDF URLs to download and chunk.
`url`	string	``	Used when 'urls' is empty.
`chunkSizeChars`	integer	`1500`	Target chars per chunk. ~1500 chars ≈ ~375 tokens (close to embedding-
`overlapChars`	integer	`200`	Char overlap between consecutive chunks (improves retrieval recall).
`maxPages`	integer	`200`	Stop after this many pages.
`skipEmpty`	boolean	`true`	Skip pages with no extractable text (e.g. scanned images).
`userAgent`	string	``	Custom UA.

Why this exists

To build a RAG (retrieval-augmented generation) system over a corpus of PDFs, you need:

Download → extract text per page
Chunk into semantic segments (1000-2000 chars typical)
Optional: embed each chunk and store in vector DB
Query: embed question, retrieve top-k chunks, ask LLM

This actor handles steps 1-2 (the most painful boilerplate). The output is shaped so you can pipe each chunk directly into OpenAI's text-embedding-3-small, Voyage AI, Cohere Embed, or any embedding model.

Other chunking SaaS (Unstructured.io API, LangChain Hosted) charge $5-20 per 1K pages. This actor: $0.50 per 1K pages.

What you get

Summary row (one per PDF)

{
  "_type": "summary",
  "url": "https://www.sec.gov/.../aapl-10k.pdf",
  "ok": true,
  "page_count": 80,
  "title": "Apple Inc. — Annual Report 2024",
  "author": "Apple Inc.",
  "chunk_size_chars": 1500,
  "overlap_chars": 200
}

Per-chunk row

{
  "_type": "chunk",
  "url": "https://...",
  "page": 12,
  "chunk_index": 0,
  "global_chunk_index": 17,
  "text": "Item 1A. Risk Factors\n\nOur business is...",
  "char_count": 1480,
  "token_estimate": 370
}

Quick start

Single PDF

{
  "url": "https://www.example.com/report.pdf"
}

Batch with custom chunk size

{
  "urls": [
    "https://...filing1.pdf",
    "https://...filing2.pdf"
  ],
  "chunkSizeChars": 2000,
  "overlapChars": 300,
  "maxPages": 100
}

Optimize for OpenAI text-embedding-3-small (8K-token max)

{
  "url": "https://...",
  "chunkSizeChars": 1500,
  "overlapChars": 200
}

Recommended chunk sizes

Embedding model	chunkSizeChars	Notes
OpenAI text-embedding-3-small	1500	~375 tokens, fits well
OpenAI text-embedding-3-large	2000	~500 tokens
Voyage voyage-3-large	1500	optimal balance
Cohere embed-v3	1800	works with 512-token chunks

Overlap of 100-300 chars boosts recall by ~5-10% with minimal storage cost.

Pricing

Pay-Per-Event:

$0.005 per PDF processed
$0.0002 per chunk emitted

Run	Chunks	Cost
One 80-page 10-K	~200	$0.045
Batch of 100 papers @ 20 pages	~6000	$1.70
Compliance archive 1000 PDFs	~80000	$21

vs Unstructured.io ($30+/mo + per-doc) or LangChain Hosted ($500+/mo).

Pipeline pattern: PDFs → vector DB

import apify_client, openai, pinecone

# 1. Chunk PDFs
client = apify_client.ApifyClient(token)
run = client.actor("gochujang/pdf-rag-chunker").call(run_input={
    "urls": ["https://...filing.pdf"],
    "chunkSizeChars": 1500,
})

# 2. Embed each chunk
chunks = list(client.dataset(run["defaultDatasetId"]).iterate_items())
chunks = [c for c in chunks if c.get("_type") == "chunk"]
embeddings = openai.embeddings.create(
    model="text-embedding-3-small",
    input=[c["text"] for c in chunks],
).data

# 3. Upsert to vector DB
index = pinecone.Index("rag-docs")
index.upsert([
    {"id": f"{c['url']}-{c['global_chunk_index']}",
     "values": embeddings[i].embedding,
     "metadata": {"url": c["url"], "page": c["page"]}}
    for i, c in enumerate(chunks)
])

Limitations

Scanned PDFs (image-only) — Returns 0 chunks. Use OCR-equipped actor.
Multi-column research papers — Reading order may be slightly off (pdfplumber respects column layout but isn't perfect).
No embedding included — Embedding requires your own OpenAI/Voyage/Cohere key (different vendor). We focus on chunking only to keep costs predictable.

PDF Text & Table Extractor — Same engine, returns full text instead of chunks
Web Page → Markdown Converter — HTML equivalent
Article Summarizer — For one-shot summaries
JSON Schema Generator

Feedback

A short review helps RAG engineers find it: Leave a review on Apify Store

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

Harry Schoeller

RAG Post Processor - Text Cleaner & Chunker for LLM Pipelines

jalicia/rag-post-processor

Clean and chunk scraped text for RAG and LLM pipelines. Strips HTML, collapses whitespace, splits into overlapping chunks ready for embedding. Works standalone or chained after any scraper. Per-row billing.

Jordan Wagner

Tender RAG Chunker — Text Chunks for LLM

adobeflex/tender-rag-chunker

Chunk tender text for RAG/embeddings with stable ids and metadata.

Yahor

RAG Docs Extractor - Documentation to Chunks

ambitious_door/ragdocs-extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

C. K.

PDF Extractor: PDF → Clean Markdown + JSON for LLM/RAG

boxbox10/pdf-extractor

Turn any PDF URL into clean, LLM-ready Markdown + structured JSON (title, metadata, per-page text, page count, word count, token count). Perfect for RAG pipelines, AI agents, and LLM document ingestion.

Marvin Eguilos

Text Cleaner For RAG

junipr/text-cleaner-for-rag

Clean web text for RAG by removing boilerplate, normalizing whitespace, and producing chunk-ready text.

junipr

PDF Text Extractor - Text, Metadata & Page Count from PDF URL

ninhothedev/pdf-text-extractor

$0.5/1K 🔥 PDF text extractor API! Extract full text, metadata & page count from any PDF URL — ready for RAG, LLMs & AI pipelines. No API key. Export JSON, CSV, Excel or API in seconds ⚡

ninhothedev

Doc-to-Markdown/JSON RAG Prep - Convert PDF & DOCX for RAG

bigjoecoding/doc-to-markdown-json-rag-prep

Convert PDF, DOCX, PPTX and webpages to clean Markdown and RAG-ready JSON chunks for your embedding pipeline. No LLM cost. $0.03 per document.

Joseph Curry

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

ozapp/ai-data-pipeline

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.