PDF → RAG Chunks (Token-Aware, Vector-Ready)
Pricing
Pay per usage
PDF → RAG Chunks (Token-Aware, Vector-Ready)
Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Hojun Lee
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
PDF → RAG Chunks
Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. No LLM cost (zero tokens). Vector-ready output. $0.005 per PDF + $0.0002 per chunk.
Why this exists
To build a RAG (retrieval-augmented generation) system over a corpus of PDFs, you need:
- Download → extract text per page
- Chunk into semantic segments (1000-2000 chars typical)
- Optional: embed each chunk and store in vector DB
- Query: embed question, retrieve top-k chunks, ask LLM
This actor handles steps 1-2 (the most painful boilerplate). The output is shaped so you can pipe each chunk directly into OpenAI's text-embedding-3-small, Voyage AI, Cohere Embed, or any embedding model.
Other chunking SaaS (Unstructured.io API, LangChain Hosted) charge $5-20 per 1K pages. This actor: $0.50 per 1K pages.
What you get
Summary row (one per PDF)
{"_type": "summary","url": "https://www.sec.gov/.../aapl-10k.pdf","ok": true,"page_count": 80,"title": "Apple Inc. — Annual Report 2024","author": "Apple Inc.","chunk_size_chars": 1500,"overlap_chars": 200}
Per-chunk row
{"_type": "chunk","url": "https://...","page": 12,"chunk_index": 0,"global_chunk_index": 17,"text": "Item 1A. Risk Factors\n\nOur business is...","char_count": 1480,"token_estimate": 370}
Quick start
Single PDF
{"url": "https://www.example.com/report.pdf"}
Batch with custom chunk size
{"urls": ["https://...filing1.pdf","https://...filing2.pdf"],"chunkSizeChars": 2000,"overlapChars": 300,"maxPages": 100}
Optimize for OpenAI text-embedding-3-small (8K-token max)
{"url": "https://...","chunkSizeChars": 1500,"overlapChars": 200}
Recommended chunk sizes
| Embedding model | chunkSizeChars | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | 1500 | ~375 tokens, fits well |
| OpenAI text-embedding-3-large | 2000 | ~500 tokens |
| Voyage voyage-3-large | 1500 | optimal balance |
| Cohere embed-v3 | 1800 | works with 512-token chunks |
Overlap of 100-300 chars boosts recall by ~5-10% with minimal storage cost.
Pricing
Pay-Per-Event:
$0.005per PDF processed$0.0002per chunk emitted
| Run | Chunks | Cost |
|---|---|---|
| One 80-page 10-K | ~200 | $0.045 |
| Batch of 100 papers @ 20 pages | ~6000 | $1.70 |
| Compliance archive 1000 PDFs | ~80000 | $21 |
vs Unstructured.io ($30+/mo + per-doc) or LangChain Hosted ($500+/mo).
Pipeline pattern: PDFs → vector DB
import apify_client, openai, pinecone# 1. Chunk PDFsclient = apify_client.ApifyClient(token)run = client.actor("gochujang/pdf-rag-chunker").call(run_input={"urls": ["https://...filing.pdf"],"chunkSizeChars": 1500,})# 2. Embed each chunkchunks = list(client.dataset(run["defaultDatasetId"]).iterate_items())chunks = [c for c in chunks if c.get("_type") == "chunk"]embeddings = openai.embeddings.create(model="text-embedding-3-small",input=[c["text"] for c in chunks],).data# 3. Upsert to vector DBindex = pinecone.Index("rag-docs")index.upsert([{"id": f"{c['url']}-{c['global_chunk_index']}","values": embeddings[i].embedding,"metadata": {"url": c["url"], "page": c["page"]}}for i, c in enumerate(chunks)])
Limitations
- Scanned PDFs (image-only) — Returns 0 chunks. Use OCR-equipped actor.
- Multi-column research papers — Reading order may be slightly off (pdfplumber respects column layout but isn't perfect).
- No embedding included — Embedding requires your own OpenAI/Voyage/Cohere key (different vendor). We focus on chunking only to keep costs predictable.
Related actors (same author)
- PDF Text & Table Extractor — Same engine, returns full text instead of chunks
- Web Page → Markdown Converter — HTML equivalent
- Article Summarizer — For one-shot summaries
- JSON Schema Generator
Feedback
A short review helps RAG engineers find it: Leave a review on Apify Store