PDF to RAG Markdown Chunks for Embeddings avatar

PDF to RAG Markdown Chunks for Embeddings

Pricing

from $3.00 / 1,000 page parseds

Go to Apify Store
PDF to RAG Markdown Chunks for Embeddings

PDF to RAG Markdown Chunks for Embeddings

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

Pricing

from $3.00 / 1,000 page parseds

Rating

0.0

(0)

Developer

Adam

Adam

Maintained by Community

Actor stats

0

Bookmarked

0

Total users

0

Monthly active users

18 hours ago

Last modified

Share

DocForge: Documents to AI-Ready Markdown

Turn PDF files you own into deterministic, token-bounded text chunks that are ready for RAG pipelines and embeddings.

What it does

DocForge takes a list of PDF URLs that you own or are authorized to process, downloads each file, extracts its text, and splits that text into deterministic, token-bounded chunks. Each chunk is emitted as a structured dataset record carrying its source document, chunk index, an estimated token count, and a content hash. A final run summary reports how many pages were parsed and how many chunks were emitted.

The chunker is overlap-aware: you control the target chunk size and the overlap between consecutive chunks, and no chunk exceeds your configured maxTokens. Each chunk's text is emitted in the markdown field as plain extracted text (no layout reconstruction or rich Markdown formatting is applied), so it drops straight into a vector store or embedding job.

Before any work begins, DocForge requires an explicit ownership attestation. If that attestation is not set, the run is rejected with zero billing. Documents that fail to download or parse are caught, logged, and skipped rather than guessed at, so the dataset only contains content that was actually extracted.

Input

FieldTypeRequiredDescription
pdfUrlsarray of stringsYesURLs of PDFs you own or are authorized to process.
chunkingobjectNoChunking options. Prefilled with maxTokens: 512 and overlapTokens: 64.
ownership_attestationbooleanYesYou confirm you own or are authorized to process these documents. Must be true or the run is rejected before any billing.

The chunking object accepts:

  • maxTokens (default 512) — the maximum estimated token size of each chunk; no chunk exceeds this.
  • overlapTokens (default 64) — how much each chunk overlaps the previous one, to preserve context across chunk boundaries.

Token counts are word-based estimates (approximately words × 1.3), not exact tokenizer counts.

Output

DocForge writes two record types to the dataset, distinguished by record_type.

chunk — one record per emitted text chunk:

FieldTypeDescription
record_typestringAlways chunk.
source_docstringThe source PDF URL the chunk came from.
page_numberintegerPresent for schema compatibility; currently emitted as 1 for every chunk (DocForge does not map chunks back to their originating page).
chunk_indexintegerZero-based index of the chunk within its document.
markdownstringThe chunk's text (plain extracted text).
token_countintegerEstimated token count for the chunk.
content_hashstringDeterministic sha256:<64 hex> hash of the chunk text.

run_summary — one record per run:

FieldTypeDescription
record_typestringAlways run_summary.
pages_parsedintegerTotal document pages parsed in the run.
chunks_emittedintegerTotal chunks emitted in the run.

Pricing

DocForge uses Apify Pay-Per-Event pricing. You are billed only for what a successful, gated run actually does:

EventPrice (USD)When it fires
actor_run_start$0.02Once per run, after the run's gates pass.
page_parsed$0.003Per document page converted to text.
chunk_emitted$0.0005Per RAG chunk emitted.

Example run cost. Processing a single 40-page PDF that yields 120 chunks:

  • 1 × actor_run_start = $0.02
  • 40 × page_parsed = $0.12
  • 120 × chunk_emitted = $0.06
  • Total ≈ $0.20

If the ownership attestation is missing, the run is rejected with zero billing.

Why this Actor

  • Deterministic, idempotent output. Every chunk carries a sha256: content hash computed directly from its text, so identical input produces identical hashes — ideal for deduplication, change detection, and re-run safety.
  • Ownership-gated by design. A required attestation must be true before any processing or billing happens. DocForge runs on PDFs you provide and are authorized to use; it does not crawl or scrape third-party sites.
  • No invented content. Text is extracted deterministically with no LLM in the loop. Documents that fail to fetch or parse are caught, logged, and skipped — they are not hallucinated or padded. The run summary reflects only what was genuinely parsed and emitted.
  • Embeddings-ready chunking. Token-bounded chunks with configurable overlap mean no chunk exceeds your maxTokens, and context is preserved across boundaries — output that's ready to embed without further reshaping.

About this Actor

This Actor is AI-authored and operated under the publisher's LLC. It uses Actor.charge() strictly to bill the customer for the Pay-Per-Event units above; the Actor contains no payout or money-out capability. All claims here reflect behavior present in the Actor's code.