Website & Docs to Markdown + RAG Chunks avatar

Website & Docs to Markdown + RAG Chunks

Pricing

from $3.00 / 1,000 page processeds

Go to Apify Store
Website & Docs to Markdown + RAG Chunks

Website & Docs to Markdown + RAG Chunks

Turn websites & docs into clean Markdown plus token-bounded, embeddings-ready RAG chunks (heading lineage + sha256) ready for Pinecone, Weaviate, Qdrant or pgvector. Optional no-hallucination field extraction and AEO mode (FAQ, answer-first, llms.txt, citations). Robots honored; ownership required.

Pricing

from $3.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

Adam

Adam

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 hours ago

Last modified

Share

RAG-Ready Content Structurer

Turn pages you own into deterministic, embeddings-ready RAG chunks — clean, hashed, and token-bounded.

What it does

This Actor ingests a list of URLs that you own or are authorized to crawl, fetches each page, strips boilerplate, and converts it into clean Markdown. It then produces RAG chunks: semantic (heading-aware) or fixed-size chunks with full heading lineage, a deterministic token estimate, and a sha256 content hash per chunk — ready to upsert into a vector store.

The pipeline is built around a fail-closed gate chain:

  1. Gate A — Ownership attestation. You must attest you own/are authorized for every URL. If not true, the run is rejected before any fetch, with zero events billed.
  2. Gate B — Paid-plan only. Default-deny until the Apify paid-plan run flag is confirmed; free/unknown plans run nothing and are not billed.
  3. Gate C — robots.txt. Always honored and cannot be disabled. Disallowed URLs emit 0 chunks, are charged 0, and are listed in robots_skipped_urls.

Cleaning uses Mozilla Readability + Turndown for boilerplate-stripped Markdown. Chunking is deterministic: identical input yields byte-stable output and the same sha256 content hash, and no chunk exceeds maxTokens.

Note: the token count per chunk is a deterministic word-based estimate used to bound chunk size, not a tiktoken count.

Input

Defined by INPUT_SCHEMA.json. Key fields:

FieldTypeNotes
sourceobjectThe URLs to ingest (URL list) that you own/are authorized to crawl. Includes maxPages (default 1000, max 50000). Required.
ownership_attestationbooleanYou attest you own or are authorized for every URL. Must be true or the run is rejected (zero billing). Required.
renderenumhttp (default, cheapest) or browser (Playwright/Chromium for JS-heavy pages).
chunkingobjectstrategy (semantic|fixed), maxTokens (128–2048, default 512), overlapTokens (default 64). No chunk exceeds maxTokens.
languagestring (nullable)Optional ISO language-code hint (e.g. en).

Output

Defined by dataset_schema.json. Every record carries a record_type. The run path emits:

  • chunk (RAG): source_url, page_title, section_path (heading lineage, e.g. ["Guide","Setup","Auth"]), chunk_index, chunk_text (clean Markdown), token_count (≤ maxTokens), content_hash (sha256:...), language, extracted_fields, retrieved_at, render_mode.
  • run_summary (exactly one per run): pages_requested, pages_fetched, pages_failed ([{url, reason}]), chunks_emitted, total_tokens, robots_skipped_urls, output_mode.

The dataset schema also defines AEO record types (faq_pair, answer_block, llms_txt, citation_block) and a structured-extraction extracted_fields shape, with prebuilt dataset views (RAG chunks, AEO assets, Run summary). These AEO/extraction outputs are reserved in the schema but are not produced by the current run path — the Actor currently emits RAG chunk records plus one run_summary.

The RAG chunks view is ready to upsert into Pinecone / Weaviate / Qdrant / pgvector.

Pricing

Pay-Per-Event. You are billed only for what actually runs (after the gates), via Actor.charge():

EventPriceWhen charged
actor_run_start$0.05Once per run, only after the ownership + paid-plan + pilot gates pass. Never on a rejected or free-plan run.
page_processed$0.003Per page successfully fetched + converted + chunked. Failed pages and robots-disallowed URLs charge $0.
field_extracted$0.005Per (page × requested field) pair returning a non-null value. Reserved for structured extraction, which is not active in the current run path, so this event is not charged today.

The developer keeps 80% and Apify keeps 20% (standard Apify 80/20 split).

Example run cost — 100 owned pages, RAG mode: $0.05 + (100 × $0.003) = $0.35.

Why this Actor

  • Deterministic & idempotent. Cleaning and chunking are pure and byte-stable: identical input produces identical chunks and identical sha256 content hashes — safe re-runs, safe vector-store de-duplication.
  • Ownership-gated and robots-compliant by design. A mandatory ownership/authorization attestation rejects unauthorized runs before any fetch (zero billing), and robots.txt is forced on and cannot be disabled.
  • Paid-plan only, fail-closed billing. Default-deny until a paid plan is confirmed; failed and robots-disallowed pages charge $0.
  • Embeddings-ready output with token bounds. Chunks carry heading lineage and a token estimate that never exceeds your maxTokens, in a schema with a prebuilt view for Pinecone / Weaviate / Qdrant / pgvector.

About

This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is the only billing path and it bills the customer only — the Actor has no payout or money-out capability; revenue settlement is handled entirely by Apify's monetization rail.