Website & Docs to Markdown + RAG Chunks
Pricing
from $3.00 / 1,000 page processeds
Website & Docs to Markdown + RAG Chunks
Turn websites & docs into clean Markdown plus token-bounded, embeddings-ready RAG chunks (heading lineage + sha256) ready for Pinecone, Weaviate, Qdrant or pgvector. Optional no-hallucination field extraction and AEO mode (FAQ, answer-first, llms.txt, citations). Robots honored; ownership required.
Pricing
from $3.00 / 1,000 page processeds
Rating
0.0
(0)
Developer
Adam
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
12 hours ago
Last modified
Categories
Share
RAG-Ready Content Structurer
Turn pages you own into deterministic, embeddings-ready RAG chunks — clean, hashed, and token-bounded.
What it does
This Actor ingests a list of URLs that you own or are authorized to crawl, fetches each page, strips boilerplate, and converts it into clean Markdown. It then produces RAG chunks: semantic (heading-aware) or fixed-size chunks with full heading lineage, a deterministic token estimate, and a sha256 content hash per chunk — ready to upsert into a vector store.
The pipeline is built around a fail-closed gate chain:
- Gate A — Ownership attestation. You must attest you own/are authorized for every URL. If not true, the run is rejected before any fetch, with zero events billed.
- Gate B — Paid-plan only. Default-deny until the Apify paid-plan run flag is confirmed; free/unknown plans run nothing and are not billed.
- Gate C — robots.txt. Always honored and cannot be disabled. Disallowed URLs emit 0 chunks, are charged 0, and are listed in
robots_skipped_urls.
Cleaning uses Mozilla Readability + Turndown for boilerplate-stripped Markdown. Chunking is deterministic: identical input yields byte-stable output and the same sha256 content hash, and no chunk exceeds maxTokens.
Note: the token count per chunk is a deterministic word-based estimate used to bound chunk size, not a tiktoken count.
Input
Defined by INPUT_SCHEMA.json. Key fields:
| Field | Type | Notes |
|---|---|---|
source | object | The URLs to ingest (URL list) that you own/are authorized to crawl. Includes maxPages (default 1000, max 50000). Required. |
ownership_attestation | boolean | You attest you own or are authorized for every URL. Must be true or the run is rejected (zero billing). Required. |
render | enum | http (default, cheapest) or browser (Playwright/Chromium for JS-heavy pages). |
chunking | object | strategy (semantic|fixed), maxTokens (128–2048, default 512), overlapTokens (default 64). No chunk exceeds maxTokens. |
language | string (nullable) | Optional ISO language-code hint (e.g. en). |
Output
Defined by dataset_schema.json. Every record carries a record_type. The run path emits:
chunk(RAG):source_url,page_title,section_path(heading lineage, e.g.["Guide","Setup","Auth"]),chunk_index,chunk_text(clean Markdown),token_count(≤maxTokens),content_hash(sha256:...),language,extracted_fields,retrieved_at,render_mode.run_summary(exactly one per run):pages_requested,pages_fetched,pages_failed([{url, reason}]),chunks_emitted,total_tokens,robots_skipped_urls,output_mode.
The dataset schema also defines AEO record types (faq_pair, answer_block, llms_txt, citation_block) and a structured-extraction extracted_fields shape, with prebuilt dataset views (RAG chunks, AEO assets, Run summary). These AEO/extraction outputs are reserved in the schema but are not produced by the current run path — the Actor currently emits RAG chunk records plus one run_summary.
The RAG chunks view is ready to upsert into Pinecone / Weaviate / Qdrant / pgvector.
Pricing
Pay-Per-Event. You are billed only for what actually runs (after the gates), via Actor.charge():
| Event | Price | When charged |
|---|---|---|
actor_run_start | $0.05 | Once per run, only after the ownership + paid-plan + pilot gates pass. Never on a rejected or free-plan run. |
page_processed | $0.003 | Per page successfully fetched + converted + chunked. Failed pages and robots-disallowed URLs charge $0. |
field_extracted | $0.005 | Per (page × requested field) pair returning a non-null value. Reserved for structured extraction, which is not active in the current run path, so this event is not charged today. |
The developer keeps 80% and Apify keeps 20% (standard Apify 80/20 split).
Example run cost — 100 owned pages, RAG mode:
$0.05 + (100 × $0.003) = $0.35.
Why this Actor
- Deterministic & idempotent. Cleaning and chunking are pure and byte-stable: identical input produces identical chunks and identical
sha256content hashes — safe re-runs, safe vector-store de-duplication. - Ownership-gated and robots-compliant by design. A mandatory ownership/authorization attestation rejects unauthorized runs before any fetch (zero billing), and
robots.txtis forced on and cannot be disabled. - Paid-plan only, fail-closed billing. Default-deny until a paid plan is confirmed; failed and robots-disallowed pages charge $0.
- Embeddings-ready output with token bounds. Chunks carry heading lineage and a token estimate that never exceeds your
maxTokens, in a schema with a prebuilt view for Pinecone / Weaviate / Qdrant / pgvector.
About
This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is the only billing path and it bills the customer only — the Actor has no payout or money-out capability; revenue settlement is handled entirely by Apify's monetization rail.