Pricing

from $3.00 / 1,000 page processeds

Website & Docs to Markdown + RAG Chunks

Turn websites & docs into clean Markdown plus token-bounded, embeddings-ready RAG chunks (heading lineage + sha256) ready for Pinecone, Weaviate, Qdrant or pgvector. Optional no-hallucination field extraction and AEO mode (FAQ, answer-first, llms.txt, citations). Robots honored; ownership required.

Pricing

from $3.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

Adam

Actor stats

Bookmarked

Total users

Monthly active users

15 days ago

Last modified

RAG-Ready Content Structurer

Turn pages you own into deterministic, embeddings-ready RAG chunks — clean, hashed, and token-bounded.

What it does

This Actor ingests a list of URLs that you own or are authorized to crawl, fetches each page, strips boilerplate, and converts it into clean Markdown. It then produces RAG chunks: semantic (heading-aware) or fixed-size chunks with full heading lineage, a deterministic token estimate, and a sha256 content hash per chunk — ready to upsert into a vector store.

The pipeline is built around a fail-closed gate chain:

Gate A — Ownership attestation. You must attest you own/are authorized for every URL. If not true, the run is rejected before any fetch, with zero events billed.
Gate B — Paid-plan only. Default-deny until the Apify paid-plan run flag is confirmed; free/unknown plans run nothing and are not billed.
Gate C — robots.txt. Always honored and cannot be disabled. Disallowed URLs emit 0 chunks, are charged 0, and are listed in robots_skipped_urls.

Cleaning uses Mozilla Readability + Turndown for boilerplate-stripped Markdown. Chunking is deterministic: identical input yields byte-stable output and the same sha256 content hash, and no chunk exceeds maxTokens.

Note: the token count per chunk is a deterministic word-based estimate used to bound chunk size, not a tiktoken count.

Input

Defined by INPUT_SCHEMA.json. Key fields:

Field	Type	Notes
`source`	object	The URLs to ingest (URL list) that you own/are authorized to crawl. Includes `maxPages` (default 1000, max 50000). Required.
`ownership_attestation`	boolean	You attest you own or are authorized for every URL. Must be `true` or the run is rejected (zero billing). Required.
`render`	enum	`http` (default, cheapest) or `browser` (Playwright/Chromium for JS-heavy pages).
`chunking`	object	`strategy` (`semantic`\|`fixed`), `maxTokens` (128–2048, default 512), `overlapTokens` (default 64). No chunk exceeds `maxTokens`.
`language`	string (nullable)	Optional ISO language-code hint (e.g. `en`).

Output

Defined by dataset_schema.json. Every record carries a record_type. The run path emits:

chunk (RAG): chunk_id (position-stable key for idempotent vector-DB upserts — re-crawling updates the same vectors instead of duplicating them), source_url, page_title, section_path (heading lineage, e.g. ["Guide","Setup","Auth"]), heading (immediate section title), chunk_index, section_chunk_index/section_chunk_count (position within the section), chunk_text (clean Markdown), token_count (≤ maxTokens), char_count, word_count, overlap_prev (chunk carries overlap from the previous one), content_hash (sha256:...), language, extracted_fields, retrieved_at, render_mode.
run_summary (exactly one per run): pages_requested, pages_fetched, pages_failed ([{url, reason}] — pages that failed extraction are isolated here at zero charge, never crashing the run), chunks_emitted, total_tokens, robots_skipped_urls, output_mode.

Chunking quality. Code fences are kept atomic — a chunk never contains a half-open ``` fence; an oversized code block is split with each piece re-wrapped in its original fence + language, so every chunk is independently valid, embeddable Markdown. Splits prefer natural boundaries (paragraph → sentence → word → char), and a hard character cap (secondary to the token bound) means a whitespace-free blob (minified JS, base64) can't masquerade as a tiny chunk and silently blow an embedding model's real token limit. All output is deterministic and byte-stable across runs.

The dataset schema also defines AEO record types (faq_pair, answer_block, llms_txt, citation_block) and a structured-extraction extracted_fields shape, with prebuilt dataset views (RAG chunks, AEO assets, Run summary). These AEO/extraction outputs are reserved in the schema but are not produced by the current run path — the Actor currently emits RAG chunk records plus one run_summary.

The RAG chunks view is ready to upsert into Pinecone / Weaviate / Qdrant / pgvector.

Pricing

Pay-Per-Event. You are billed only for what actually runs (after the gates), via Actor.charge():

Event	Price	When charged
`actor_run_start`	$0.05	Once per run, only after the ownership + paid-plan + pilot gates pass. Never on a rejected or free-plan run.
`page_processed`	$0.003	Per page successfully fetched + converted + chunked. Failed pages and robots-disallowed URLs charge $0.
`field_extracted`	$0.005	Per `(page × requested field)` pair returning a non-null value. Reserved for structured extraction, which is not active in the current run path, so this event is not charged today.

The developer keeps 80% and Apify keeps 20% (standard Apify 80/20 split).

Example run cost — 100 owned pages, RAG mode: $0.05 + (100 × $0.003) = $0.35.

Why this Actor

Deterministic & idempotent. Cleaning and chunking are pure and byte-stable: identical input produces identical chunks and identical sha256 content hashes — safe re-runs, safe vector-store de-duplication.
Ownership-gated and robots-compliant by design. A mandatory ownership/authorization attestation rejects unauthorized runs before any fetch (zero billing), and robots.txt is forced on and cannot be disabled.
Paid-plan only, fail-closed billing. Default-deny until a paid plan is confirmed; failed and robots-disallowed pages charge $0.
Embeddings-ready output with token bounds. Chunks carry heading lineage and a token estimate that never exceeds your maxTokens, in a schema with a prebuilt view for Pinecone / Weaviate / Qdrant / pgvector.

About

This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is the only billing path and it bills the customer only — the Actor has no payout or money-out capability; revenue settlement is handled entirely by Apify's monetization rail.

PDF to RAG Markdown Chunks for Embeddings

awesome_highboy/docforge

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

Adam

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

RSS & Atom Feeds to RAG Markdown Chunks

awesome_highboy/rss-news-structurer

Turn RSS/Atom feeds into full-article clean Markdown + token-bounded RAG chunks for embeddings & vector DBs. sha256 cross-item dedup means you pay only for net-new articles, not syndicated copies. Robots honored; ownership-gated.

Adam

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Hitman studio

News to Markdown — RAG-Ready News Chunks API

nexgendata/news-announcements-rag-markdown

Convert news and announcements into RAG-ready Markdown chunks. Clean JSON for PR, media-monitoring teams and AI agents.

NexGenData

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

Harry Schoeller

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Badruddeen Naseem

5.0

Wikipedia to RAG — Article Scraper for AI Pipelines

yuchiaoniu/wikipedia-rag-scraper

Search Wikipedia and download articles as clean Markdown chunks ready for RAG pipelines, Pinecone, Weaviate, Chroma, or any vector database. No API key required.