Pricing

$2.00 / 1,000 dataset item scrapeds

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

Pricing

$2.00 / 1,000 dataset item scrapeds

Rating

0.0

(0)

Developer

Harry Schoeller

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

RAG Web Crawler — Clean Markdown + Token-Sized Chunks, Pay-Per-Result

Turn any website into embeddings-ready chunks with citations and predictable per-chunk pricing. No CSS tuning, no runaway compute bills.

Generic crawlers hand you raw pages and make you build the RAG pipeline yourself. This actor hands you clean, token-sized, deduplicated, citable chunks — at a fixed price per chunk you keep.

What it does

Clean LLM-ready Markdown — @mozilla/readability strips nav, footers, ads, and cookie banners; turndown + GFM converts the cleaned DOM to Markdown with heading hierarchy, fenced code blocks, and tables preserved.
Structure-aware, token-budgeted chunking — splits on the heading tree, then recursively sub-splits oversized sections to your token budget (default 512) with overlap (default 75). Code blocks and tables are kept intact, never split mid-block.
Rich per-chunk provenance — every chunk ships source URL + deep anchor, page title, full headings path, content hash, token count, content type, and language for metadata-filtered vector search and deep-link citations.
Dedup + junk filtering — exact content-hash dedup plus 64-bit SimHash near-duplicate collapsing, and low-information / nav-residue chunk filtering.
Four output formats — chunks-jsonl (one record per chunk), markdown (one record per page), langchain (drop-in {page_content, metadata} Document JSON), and jsonl-bulk (flat one-record-per-chunk for DB/COPY/pgvector).
Incremental / delta sync — on scheduled re-runs, only NEW or CHANGED pages are re-emitted (and billed). Makes daily/weekly crawls cheap.
Budget guarantee — maxPages is a hard ceiling; billing is per emitted result, so a runaway crawl can never produce a runaway bill.

Incremental / delta sync — cheap scheduled re-runs

Turn on Incremental sync (incremental: true) and schedule the actor to run daily or weekly. The first run does a full crawl and seeds a per-URL content-hash state in a named key-value store. Every later run crawls the site, but only re-emits chunks for pages that are new or changed — unchanged pages cost nothing. A typical weekly docs re-crawl re-emits a handful of pages instead of hundreds.

State is automatic. The state store name defaults to a deterministic hash of your start URLs, so a scheduled task reuses its own prior state with zero config. Set stateStoreName explicitly to share state across tasks/schedules.
forceFullCrawl: true re-emits everything and rebuilds the baseline — use after changing chunking settings or to refresh a stale index.
emitDeletions: true writes a tombstone record ({ deleted: true, url, ... }) to a separate deletions dataset for every URL that disappeared since the last run, so downstream vector stores can purge stale vectors. Tombstones are not billed.
When incremental is ON, each emitted record carries a change_status (new | changed) in its metadata.

The run summary (OUTPUT key-value record) includes a delta block: pages_new, pages_changed, pages_unchanged, pages_deleted, chunks_skipped_unchanged (the spend you saved), state_store, prior_run_id.

When all incremental options are OFF (the default), behavior and output are byte-for-byte identical to v1.0.

Output (chunks-jsonl)

{
  "id": "a1f3c9e29b2c4d10",
  "url": "https://docs.example.com/guide/install",
  "title": "Getting Started — Example Docs",
  "chunkIndex": 3,
  "chunkTotal": 11,
  "headingsPath": ["Getting Started", "Setup", "Installation"],
  "text": "## Installation\n\nInstall via npm:\n\n```bash\nnpm install crawlee\n```",
  "tokenEstimate": 498,
  "fetchedAt": "2026-06-20T14:02:11Z",
  "content_hash": "sha256:...",
  "metadata": {
    "source_url": "https://docs.example.com/guide/install",
    "deep_link": "https://docs.example.com/guide/install#installation",
    "anchor": "installation",
    "canonical_url": "https://docs.example.com/guide/install",
    "page_title": "Getting Started — Example Docs",
    "char_count": 2104,
    "content_type": "mixed",
    "language": "en",
    "last_modified": null,
    "crawl_timestamp": "2026-06-20T14:02:11Z"
  }
}

Each record maps 1:1 to a vector-DB upsert: { id, values=embed(text), metadata }.

Output (langchain)

One record per chunk, drop-in for LangChain — [Document(**r) for r in dataset]:

{
  "page_content": "## Installation\n\nInstall via npm...",
  "metadata": {
    "id": "a1f3c9e29b2c4d10",
    "source": "https://docs.example.com/guide/install",
    "title": "Getting Started — Example Docs",
    "deep_link": "https://docs.example.com/guide/install#installation",
    "canonical_url": "https://docs.example.com/guide/install",
    "headings_path": ["Getting Started", "Setup", "Installation"],
    "chunk_index": 3,
    "chunk_total": 11,
    "content_type": "mixed",
    "language": "en",
    "token_estimate": 498,
    "char_count": 2104,
    "content_hash": "sha256:...",
    "last_modified": null,
    "crawl_timestamp": "2026-06-20T14:02:11Z"
  }
}

Output (jsonl-bulk)

Fully flat one-record-per-chunk for generic bulk import (DB COPY / pgvector):

{
  "id": "a1f3c9e29b2c4d10",
  "text": "## Installation\n\nInstall via npm...",
  "source_url": "https://docs.example.com/guide/install",
  "deep_link": "https://docs.example.com/guide/install#installation",
  "canonical_url": "https://docs.example.com/guide/install",
  "title": "Getting Started — Example Docs",
  "headings_path": "Getting Started > Setup > Installation",
  "chunk_index": 3,
  "chunk_total": 11,
  "content_type": "mixed",
  "language": "en",
  "token_estimate": 498,
  "char_count": 2104,
  "content_hash": "sha256:...",
  "last_modified": null,
  "crawl_timestamp": "2026-06-20T14:02:11Z"
}

Input

See .actor/input_schema.json. Key fields: startUrls, crawlScope, maxCrawlDepth, maxPages, renderJs, outputFormat, chunkSize, chunkOverlap, dedupNearDuplicates, filterJunkChunks, and the incremental sync fields incremental, forceFullCrawl, stateStoreName, emitDeletions.

Pricing

Pay-Per-Event. Billable unit = one emitted dataset item. Deduped and junk-filtered chunks are not billed.

Event	Price
Per chunk emitted (chunks-jsonl)	$0.0008 / chunk ($0.80 / 1,000)
Per page emitted (markdown)	$0.002 / page ($2.00 / 1,000)

Run locally

npm install
npm run build
apify run    # reads .actor/INPUT.json

Roadmap (v1.2+)

Inline embeddings, direct vector-DB push (Pinecone/Qdrant/Weaviate/pgvector), missedGraceRuns before tombstoning, Standby low-latency mode.

PDF to RAG Markdown Chunks for Embeddings

awesome_highboy/docforge

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

Adam

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Ken M

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

RAG Web Extractor — Clean Markdown, HTML & Chunks

junipr/rag-web-extractor

Extract clean website content for RAG and AI search. Crawl pages, remove boilerplate, preserve structure, and export markdown, HTML, text, JSON, and chunks.

junipr

News to Markdown — RAG-Ready News Chunks API

nexgendata/news-announcements-rag-markdown

Convert news and announcements into RAG-ready Markdown chunks. Clean JSON for PR, media-monitoring teams and AI agents.

NexGenData

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.