RAG Web Crawler: Clean Markdown + Token-Sized Chunks avatar

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

Pricing

$2.00 / 1,000 dataset item scrapeds

Go to Apify Store
RAG Web Crawler: Clean Markdown + Token-Sized Chunks

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

Pricing

$2.00 / 1,000 dataset item scrapeds

Rating

0.0

(0)

Developer

Harry Schoeller

Harry Schoeller

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

RAG Web Crawler — Clean Markdown + Token-Sized Chunks, Pay-Per-Result

Turn any website into embeddings-ready chunks with citations and predictable per-chunk pricing. No CSS tuning, no runaway compute bills.

Generic crawlers hand you raw pages and make you build the RAG pipeline yourself. This actor hands you clean, token-sized, deduplicated, citable chunks — at a fixed price per chunk you keep.

What it does

  • Clean LLM-ready Markdown@mozilla/readability strips nav, footers, ads, and cookie banners; turndown + GFM converts the cleaned DOM to Markdown with heading hierarchy, fenced code blocks, and tables preserved.
  • Structure-aware, token-budgeted chunking — splits on the heading tree, then recursively sub-splits oversized sections to your token budget (default 512) with overlap (default 75). Code blocks and tables are kept intact, never split mid-block.
  • Rich per-chunk provenance — every chunk ships source URL + deep anchor, page title, full headings path, content hash, token count, content type, and language for metadata-filtered vector search and deep-link citations.
  • Dedup + junk filtering — exact content-hash dedup plus 64-bit SimHash near-duplicate collapsing, and low-information / nav-residue chunk filtering.
  • Four output formatschunks-jsonl (one record per chunk), markdown (one record per page), langchain (drop-in {page_content, metadata} Document JSON), and jsonl-bulk (flat one-record-per-chunk for DB/COPY/pgvector).
  • Incremental / delta sync — on scheduled re-runs, only NEW or CHANGED pages are re-emitted (and billed). Makes daily/weekly crawls cheap.
  • Budget guaranteemaxPages is a hard ceiling; billing is per emitted result, so a runaway crawl can never produce a runaway bill.

Incremental / delta sync — cheap scheduled re-runs

Turn on Incremental sync (incremental: true) and schedule the actor to run daily or weekly. The first run does a full crawl and seeds a per-URL content-hash state in a named key-value store. Every later run crawls the site, but only re-emits chunks for pages that are new or changed — unchanged pages cost nothing. A typical weekly docs re-crawl re-emits a handful of pages instead of hundreds.

  • State is automatic. The state store name defaults to a deterministic hash of your start URLs, so a scheduled task reuses its own prior state with zero config. Set stateStoreName explicitly to share state across tasks/schedules.
  • forceFullCrawl: true re-emits everything and rebuilds the baseline — use after changing chunking settings or to refresh a stale index.
  • emitDeletions: true writes a tombstone record ({ deleted: true, url, ... }) to a separate deletions dataset for every URL that disappeared since the last run, so downstream vector stores can purge stale vectors. Tombstones are not billed.
  • When incremental is ON, each emitted record carries a change_status (new | changed) in its metadata.

The run summary (OUTPUT key-value record) includes a delta block: pages_new, pages_changed, pages_unchanged, pages_deleted, chunks_skipped_unchanged (the spend you saved), state_store, prior_run_id.

When all incremental options are OFF (the default), behavior and output are byte-for-byte identical to v1.0.

Output (chunks-jsonl)

{
"id": "a1f3c9e29b2c4d10",
"url": "https://docs.example.com/guide/install",
"title": "Getting Started — Example Docs",
"chunkIndex": 3,
"chunkTotal": 11,
"headingsPath": ["Getting Started", "Setup", "Installation"],
"text": "## Installation\n\nInstall via npm:\n\n```bash\nnpm install crawlee\n```",
"tokenEstimate": 498,
"fetchedAt": "2026-06-20T14:02:11Z",
"content_hash": "sha256:...",
"metadata": {
"source_url": "https://docs.example.com/guide/install",
"deep_link": "https://docs.example.com/guide/install#installation",
"anchor": "installation",
"canonical_url": "https://docs.example.com/guide/install",
"page_title": "Getting Started — Example Docs",
"char_count": 2104,
"content_type": "mixed",
"language": "en",
"last_modified": null,
"crawl_timestamp": "2026-06-20T14:02:11Z"
}
}

Each record maps 1:1 to a vector-DB upsert: { id, values=embed(text), metadata }.

Output (langchain)

One record per chunk, drop-in for LangChain — [Document(**r) for r in dataset]:

{
"page_content": "## Installation\n\nInstall via npm...",
"metadata": {
"id": "a1f3c9e29b2c4d10",
"source": "https://docs.example.com/guide/install",
"title": "Getting Started — Example Docs",
"deep_link": "https://docs.example.com/guide/install#installation",
"canonical_url": "https://docs.example.com/guide/install",
"headings_path": ["Getting Started", "Setup", "Installation"],
"chunk_index": 3,
"chunk_total": 11,
"content_type": "mixed",
"language": "en",
"token_estimate": 498,
"char_count": 2104,
"content_hash": "sha256:...",
"last_modified": null,
"crawl_timestamp": "2026-06-20T14:02:11Z"
}
}

Output (jsonl-bulk)

Fully flat one-record-per-chunk for generic bulk import (DB COPY / pgvector):

{
"id": "a1f3c9e29b2c4d10",
"text": "## Installation\n\nInstall via npm...",
"source_url": "https://docs.example.com/guide/install",
"deep_link": "https://docs.example.com/guide/install#installation",
"canonical_url": "https://docs.example.com/guide/install",
"title": "Getting Started — Example Docs",
"headings_path": "Getting Started > Setup > Installation",
"chunk_index": 3,
"chunk_total": 11,
"content_type": "mixed",
"language": "en",
"token_estimate": 498,
"char_count": 2104,
"content_hash": "sha256:...",
"last_modified": null,
"crawl_timestamp": "2026-06-20T14:02:11Z"
}

Input

See .actor/input_schema.json. Key fields: startUrls, crawlScope, maxCrawlDepth, maxPages, renderJs, outputFormat, chunkSize, chunkOverlap, dedupNearDuplicates, filterJunkChunks, and the incremental sync fields incremental, forceFullCrawl, stateStoreName, emitDeletions.

Pricing

Pay-Per-Event. Billable unit = one emitted dataset item. Deduped and junk-filtered chunks are not billed.

EventPrice
Per chunk emitted (chunks-jsonl)$0.0008 / chunk ($0.80 / 1,000)
Per page emitted (markdown)$0.002 / page ($2.00 / 1,000)

Run locally

npm install
npm run build
apify run # reads .actor/INPUT.json

Roadmap (v1.2+)

Inline embeddings, direct vector-DB push (Pinecone/Qdrant/Weaviate/pgvector), missedGraceRuns before tombstoning, Standby low-latency mode.