RAG Web Crawler: Clean Markdown + Token-Sized Chunks
Pricing
$2.00 / 1,000 dataset item scrapeds
RAG Web Crawler: Clean Markdown + Token-Sized Chunks
Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.
Pricing
$2.00 / 1,000 dataset item scrapeds
Rating
0.0
(0)
Developer
Harry Schoeller
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
RAG Web Crawler — Clean Markdown + Token-Sized Chunks, Pay-Per-Result
Turn any website into embeddings-ready chunks with citations and predictable per-chunk pricing. No CSS tuning, no runaway compute bills.
Generic crawlers hand you raw pages and make you build the RAG pipeline yourself. This actor hands you clean, token-sized, deduplicated, citable chunks — at a fixed price per chunk you keep.
What it does
- Clean LLM-ready Markdown —
@mozilla/readabilitystrips nav, footers, ads, and cookie banners;turndown+ GFM converts the cleaned DOM to Markdown with heading hierarchy, fenced code blocks, and tables preserved. - Structure-aware, token-budgeted chunking — splits on the heading tree, then recursively sub-splits oversized sections to your token budget (default 512) with overlap (default 75). Code blocks and tables are kept intact, never split mid-block.
- Rich per-chunk provenance — every chunk ships source URL + deep anchor, page title, full headings path, content hash, token count, content type, and language for metadata-filtered vector search and deep-link citations.
- Dedup + junk filtering — exact content-hash dedup plus 64-bit SimHash near-duplicate collapsing, and low-information / nav-residue chunk filtering.
- Four output formats —
chunks-jsonl(one record per chunk),markdown(one record per page),langchain(drop-in{page_content, metadata}Document JSON), andjsonl-bulk(flat one-record-per-chunk for DB/COPY/pgvector). - Incremental / delta sync — on scheduled re-runs, only NEW or CHANGED pages are re-emitted (and billed). Makes daily/weekly crawls cheap.
- Budget guarantee —
maxPagesis a hard ceiling; billing is per emitted result, so a runaway crawl can never produce a runaway bill.
Incremental / delta sync — cheap scheduled re-runs
Turn on Incremental sync (incremental: true) and schedule the actor to run
daily or weekly. The first run does a full crawl and seeds a per-URL content-hash
state in a named key-value store. Every later run crawls the site, but only
re-emits chunks for pages that are new or changed — unchanged pages cost
nothing. A typical weekly docs re-crawl re-emits a handful of pages instead of
hundreds.
- State is automatic. The state store name defaults to a deterministic hash of
your start URLs, so a scheduled task reuses its own prior state with zero config.
Set
stateStoreNameexplicitly to share state across tasks/schedules. forceFullCrawl: truere-emits everything and rebuilds the baseline — use after changing chunking settings or to refresh a stale index.emitDeletions: truewrites a tombstone record ({ deleted: true, url, ... }) to a separatedeletionsdataset for every URL that disappeared since the last run, so downstream vector stores can purge stale vectors. Tombstones are not billed.- When incremental is ON, each emitted record carries a
change_status(new|changed) in its metadata.
The run summary (OUTPUT key-value record) includes a delta block:
pages_new, pages_changed, pages_unchanged, pages_deleted,
chunks_skipped_unchanged (the spend you saved), state_store, prior_run_id.
When all incremental options are OFF (the default), behavior and output are byte-for-byte identical to v1.0.
Output (chunks-jsonl)
{"id": "a1f3c9e29b2c4d10","url": "https://docs.example.com/guide/install","title": "Getting Started — Example Docs","chunkIndex": 3,"chunkTotal": 11,"headingsPath": ["Getting Started", "Setup", "Installation"],"text": "## Installation\n\nInstall via npm:\n\n```bash\nnpm install crawlee\n```","tokenEstimate": 498,"fetchedAt": "2026-06-20T14:02:11Z","content_hash": "sha256:...","metadata": {"source_url": "https://docs.example.com/guide/install","deep_link": "https://docs.example.com/guide/install#installation","anchor": "installation","canonical_url": "https://docs.example.com/guide/install","page_title": "Getting Started — Example Docs","char_count": 2104,"content_type": "mixed","language": "en","last_modified": null,"crawl_timestamp": "2026-06-20T14:02:11Z"}}
Each record maps 1:1 to a vector-DB upsert: { id, values=embed(text), metadata }.
Output (langchain)
One record per chunk, drop-in for LangChain — [Document(**r) for r in dataset]:
{"page_content": "## Installation\n\nInstall via npm...","metadata": {"id": "a1f3c9e29b2c4d10","source": "https://docs.example.com/guide/install","title": "Getting Started — Example Docs","deep_link": "https://docs.example.com/guide/install#installation","canonical_url": "https://docs.example.com/guide/install","headings_path": ["Getting Started", "Setup", "Installation"],"chunk_index": 3,"chunk_total": 11,"content_type": "mixed","language": "en","token_estimate": 498,"char_count": 2104,"content_hash": "sha256:...","last_modified": null,"crawl_timestamp": "2026-06-20T14:02:11Z"}}
Output (jsonl-bulk)
Fully flat one-record-per-chunk for generic bulk import (DB COPY / pgvector):
{"id": "a1f3c9e29b2c4d10","text": "## Installation\n\nInstall via npm...","source_url": "https://docs.example.com/guide/install","deep_link": "https://docs.example.com/guide/install#installation","canonical_url": "https://docs.example.com/guide/install","title": "Getting Started — Example Docs","headings_path": "Getting Started > Setup > Installation","chunk_index": 3,"chunk_total": 11,"content_type": "mixed","language": "en","token_estimate": 498,"char_count": 2104,"content_hash": "sha256:...","last_modified": null,"crawl_timestamp": "2026-06-20T14:02:11Z"}
Input
See .actor/input_schema.json. Key fields: startUrls, crawlScope,
maxCrawlDepth, maxPages, renderJs, outputFormat, chunkSize,
chunkOverlap, dedupNearDuplicates, filterJunkChunks, and the incremental
sync fields incremental, forceFullCrawl, stateStoreName, emitDeletions.
Pricing
Pay-Per-Event. Billable unit = one emitted dataset item. Deduped and junk-filtered chunks are not billed.
| Event | Price |
|---|---|
| Per chunk emitted (chunks-jsonl) | $0.0008 / chunk ($0.80 / 1,000) |
| Per page emitted (markdown) | $0.002 / page ($2.00 / 1,000) |
Run locally
npm installnpm run buildapify run # reads .actor/INPUT.json
Roadmap (v1.2+)
Inline embeddings, direct vector-DB push (Pinecone/Qdrant/Weaviate/pgvector),
missedGraceRuns before tombstoning, Standby low-latency mode.