All notable changes to AI / RAG Web Crawler are documented here.
The format is based on Keep a Changelog ,
and this project adheres to Semantic Versioning .
- Initial release.
- HTTP crawler (Crawlee
CheerioCrawler) with link discovery, depth + page limits, same-domain crawling, include/exclude URL globs.
- Main-content extraction: strips nav/header/footer/sidebar/scripts/ads, prefers
<article>/<main>, converts to clean Markdown (node-html-markdown — no headless browser).
- RAG chunking: paragraph-aware, overlapping chunks sized for embeddings; one dataset row per chunk, or one row per page when chunking is off.
- Per-page error isolation, configurable concurrency, optional Apify Proxy, optional cleaned-HTML output.
- Outputs: chunk rows in the dataset + SUMMARY key-value record.
- Apify input / dataset / output schemas (all validated), README, and Store listing copy.
- Vitest suite covering boilerplate stripping, Markdown conversion, link preservation, and chunking (size, overlap, metadata).