Sitemap to Changed-Only RAG JSONL
Pricing
$1.00 / 1,000 sitemap page processeds
Sitemap to Changed-Only RAG JSONL
Crawl sitemap.xml files and emit only added, changed, or deleted Markdown/JSONL chunks for cheaper RAG reindexing.
Pricing
$1.00 / 1,000 sitemap page processeds
Rating
0.0
(0)
Developer
Orbiscribe Labs
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Use this Actor when you already have a public sitemap and need to know what changed before reindexing a knowledge base, search index, or chatbot corpus.
It fetches sitemap URLs, follows sitemap indexes, extracts public HTML pages, hashes readable content, and compares pages against the previous snapshot stored in Apify key-value storage.
Why use this instead of a generic crawler?
Generic crawlers are useful for the first crawl. This Actor is for the second, third, and hundredth crawl, when you do not want to re-embed unchanged pages.
- paste a
sitemap.xml - keep only useful docs paths with
includeUrlPatterns - schedule repeat runs
- emit
added,changed, anddeletedrecords - succeed with a no-change summary when scheduled runs find nothing new
- leave
includeUnchangedoff for changed-only reindexing - export
RAG_DELTA_CHUNKS_JSONLdirectly to your vector pipeline - pay only for processed sitemap pages, with a small live default run
What you get
- Dataset rows for sitemap pages and chunks.
changeTypevalues:added,changed,unchanged,deleted, andno_changes.- Clean Markdown, main text, headings, canonical URL, content hash, and source sitemap metadata.
- Key-value outputs:
RAG_CHUNKS_JSONL,RAG_DELTA_CHUNKS_JSONL,DOCUMENTS_JSONL,URL_INVENTORY,CHANGE_SUMMARY,MARKDOWN_BUNDLE,BUYER_BRIEF, andRUN_SUMMARY.
Common workflows
- Reindex only changed pages from a documentation or marketing site.
- Schedule weekly sitemap diffs and send changed records to a webhook.
- Keep an auditable URL inventory with hashes and source sitemap provenance.
- Export JSONL for a vector database without reshaping Apify dataset rows.
Input
Start with one or more sitemapUrls. Turn on compareToPreviousRun for scheduled delta runs. Keep includeUnchanged off when you only want records that require action.
The default input runs a tiny live Apify docs sitemap sample and filters to the Actor marketing playbook path, so the first run shows real content instead of a site-wide URL inventory:
{"sitemapUrls": ["https://docs.apify.com/sitemap_base.xml"],"includeUrlPatterns": ["/academy/actor-marketing-playbook/"],"excludeUrlPatterns": [],"compareToPreviousRun": true,"includeUnchanged": false,"maxPages": 5,"dryRun": false}
Use dryRun: true when you want bundled demo records without crawling live pages
or calling custom pay-per-event charges.
Pricing
Recommended monetization: Pay per Event at $0.001 per sitemap-rag-page.
That is $1 per 1,000 processed sitemap pages, plus normal Apify platform usage. When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large sitemap crawls.
Limits and compliance
Public pages only. This Actor does not bypass logins, paywalls, robots policies, or access controls. Extraction quality depends on the page structure, so run a small crawl before scheduling a large site.