Sitemap to Changed-Only RAG JSONL avatar

Sitemap to Changed-Only RAG JSONL

Pricing

$1.00 / 1,000 sitemap page processeds

Go to Apify Store
Sitemap to Changed-Only RAG JSONL

Sitemap to Changed-Only RAG JSONL

Crawl sitemap.xml files and emit only added, changed, or deleted Markdown/JSONL chunks for cheaper RAG reindexing.

Pricing

$1.00 / 1,000 sitemap page processeds

Rating

0.0

(0)

Developer

Orbiscribe Labs

Orbiscribe Labs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Use this Actor when you already have a public sitemap and need to know what changed before reindexing a knowledge base, search index, or chatbot corpus.

It fetches sitemap URLs, follows sitemap indexes, extracts public HTML pages, hashes readable content, and compares pages against the previous snapshot stored in Apify key-value storage.

Why use this instead of a generic crawler?

Generic crawlers are useful for the first crawl. This Actor is for the second, third, and hundredth crawl, when you do not want to re-embed unchanged pages.

  • paste a sitemap.xml
  • keep only useful docs paths with includeUrlPatterns
  • schedule repeat runs
  • emit added, changed, and deleted records
  • succeed with a no-change summary when scheduled runs find nothing new
  • leave includeUnchanged off for changed-only reindexing
  • export RAG_DELTA_CHUNKS_JSONL directly to your vector pipeline
  • pay only for processed sitemap pages, with a small live default run

What you get

  • Dataset rows for sitemap pages and chunks.
  • changeType values: added, changed, unchanged, deleted, and no_changes.
  • Clean Markdown, main text, headings, canonical URL, content hash, and source sitemap metadata.
  • Key-value outputs: RAG_CHUNKS_JSONL, RAG_DELTA_CHUNKS_JSONL, DOCUMENTS_JSONL, URL_INVENTORY, CHANGE_SUMMARY, MARKDOWN_BUNDLE, BUYER_BRIEF, and RUN_SUMMARY.

Common workflows

  • Reindex only changed pages from a documentation or marketing site.
  • Schedule weekly sitemap diffs and send changed records to a webhook.
  • Keep an auditable URL inventory with hashes and source sitemap provenance.
  • Export JSONL for a vector database without reshaping Apify dataset rows.

Input

Start with one or more sitemapUrls. Turn on compareToPreviousRun for scheduled delta runs. Keep includeUnchanged off when you only want records that require action.

The default input runs a tiny live Apify docs sitemap sample and filters to the Actor marketing playbook path, so the first run shows real content instead of a site-wide URL inventory:

{
"sitemapUrls": ["https://docs.apify.com/sitemap_base.xml"],
"includeUrlPatterns": ["/academy/actor-marketing-playbook/"],
"excludeUrlPatterns": [],
"compareToPreviousRun": true,
"includeUnchanged": false,
"maxPages": 5,
"dryRun": false
}

Use dryRun: true when you want bundled demo records without crawling live pages or calling custom pay-per-event charges.

Pricing

Recommended monetization: Pay per Event at $0.001 per sitemap-rag-page.

That is $1 per 1,000 processed sitemap pages, plus normal Apify platform usage. When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large sitemap crawls.

Limits and compliance

Public pages only. This Actor does not bypass logins, paywalls, robots policies, or access controls. Extraction quality depends on the page structure, so run a small crawl before scheduling a large site.