Actor Website Change Monitor avatar

Actor Website Change Monitor

Pricing

Pay per usage

Go to Apify Store
Actor Website Change Monitor

Actor Website Change Monitor

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Egor Kaleynik

Egor Kaleynik

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Website Change Detection & Monitoring (Apify Actor)

Stateful website monitoring actor for long-running cron use. Unlike one-shot extractors, this actor stores URL history, compares snapshots between runs, and emits change events.

Modes

  • monitor: user provides explicit URL list (startUrls), actor watches for changes.
  • discover: actor discovers URLs from a domain (or category URL seed) and then monitors them.
  • promo-detect: actor generates promo candidates from a pattern library, probes them in batch, then monitors live pages.

Amazon rule:

  • platform=amazon + mode=promo-detect is blocked by input validation.

Scheduler Guidance

  • Default: every 24h
  • Recommended: every 12h
  • Hard minimum: 6h
  • Maximum interval: 168h (7 days)

Input guardrail:

  • checkIntervalHours must be 6..168.

Tiering Model

  • Tier 1: checked every run.
  • Tier 2: checked on modulo rotation (runCounter % N === tierOffset), where N=tier2Modulo.
  • Tier 3: baseline capture tier (new URLs).
  • Quarantine: after 5 consecutive fetch errors.

Transitions:

  • Tier 3 -> Tier 2 after baseline captured.
  • Tier 2 -> Tier 1 after detected change.
  • Tier 1 -> Tier 2 after 3 stable runs.
  • Any -> Quarantine after 5 consecutive errors.

Platform Support

  • Detection: Shopify, WooCommerce, Magento, Amazon, BigCommerce, PrestaShop, Generic.
  • Platform extraction: Shopify/Woo/Magento/Amazon selector packs + JSON/API-like parsing + JSON-LD fallback.

Discovery Strategies

Mode discover

  1. Sitemap recursion (/sitemap.xml, /sitemap_index.xml, robots.txt references)
  2. Shopify API pagination (/products.json?limit=250&since_id=...) for Shopify platforms
  3. Category/listing crawl fallback (depth and page-limit configurable)

Amazon discover

  • Strategy 1: amazonAsins list (build /dp/{ASIN} URLs)
  • Strategy 2: keyword search crawl (/s), parse data-asin
  • Strategy 3: browse node crawl (/b?node=...), parse data-asin

Mode C Probe Engine

Batch probe engine (before registry merge):

  • Concurrency control (probeConcurrency)
  • Per-domain throttle (probeDelayMs)
  • HEAD-first probing with GET fallback for HEAD-blocking responses
  • Redirect capture (MODE_C_REDIRECTS KV key)
  • Probe lifecycle events:
    • page_appeared for newly live probed URLs
    • page_disappeared for previously known URLs that fail probe

Localization:

  • Promo URL candidates include static locales and dynamic prefixes extracted from homepage hreflang links.

Amazon Hardening

Implemented safeguards:

  • Forces JS rendering (renderJs=true semantics)
  • Randomized delay profile (3-8s)
  • Rotating desktop user agents
  • Session cookies persisted in-run
  • Residential proxy enforcement (proxyConfig.apifyProxyGroups must include RESIDENTIAL)

Stateful Storage (KV)

  • urlRegistry
  • urlRegistryByDomain
  • platformCache
  • runCounter
  • failedWebhooks
  • deadWebhooks
  • MODE_C_REDIRECTS

Webhook Reliability

  • Failed sends buffered into failedWebhooks
  • Retry on next run before dispatching new events
  • Retry cap: 20 attempts
  • Exceeded entries moved to deadWebhooks dead-letter queue

Run output includes webhook retry/dead-letter metrics.

Input Highlights

Core:

  • mode, startUrls, domain, platformHint
  • monitorSelectors, changeThreshold
  • maxUrls, tier2Modulo, checkIntervalHours
  • notificationWebhook

Mode B:

  • categoryCrawlDepth, maxCategoryPages

Mode C:

  • probeOnlyMode, probeConcurrency, probeDelayMs
  • userPatterns

Amazon:

  • proxyConfig
  • amazonAsins, amazonSearchKeywords, amazonSearchCategoryId, amazonBrowseNodeId

Scale:

  • checkConcurrency (parallel URL checks)

domain input accepts either:

  • host/domain, e.g. www.reserved.com
  • full URL seed, e.g. https://www.reserved.com/pl/pl/kobieta

When a full URL seed is provided in discover mode, category crawl starts from that URL first.

Local Development

Install/build/run:

npm install
npm run build
npm start

Run tests:

$npm run test

Smoke inputs are in smoke-tests/.

Deployment Sanity Checklist

  1. npm run build passes.
  2. npm run test passes.
  3. Verify .actor/actor.json points to INPUT_SCHEMA.json and Dockerfile.
  4. For Amazon runs, set proxy group to RESIDENTIAL.
  5. Create Apify Scheduler task with interval >= 6h.
  6. Validate webhook endpoint availability and idempotency handling.

Cost Model Notes

  • HTML fetch path (Cheerio/gotScraping) is the cheap default.
  • JS rendering is more expensive and should be used only where needed.
  • Tiering and modulo rotation are the primary cost controls at scale.
  • Probe-first Mode C avoids expensive full checks for dead URLs.