Scrape totaljobs.com - UK’s largest job board. Salary data, employer contact details, full job descriptions, and job change monitoring. Incremental mode detects new and changed listings. Compact output for AI agents and MCP workflows.
Critical billing — caught by external review of v1.3.0 before ship
SERP-only Scrape.do path no longer double-bills. With includeDetails: false, Scrape.do pushed every row directly, but scrapedoSerpUsed = pendingDetailJobs.length > 0 evaluated to false because nothing was queued for detail enrichment. CheerioCrawler then ran the same start URLs again, producing duplicate dataset rows and charging twice. The flag now derives from "did Scrape.do successfully parse a SERP page", and the SERP-only push routes through the central pushOutputForJob helper. v1.3.0 shipped this regression but was held before going live; production stayed on v1.2.x.
Push contract is now wired into every engine path (Playwright SERP+detail, Firefox detail, browser-fetch detail, Playwright detail crawler, plus the onDetailFailed SERP-fallback push in main.ts). v1.3.0 only reached routes.ts and Scrape.do, leaving the rest with the same C2/I1/I2 bugs the helper was meant to retire:
markSeen-on-dedup-hit: local maybePush returned true on dedup, callers then unconditionally called incremental.markSeen for rows that never reached the dataset, poisoning incremental state.
detailsFetched: true for parser failures: a 200 with no JSON-LD JobPosting now correctly emits detailsFetched: false. Was unconditionally true whenever the HTTP response succeeded.
SERP-only rows missing changeType / firstSeenAt / lastSeenAt: the engine paths advertised these lifecycle fields but only Scrape.do (post-v1.3.0) populated them.
Diagnostics
writeFailureDiagnostics now fires on every early-exit branch. Invalid maxResults, invalid startUrl, and state-lock conflict each write a FAILURE_DIAGNOSTICS KV record so consumers can distinguish "ran clean with 0 results" from "never started". Lock-conflict happens before the outer try/catch wrap, so it gets its own write rather than relying on the thrown_error fallback.
State store
Incremental state is pruned on save. Two-phase: age-based prune drops entries whose lastSeenAt is older than 90 days, then a soft size budget (8 MB, leaving headroom under Apify's 9 MB per-record limit) triggers oldest-first pruning if the JSON payload still exceeds it. Without this, the v2 timestamp-per-id format would eventually breach the KV limit and start failing every save on long-lived state stores. INCREMENTAL_RETENTION_MS and INCREMENTAL_MAX_BYTES are exported so other actors can tune.
1.3.0 — 2026-05-04
Critical correctness — push contract
pushOutputForJob helper centralises every dataset write. Owns changeType, firstSeenAt, lastSeenAt, transform, push, dedup, and incremental.markSeen. Callers now go through one function instead of 7 ad-hoc copies, so future fixes don't have to be replicated across engine paths (which is exactly the failure mode that left the Scrape.do paths stale through v1.1.0 / v1.2.0).
markSeen is now strictly post-push. Previously routes.ts SERP-only and LABEL_DETAIL paths called incremental.markSeen BEFORE attemptPush, so a cap rejection or transient push failure left the job permanently locked out of future incremental runs. Helper invariant now enforces "markSeen iff push succeeded".
Cap-rejected jobs are no longer marked seen. The SERP onDetailJob callback used to mark cap-overflow jobs as seen ("// cap reached — mark remaining jobs seen"), causing silent data loss across runs. Removed; future runs now rediscover those jobs as NEW.
State model
IncrementalState v2 KV format. Per-id timestamps {firstSeenAt, lastSeenAt} replace the bare seen-set. Legacy v1 {ids} state migrates automatically on first load by synthesising timestamps from the run-level updatedAt. OutputItem.firstSeenAt / lastSeenAt now carry real cross-run semantics.
State lock TTL bumped 30 min → 90 min. With actor timeoutSecs: 1800 (30 min), the previous TTL was identical to the timeout, so any long-but-legal run could trigger a stale_override race mid-write.
Lock release is compare-and-delete. Previously setValue(lockKey, null) ran unconditionally, so a late-exiting run could nuke a concurrent run's freshly-acquired lock. Now reads current lock first and only clears if runId matches.
Scrape.do detail engine — bugs from the v1.2.0 audit pass
The "comprehensive correctness pass" of v1.1.0 missed both Scrape.do detail blocks (primary + deferred retry). All three engine-specific bugs are now fixed in those paths:
Currency: parseDetailJsonLd now receives defaultCurrencyForGeo(geo). UK was getting EUR salaries; now correctly GBP.
contentHash is positional null-safe — was using .filter(Boolean).join('|') which dropped empty strings and shifted slot positions, so a missing field would re-hash an unrelated value.
detailsFetched = Boolean(detail) — was unconditionally true when Scrape.do returned 200, claiming detail success even when the parser couldn't read the JSON-LD.
Cap math
SERP collection at main.ts:916 now compares pushed + reserved + pendingDetailJobs.length against maxResults instead of just pendingDetailJobs.length. Same I7 bug from the v1.1.0 audit, different code path that had been missed.
Input contract
geo runtime default: 'TOTALJOBS'. Was DEFAULTS.geo (= 'DE') inherited from a shared constants file. API/CLI callers omitting geo now match the schema's "Fixed to Totaljobs (UK)" promise.
includeDetails runtime default: true. Was false while the schema advertised true, so Console UI users got details but API/CLI callers without the field got SERP-only data. The Apify quality-test guard lives in prefill: false (separate field), unchanged.
Drift-audit test added — parses main.ts for runtime defaults and compares to input_schema.json.
Diagnostics
Scrape.do SERP failures now record block signals. A 403/429/transport block on the optimised SERP path used to surface as "no signals + 0 pushed" in run summary, indistinguishable from an empty query.
Failure diagnostics on early exit / thrown errors. Validation failures (missing query, invalid geo) and unhandled exceptions now write a FAILURE_DIAGNOSTICS KV record so consumers can tell a clean run from one that never started.
1.2.0 — 2026-05-04
Performance
Inline detail fetches now run with bounded concurrency (playwrightCrawler.ts). Each SERP page used to fetch its 25 details one-at-a-time; now bounded by min(maxConcurrency, 8) per SERP page. End-to-end inline-detail throughput up ~4-8× on default settings.
Firefox detail crawler reuses browser contexts across jobs (up to 10 uses per context, then auto-recycled). Previously created and immediately closed a fresh context per job (and per retry), which OOM'd on 500+-job runs and added ~1-2s of overhead per fetch. Retries still get a fresh context to avoid keeping a flagged identity.
Scrape.do retry backoff now jittered (±20% uniform). Multiple actors retrying at the same tick produced thundering-herd spikes against the proxy; jitter spreads the second wave.
Schema
dataset_schema.json drift fixed: descriptionHtml, descriptionMarkdown, contentHash, changeType were missing from the "all" view's transformation fields. Display labels added for 11 lifecycle/repost/extraction fields that previously appeared with raw key names. A drift-audit test guards against future regressions.
1.1.0 — 2026-05-03
Critical fixes
Pagination cap removed: maxPages was previously clamped to 1 in the browser-fetch path, so users requesting maxPages=10 got only the first page of results. The clamp was a leftover speculative optimization; the SERP path is owned by CheerioCrawler/Playwright and pagination now runs end-to-end. ([reported by @cleme_ntino])
Pass-2 escalation no longer double-bills: pendingDetailJobs is now cleared after each detail phase. Previously, escalating from datacenter to residential proxy re-pushed the entire pass-1 set against the cap, billing users twice for the same listings.
Detail retry no longer false-fails: Firefox detail crawler used to retry whenever description was empty, even when the JSON-LD JobPosting block was present and complete. Retries now only fire when JSON-LD is entirely missing.
Incremental state isolation: Two runs with identical query+geo+location but different age/radius/contractType/etc. used to share state, silently suppressing fresh hits in run B. Filter dimensions are now hashed into the state-key prefix.
Phone & URL extraction now read post-format text (no longer broken by HTML→Markdown conversion); email extraction reads raw HTML (so mailto: anchors aren't lost). The previous code did the opposite of both.
changeType: 'NEW' now wired across all detail engines (Firefox, Playwright, browser-fetch, Scrape.do). Was missing on multiple paths, so incremental subscribers couldn't tell new from existing items.
contentHash is now null-safe: previously a missing field would throw inside the SHA-256 hashing call.
Lock-acquisition errors no longer mask root cause: Actor.fail() was being thrown awkwardly during state-lock acquisition, swallowing the underlying error message. Now throws a plain Error.
State lock always released on failure: try/catch added around the main run body so a crash mid-run still releases the lock instead of holding it for the full TTL.
Important fixes
Currency mapping is now geo-aware: UK GBP, EU EUR, ZA ZAR. Salaries from JSON-LD without explicit currency previously defaulted to EUR for everything.
Scrape.do success criterion fixed: changed from html.length > 5000 to JSON-LD presence check. Long block pages used to count as success; legitimate compact templates used to count as failure.
Telegram/WhatsApp message splits at semantic boundaries: notifications now split at \n\n boundaries before falling back to hard slices, preventing job entries from being chopped mid-sentence.
Notification dispatch gated on success: previously dispatched even when the run had failed mid-way.
Detail uniqueKey discriminated by pass: pass-2 detail retries used to be deduped against pass-1 entries by Crawlee's RequestQueue, making escalation a no-op. UniqueKey now includes the pass label.
startUrls hostname validation: invalid hostnames are rejected up front instead of failing mid-run with a confusing error.
onDetailJob cap math: pass-2 escalation now correctly accounts for pushed + reserved + pendingDetailJobs.length against maxResults.
stateStoreName default: was "stepstone-state" in code but "totaljobs-state" in input_schema.json. Aligned to "totaljobs-state".
Compact output
salaryMin / salaryMax added to compact field set (essential for AI-agent salary filtering).
Operational
Default memory bumped from 1024MB → 2048MB; default timeout 300s → 1800s. Browser detail paths previously OOM'd on larger runs and timed out on maxPages>5.
1.0.x — 2026-04-30
Fixed: startUrls now processes all URLs in the array. Previously the optimized SERP path only used the first URL; subsequent URLs were silently ignored. Each URL is now its own pagination universe with shared dedup + maxResults cap.
0.1.x — 2026-04-14
Added: descriptionHtml, descriptionMarkdown output fields (triple-format descriptions for RAG/LLM pipelines)
Added: contentHash output field (SHA-256 fingerprint of content-identifying fields)