Scrape totaljobs.com - UK’s largest job board. Salary data, employer contact details, full job descriptions, and job change monitoring. Incremental mode detects new and changed listings. Compact output for AI agents and MCP workflows.
Added schema drift monitoring for incremental runs, with source-field baselines persisted in the configured state store.
2026-05-14
Fixed: maxPages no longer silently caps maxResults. When you request more results than the default page budget allows, the actor now extends pagination automatically.
1.3.1 — 2026-05-04
Critical billing — caught by external review of v1.3.0 before ship
SERP-only fast path no longer double-bills. With includeDetails: false, the fast path pushed every row directly, but its usage flag evaluated to false because nothing was queued for detail enrichment. The standard path then ran the same start URLs again, producing duplicate dataset rows and charging twice. The flag now derives from "did the fast path successfully parse a SERP page", and the SERP-only push routes through the central pushOutputForJob helper. v1.3.0 shipped this regression but was held before going live; production stayed on v1.2.x.
Push contract is now wired into every engine path (browser SERP+detail, alternate detail, browser-fetch detail, detail crawler, plus the onDetailFailed SERP-fallback push in main.ts). v1.3.0 only reached routes.ts and the fast path, leaving the rest with the same C2/I1/I2 bugs the helper was meant to retire:
markSeen-on-dedup-hit: local maybePush returned true on dedup, callers then unconditionally called incremental.markSeen for rows that never reached the dataset, poisoning incremental state.
detailsFetched: true for parser failures: a 200 with no JSON-LD JobPosting now correctly emits detailsFetched: false. Was unconditionally true whenever the HTTP response succeeded.
SERP-only rows missing changeType / firstSeenAt / lastSeenAt: the engine paths advertised these lifecycle fields but only one path (post-v1.3.0) populated them.
Diagnostics
writeFailureDiagnostics now fires on every early-exit branch. Invalid maxResults, invalid startUrl, and state-lock conflict each write a FAILURE_DIAGNOSTICS KV record so consumers can distinguish "ran clean with 0 results" from "never started". Lock-conflict happens before the outer try/catch wrap, so it gets its own write rather than relying on the thrown_error fallback.
State store
Incremental state is pruned on save. Two-phase: age-based prune drops entries whose lastSeenAt is older than 90 days, then a soft size budget (8 MB, leaving headroom under Apify's 9 MB per-record limit) triggers oldest-first pruning if the JSON payload still exceeds it. Without this, the v2 timestamp-per-id format would eventually breach the KV limit and start failing every save on long-lived state stores. INCREMENTAL_RETENTION_MS and INCREMENTAL_MAX_BYTES are exported so other actors can tune.
1.3.0 — 2026-05-04
Critical correctness — push contract
pushOutputForJob helper centralises every dataset write. Owns changeType, firstSeenAt, lastSeenAt, transform, push, dedup, and incremental.markSeen. Callers now go through one function instead of 7 ad-hoc copies, so future fixes don't have to be replicated across engine paths.
markSeen is now strictly post-push. Previously routes.ts SERP-only and LABEL_DETAIL paths called incremental.markSeen BEFORE attemptPush, so a cap rejection or transient push failure left the job permanently locked out of future incremental runs. Helper invariant now enforces "markSeen iff push succeeded".
Cap-rejected jobs are no longer marked seen. The SERP onDetailJob callback used to mark cap-overflow jobs as seen ("// cap reached — mark remaining jobs seen"), causing silent data loss across runs. Removed; future runs now rediscover those jobs as NEW.
State model
IncrementalState v2 KV format. Per-id timestamps {firstSeenAt, lastSeenAt} replace the bare seen-set. Legacy v1 {ids} state migrates automatically on first load by synthesising timestamps from the run-level updatedAt. OutputItem.firstSeenAt / lastSeenAt now carry real cross-run semantics.
State lock TTL bumped 30 min → 90 min. With actor timeoutSecs: 1800 (30 min), the previous TTL was identical to the timeout, so any long-but-legal run could trigger a stale_override race mid-write.
Lock release is compare-and-delete. Previously setValue(lockKey, null) ran unconditionally, so a late-exiting run could nuke a concurrent run's freshly-acquired lock. Now reads current lock first and only clears if runId matches.
Detail engine bugs from the v1.2.0 audit pass
The "comprehensive correctness pass" of v1.1.0 missed both detail blocks (primary + deferred retry). All three engine-specific bugs are now fixed in those paths:
Currency: parseDetailJsonLd now receives defaultCurrencyForGeo(geo). UK was getting EUR salaries; now correctly GBP.
contentHash is positional null-safe — was using .filter(Boolean).join('|') which dropped empty strings and shifted slot positions, so a missing field would re-hash an unrelated value.
detailsFetched = Boolean(detail) — was unconditionally true when the detail response succeeded, claiming detail success even when the parser couldn't read the JSON-LD.
Cap math
SERP collection at main.ts:916 now compares pushed + reserved + pendingDetailJobs.length against maxResults instead of just pendingDetailJobs.length. Same I7 bug from the v1.1.0 audit, different code path that had been missed.
Input contract
geo runtime default: 'TOTALJOBS'. API/CLI callers omitting geo now match the "Fixed to Totaljobs (UK)" promise.
includeDetails runtime default: true. Console UI users and API/CLI callers now get the same default detail-enriched output.
Default drift audit added — guards against future mismatches between Console defaults and API/CLI defaults.
Diagnostics
Optimised SERP failures now record block signals. A 403/429/transport block on the optimised SERP path used to surface as "no signals + 0 pushed" in run summary, indistinguishable from an empty query.
Failure diagnostics on early exit / thrown errors. Validation failures (missing query, invalid geo) and unhandled exceptions now write a FAILURE_DIAGNOSTICS KV record so consumers can tell a clean run from one that never started.
1.2.0 — 2026-05-04
Performance
Inline detail fetches now run with bounded concurrency. Each SERP page used to fetch its 25 details one-at-a-time; now bounded by min(maxConcurrency, 8) per SERP page. End-to-end inline-detail throughput up ~4-8× on default settings.
Alternate detail crawler reuses browser contexts across jobs (up to 10 uses per context, then auto-recycled). Previously created and immediately closed a fresh context per job (and per retry), which OOM'd on 500+-job runs and added ~1-2s of overhead per fetch. Retries still get a fresh context after a failed identity.
External retry backoff now jittered (±20% uniform). Multiple actors retrying at the same tick produced thundering-herd spikes; jitter spreads the second wave.
Schema
Output view drift fixed: descriptionHtml, descriptionMarkdown, contentHash, changeType were missing from the "all" export view. Display labels added for 11 lifecycle/repost/extraction fields that previously appeared with raw key names. A drift-audit test guards against future regressions.
1.1.0 — 2026-05-03
Critical fixes
Pagination cap removed: maxPages was previously clamped to 1 in the browser-fetch path, so users requesting maxPages=10 got only the first page of results. The clamp was a leftover speculative optimization; the SERP path now runs pagination end-to-end. ([reported by @cleme_ntino])
Pass-2 escalation no longer double-bills: pendingDetailJobs is now cleared after each detail phase. Previously, escalating after a failed pass re-pushed the entire pass-1 set against the cap, billing users twice for the same listings.
Detail retry no longer false-fails: Firefox detail crawler used to retry whenever description was empty, even when the JSON-LD JobPosting block was present and complete. Retries now only fire when JSON-LD is entirely missing.
Incremental state isolation: Two runs with identical query+geo+location but different age/radius/contractType/etc. used to share state, silently suppressing fresh hits in run B. Filter dimensions are now hashed into the state-key prefix.
Phone & URL extraction now read post-format text (no longer broken by HTML→Markdown conversion); email extraction reads raw HTML (so mailto: anchors aren't lost). The previous code did the opposite of both.
changeType: 'NEW' now wired across all detail engines. Was missing on multiple paths, so incremental subscribers couldn't tell new from existing items.
contentHash is now null-safe: previously a missing field would throw inside the SHA-256 hashing call.
Lock-acquisition errors no longer mask root cause: Actor.fail() was being thrown awkwardly during state-lock acquisition, swallowing the underlying error message. Now throws a plain Error.
State lock always released on failure: try/catch added around the main run body so a crash mid-run still releases the lock instead of holding it for the full TTL.
Important fixes
Currency mapping is now geo-aware: UK GBP, EU EUR, ZA ZAR. Salaries from JSON-LD without explicit currency previously defaulted to EUR for everything.
External detail success criterion fixed: changed from html.length > 5000 to JSON-LD presence check. Long block pages used to count as success; legitimate compact templates used to count as failure.
Telegram/WhatsApp message splits at semantic boundaries: notifications now split at \n\n boundaries before falling back to hard slices, preventing job entries from being chopped mid-sentence.
Notification dispatch gated on success: previously dispatched even when the run had failed mid-way.
Detail uniqueKey discriminated by pass: pass-2 detail retries used to be deduped against pass-1 entries by the request queue, making escalation a no-op. UniqueKey now includes the pass label.
startUrls hostname validation: invalid hostnames are rejected up front instead of failing mid-run with a confusing error.
onDetailJob cap math: pass-2 escalation now correctly accounts for pushed + reserved + pendingDetailJobs.length against maxResults.
stateStoreName default: now consistently defaults to "totaljobs-state" for all entry points.
Compact output
salaryMin / salaryMax added to compact field set (essential for AI-agent salary filtering).
Operational
Default memory bumped from 1024MB → 2048MB; default timeout 300s → 1800s. Browser detail paths previously OOM'd on larger runs and timed out on maxPages>5.
1.0.x — 2026-04-30
Fixed: startUrls now processes all URLs in the array. Previously the optimized SERP path only used the first URL; subsequent URLs were silently ignored. Each URL is now its own pagination universe with shared dedup + maxResults cap.
0.1.x — 2026-04-14
Added: descriptionHtml, descriptionMarkdown output fields (triple-format descriptions for RAG/LLM pipelines)
Added: contentHash output field (SHA-256 hash of content-identifying fields)