- Output usability fields. Added
summary, contactPhone, searchQuery, and searchUrl to every live output record. summary is derived from the job description for quick scanning; contactPhone exposes the first validated phone number as a top-level field; search provenance makes multi-URL and pasted-URL runs auditable per row.
- Compact output updated. Compact mode now includes
summary, contactEmail, contactPhone, applyUrl, searchQuery, and searchUrl.
- Local output audit tool. Added
tools/local-output-audit.mjs and refreshed docs/OUTPUT_SAMPLE_AUDIT.md from a broad local Willhaben sample.
- README output example refreshed. The sample now shows high-value populated fields instead of a null-heavy matrix.
- Coverage summary footer. Every run now ends with a structured
Coverage summary block that makes the full drop-chain visible: reported → fetched → unique → kept-after-filters → emitted. Multi-URL runs additionally get a per-URL table showing exactly which start URLs contributed how many jobs, and tasks whose later pages were skipped because of maxResults are flagged inline (e.g. ⚠ 237 jobs on pages 2–4 skipped (maxResults)). Motivated by emmat's reported scenario where 13 URLs at the default maxResults=25 produced output that looked like only a fraction of the source — the cap was doing exactly what it was documented to do, but the schema description never made that visible. The footer surfaces the full picture so the cause is self-diagnosable from the run log alone.
- Pre-fetch warning for skewed
maxResults/taskCount ratio. When more than 3 start URLs are combined with a maxResults value below taskCount × 10, the actor now emits a warning before SERP work starts so the user can cancel and re-run with a sensible cap rather than only discovering the truncation after the fact. Suggested raise value scales with task count (taskCount × 50, floor 500).
maxResults schema description rewritten to make explicit that the cap is a global total across all start URLs (not per-URL), with a worked example: 13 URLs + maxResults=25 ≈ 2 jobs per URL. Recommends 0 (unlimited) or a high explicit total for full multi-URL coverage.
- 37 unit tests for the coverage summary covering: emmat's 13-URL scenario, single-URL no-truncation, cap=0 unlimited with cross-URL overlap, invalid-URLs-rejected, filters dropping with reason breakdown, incremental-mode classification breakdown, empty-SERP rendering, number formatting, and the pre-fetch cap-warning helper across all threshold edge cases. Total: 312 tests, all green.
- Phone-number digits no longer leak into
salaryMax. Live smoke-test against Senior Java Developer:in (STRABAG) produced salaryMax: 224221491 because parseSalary scanned every digit run in the description, including the contact phone +43 1 224221491. The parser now accepts minBound / maxBound options (defaults: 100 / 10M) that filter implausible candidates before computing min/max. With the fix, the same job now correctly returns salaryMax: 55772 (the real upper bound from "50.908 bis 55.772 brutto/Jahr"). 3 regression tests added.
- SERP page-1 failure now propagates instead of silently returning an empty universe. Previously a single 503 on page 1 in incremental mode would mark every prior-state listing as
EXPIRED (universe corruption); the run now fails fast so state is preserved.
- Retry loop no longer self-cancels.
fetchWithRetry now builds a fresh AbortSignal per attempt — the prior single-signal pattern caused the first timeout to abort every subsequent retry instantly. Network-level errors (ECONNRESET, ETIMEDOUT, fetch wrappers) are now retryable, not just retryable HTTP statuses.
EXPIRED stubs no longer charged. Billing event now sizes by live items only (NEW/UPDATED/UNCHANGED/REAPPEARED); tombstones are operational data with no business value.
- Run-footer headline split. Footer now reports
X new/updated jobs exported (live only) plus a separate 🪦 Y expired jobs emitted as stubs line — no more "exported X new/updated" when half the dataset is tombstones.
buildExpiredStub is now typed. EXPIRED stubs returned OutputItem via as unknown as cast, leaving every field except a handful as undefined. They now explicitly null-init every field so dataset views render predictable cells, and adding a new OutputItem field correctly causes a TS error.
- State-lock failures throw a plain
Error instead of throw await Actor.fail(...) — the prior pattern interfered with the outer try/catch's releaseLock cleanup path on lock-loss.
contactName preserved when detail's contact is empty. When Willhaben occasionally ships { email: ... } with no firstname/lastname, the prior code null-blanked any SERP-side contactName; we now fall through to the existing value.
salaryMax now parsed from description even when API has no salary. The fallback used to require an API min before attempting description parsing; we now parse both min and max from description text when the API returns nothing.
maxResults default aligned across the actor configuration (was inconsistent before).
- 1 smoke-test expectation updated for the default result-limit change. All 201 unit/integration tests green.
applyUrl now populated with the public Willhaben URL instead of always being null. Notifications (Telegram, Discord, Slack, WhatsApp) previously rendered without an apply link; they now include a working URL for every job. The field is held back at null inside the tracked-content hash so existing incremental state stays compatible — no spurious UPDATED flood on the first run after upgrade.
- Incremental mode universe is now complete when
maxResults is set. Previously only page 1 (~90 ids) per task was used to build universeIds, so jobs on page 2+ were wrongly flagged EXPIRED next run. Incremental mode now lifts the per-task page cap entirely so EXPIRED detection sees the full universe.
compact mode no longer breaks notifications. Compact stripping is now applied only at dataset-write time. Notifications always receive the full record so titles, links, and descriptions render correctly.
- Vienna-local timestamps now carry the correct timezone offset. Naked
creationDate/postedDate/lastModifiedDate/lastReorderedAt from the API used to get a bare Z suffix that mis-represented Vienna wall-clock as UTC (off by 1–2 hours depending on DST). The actor now appends +01:00 or +02:00 based on Europe/Vienna DST for the date in question.
- Identical
startUrls are now deduped by canonical form (sorted query params, no trailing slash). An accidentally-pasted duplicate no longer fires a redundant SERP pipeline.
employmentModes is sorted before hashing in TrackedFields, so a re-ordering of the same set no longer triggers a spurious UPDATED classification. Migration note: jobs whose multi-mode order in the API does not happen to match alphabetical may emit one UPDATED on the first run after upgrade — single-mode jobs (~majority) are unaffected.
- Per-task fetch budget now scales with task count. With multiple
startUrls each task fetches min(maxResults, max(pageSize, ⌈maxResults × 2 / tasks⌉)) rows. Power-user runs with maxResults ≫ pageSize now save SERP requests by not fetching the full cap on every task.
- Repost detection is now O(1) per current item via a pre-built content-hash → expired-prior-entry index. Previously linear scan of
priorState.jobs for every item.
- Page data parser now matches across newlines. If Willhaben ever ships a pretty-printed payload, parsing won't silently fail.
EXPIRED stub items are now routed through filterCompact when compact: true is set, matching the schema of live items in the dataset.
- Empty/whitespace
stateKey now falls back to the automatic search-specific state key when the field is left blank in the UI.
- EXPIRED stub items are no longer sent to notifications. When
emitExpired: true, the per-job stub has only timestamps + jobId — Telegram/Discord/Slack would have rendered them as (untitled) with no link. They still appear in the dataset; only notification rendering is skipped.
- Retry delays now include ±20% jitter to break up thundering-herd retries when many parallel SERP pages hit the same 5xx.
- HTTP 429 is now retryable (was previously bubbled directly).
Actor.charge failures log at debug level instead of being silently swallowed — systematic billing breakage is now observable.
descriptionHtml field removed — was always null since v0.2.0. Use description (raw, often HTML) or descriptionMarkdown (auto-converted).
- Telegram messages over 4096 chars now split on blank-line (job-entry) boundaries, falling back to hard-cut only when a single block exceeds the cap.
parseStartUrl treats present-but-empty params (?keyword=®ion=900) as undefined, matching how absent params behave.
- 51 new tests across
transform, searchTasks, notifications, mergeTaskResults, computePerTaskBudget, and a new live multi-URL regression suite that verifies emmat's reported scenario against the real Willhaben API.
- Total: 281 tests, all green (incl. 80 live tests).
Fixed — Multi-URL output now distributes across all startUrls
- With multiple
startUrls, the global maxResults cap was filled entirely from the
first URL's items because the merge preserved task-1-first insertion order. Users
with 13 startUrls and maxResults=5 saw only 5 results, all from URL 1, even though
the other 12 URLs ran successfully. Reported by user emmat.
- Results are now interleaved round-robin across tasks before the global cap, so
every URL contributes at least
floor(maxResults / tasks) items.
- Per-task pagination still fetches up to
maxResults rows so the merger has enough
data to interleave from. Cap is applied once globally after merge + filters.
- Log line for multi-task runs now shows per-task contributions:
T1=2 T2=2 T3=1 ….
maxResults prefill in the input schema raised from 5 to 25 so casual UI runs
with multiple startUrls see a fair sample from every URL by default.
Changed — stateKey is now optional
- When incremental mode is enabled,
stateKey is no longer required. If omitted, a
stable identifier is auto-generated from your search inputs so different searches
never share state — narrower runs no longer accidentally mark jobs from broader
runs as EXPIRED.
- Migration note: existing schedules that already pass an explicit
stateKey keep
their prior state intact. Schedules that previously errored ("stateKey is required")
will now succeed and start fresh state.
startUrls now processes all URLs in the array. Previously only the first URL was used; subsequent URLs were silently dropped. Each URL becomes its own search task; results are merged and deduped by job ID across all tasks. Reported by user emmat.
companyVatId — Austrian VAT/UID number for direct B2B outreach (~70% populated)
companyActiveAdverts — Count of active job postings per employer (hiring-volume signal, 100% populated)
salaryMin, salaryMax, salaryCurrency, salaryPeriod — Structured salary fields (parsed from API + description text)
countryCode — ISO 3166-1 (always "AT")
locations[] — Multi-location array {name, federalState, country}
isFeatured — Promoted/topJob flag
isFreshlyPosted — 24-48h freshness flag
internalApplicationOnly — Apply via Willhaben vs external
requiresExternalApplication — External application form required
requiresProfessionalExperience — Professional experience required
createdAt, lastReorderedAt — Separate from firstPublishDate/lastModifiedDate
extractedEmails[] — Regex-extracted emails from description text
extractedPhones[] — Defensive phone-number extraction (strict mode default; lenient available)
startUrls[] — Paste raw Willhaben search URLs; query params merge with explicit input
sortBy — publish_date_desc (newest) or relevance
salaryMinFilter, salaryMaxFilter — Post-fetch salary range filter (EUR)
whatAnd, whatExclude — Post-fetch keyword AND/NOT filter
emitUnchanged, emitExpired — Incremental emission policy
skipReposts — Drop reposts of previously expired jobs
telegramToken, telegramChatId — Telegram notifications
discordWebhookUrl — Discord notifications
slackWebhookUrl — Slack notifications
whatsappAccessToken, whatsappPhoneNumberId, whatsappTo — WhatsApp Cloud API (free-form, 24h service window)
notificationLimit, notifyOnlyChanges — Notification controls
phoneExtractionMode — strict (default) or lenient
- Full incremental classification —
changeType now correctly emits NEW, UPDATED, UNCHANGED, EXPIRED, REAPPEARED (uppercase). Previously only new/updated were generated despite README claims.
- Incremental fields populated —
firstSeenAt, lastSeenAt, previousSeenAt, expiredAt are now real fields on output records (not just README aspiration).
- State-lock on incremental runs — Concurrent runs sharing the same
stateKey now refuse with Actor.fail instead of silently corrupting state.
- Default memory — 128 MB → 256 MB; max 512 MB → 1024 MB.
- Salary structure: legacy
salary/salaryTimeFrame retained; new salaryMin/Max/Currency/Period are the canonical fields. salaryMax parsed from description text via salaryParser.
- Date timestamps now carry UTC
Z suffix (was naked CET/CEST).
descriptionMarkdown actually runs htmlToMarkdown (was passthrough of plain text).
- README claims about
firstSeenAt/lastSeenAt/emitUnchanged/UNCHANGED/REAPPEARED/EXPIRED/isRepost are now true.
- "Skill tags" → "language skills" in description.
- Now imports canonical
_lib/incrementalState.ts + _lib/stateLock.ts + _lib/notifications.ts + _lib/phoneExtractor.ts (no more hand-rolled simpleHash state).
- Added:
descriptionHtml, descriptionMarkdown output fields (triple-format descriptions for RAG/LLM pipelines)
- Added:
contentHash output field (stable hash of content-identifying fields, used for change detection)
- Added: cross-run repost detection (
isRepost, repostOfId, repostDetectedAt)
- Added:
skipReposts input to exclude detected reposts from output
- Initial release
- Search Austrian job listings on willhaben.at by keyword, location, and filters
- Salary, company profile, and contact info extraction
- Incremental mode with change detection
- Compact output mode for AI-agent and MCP workflows