Scrape willhaben.at — Austria's largest classifieds platform. Pull listings from any pasted search URL across every platform section with incremental change tracking that emits only new and updated items between runs.
jobSalaryPeriod — normalized period ("hour" / "day" / "week" / "month" / "year"), distinct from the raw jobSalaryTimeFrame Willhaben ships in German.
The pre-existing jobSalary / jobSalaryTimeFrame / jobSalaryText fields are preserved unchanged for backward compatibility.
Salary parser ships with bounds (defaults: min €100, max €10M) and a 5× ratio cap on description-parsed maxes, so phone numbers and VAT ids in descriptions can't leak into salary values. (Same defense as willhaben-scraper v0.2.6.)
Fixed
contactName preserved when detail's contact is empty (parity with willhaben-scraper v0.2.5). When Willhaben ships { email: ... } without firstname/lastname, prior jobContactName is no longer null-blanked.
0.3.2 — 2026-05-03
Fixed — Critical
SERP page-1 failure now propagates in both jobs and attribute (immo / autos / marktplatz) tasks instead of silently returning an empty universe. A single 5xx on page 1 in incremental mode would have marked every prior-state listing as EXPIRED (universe corruption); runs now fail fast so state is preserved.
Retry loop no longer self-cancels.fetchWithRetry now builds a fresh AbortSignal per attempt — the prior single-signal pattern caused the first timeout to abort every subsequent retry instantly. Network-level errors (ECONNRESET, ETIMEDOUT, EAI_AGAIN, fetch wrappers) are now retryable, not just retryable HTTP statuses.
Hostname validation on searchUrl / startUrls. Pasted non-Willhaben URLs are now rejected with a clear error (searchUrl must be a willhaben.at URL) instead of silently parsing into a request that goes nowhere meaningful.
Fixed — Important
EXPIRED stubs no longer charged. Billing event now sizes by live items only (NEW/UPDATED/UNCHANGED/REAPPEARED); tombstones are operational data with no business value.
Run-footer headline split. Footer now reports X new/updated listings exported (live only) plus a separate 🪦 Y expired listings emitted as stubs line — no more "exported X new/updated" when half the dataset is tombstones.
buildExpiredStub is now typed. Adding a new OutputItem field would (correctly) cause a TS error rather than silently leak undefined into the dataset.
State-lock failures throw a plain Error instead of throw await Actor.fail(...) — the prior pattern interfered with the outer try/catch's releaseLock cleanup path.
immoAvailableDate now ISO-normalized, matching all other date fields.
validThrough and jobExpiryDate consistent. Hash input and output field now use the same normalized expiryDate so a presentation-only timestamp drift can't trigger a spurious UPDATED.
State-key auto-fingerprint is section-aware. For non-jobs runs, jobs-* filters no longer enter the fingerprint; for non-immo runs, immoSubTypes doesn't enter; etc. Prevents accidental state-sharing across unrelated universes.
Fixed — Nice-to-have
Default memory raised from 256 MB to 512 MB; max raised to 2048 MB. The 256 MB default OOMed on maxResults: 0 runs over large universes.
Dead applyExtraction removed.
Tests
7 new tests covering emission-policy toggles and full classification round-trip (prior-active missing → EXPIRED, firstSeenAt preservation, REAPPEARED detection). Total: 157 unit/integration tests, all green.
0.3.1 — 2026-05-03
Added — Emission policy controls (parity with willhaben-scraper)
emitUnchanged — when enabled, incremental runs also emit listings
classified as UNCHANGED (no tracked content drift since last run). Default
false — most pipelines only want NEW / UPDATED / REAPPEARED.
emitExpired — when enabled, listings tracked in the prior state but
missing from the current run are emitted as EXPIRED stubs (timestamps +
listingId only, no live data). Notifications skip these stubs (no title /
apply link to render). Default false.
Closes the schema drift between this actor and willhaben-scraper, which
has had these toggles since v0.2.0.
0.3.0 — 2026-05-03
Added — Multi-URL startUrls input
New startUrls input accepts an array of willhaben.at search URLs. Each URL
becomes its own search task; results are merged round-robin across tasks
before the global maxResults cap so every URL contributes to the output.
Without this, a small maxResults would be filled entirely from URL #1's
items even when the other URLs were also fetched (the bug user emmat
reported against the jobs scraper).
All URLs in a single run must target the same section (jobs / immobilien /
autos / marktplatz). Mixed-section runs are rejected with a clear error
asking the user to split into separate runs.
Identical URLs are deduped by canonical form (sorted query params, trimmed
trailing slash) so an accidentally-pasted duplicate doesn't fire two
identical SERP pipelines.
The single-string searchUrl field still works (treated as a one-entry
startUrls array) for backward compatibility with existing schedules.
Fixed — Structural (parity with willhaben-scraper v0.2.4)
Jobs URL filters now extracted from URL. A pasted /jobs/suche URL with
?location=Wien®ion=900&employment_type=vollzeit previously dropped
these params silently into extraParams (which the jobs API client doesn't
read). They're now mapped onto the typed jobLocation / jobRegion /
jobEmploymentMode fields so per-URL job filters actually take effect.
Vienna-local timestamps now carry the correct timezone offset. Naked
firstPublishDate / PUBLISHED_String / expiryDate from the API used to
pass through without normalization, mis-representing Vienna wall-clock
values. Now appends +01:00 or +02:00 based on Europe/Vienna DST for
the date in question.
EXPIRED stub items no longer sent to notifications. Per-job stubs only
carry timestamps + listingId — Telegram/Discord/Slack would have rendered
them as (untitled) with no link. Still appear in the dataset; only
notification rendering is skipped.
Telegram messages over 4096 chars now split on blank-line (entry-block)
boundaries, falling back to hard-cut only when a single entry exceeds the cap.
HTTP 429 is now retryable (was previously bubbled directly to the caller).
Retry delays now include ±20% jitter to break up thundering-herd retries
when many parallel SERP pages hit the same upstream 5xx.
Actor.charge failures log at debug level instead of being silently
swallowed — systematic billing breakage is now observable in dev/CI logs.
Empty/whitespace stateKey is now coerced to null so the
auto-fingerprint fallback fires when the field is left blank in the UI.
Repost detection is now O(1) per current item via a pre-built
content-hash → expired-prior-entry index. Previously linear scan over the
full prior state for every item.
Tests
30 new tests (mergeTaskResults, computePerTaskBudget, searchTasks).
Total: 150 tests, all green.
0.2.1 — 2026-05-01
Changed — stateKey is now optional
When incremental mode is enabled, stateKey is no longer required. If omitted, a
stable identifier is auto-generated from your search inputs (section, query,
location, sub-type, price and job filters, plus any params from a pasted searchUrl)
so different searches never share state.
Migration note: existing schedules with an explicit stateKey keep their prior
state intact. Schedules that previously errored ("stateKey is required") will now
succeed and start fresh state.
0.1.x — 2026-04-14
Added: descriptionHtml, descriptionMarkdown output fields (triple-format descriptions for RAG/LLM pipelines)
Added: contentHash output field (stable hash of content-identifying fields, used for change detection)