- Locale-aware price parser. Brazilian/European thousands separator (
.) was being parsed as a decimal (R$2.800 → 2.8 instead of 2800). New parser handles US ($1,234.56), EU (€1.234,56), BRL (R$2.800), JPY (¥29,000), and multi-grouped (1.234.567) formats with 21/21 unit tests passing.
- startUrls hard failure. When every user-provided
startUrls entry is invalid or non-Craigslist, the actor now fails fast with an actionable status message instead of silently falling through to scrape the default city: 'newyork'.
- Greatly improved typed attribute extraction. Cars+trucks now parse
year, make, model, make_model, odometer (numeric), fuel, transmission, drive, cylinders, title_status, body_type, vin, condition, paint_color. Housing now parses bedrooms, bathrooms, sqft (numeric, from 2BR / 1Ba - 700ft² block AND from JSON-LD Apartment schema), plus pets_allowed, smoking_allowed, cats_ok, dogs_ok, wheelchaccess, airconditioning, ev_charging, rent_period, housing_type, laundry, parking, available, furnished. Jobs continue to expose compensation, employment_type, job_title.
- Address now prefers the richer JSON-LD
Apartment.address (street, city, region, postal, country) over the bare .mapaddress text, and strips Craigslist's literal (google map) placeholder.
- Geo falls back to JSON-LD
latitude/longitude when the #map div is absent.
- Brutal-tested across 8 categories × 3 cities.
- Initial release.
- All 8 Craigslist categories (jobs, housing, for-sale, services, community, gigs, resumes, events) with typed per-category attribute extraction (jobs: compensation+employmentType+role, housing: bedrooms+bathrooms+sqft+laundry+parking, cars: year+make+model+odometer+fuel+transmission+title status, etc.).
- Multi-city auto-fanout via
topUSMetros (top 25 US metros) and topGlobalMetros (top 50 worldwide) presets, or arbitrary cities list.
- Cross-city repost deduplication via post-body fingerprint (SHA-256 over normalised title + body + price + first image hostname).
- Two-stage scraping: search-result cards (cheap
listing-scraped event @ $0.005) and optional full post detail (listing-detailed event @ $0.008) for description, full image gallery, attributes, contact reply URL.
- Optional public contact extraction (
extractContacts) — emails/phones present in the post body or reply URL only; no logged-in scraping, no reply-form bypass.
- Residential proxy + Crawlee session pool (max 30 uses/session, retire on 403/429), circuit breaker at >50% error rate over rolling 20 requests.
- MCP-ready: 5-part tool description, 4-sentence input field descriptions, flat ≤500-token dataset items,
_warnings and _summary sentinel records on partial success.
v0.4 (2026-05-22) — Soft-fail on invalid input: convert Actor.fail on bad startUrls / no cities to Actor.exit with WARNING status. User-input errors no longer kill the success-rate metric.
v0.5 (2026-06-08) — Soft-fail on empty-result-with-errors: the end-of-run guard no longer calls Actor.fail() when 0 listings are scraped after request errors. A misspelled free-text city subdomain (dead host → proxy 5xx) or a transient Craigslist block now exits cleanly with an actionable WARNING status instead of a FAILED run. Removes the last input/transient cause of the 30-day failure rate; honest data-collection failures are still surfaced in the status message + RUN_SUMMARY.