Redfin property data enriched with 45 government data fields from 14 free APIs — Census demographics, FEMA flood zones, FBI crime, Walk Score, HUD rent benchmarks, and more. USPS-normalized, confidence-scored, analysis-ready. From $4/1K properties.
Multi-source expansion from 2 scrapers to 10, government data enrichment from 4 APIs to 16,
and 40+ bug fixes across 6 audit rounds. Test suite grew from 625 to 1797.
Census ACS enrichment — tract-level demographics from api.census.gov (free, no key):
median_household_income, median_home_value, vacancy_rate, population, median_age,
owner_occupied_pct, renter_occupied_pct. Batch-optimized: one API call per county.
FBI Crime enrichment — state-level UCR data from api.usa.gov/crime/fbi/cde:
violent_crime_rate, property_crime_rate, crime_data_year. Per-offense independent
fallback when one rate fails.
Enrichment framework — src/enrichment/ with BaseEnrichment ABC, cache/batch
pattern, _validate_cached() hook, and _no_cache flag to prevent partial-data
cache poisoning.
GovernmentData model — includeGovernmentData input flag, formatter exclusions,
consistent key presence across all records (two-pass formatter).
Decodo (Smartproxy) proxy support — residential proxy via PROXY_HOST/PORT/USER/PASS
env vars for Realtor.com scraping. Falls back gracefully when vars not set.
address_undisclosed flag on Address model — numeric-only streets and "Undisclosed"
URLs flagged to prevent garbage canonical_key.
monthly_rent field on PriceInfo for rental listings (separate from sale price).
Listing type to source routing: for_rent removes Redfin and adds Zumper;
for_sale/sold removes Zumper.
Stale-while-error cache fallback — CacheManager.get_stale() serves expired entries
with warning when enrichment APIs fail.
days_on_market computed for sold records from price_history (NAR DOM definition).
Foreclosure.com scraper — distressed property listings via JSON-LD CollectionPage
extraction. 10s crawl delay per robots.txt. New fields: auction_date,
foreclosure_status, estimated_market_value. 29 tests.
Realtor.com Kasada bypass — replaced curl_cffi (~7% success) with Camoufox
(headless) + Patchright fallback. Session warming, sticky proxy, normal distribution
timing, __NEXT_DATA__ extraction via page.evaluate(). 43 tests.
Zillow search scraper — curl_cffi PUT API with Chrome124 TLS impersonation.
Map-splitting subdivides bounding box into quadrants when results hit 500 cap
(recurses to depth 5). Zillow-specific fields: zestimate, rent_zestimate,
broker_name. 56 tests.
Government REO scraper — HUD ArcGIS API, Freddie Mac HomeSteps JSON-LD, USDA
AJAX. Bulkhead isolation per agency. New fields: reo_agency, reo_case_number.
52 tests.
FSBO scraper — FSBO.com JSON API + ForSaleByOwner.com JSON API. Seller PII
opt-in via includeSellerContact flag (CCPA compliance). 61 tests.
Tax Sale scraper (Bid4Assets) — HTML scraping with Kendo UI pagination. New
fields: auction_opening_bid, lien_amount, parcel_id, sale_type.
Land Bank scraper (Detroit) — Socrata SODA API with ArcGIS Feature Service
fallback. New field: property_condition. Template design for other cities.
New Construction scraper — KB Home (embedded regionMapData JSON) +
NewHomeSource (HTML data-* attributes). New fields: builder_name,
community_name, completion_date, base_price.
10 active scrapers total: Redfin, Realtor.com, Zillow, Zumper, Foreclosure.com,
Government REO, FSBO, Tax Sale, Land Bank, New Construction.
Phase 6: Enrichment expansion (6A-6B)
FHFA House Price Index enrichment — 12-month % change via bulk CSV download.
Uses MSA/CBSA codes from Geocodio (ZIP5/County paths removed after verification
against real data). New field: hpi_12mo_change.
IRS Statistics of Income enrichment — estimated median AGI by ZIP (bracket
interpolation) + county migration net flow. New fields: zip_median_agi,
migration_net_flow.
FEMA NFHL flood zone enrichment — ArcGIS REST API lookup. Free, no API key.
CFPB HMDA enrichment — mortgage approval rate + mean loan amount via
aggregation API. Uses mean (not median) due to API limitations. Free, no API key.
Bulk data infrastructure — BulkDataManager for downloading/caching/parsing
large CSV files with freshness checks (30d FHFA, 365d IRS) and stale-while-error
fallback.
cbsa_code field on UnifiedProperty (from Geocodio, used for FHFA HPI lookup).
GovernmentData expanded from 23 to 36 fields across 16 enrichment APIs (Geocodio,
Placekey, Census ACS, Walk Score, HUD FMR, FEMA Disasters, FBI Crime, FHFA HPI,
IRS SOI, FEMA NFHL, CFPB HMDA, BLS QCEW, BLS OES, FRED, EPA Brownfields, Parcl Labs).
BLS QCEW enrichment — county-level total employment + average weekly wage via
Open Data API (CSV slices). No auth required. Quarterly data with ~6 month lag.
New fields: qcew_total_employment, qcew_avg_weekly_wage.
BLS OES enrichment — metro-level median hourly wage via Public Data API v2
(25-char series ID, datatype 13 = annual median / 2080). Requires BLS_API_KEY
(free, 500 queries/day). New field: oes_median_hourly_wage.
FRED mortgage rate enrichment — national 30-year fixed rate from MORTGAGE30US
series. Weekly Thursday updates. Value is STRING, "." = missing. Requires
FRED_API_KEY (free, 120 req/min). New field: mortgage_rate_30yr.
EPA Brownfields enrichment — Superfund/SEMS site proximity via FRS REST API.
1-mile radius search, boolean result. No auth required. New field:
brownfield_within_1mi.
Parcl Labs price feed enrichment — ZIP-level price per sqft via two-step
search (free) + price feed (1 credit). Requires PARCLLABS_API_KEY (free 1K
credits/month). New field: parcl_price_feed.
Changed
Tests consolidated with @pytest.mark.parametrize — same coverage, fewer
repetitive test functions.
All datetime.now() calls replaced with datetime.now(timezone.utc) at 7
locations (ISO 8601 compliant with +00:00 suffix).
garage field changed from string to integer ("2-car attached" -> 2, "Yes" -> 1).
tax_amount removed from TaxHistoryEntry model, schema, and output (Redfin
does not provide tax amounts — was always null).
stories field extraction improved: publicRecordsInfo fallback for REcolorado MLS,
amenity key matching, validated <= 104 (tallest US building).
Source-aware dead field exclusions in formatter: Zillow/foreclosure/rental fields
omitted when source not active (fixes shallow-copy mutation bug).
run_summary.json enriched with timestamps, duration_seconds,
pipeline_version (git hash), and per-field fill_rates.
FHFA HPI rewritten from broken ZIP5/County paths to working MSA/CBSA lookup
after verification against real government CSV data.
Census ACS vintage updated from 2023 to 2024 (available since Jan 2026).
Placekey quota tracker corrected from 100K/month to 10K/day (actual free tier).
False-positive validation warnings on 100% of records — enrichment warnings fired
before geo enrichment populated fields. Added refresh_enrichment_warnings() called
after enrichment.
Dead fields stories, garage, date_listed excluded from output when always null
(Redfin-only fields added to _DEAD_FIELD_EXCLUSIONS).
rent_estimate leaking into field_confidence on both multi-source and single-source
paths.
PH unit designator embedded in street name — added _extract_embedded_unit() for
USPS Pub 28 Table 2 designators, _clean_unit_hyphens() strips artifacts.
lot_sqft=1 on condos (Redfin data error) — soft validation nullifies lot_sqft < 10
on non-land properties.
date_listed returned oldest listing event (20+ years ago) instead of most recent.
price.listing on sold records used oldest listing price instead of last ask before
sale.
Stale listing price on re-listed sold properties — cycle-break detection identifies
prior transaction boundaries in price_history. Refined: duplicate Sold events from
MLS+Public Records on adjacent dates are not cycle breaks; only intervening Sold with
different price indicates prior cycle.
price_change garbage (-99%, +21,328%) across sale/rental context boundaries — context
propagation walks chronologically to inherit sale/rental context. "Rental Removed"
correctly switches back to sale context.
Building-level bulk sale transactions filtered from condo price_history (county public
records recording $12.7M against 900 sqft units). Recalculates price_change after
filtering.
All records now have identical top-level keys (null/[] instead of missing keys).
Price history events deduplicated (same date+event+price -> keep one).
Parking fallback capped at 10 to exclude building-level counts; parking-only entries
without "garage" in name serve as fallback for garage count.
HUD FMR: API requires entity IDs ({fips}99999), not raw ZIP codes. Fixed for all
markets.
5-session pre-launch audit. 33 issues resolved, 625 tests, 80 records verified across
5 US markets (Austin, Denver, Seattle, Phoenix, Kalispell).
Fixed
Pre-audit (QA sessions 1-4)
CRITICAL: Historical sold data leaking onto for_sale listings — price.sold and
listing.date_sold were populated on active for_sale records with historical sale data
from Redfin's GIS response. Now only populated when listing.status == "sold". (#26)
CRITICAL: Dedup Layer 4 merging distinct condo units — Placekey dedup merged all
units in a building into one record (54% data loss in Denver). Added unit-discrimination
guard to Layers 1 and 4. (#13)
CRITICAL: Census tract and FIPS code 0% populated — Geocodio parser hard-coded year
"2020" but API returns "2025". Now uses dynamic year key. Also switched from full_fips
(15-digit) to county_fips (5-digit). (#1, #2, #14, #17)
CRITICAL: Placekey enrichment 0% populated — load_dotenv() was missing from entry
points, so API keys were not loaded for local runs. (#3)
CRITICAL: for_rent silently returned for_sale data — Redfin's search API does not
support rental listings. Now returns empty results with a clear warning instead of wrong
data. (#19)
CRITICAL: Sold listings missing price.sold and date_sold — Parser now extracts
soldDate (epoch ms) and maps price.value to sold_price for sold listings. (#20)
Nullable fields (placekey, census_tract, fips_code, address.unit) now serialized as
null instead of being omitted from JSON output. (#27)
Placekey API query now includes unit number in street_address for unit-level Placekeys
on condos. (#15)
Poisoned null values cleared from geocoding and Placekey caches. (#16)
County enrichment: address.county now populated from Geocodio (was always null). (#37)
Spatial mismatch detection: skip census/FIPS enrichment when Geocodio location differs
from source coordinates by >2km, preventing wrong-county FIPS assignment. (#38)
Session A: Schema integrity
Dead fields removed from customer output: price_history, tax_history,
rent_estimate, rent_estimate_source. Excluded via _DEAD_FIELD_EXCLUSIONS in
formatter — internal model fields preserved for Phase 3 re-enablement. (#35, #36)
CSV list/dict fields now serialize as JSON strings (was Python repr()). (#29)
CSV output now delivered to customer via Apify KV store (was generated but not
accessible). (#30)
Metadata field_fill_rates now includes enrichment fields. (#31)
Placekey failures surfaced in metadata warnings. (#32)
Session D: Dedup hardening & security
Layer 3 proximity dedup now checks street address similarity (token_sort_ratio > 0.7)
before merging — prevents false merge of different addresses within 50m. (#28)
httpx logger set to WARNING to suppress API keys from URL logs. (#34)
Added
17 regression tests in test_output_invariants.py covering all known failure modes:
for_sale/sold invariant, FIPS format, nullable fields, USPS unit compliance, dedup
false merge prevention, condo unit preservation, dead field exclusion. (#39)
Schema validation test: output validated against .actor/dataset_schema.json.
74 new tests across Sessions A-D (551 → 625 total).
4-session QA audit trail: QA-TRACKER.md, TEST-RESULTS-*.md (348 records tested
across 12 US markets covering for_sale, sold, for_rent, boundary, and error cases).
python-dotenv dependency for local .env file loading.
Changed
README updated with real output examples, honest Realtor.com limitations, rate limit
guidance, and canonical_key documentation.
FIPS code format changed from 15-digit full_fips to 5-digit county_fips.