Houzz Pro Scraper
Under maintenancePricing
from $3.00 / 1,000 results
Houzz Pro Scraper
Under maintenanceScrapes Houzz professional listings into a clean lead dataset (name, phone, address; optional rating/reviews/website/license). Two-tier crawl. Pay-Per-Result: priced per record (use maxResults to cap spend).
Pricing
from $3.00 / 1,000 results
Rating
0.0
(0)
Developer
Nathan Black
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
An Apify actor that turns the Houzz professional directory into a clean, deduplicated lead dataset: business name, phone, and address for every listed pro, plus optional rating, full review text, website, license, services and cost estimates. Built to be reliable, current, and honestly specified — every field rate below is measured from a real run, not claimed.
Two crawl tiers:
- Tier 1 (listing) — one request returns ~15–20 pros from the page's structured JSON-LD. This is the base lead record (name / phone / address).
- Tier 2 (enrichment) — opt-in (
enrich: true); fetches each pro's profile to add rating, reviews, website, license, services, cost estimate and badges.
Output fields & measured fill rates
Honesty note. Nulls are kept in the output — a null is an honest signal that Houzz didn't expose that field for that pro, not a dropped field. Fill rates below come from the real runs described under Coverage & caveats. Your numbers will vary by category and location.
Tier 1 — base record
Measured on 500 records, general-contractor + architects across Austin/Dallas/Houston (TX) plus a few IL/NC, 2026-06-22.
| Field | Fill | Notes |
|---|---|---|
profile_id | 100% | The pf~<id>; stable dedup key. |
profile_url | 100% | Canonical Houzz profile URL (from JSON-LD sameAs). |
name | 100% | Business name. |
telephone | 98% | The field that makes this a lead list. |
locality (city) | 99% | |
region (state) | 98% | |
postal_code | 91% | |
country | 100% | |
street_address | 72% | Many pros list city/region only, no street. |
latitude / longitude | 80% | Geo present for ~4 of 5 pros. |
image | 100% | Primary photo URL. |
area_served | 100% |
Dedup verified clean: 500 unique profile_ids out of 500 pushed (546 overlapping/
sponsored duplicates were dropped before push).
Tier 2 — enrichment (enrich: true)
Measured on 20 enriched records, general-contractor / Austin TX, full review history, 2026-06-22.
| Field | Fill | Notes |
|---|---|---|
full_address | 100% | Full street address from the profile. |
about | 100% | Profile description. |
services | 100% | Service list. |
rating | 90% | Star rating (×10 on Houzz, normalized to e.g. 5.0). |
rating_count | 90% | Number of reviews behind the rating. |
reviews | 90% | Full review objects (see below). |
sub_ratings | 85% | Quality / responsiveness / value aspect ratings. |
website | 85% | The pro's own site (rawDomain); junk http:// dropped. |
badges | 85% | Houzz merit / standout badges. |
cost_estimate | 75% | Typical project cost string. |
awards | 40% | Best-of-Houzz etc. |
estimate_details | 35% | |
license_number / license_status / license_verified_status | 0% (in this sample) | Texas issues no state general-contractor license, so this is a true null, not a parse miss. The parser extracts license number/status where Houzz exposes it; fill in a licensed trade/state has not been separately measured — don't advertise it as guaranteed. |
The 10% of pros with no rating/reviews are simply new pros with zero reviews — an
honest null, not a gap.
Review objects (reviews[], measured over 407 real reviews):
body, rating, author, created all 100% filled; review_id, relationship,
project_date, modified, num_likes, status present; project_price only ~21%
(Houzz's "Choose one" placeholder is dropped to null). Full review history is fetched by
default in one request per pro (up to Houzz's 1024-review cap); use maxReviews to
cap it.
See samples/sample_dataset.json (20 real enriched records) and samples/sample_record_enriched.json (one record, trimmed to 2 reviews, for a quick look).
Coverage & caveats
- Fill rates are sample-specific. The Tier-1 numbers are TX-heavy: with a
maxResultscap, the run fills from the first seeds, so Austin/Dallas dominated before later cities were reached. Different categories/regions will differ — re-measure for your slice. - Location resolution depends on Houzz's in-page region links. Locations are resolved
by matching the slug against the region links Houzz renders on the category page. If a
slug doesn't appear (e.g.
san-antonio-tx-usdid not resolve in testing), it's reported inunresolved_locationsand skipped — it does not crash the run. Prefer major metros, or pass an explicit listing URL viastartUrls. totalResultsis a pagination ceiling, not a true count. Houzz reports ~1499 for a category regardless of region. The crawl early-stops when a page yields no new pros rather than paginating to the ceiling, so a single city yields its real unique count (a few hundred), not 1499.- No "complete national database" claim. Coverage is whatever the crawl + dedup actually yields for the categories/locations you request, stated plainly.
- License/awards/estimate fields are genuinely partial. They reflect what each pro filled in on Houzz; treat the rates above as the expectation, not a guarantee.
Reliability & volume (what was actually exercised)
Recon (R4) found no rate-limit / anti-bot infrastructure (Fastly + Envoy; no
Cloudflare, no x-ratelimit-*, no 429). Build 6 validated this with a gradual real
volume ramp, watching for the first non-200:
| Run | Scope | Pages | Pushed | Result |
|---|---|---|---|---|
| 50 | GC / Austin | 4 | 50 | clean — 0 failures, 0 non-200 |
| 200 | GC / Austin+Dallas+Houston | 15 | 200 | clean — 0 failures, 0 non-200 |
| 500 | GC + architects / 5 metros (8 seeds) | 39 | 500 | clean — 0 failures, 0 non-200 |
| 20 (enrich) | GC / Austin, full reviews | 2 + 40 | 20 | clean — 407 reviews, 0 failures |
Honest limit of this validation: the largest clean run exercised here was 500 Tier-1 records / 39 pages and a 20-pro full enrichment. Sustained tens-of-thousands / full multi-geo crawls were not exercised (respectful-volume guardrail) — do not assume 10k+ works untested. Ramp gradually and watch for the first non-200.
Resilience built in (Builds 4): jittered ~1–3s pacing (configurable), exponential
backoff + retry on transient errors / 429 / 5xx, per-seed isolation (one bad seed doesn't
kill the run), and an optional proxy-rotation fallback for IP blocks
(proxyConfiguration). Proxy is not required at modest volume — it's a tested
contingency.
Pricing — Pay-Per-Result
The actor is monetized per result: one price per unique record written to the dataset,
via Apify's built-in apify-default-dataset-item event. There is no per-event accounting —
you pay per lead.
maxResultsis your spend cap. The run stops after that many unique pros.- Only clean, unique, non-junk records are pushed (dedup by
profile_id; records with no profile id are skipped), so you're never billed for duplicates or junk. enrich: trueyields richer rows at the same per-result price (it costs ~2 extra requests/pro behind the scenes; the per-result price is set at publish time).
Usage
Input
| Field | Type | Default | Meaning |
|---|---|---|---|
categories | string[] | ["general-contractor"] | Houzz category slugs. Resolved to the canonical probr0-bo~t_<id> listing URL. |
locations | string[] | — | Location slugs (e.g. austin-tx-us). Empty = national listing. |
startUrls | url[] | — | Explicit Houzz listing URLs, instead of / in addition to categories+locations. |
enrich | bool | false | Fetch profiles for Tier-2 fields. |
maxReviews | int | 0 | Per-pro review cap; 0 = full history. |
maxResults | int | 0 | Spend cap (unique pros); 0 = until exhausted. |
minDelaySecs / maxDelaySecs | int | 1 / 3 | Jittered pacing bounds. Keep min ≥ 1 to be a good citizen. |
proxyConfiguration | object | off | Optional proxy / IP-block contingency. |
Example (inputs/example.json):
{"categories": ["general-contractor"],"locations": ["austin-tx-us"],"enrich": false,"maxResults": 30,"minDelaySecs": 1,"maxDelaySecs": 3}
Run locally (without the Apify platform)
pip install -r requirements.txt # ideally in a venv (Python 3.13)scripts/run_local.sh # uses inputs/example.jsonscripts/run_local.sh inputs/enrich_smoke.json # an enrichment example
Results land in storage/datasets/default/ (gitignored); logs print a per-run summary
(pages_fetched, records_seen, duplicates, pushed, failed_seeds, enriched,
reviews_fetched).
Tests
pip install -r requirements-dev.txtpython -m pytest -q # 41 tests, fixture-based, no network
Legal / Terms note
Houzz's Terms of Use prohibit scraping and commercial reuse of listings; robots.txt
allowing the paths does not override that contract (CA jurisdiction). Publishing this as a
paid public actor is a deliberate, owner-accepted business-risk decision (see
docs/PLAN.md §7). The actor is built to be a good citizen — polite pacing, gradual ramp,
honest spec — to reduce both ethical and enforcement risk.
How this project is built
LLM-driven, multi-session project. GitHub Issues are the source of truth. See
CLAUDE.md (operating rules), docs/PLAN.md (the approved plan), docs/WORKFLOW.md
(conventions), and pinned issue #1 (state & handoff log).