Houzz Pro Scraper avatar

Houzz Pro Scraper

Under maintenance

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Houzz Pro Scraper

Houzz Pro Scraper

Under maintenance

Scrapes Houzz professional listings into a clean lead dataset (name, phone, address; optional rating/reviews/website/license). Two-tier crawl. Pay-Per-Result: priced per record (use maxResults to cap spend).

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Nathan Black

Nathan Black

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

An Apify actor that turns the Houzz professional directory into a clean, deduplicated lead dataset: business name, phone, and address for every listed pro, plus optional rating, full review text, website, license, services and cost estimates. Built to be reliable, current, and honestly specified — every field rate below is measured from a real run, not claimed.

Two crawl tiers:

  • Tier 1 (listing) — one request returns ~15–20 pros from the page's structured JSON-LD. This is the base lead record (name / phone / address).
  • Tier 2 (enrichment) — opt-in (enrich: true); fetches each pro's profile to add rating, reviews, website, license, services, cost estimate and badges.

Output fields & measured fill rates

Honesty note. Nulls are kept in the output — a null is an honest signal that Houzz didn't expose that field for that pro, not a dropped field. Fill rates below come from the real runs described under Coverage & caveats. Your numbers will vary by category and location.

Tier 1 — base record

Measured on 500 records, general-contractor + architects across Austin/Dallas/Houston (TX) plus a few IL/NC, 2026-06-22.

FieldFillNotes
profile_id100%The pf~<id>; stable dedup key.
profile_url100%Canonical Houzz profile URL (from JSON-LD sameAs).
name100%Business name.
telephone98%The field that makes this a lead list.
locality (city)99%
region (state)98%
postal_code91%
country100%
street_address72%Many pros list city/region only, no street.
latitude / longitude80%Geo present for ~4 of 5 pros.
image100%Primary photo URL.
area_served100%

Dedup verified clean: 500 unique profile_ids out of 500 pushed (546 overlapping/ sponsored duplicates were dropped before push).

Tier 2 — enrichment (enrich: true)

Measured on 20 enriched records, general-contractor / Austin TX, full review history, 2026-06-22.

FieldFillNotes
full_address100%Full street address from the profile.
about100%Profile description.
services100%Service list.
rating90%Star rating (×10 on Houzz, normalized to e.g. 5.0).
rating_count90%Number of reviews behind the rating.
reviews90%Full review objects (see below).
sub_ratings85%Quality / responsiveness / value aspect ratings.
website85%The pro's own site (rawDomain); junk http:// dropped.
badges85%Houzz merit / standout badges.
cost_estimate75%Typical project cost string.
awards40%Best-of-Houzz etc.
estimate_details35%
license_number / license_status / license_verified_status0% (in this sample)Texas issues no state general-contractor license, so this is a true null, not a parse miss. The parser extracts license number/status where Houzz exposes it; fill in a licensed trade/state has not been separately measured — don't advertise it as guaranteed.

The 10% of pros with no rating/reviews are simply new pros with zero reviews — an honest null, not a gap.

Review objects (reviews[], measured over 407 real reviews): body, rating, author, created all 100% filled; review_id, relationship, project_date, modified, num_likes, status present; project_price only ~21% (Houzz's "Choose one" placeholder is dropped to null). Full review history is fetched by default in one request per pro (up to Houzz's 1024-review cap); use maxReviews to cap it.

See samples/sample_dataset.json (20 real enriched records) and samples/sample_record_enriched.json (one record, trimmed to 2 reviews, for a quick look).


Coverage & caveats

  • Fill rates are sample-specific. The Tier-1 numbers are TX-heavy: with a maxResults cap, the run fills from the first seeds, so Austin/Dallas dominated before later cities were reached. Different categories/regions will differ — re-measure for your slice.
  • Location resolution depends on Houzz's in-page region links. Locations are resolved by matching the slug against the region links Houzz renders on the category page. If a slug doesn't appear (e.g. san-antonio-tx-us did not resolve in testing), it's reported in unresolved_locations and skipped — it does not crash the run. Prefer major metros, or pass an explicit listing URL via startUrls.
  • totalResults is a pagination ceiling, not a true count. Houzz reports ~1499 for a category regardless of region. The crawl early-stops when a page yields no new pros rather than paginating to the ceiling, so a single city yields its real unique count (a few hundred), not 1499.
  • No "complete national database" claim. Coverage is whatever the crawl + dedup actually yields for the categories/locations you request, stated plainly.
  • License/awards/estimate fields are genuinely partial. They reflect what each pro filled in on Houzz; treat the rates above as the expectation, not a guarantee.

Reliability & volume (what was actually exercised)

Recon (R4) found no rate-limit / anti-bot infrastructure (Fastly + Envoy; no Cloudflare, no x-ratelimit-*, no 429). Build 6 validated this with a gradual real volume ramp, watching for the first non-200:

RunScopePagesPushedResult
50GC / Austin450clean — 0 failures, 0 non-200
200GC / Austin+Dallas+Houston15200clean — 0 failures, 0 non-200
500GC + architects / 5 metros (8 seeds)39500clean — 0 failures, 0 non-200
20 (enrich)GC / Austin, full reviews2 + 4020clean — 407 reviews, 0 failures

Honest limit of this validation: the largest clean run exercised here was 500 Tier-1 records / 39 pages and a 20-pro full enrichment. Sustained tens-of-thousands / full multi-geo crawls were not exercised (respectful-volume guardrail) — do not assume 10k+ works untested. Ramp gradually and watch for the first non-200.

Resilience built in (Builds 4): jittered ~1–3s pacing (configurable), exponential backoff + retry on transient errors / 429 / 5xx, per-seed isolation (one bad seed doesn't kill the run), and an optional proxy-rotation fallback for IP blocks (proxyConfiguration). Proxy is not required at modest volume — it's a tested contingency.


Pricing — Pay-Per-Result

The actor is monetized per result: one price per unique record written to the dataset, via Apify's built-in apify-default-dataset-item event. There is no per-event accounting — you pay per lead.

  • maxResults is your spend cap. The run stops after that many unique pros.
  • Only clean, unique, non-junk records are pushed (dedup by profile_id; records with no profile id are skipped), so you're never billed for duplicates or junk.
  • enrich: true yields richer rows at the same per-result price (it costs ~2 extra requests/pro behind the scenes; the per-result price is set at publish time).

Usage

Input

FieldTypeDefaultMeaning
categoriesstring[]["general-contractor"]Houzz category slugs. Resolved to the canonical probr0-bo~t_<id> listing URL.
locationsstring[]Location slugs (e.g. austin-tx-us). Empty = national listing.
startUrlsurl[]Explicit Houzz listing URLs, instead of / in addition to categories+locations.
enrichboolfalseFetch profiles for Tier-2 fields.
maxReviewsint0Per-pro review cap; 0 = full history.
maxResultsint0Spend cap (unique pros); 0 = until exhausted.
minDelaySecs / maxDelaySecsint1 / 3Jittered pacing bounds. Keep min ≥ 1 to be a good citizen.
proxyConfigurationobjectoffOptional proxy / IP-block contingency.

Example (inputs/example.json):

{
"categories": ["general-contractor"],
"locations": ["austin-tx-us"],
"enrich": false,
"maxResults": 30,
"minDelaySecs": 1,
"maxDelaySecs": 3
}

Run locally (without the Apify platform)

pip install -r requirements.txt # ideally in a venv (Python 3.13)
scripts/run_local.sh # uses inputs/example.json
scripts/run_local.sh inputs/enrich_smoke.json # an enrichment example

Results land in storage/datasets/default/ (gitignored); logs print a per-run summary (pages_fetched, records_seen, duplicates, pushed, failed_seeds, enriched, reviews_fetched).

Tests

pip install -r requirements-dev.txt
python -m pytest -q # 41 tests, fixture-based, no network

Houzz's Terms of Use prohibit scraping and commercial reuse of listings; robots.txt allowing the paths does not override that contract (CA jurisdiction). Publishing this as a paid public actor is a deliberate, owner-accepted business-risk decision (see docs/PLAN.md §7). The actor is built to be a good citizen — polite pacing, gradual ramp, honest spec — to reduce both ethical and enforcement risk.


How this project is built

LLM-driven, multi-session project. GitHub Issues are the source of truth. See CLAUDE.md (operating rules), docs/PLAN.md (the approved plan), docs/WORKFLOW.md (conventions), and pinned issue #1 (state & handoff log).