Houzz Professional Scraper · Multi-Country Search URLs avatar

Houzz Professional Scraper · Multi-Country Search URLs

Pricing

from $1.99 / 1,000 saved professional — no inboxes

Go to Apify Store
Houzz Professional Scraper · Multi-Country Search URLs

Houzz Professional Scraper · Multi-Country Search URLs

Paste supported Houzz browse URLs—same filters you use online. CRM-ready rows with phones, bios, postal fields, ratings, profile & partner URLs. Turn on enrichment for a bounded crawl per pro site to fill missing inbox, phone gaps, and extra social anchors—no-login HTTP.

Pricing

from $1.99 / 1,000 saved professional — no inboxes

Rating

0.0

(0)

Developer

Corentin Robert

Corentin Robert

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

22 days ago

Last modified

Share

Houzz professionals (multi‑country storefronts)

Paste the Houzz browse URL kept in your browser—the page after filtering by professional category + geography—on **houzz.com**, **houzz.fr**, **houzz.co.uk**, **houzz.de**, **houzz.it**, **houzz.es**, **houzz.jp**, **houzz.ru**, Nordic storefronts, or APAC hubs like **houzz.in**, **houzz.sg**, **houzz.nz** (see codebase allow‑list — append new apex domains whenever Houzz opens one).

Each storefront uses the same crawler contract: Cheerio discovers profile cards linking to **pf~ numeric identifiers**, pagination follows **rel="next", detailed rows read **LocalBusiness JSON‑LD.

Outbound / analysts export one JSON row per professional with storefront domain (houzz_site), display name, publicly shown telephone, postal fragments when rendered, shortened bio, aggregated review counters, canonical profile URL plus sameAs sites (often own website / social graphs), plus the originating search URL (listing_source_url) for bookkeeping.


What this does not do

  • Houzz almost never emits mailbox addresses straight on profile cards—email usually stays blank. Respect regional privacy law and marketplace terms whenever you augment data.
  • By default the Actor only downloads Houzz pages. Optional website email lookup (input toggle) opens the outbound homepage and then follows up to 10 distinct same-site navigation links from the markup (DOM order, hard cap **12** page fetches incl. homepage). Many destinations still omit mailboxes from raw HTML or block datacenter egress.
  • Extremely new storefront hostnames outside the allow‑list surface a validation error until you whitelist the apex domain (src/lib/houzzPure.jsHOUZZ_REGISTERED_DOMAINS).

Input (Console)

The Console focuses on essentials—plus an optional outbound email lookup toggle:

FieldMeaning
Search URLsPaste the storefront browse URL(s) Houzz keeps in the browser bar once your filters land. Separate lines keep multiple scopes easy.
Maximum professionalsRow cap — use 0 for no numerical limit (run until listing chains end, subject to maxListingPages and duplicates).
Look for email on their websiteNo / Yes (dropdown). Runs a bounded outbound crawl when Yes.

Console scope: Cheerio parallelism, dataset row shape (**spreadsheet** vs **complete**), and other tuning keys are defaults in code — they intentionally do not appear as Console fields so the storefront form stays beginner-friendly. Automated runs delivered as full JSON input (API / apify run --input-file / MCP) may still pass **maxConcurrency**, **minConcurrency**, **datasetJson**, plus the other knobs in the advanced table below.

Input (API / advanced-only)

Use these together with searchUrls / maxItems. Not shown on Apify Console — pass through the full JSON payload (apify run --input-file=…, Schedules actor input, MCP, REST) when you need to override internals.

KeyMeaning
startUrlsAlias of searchUrls.
maxItems / maxProfessionals / maxResultsRow ceiling; **0** = no numerical cap (browse until chains end or the task stops; maxListingPages and duplicate skips still apply).
maxListingPagesPagination depth via rel="next" (default inside code: 80).
maxConcurrencyCheerio parallelism (code default **56**, cap **80**). Hidden from Console.
minConcurrencyOptional floor (2 … maxConcurrency) — API JSON only unless you fork the Actor.
datasetJson**spreadsheet** (built-in default) or **complete**. Hidden from Console — spreadsheet matches compact CSV; complete keeps extractor columns.
websiteEmailEnrichment**"yes"** / **"no"** in the Console schema; **true** / **false** and strings like **"on"** still work from API forks. Bounded outbound site crawl fills gaps (inbox, phone-from-site when storefront line is absent, outbound social merges).
verboseLogsWhen true, prints Verbose — website inbox: lines (mailbox string + shortened profile/page URLs after enrichment), catalogs every browse sheet line, Cheerio parallelism notes, fuller retry traces — louder RUN_LOG for troubleshooting only.
websiteEmailMaxPagesTotal pages fetched per outbound site (homepage + follow-ups, max **12**). Defaults to **11** so up to **10** in-site navigation targets from the homepage are eligible after counting the homepage itself. Lower it to save bandwidth when you rarely need enrichment.
csvDetailedExportWhen **true** and running locally, output.csv keeps the full dataset column sweep (sparse JSON‑LD helpers included). Default **false: tighter CRM‑style columns (~21) in a sensible outreach order (id, name, merged **phone, merged **email**, address, geo, ratings, budget, bio short, URLs).
proxyConfiguration{ "useApifyProxy": true, ... } when you need Managed or Residential egress.

Example inputs

Minimal (Console-parity):

{
"searchUrls": [
"https://www.houzz.com/professionals/kitchen-and-bath/chicago-il-us-probr0-bo~t_11790~r_4887398",
"https://www.houzz.fr/professionals/architectes-d-interieur/c/Boulogne_Billancourt"
],
"maxItems": 30
}

Larger scrape via API knobs:

{
"searchUrls": [
"https://www.houzz.fr/professionals/architectes-d-interieur/c/Paris"
],
"maxItems": 500,
"maxListingPages": 120,
"maxConcurrency": 16
}

Outputs

Dataset JSON defaults to spreadsheet-aligned objects (same merges as compact **output.csv: one phone, one email, **website / Facebook / Instagram / LinkedIn / YouTube, bio short — empty keys stripped). Automation can pass **datasetJson: complete** in JSON input when you want the extractor dump (**social_links**, **contact_point_*** phones, enrichment helpers like **email_website_fetch_url**, sparse Schema.org facets).

Key-value store (default run store): .actor/key_value_store_schema.json groups Run log (RUN_LOG, text/plain) and Run input snapshot (INPUT, application/json) so Console / API users can focus on those keys first. This Actor does not ship a live-view HTTP server—ignore any OpenAPI field in the platform unless you add such a server yourself.

Locally, **output.csv** is regenerated from downloaded dataset rows (**csvDetailedExport** controls wide vs CRM layout). Rows dedupe within a single run (**houzz_professional_id / pf~**). **scrape_error** is surfaced in spreadsheet JSON only when present (failed grid + JSON‑LD salvage message).

**websiteEmailEnrichment** fans out on the outbound site (homepage + up to **10** same-origin navigation targets, **websiteEmailMaxPages** caps total GETs incl. SPA script probes — max **12**). Mailbox / phone‑from‑site results fold into the merged **email** / **phone** columns when **datasetJson is spreadsheet** — use **complete** JSON to inspect **email_website_status** and probe URLs alongside raw Houzz extractor fields (business_schema_types, full bio, **typical_job_budget**, **structured_data_url**, etc.). RUN_LOG mirrors the live console stream. Messages stay outcome-led: profiles saved (with optional counts of website-inbox fills between milestones), condensed catalog progress, wall-clock shutdown. Long browse chains use throttled catalog heartbeats. Aggregates such as duplicates, SPA-grid rescues, scrape_error totals, and outbound-enrichment breakdowns roll into shutdown. Per-profile URLs and harvested mailboxes only appear behind **verboseLogs: true**. Crawlee's default "Crawled X/Y pages" platform line and periodic Statistics ticks are suppressed.

Views:

  • **overview** — compact reviewer columns incl. storefront + URL columns.
  • **complete** — Console view spanning every tracked field (spreadsheet JSON still skips unset keys in the underlying items).

How it works (short)

  1. Request each pasted browse/search URL (listing wave) with storefront-accurate Accept-Language.
  2. Harvest anchor + payload links containing **pf~ identifiers** (/professionals/… or localized folders such as /professionnels/…).
  3. Follow **rel="next"** pagination until quotas or exhaustion.
  4. Open each profile (detail wave), read **LocalBusiness** JSON‑LD where present, and scrape the **#business professional-details table** rendered in HTML (localized labels — FR/EN/DE/ES…) to backfill postcode, budget range, outbound website (/trk/ links decoded into social_links) when structured data skips those fields — or build a workable row entirely from that grid when JSON-LD is absent.
  5. When enabled, **websiteEmailEnrichment** uses Crawlee sendRequest (same proxy path as Houzz passes) against the homepage of the first non-social outbound link, then sequentially GETs same-site pages linked from the first 10 usable navigation anchors (until websiteEmailMaxPages caps the run). JavaScript-only footers or blocked datacenter traffic can still produce email_website_status: not_found.

Local development

Follows the standard Apify local run contract (see Local apify run — INPUT validation in Apify Actor docs): the CLI validates input before your code starts.

  1. npm install
  2. npm test
  3. Input files: storage/key_value_stores/default/INPUT.json is what Actor.getInput() reads first. If you run **npm start from the repo root**, a **./input.json** file (next to package.json) is merged after that file — duplicate keys in ./input.json win (websiteEmailEnrichment, csvDetailedExport, etc.). For pure-CLI runs, use **apify run --input-file=./input.json**. **storage/**, **input.json**, and **output.csv** stay gitignored.
  4. apify run or npm start (same Node entrypoint).

Outputs locally: JSON rows under **storage/datasets/default/** (defaults to spreadsheet CRM shape locally too unless your **input.json** sets **datasetJson: complete**) and **output.csv** at the repo root when not on Apify Cloud. The CSV defaults to a narrow contact sheet (~25 ordered columns—ids, outreach fields, geography, ratings, truncated bio, social/directory links—see csvDetailedExport in the README API table); set **csvDetailedExport: true** in input for the exhaustive column layout aligned with debugging the JSON‑LD extractor. Spreadsheet-safe single-line cells; bios stay multi-line only in the JSON files. Cloud runs use Console export for CSV.

Docker base image matches Dockerfile.


Pay per result (two email tiers — Apify Console)

Single automatic event apify-default-dataset-item = one price for every row.

To bill avec email vs sans email:

  1. In Development → Monetization → Pay per event, add two events with prices (Bronze/Silver/Gold ladders as usual). event names must match code exactly:
    • houzz-professional-with-email — spreadsheet-style merged inbox present (email or email_from_website before shaping).
    • houzz-professional-no-email — no usable merged inbox line.
  2. Set apify-default-dataset-item to \$0.00 (or remove/disable that unit if the Console lets you)—the SDK adds both the explicit charge and the default-dataset surcharge when prices are nonzero, so skipping the $0 guard double-charges.
  3. This Actor already calls Actor.pushData(payload, eventName) (v1.18+). For FREE / non–pay-per-event Actors this is harmless; charges apply only where those events exist in pricing.

Support

Professional services, white-label scraping, or custom Houzz integrations: corentin@outreacher.fr.