Houzz Professional Scraper · Multi-Country Search URLs
Pricing
from $1.99 / 1,000 saved professional — no inboxes
Houzz Professional Scraper · Multi-Country Search URLs
Paste supported Houzz browse URLs—same filters you use online. CRM-ready rows with phones, bios, postal fields, ratings, profile & partner URLs. Turn on enrichment for a bounded crawl per pro site to fill missing inbox, phone gaps, and extra social anchors—no-login HTTP.
Pricing
from $1.99 / 1,000 saved professional — no inboxes
Rating
0.0
(0)
Developer
Corentin Robert
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
22 days ago
Last modified
Categories
Share
Houzz professionals (multi‑country storefronts)
Paste the Houzz browse URL kept in your browser—the page after filtering by professional category + geography—on **houzz.com**, **houzz.fr**, **houzz.co.uk**, **houzz.de**, **houzz.it**, **houzz.es**, **houzz.jp**, **houzz.ru**, Nordic storefronts, or APAC hubs like **houzz.in**, **houzz.sg**, **houzz.nz** (see codebase allow‑list — append new apex domains whenever Houzz opens one).
Each storefront uses the same crawler contract: Cheerio discovers profile cards linking to **pf~ numeric identifiers**, pagination follows **rel="next", detailed rows read **LocalBusiness JSON‑LD.
Outbound / analysts export one JSON row per professional with storefront domain (houzz_site), display name, publicly shown telephone, postal fragments when rendered, shortened bio, aggregated review counters, canonical profile URL plus sameAs sites (often own website / social graphs), plus the originating search URL (listing_source_url) for bookkeeping.
What this does not do
- Houzz almost never emits mailbox addresses straight on profile cards—
emailusually stays blank. Respect regional privacy law and marketplace terms whenever you augment data. - By default the Actor only downloads Houzz pages. Optional website email lookup (input toggle) opens the outbound homepage and then follows up to 10 distinct same-site navigation links from the markup (DOM order, hard cap
**12** page fetches incl. homepage). Many destinations still omit mailboxes from raw HTML or block datacenter egress. - Extremely new storefront hostnames outside the allow‑list surface a validation error until you whitelist the apex domain (
src/lib/houzzPure.js→HOUZZ_REGISTERED_DOMAINS).
Input (Console)
The Console focuses on essentials—plus an optional outbound email lookup toggle:
| Field | Meaning |
|---|---|
| Search URLs | Paste the storefront browse URL(s) Houzz keeps in the browser bar once your filters land. Separate lines keep multiple scopes easy. |
| Maximum professionals | Row cap — use 0 for no numerical limit (run until listing chains end, subject to maxListingPages and duplicates). |
| Look for email on their website | No / Yes (dropdown). Runs a bounded outbound crawl when Yes. |
Console scope: Cheerio parallelism, dataset row shape (**spreadsheet** vs **complete**), and other tuning keys are defaults in code — they intentionally do not appear as Console fields so the storefront form stays beginner-friendly. Automated runs delivered as full JSON input (API / apify run --input-file / MCP) may still pass **maxConcurrency**, **minConcurrency**, **datasetJson**, plus the other knobs in the advanced table below.
Input (API / advanced-only)
Use these together with searchUrls / maxItems. Not shown on Apify Console — pass through the full JSON payload (apify run --input-file=…, Schedules actor input, MCP, REST) when you need to override internals.
| Key | Meaning |
|---|---|
startUrls | Alias of searchUrls. |
maxItems / maxProfessionals / maxResults | Row ceiling; **0** = no numerical cap (browse until chains end or the task stops; maxListingPages and duplicate skips still apply). |
maxListingPages | Pagination depth via rel="next" (default inside code: 80). |
maxConcurrency | Cheerio parallelism (code default **56**, cap **80**). Hidden from Console. |
minConcurrency | Optional floor (2 … maxConcurrency) — API JSON only unless you fork the Actor. |
datasetJson | **spreadsheet** (built-in default) or **complete**. Hidden from Console — spreadsheet matches compact CSV; complete keeps extractor columns. |
websiteEmailEnrichment | **"yes"** / **"no"** in the Console schema; **true** / **false** and strings like **"on"** still work from API forks. Bounded outbound site crawl fills gaps (inbox, phone-from-site when storefront line is absent, outbound social merges). |
verboseLogs | When true, prints Verbose — website inbox: lines (mailbox string + shortened profile/page URLs after enrichment), catalogs every browse sheet line, Cheerio parallelism notes, fuller retry traces — louder RUN_LOG for troubleshooting only. |
websiteEmailMaxPages | Total pages fetched per outbound site (homepage + follow-ups, max **12**). Defaults to **11** so up to **10** in-site navigation targets from the homepage are eligible after counting the homepage itself. Lower it to save bandwidth when you rarely need enrichment. |
csvDetailedExport | When **true** and running locally, output.csv keeps the full dataset column sweep (sparse JSON‑LD helpers included). Default **false: tighter CRM‑style columns (~21) in a sensible outreach order (id, name, merged **phone, merged **email**, address, geo, ratings, budget, bio short, URLs). |
proxyConfiguration | { "useApifyProxy": true, ... } when you need Managed or Residential egress. |
Example inputs
Minimal (Console-parity):
{"searchUrls": ["https://www.houzz.com/professionals/kitchen-and-bath/chicago-il-us-probr0-bo~t_11790~r_4887398","https://www.houzz.fr/professionals/architectes-d-interieur/c/Boulogne_Billancourt"],"maxItems": 30}
Larger scrape via API knobs:
{"searchUrls": ["https://www.houzz.fr/professionals/architectes-d-interieur/c/Paris"],"maxItems": 500,"maxListingPages": 120,"maxConcurrency": 16}
Outputs
Dataset JSON defaults to spreadsheet-aligned objects (same merges as compact **output.csv: one phone, one email, **website / Facebook / Instagram / LinkedIn / YouTube, bio short — empty keys stripped). Automation can pass **datasetJson: complete** in JSON input when you want the extractor dump (**social_links**, **contact_point_*** phones, enrichment helpers like **email_website_fetch_url**, sparse Schema.org facets).
Key-value store (default run store): .actor/key_value_store_schema.json groups Run log (RUN_LOG, text/plain) and Run input snapshot (INPUT, application/json) so Console / API users can focus on those keys first. This Actor does not ship a live-view HTTP server—ignore any OpenAPI field in the platform unless you add such a server yourself.
Locally, **output.csv** is regenerated from downloaded dataset rows (**csvDetailedExport** controls wide vs CRM layout). Rows dedupe within a single run (**houzz_professional_id / pf~**). **scrape_error** is surfaced in spreadsheet JSON only when present (failed grid + JSON‑LD salvage message).
**websiteEmailEnrichment** fans out on the outbound site (homepage + up to **10** same-origin navigation targets, **websiteEmailMaxPages** caps total GETs incl. SPA script probes — max **12**). Mailbox / phone‑from‑site results fold into the merged **email** / **phone** columns when **datasetJson is spreadsheet** — use **complete** JSON to inspect **email_website_status** and probe URLs alongside raw Houzz extractor fields (business_schema_types, full bio, **typical_job_budget**, **structured_data_url**, etc.).
RUN_LOG mirrors the live console stream. Messages stay outcome-led: profiles saved (with optional counts of website-inbox fills between milestones), condensed catalog progress, wall-clock shutdown. Long browse chains use throttled catalog heartbeats. Aggregates such as duplicates, SPA-grid rescues, scrape_error totals, and outbound-enrichment breakdowns roll into shutdown. Per-profile URLs and harvested mailboxes only appear behind **verboseLogs: true**. Crawlee's default "Crawled X/Y pages" platform line and periodic Statistics ticks are suppressed.
Views:
**overview** — compact reviewer columns incl. storefront + URL columns.**complete**— Console view spanning every tracked field (spreadsheet JSONstill skips unset keys in the underlying items).
How it works (short)
- Request each pasted browse/search URL (listing wave) with storefront-accurate
Accept-Language. - Harvest anchor + payload links containing
**pf~identifiers** (/professionals/…or localized folders such as/professionnels/…). - Follow
**rel="next"** pagination until quotas or exhaustion. - Open each profile (detail wave), read
**LocalBusiness** JSON‑LD where present, and scrape the**#businessprofessional-details table** rendered in HTML (localized labels — FR/EN/DE/ES…) to backfill postcode, budget range, outbound website (/trk/links decoded intosocial_links) when structured data skips those fields — or build a workable row entirely from that grid when JSON-LD is absent. - When enabled,
**websiteEmailEnrichment** uses CrawleesendRequest(same proxy path as Houzz passes) against the homepage of the first non-social outbound link, then sequentially GETs same-site pages linked from the first 10 usable navigation anchors (untilwebsiteEmailMaxPagescaps the run). JavaScript-only footers or blocked datacenter traffic can still produceemail_website_status: not_found.
Local development
Follows the standard Apify local run contract (see Local apify run — INPUT validation in Apify Actor docs): the CLI validates input before your code starts.
npm installnpm test- Input files:
storage/key_value_stores/default/INPUT.jsonis whatActor.getInput()reads first. If you run**npm startfrom the repo root**, a**./input.json** file (next topackage.json) is merged after that file — duplicate keys in./input.jsonwin (websiteEmailEnrichment,csvDetailedExport, etc.). For pure-CLI runs, use**apify run --input-file=./input.json**.**storage/**,**input.json**, and**output.csv**stay gitignored. apify runornpm start(same Node entrypoint).
Outputs locally: JSON rows under **storage/datasets/default/** (defaults to spreadsheet CRM shape locally too unless your **input.json** sets **datasetJson: complete**) and **output.csv** at the repo root when not on Apify Cloud. The CSV defaults to a narrow contact sheet (~25 ordered columns—ids, outreach fields, geography, ratings, truncated bio, social/directory links—see csvDetailedExport in the README API table); set **csvDetailedExport: true** in input for the exhaustive column layout aligned with debugging the JSON‑LD extractor. Spreadsheet-safe single-line cells; bios stay multi-line only in the JSON files. Cloud runs use Console export for CSV.
Docker base image matches Dockerfile.
Pay per result (two email tiers — Apify Console)
Single automatic event apify-default-dataset-item = one price for every row.
To bill avec email vs sans email:
- In Development → Monetization → Pay per event, add two events with prices (Bronze/Silver/Gold ladders as usual).
event namesmust match code exactly:houzz-professional-with-email— spreadsheet-style merged inbox present (emailoremail_from_websitebefore shaping).houzz-professional-no-email— no usable merged inbox line.
- Set
apify-default-dataset-itemto\$0.00(or remove/disable that unit if the Console lets you)—the SDK adds both the explicit charge and the default-dataset surcharge when prices are nonzero, so skipping the $0 guard double-charges. - This Actor already calls
Actor.pushData(payload, eventName)(v1.18+). For FREE / non–pay-per-event Actors this is harmless; charges apply only where those events exist in pricing.
Support
Professional services, white-label scraping, or custom Houzz integrations: corentin@outreacher.fr.