# Website Lead Enricher (`operational_zirconia/website-lead-enricher`) Actor

Extract emails, phones, social profiles, and company data from any website. CRM-ready B2B lead enrichment with HubSpot, Salesforce, and Pipedrive export modes. Quality score, WHOIS lookup, and E.164 phone normalization included.

- **URL**: https://apify.com/operational\_zirconia/website-lead-enricher.md
- **Developed by:** [RH Studios](https://apify.com/operational_zirconia) (community)
- **Categories:** Social media, E-commerce, Lead generation
- **Stats:** 2 total users, 1 monthly users, 85.7% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.90 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Website Lead Enricher

**Turn any website into CRM-ready B2B leads — emails, phones, social profiles, company data, plus detected email naming conventions and per-domain bounce-risk scoring.**

> 🚀 **[Try it on Apify Store →](https://apify.com/operational_zirconia/website-lead-enricher)** — runs in your browser, free tier included, no signup needed for the first batch.
>
> 🌐 **[See the visual pipeline →](https://website-lead-enricher.netlify.app/)** — interactive diagram + sample JSON/CSV output, no signup required.

---

### Why this Actor?

- 🎯 **Stop guessing who's reachable** — per-record `isSendable` flag + per-domain `bounceRiskBucket` so Instantly, Smartlead, and Apollo can filter out high-bounce-risk domains *before* you spend sending credits
- 📧 **5–10× the email coverage per domain** — Email Pattern Finder detects `first.last` / `flast` / `first` conventions from existing emails, runs a single SMTP catch-all probe, and generates 2–10 predicted team emails per domain (or 20–200 when paired with Hunter.io)
- 🔌 **Drop-in HTTP API for agents and apps** — Standby mode exposes `/leads`, `/leads/{domain}`, `/stats`, `/health` for AI agents, MCP integrations, and embedded B2B tools
- 📊 **CRM-ready exports** — HubSpot / Salesforce / Pipedrive column shapes built in; import without mapping
- 🤖 **Heuristic, not AI** — deterministic rules, no LLM cost, no external API keys, fully auditable
- 🛡️ **No silent failures** — per-step error isolation: one bad step never kills the record; every step carries `ok` / `error` status + structured `{code, message}` on failure
- ⚡ **Up to 1,000 URLs per run**, ~5s/record, parallel processing up to 10 concurrent

---

### What you get per record

Every input URL produces one record with these fields:

| Field | Type | What it tells you |
|---|---|---|
| 📧 **Emails** | `string[]` classified | Corporate vs. generic vs. invalid; throwaway domains filtered |
| 📱 **Phones** | `string[]` E.164 | Normalized for 50+ countries |
| 🌐 **Socials** | object | LinkedIn, Facebook, Instagram, X/Twitter, YouTube (validated, not generic pages) |
| 🏢 **Company** | object | WHOIS registrant + registration date (opt-in) |
| 📍 **Address** | object | City, postal code, country extracted from page text |
| ⭐ **Quality score** | `0-100` | Per-record score with `breakdown` + `missing_fields` array |
| 🏷️ **Company type** | enum | 14 verticals (saas, saas_b2b, agency, ecommerce, legal, medical, consulting, manufacturing, media, nonprofit, education, realestate, finance, other) with confidence |
| 📨 **isSendable** | boolean | Safe to mail? (see Outreach safety below) |
| 🔍 **emailPattern** | string | Detected naming convention: `first.last`, `flast`, `first`, etc. (or `null`) |
| 🎯 **bounceRiskBucket** | `low` / `medium` / `high` | Per-domain deliverability risk |
| 📋 **generatedEmails** | array | Predicted team emails with provenance tags (`page-discovered`, `pattern-from-page`, `pattern-alternate`) |
| 📞 **contactForm** | boolean + URL | Same-domain `<form>` on `/contact` etc. (3rd-party form vendors excluded) |
| ⚠️ **scrapeError** | object \| null | Machine-readable failure code on hard errors |
| 🛡️ **pipelineData.steps[]** | array | Per-step status + duration + error per record |

Full schema: [`docs/NextSteps/EmailPatternFinder.md`](docs/NextSteps/EmailPatternFinder.md) and [`.actor/dataset_schema.json`](.actor/dataset_schema.json).

---

### Cost & performance

| Batch size | Compute units (typical) | Wall-clock |
|---|---|---|
| 100 URLs | ~5 CU | ~50s |
| 1,000 URLs | ~50 CU | ~5–8 min |

**Free every run:** heuristic extraction (no API cost). **Pay only when you opt in:** WHOIS lookups (~1s/URL), proxy bandwidth (DATACENTER ~$2.50/GB, RESIDENTIAL ~$12/GB).

---

### Outreach safety

Two complementary signals tell you whether to mail a record:

#### 1. Per-record: `isSendable`

`isSendable: true` only when **all** of the following hold:

- A personal email (not `no-reply@`, `noreply@`, `postmaster@`)
- The personal email's domain has valid MX (or A fallback) — 2s timeout
- The domain is not a known spam-trap (mailinator, tempmail, guerrillamail)

Form-only records (no email, no phone) are flagged with `isSendableReason: ["not_contactable"]` so outreach tools can route them to a manual follow-up track instead of a campaign. Records with `isSendable: true` can be mapped straight to a campaign.

#### 2. Per-domain: `patternAnalysis.bounceRiskBucket`

| Bucket | Means |
|---|---|
| `low` | Domain has MX, server rejects unknown recipients, pattern confidence clears the goal threshold. **Safe to send.** |
| `medium` | SMTP probe inconclusive OR catch-all with valid MX OR `quick-outreach` with low confidence. **Test before blasting.** |
| `high` | Domain unreachable OR catch-all + no MX. **Don't send.** |

Threshold tuned by the `goal` input:

| `goal` | `bounceRiskBucket: "low"` requires | Outreach strategy |
|---|---|---|
| `quick-outreach` | `isCatchAll: false` AND `mxValid` AND `patternConfidence >= 0.9` | `single-shot` — only the primary pattern |
| `high-deliverability` (default) | `isCatchAll: false` AND `mxValid` | `fallback` — try alternate if primary bounces |
| `max-coverage` | any reachable domain | `progressive` — start strict, loosen based on response |

The `patternAnalysis.isCatchAll` field is a tri-state (`true` / `false` / `null`) populated by a single-RCPT-TO SMTP probe on the domain's primary MX. Stampede-cached so concurrent calls for the same domain share one TCP socket. 1-second timeout; never blocks the step on unresponsive mail servers.

See [`docs/plans/IsSendable-implementation.md`](docs/plans/IsSendable-implementation.md) and [`docs/plans/EmailPatternFinder-implementation.md`](docs/plans/EmailPatternFinder-implementation.md) for the full algorithms.

---

### How it works

1. **Submit** up to 1,000 URLs per run (bare domains auto-prefixed with `https://`)
2. **Scrape** each site with Cheerio-based HTML extraction (lightweight, no headless browser overhead), rotating user agents, and automatic retry with exponential backoff
3. **Validate & enrich** — emails classified, phones normalized, socials verified, WHOIS looked up, email pattern detected, SMTP catch-all probed
4. **Export** — one row per URL in the Apify Dataset, or download as a CSV ready for HubSpot, Salesforce, or Pipedrive

> **Note on JS-heavy sites:** the production pipeline uses Cheerio + Axios only — no headless browser. Sites that render content client-side (React/Vue SPAs) will produce partial results. Pair with the optional `proxyConfiguration` to bypass anti-bot gates on protected sites. See the full pipeline below.

#### Pipeline at a glance

<!-- docs/pipeline-diagram.html for the standalone hosted version of this diagram -->

```mermaid
flowchart LR
    A[URLs<br/>up to 1,000] --> B[Step 1: Scrape<br/>Cheerio + Axios]
    B --> C[Step 2: Email<br/>Pattern Finder<br/>DNS + SMTP probe]
    C --> D[Classify &<br/>Validate<br/>phones, socials,<br/>company type]
    D --> E[Quality Score<br/>0-100]
    E --> F[Export]
    F --> F1[Apify Dataset<br/>one row per URL]
    F --> F2[CRM-ready CSV<br/>HubSpot / Salesforce / Pipedrive]
    F --> F3[Standby HTTP API<br/>/leads /leads/&#123;domain&#125; /stats]
    F --> F4[KV Store<br/>runSummary]

    classDef input fill:#1f2937,color:#fff,stroke:#0ea5e9,stroke-width:2px
    classDef step fill:#0ea5e9,color:#fff,stroke:#0369a1
    classDef output fill:#10b981,color:#fff,stroke:#047857
    class A input
    class B,C,D,E step
    class F,F1,F2,F3,F4 output
````

A standalone, color-rendered version of this diagram is live at **[website-lead-enricher.netlify.app](https://website-lead-enricher.netlify.app/)**. The source is [`docs/pipeline-diagram.html`](docs/pipeline-diagram.html) — feel free to fork it for your own pipeline pages. Drop the live URL into your Apify Actor long description for a richer preview than plain markdown.

***

### 🔌 Live HTTP API (Standby mode)

Run this Actor in **Apify Standby mode** and it spins up a read-only HTTP API on the standby port — perfect for AI agents, MCP integrations, embedded B2B tools, and Zapier/n8n-style workflows where you want a stable queryable endpoint instead of one-shot batch runs.

| Endpoint | Returns |
|---|---|
| `GET /health` | Liveness probe (`{ status: "ok", uptimeMs }`) |
| `GET /leads` | Paginated list of enriched leads (max 1000 per page, supports `?limit=` and `?offset=`) |
| `GET /leads/{domain}` | Single-lead lookup by domain — full record shape (same as one Dataset row) |
| `GET /stats` | Run-level summary: `stepErrors` per pipeline step, `droppedRecords`, `totalRecords`, `durationMs` |

CORS is open by default. The OpenAPI schema lives in [`.actor/openapi.json`](.actor/openapi.json) — import it into Postman, Insomnia, or any OpenAPI generator to scaffold a client in seconds.

```bash
## Get all sendable leads for a campaign import
## (Start the Actor in Standby mode from https://apify.com/operational_zirconia/website-lead-enricher first,
##  then replace <your-standby-host> with the standby URL the Apify Console gives you.)
curl https://<your-standby-host>/leads?limit=500 | \
  jq '.[] | select(.isSendable == true) | {email: .contacts.emails_corporate, domain, company: .company.name}'
```

> The dataset is populated by previous normal (non-standby) runs; the standby server reads from the same Dataset and Key-Value Store. Pair a normal run with standby mode and the API stays queryable as long as the Actor is running.

***

### CRM-ready export

Set `csvMode` in the input and get a file formatted exactly for your platform:

| Mode | Booleans | Use case |
|------|----------|----------|
| `standard` | `true` / `false` | Generic CSV for custom tooling |
| `hubspot` | `true` / `false` | HubSpot Contact Import |
| `salesforce` | `TRUE` / `FALSE` | Salesforce Lead Import Wizard |
| `pipedrive` | `1` / `0` | Pipedrive Person Import |

```json
{
  "urls": ["https://acme.com", "https://stripe.com"],
  "csvMode": "hubspot"
}
```

Output: `OUTPUT_HUBSPOT_CSV` in the Key-Value Store tab — import directly, no transformation.

***

### Filter by company type

Each input URL is heuristically classified into one of 14 verticals (`saas`, `saas_b2b`, `agency`, `ecommerce`, `legal`, `medical`, `consulting`, `manufacturing`, `media`, `nonprofit`, `education`, `realestate`, `finance`, or `other`) using schema.org markup, meta description, and body-text keywords. Set `companyTypes` to keep only the verticals you care about.

```json
{
  "urls": ["https://acme.com", "https://bobslegal.com", "https://carsforkids.com"],
  "companyTypes": ["saas", "consulting"]
}
```

Dropped records remain in the Apify Dataset with `passedCompanyTypeFilter: false` so you can audit them; they are removed from the local CSV/JSON export.

***

### Email Pattern Finder in depth

Step 2 detects the company's email naming convention from the emails Step 1 found on the page, validates the domain with MX + a single SMTP catch-all probe, and emits a `generatedEmails[]` array plus a `patternAnalysis` block.

#### What's emitted

```json
{
  "emailPattern": "first.last",
  "patternConfidence": 0.92,
  "generatedEmails": [
    { "address": "jan.curry@acme.com", "name": "Jan Curry", "source": "page-discovered" },
    { "address": "ada.lovelace@acme.com", "name": "Ada Lovelace", "source": "pattern-from-page" },
    { "address": "curry.jan@acme.com", "name": "Jan Curry", "source": "pattern-alternate" }
  ],
  "patternAnalysis": {
    "mxValid": true,
    "isCatchAll": false,
    "emailCulture": "strict-format",
    "sequenceStrategy": "fallback",
    "bounceRiskBucket": "low"
  }
}
```

#### `source` enum values

| Source | Meaning |
|---|---|
| `page-discovered` | Email Step 1 already found on the page that parses to a personal name |
| `pattern-from-page` | The detected pattern applied to a contact name found on the page |
| `pattern-alternate` | A backup pattern applied to the same names (when confidence is low) |

See [`docs/NextSteps/EmailPatternFinder.md`](docs/NextSteps/EmailPatternFinder.md) for the full spec, [`docs/plans/EmailPatternFinder-adr.md`](docs/plans/EmailPatternFinder-adr.md) for the architecture decisions, and [`docs/plans/EmailPatternFinder-implementation.md`](docs/plans/EmailPatternFinder-implementation.md) for the build plan.

***

### Input

```json
{
  "urls": ["https://site1.com", "stripe.com", "www.example.org/contact"],
  "maxConcurrency": 5,
  "includeWhois": false,
  "csvMode": "standard",
  "companyTypes": ["saas", "consulting"],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["DATACENTER"]
  },
  "skipEmailPatternFinder": false,
  "goal": "high-deliverability"
}
```

| Field | Default | Description |
|-------|---------|-------------|
| `urls` | **required** | Up to 1,000 URLs or bare domains |
| `maxConcurrency` | `5` | Parallel requests (1–10). Use 1–2 for large batches |
| `includeWhois` | `false` | Adds registrant name and registration date (~1s extra per URL) |
| `csvMode` | `standard` | `standard`, `hubspot`, `salesforce`, or `pipedrive` |
| `companyTypes` | `[]` | Allow-list of verticals. Empty = include all. |
| `proxyConfiguration` | `{ useApifyProxy: false }` | Optional. Routes requests through Apify's proxy pool — see [Proxy support](#proxy-support) |
| `skipEmailPatternFinder` | `false` | Skip Step 2 (Email Pattern Finder) — when true, no DNS / SMTP work is performed |
| `searchWhois` | `false` | Mine the WHOIS registrant email and add it to `generatedEmails[]` with `source: "whois-registrant"`. No-op when `skipEmailPatternFinder: true` |
| `goal` | `high-deliverability` | Outreach intent. `quick-outreach` (strict, single-shot), `high-deliverability` (medium, fallback), `max-coverage` (loose, progressive) |
| `hunterApiKey` | `null` | Optional Hunter.io API key. When set, pulls additional emails from Hunter's domain-search API into `generatedEmails[]` with `source: "hunter-api"`. Free tier works. Failures populate `patternAnalysis.hunterError` without failing the step |

***

### Sample output

```json
{
  "url": "https://www.acme.com",
  "domain": "acme.com",
  "scrapedAt": "2026-06-21T10:00:00Z",
  "contacts": {
    "emails": [
      { "address": "jan@acme.com", "type": "corporate" },
      { "address": "contact@acme.com", "type": "generic" }
    ],
    "phones": ["+12125551234"]
  },
  "socials": { "linkedin": "https://linkedin.com/company/acme" },
  "qualityScore": { "total": 85, "breakdown": { "completeness": 80, "emailValidity": 100, "phoneValidity": 100, "socialPresence": 60 } },
  "companyType": "saas",
  "isSendable": true,
  "emailPattern": "first.last",
  "patternConfidence": 0.92,
  "generatedEmails": [
    { "address": "jan.curry@acme.com", "name": "Jan Curry", "source": "page-discovered" }
  ],
  "patternAnalysis": {
    "mxValid": true,
    "isCatchAll": false,
    "bounceRiskBucket": "low",
    "emailCulture": "strict-format"
  },
  "dataQuality": "medium",
  "scrapeError": null
}
```

Full schema: [`.actor/dataset_schema.json`](.actor/dataset_schema.json).

***

### Quality you can trust

- **No "nan"** — null/NaN values become empty fields, never broken cells
- **UTF-8 BOM** — accented company names import cleanly into Excel and every CRM
- **CSV injection guard (CWE-1236)** — formula-triggering values (`=`, `+`, `-`, `@`) are quoted to prevent execution when the CSV is opened in Excel
- **Single homepage fetch** — company name, socials, and address extracted from the same response; no wasteful re-scraping
- **WHOIS cache** — duplicate domains in one run cost nothing
- **Graceful errors** — failed URLs still appear in the dataset with error context, so nothing is lost silently

***

### Proxy support

Hit a Cloudflare block? Scraping EU sites that geo-fence US IPs? Add `proxyConfiguration` to your input and the actor will route every request through the Apify-managed proxy pool. Default is **off** — you only pay proxy bandwidth when you opt in.

| Tier | Apify cost | Best for | Failure mode |
|---|---|---|---|
| `DATACENTER` | ~$2.50/GB | US sites without aggressive anti-bot | Blocked by Cloudflare / Akamai |
| `RESIDENTIAL` | ~$12/GB | Anti-bot sites, EU geo-targeting, compliance-sensitive leads | 4–5× the bandwidth cost |

EU geo-pinning example:

```json
{
  "urls": ["https://acme.de", "https://example.de"],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyCountry": "DE"
  }
}
```

Every run that uses proxy prints a summary line at the end so you can track cost. If any URL hits a Cloudflare challenge, you'll see a tip suggesting you enable proxy.

***

### Scrape errors

Every record carries a top-level `scrapeError: object | null` field. `code` is one of eight machine-readable categories:

| Code | Meaning | Retry? |
|------|---------|--------|
| `timeout` | Request exceeded the timeout budget | ✅ Retry |
| `blocked` | HTTP 403 or Cloudflare / bot-challenge signal | ⚠️ With proxy |
| `dns_error` | DNS lookup failed (`ENOTFOUND` / `EAI_AGAIN`) | ❌ Permanent |
| `tls_error` | Certificate / TLS handshake failed | ❌ Permanent |
| `5xx` | Upstream 5xx response | ✅ Retry |
| `4xx` | Other 4xx response (404, 429, …) | ⚠️ Depends |
| `empty` | Fetch succeeded but no contact data extracted | ⚠️ Optional |
| `unknown` | Unclassified failure | ⚠️ Case-by-case |

**Partial-success rule:** if any path in the scrape loop yielded data, `scrapeError` is cleared to `null`. A record that got even one email from `/contact` succeeded.

***

### Use cases

- **Sales prospecting** — find decision-maker emails and direct phones for outbound campaigns
- **Cold outreach prep** — build targeted lists with verified corporate emails and bounce-risk per domain
- **Lead enrichment** — append real contact data to existing CRM records
- **Competitor research** — map competitor digital presence at scale
- **Domain due diligence** — WHOIS-backed company name and registration date for vendor research

***

### Technical notes

- Cheerio-based HTML extraction (lightweight, no headless browser overhead)
- Automatic retry with exponential backoff
- Rotating user agents to reduce blocks
- Configurable timeout (15s default, 5s WHOIS, 1s SMTP probe)
- Optional Apify proxy integration (DATACENTER / RESIDENTIAL / country pinning / custom URLs)
- 951 tests across 49 test suites

***

**Categories:** Lead generation · Data scraping · Sales automation

**Tags:** email scraper, phone extractor, social media finder, B2B lead enrichment, CRM enrichment, contact discovery, WHOIS lookup, sales automation, proxy support, Cloudflare bypass, residential proxy, datacenter proxy, email pattern finder, bounce risk

# Actor input Schema

## `urls` (type: `array`):

List of URLs or domains to enrich. Each entry is fetched, parsed, and analysed for contact data (emails, phones, social links). Bare domains automatically get 'https://' prepended. Example: \['https://acme.com', 'stripe.com', 'www.example.org/contact']. Total runtime scales with the list size; processing runs in parallel up to 'maxConcurrency'.

## `maxConcurrency` (type: `integer`):

Maximum number of URLs processed in parallel. Example: with 20 URLs and concurrency 5, the actor handles 5 at a time and queues the remaining 15. Higher values finish faster but may trigger rate limits or get your IP blocked by target sites; lower values are slower but safer. Recommended: 3-5 for small batches (< 50 URLs), 1-2 for large batches (50+ URLs).

## `includeWhois` (type: `boolean`):

When true, each output row includes a 'whois' object with domain registration data: registrar name, creation/expiration dates, and registrant organization (when public per GDPR redaction). When false, only scraped page data is returned and no WHOIS lookup is performed. Example: enabling this for 'acme.com' adds {registrar: 'GoDaddy', createdAt: '2010-03-15', registrantOrg: 'ACME Corp'} to that row; disabling it skips the lookup entirely and saves ~1 second per URL. Default is false to minimize per-domain cost — enable explicitly when you need registrant data.

## `skipEmailPatternFinder` (type: `boolean`):

Skip Step 2 (email pattern detection) — when true, the pipeline emits a synthetic 'pattern' step with status 'skipped' and no DNS lookups are performed. When false (default), Step 2 runs and adds emailPattern, patternConfidence, generatedEmails\[], patternAnalysis, alternateEmailPatterns, and dataQuality to every record. See docs/NextSteps/EmailPatternFinder.md.

## `searchWhois` (type: `boolean`):

Mine the WHOIS registrant email (when includeWhois is true and the registry exposes it) and add it to generatedEmails\[] with source 'whois-registrant'. No-op when skipEmailPatternFinder is true. Default false.

## `goal` (type: `string`):

Tunes patternAnalysis.bounceRiskBucket thresholds and sequenceStrategy per outreach intent. 'quick-outreach' is strict — only the safest ~20% of generated patterns ship (sequenceStrategy: 'single-shot'). 'high-deliverability' (default) is medium — top ~60% ship (sequenceStrategy: 'fallback'). 'max-coverage' is loose — top ~95% ship (sequenceStrategy: 'progressive'). See docs/NextSteps/EmailPatternFinder.md §goal semantics.

## `hunterApiKey` (type: `string`):

Optional Hunter.io API key — when set, the Step 2 pipeline additionally pulls emails from Hunter.io's domain-search API and adds them to generatedEmails\[] with source 'hunter-api'. Failures (401, 429, network) populate patternAnalysis.hunterError without failing the step. Free tier is fine for low-volume runs. Leave empty to skip. See docs/NextSteps/EmailPatternFinder.md.

## `csvMode` (type: `string`):

Format of the generated CSV. 'standard' (generic - 18 columns, booleans true/false), 'hubspot' (First Name, Last Name, Email, Company, Phone, LinkedIn Profile URL - booleans true/false), 'salesforce' (FirstName, LastName, Email, Company, Phone, LeadSource - booleans TRUE/FALSE), 'pipedrive' (name, email, phone, org\_name, linkedin - booleans 1/0). Default: 'standard'. Use the mode matching your CRM to import the CSV directly without mapping columns. In Apify cloud, the CRM-ready CSV is written to the Key-value store tab as OUTPUT\_<MODE>\_CSV. Locally, it is written to LOCAL\_OUTPUT\_DIR with the mode included in the file name.

## `companyTypes` (type: `array`):

Allow-list filter for company types. Each input website is heuristically classified as one of the supported verticals (saas, saas\_b2b, agency, ecommerce, legal, medical, consulting, manufacturing, media, nonprofit, education, realestate, finance, or 'other' when no rule matches). When this list is non-empty, only records whose inferred companyType appears in the list are kept in the output; the rest are marked with passedCompanyTypeFilter=false in the dataset and dropped from the local CSV / JSON export. When this list is empty (default), every record is kept regardless of companyType. Note: 'other' is intentionally not in the enum — it is the fallback label when no rule matches, so filtering on it is meaningless.

## `proxyConfiguration` (type: `object`):

Apify proxy passthrough. **Default is OFF (no proxy)** — the actor makes direct HTTP requests when this section is left untouched. Flip 'Use Apify Proxy' on to route through the Apify-managed pool. Tier matters: **DATACENTER** is US-only and ~$2.50/GB (cheap US scrapes, blocked by Cloudflare/Akamai); **RESIDENTIAL** supports 200+ countries (DE, FR, UK, JP, BR, etc.) and is ~$12/GB (bypasses most anti-bot, required for non-US geo-targeting). Use 'apifyProxyCountry' to pin exit-node geography. For BYO proxies, set 'useApifyProxy' off and provide 'proxyUrls' in 'scheme://user:pass@host:port' form.

## Actor input object example

```json
{
  "urls": [
    "https://example.com"
  ],
  "maxConcurrency": 5,
  "includeWhois": false,
  "skipEmailPatternFinder": false,
  "searchWhois": false,
  "goal": "high-deliverability",
  "csvMode": "standard",
  "companyTypes": []
}
```

# Actor output Schema

## `enrichedLeads` (type: `string`):

JSON array of enriched lead records, one per input URL. Each record contains: url (original input), domain (lowercased, no www.), scrapedAt (ISO 8601 timestamp), company {name, registrant, createdAt} populated from WHOIS or page metadata, contacts {emails\[] classified as corporate vs generic, email\_summary {corporate, generic, total}, phones\[] normalized to E.164}, socials {linkedin, facebook, instagram, twitter, youtube, ...} with full https URLs, qualityScore {total 0-100, breakdown {completeness 0-40, emailValidity 0-30, phoneValidity 0-20, socialPresence 0-10}}, companyType (heuristic vertical: saas, saas\_b2b, agency, ecommerce, legal, medical, consulting, manufacturing, media, nonprofit, education, realestate, finance, other) with companyTypeConfidence (high|medium|low) and passedCompanyTypeFilter boolean, isSendable (boolean, strictly stricter than contactable — true when a personal email exists with valid MX / A-record and the email is not a known spam-trap) with isSendableReason\[] (list of failure reasons: not\_contactable, generic\_email, no\_mx, spam\_trap), contactFormDetected (boolean) and contactFormUrl (string|null) per ContactFormDetected.md, scrapeError (object|null per ScrapeError.md — { code, message, httpStatus? } when scraping failed, null on success; code is one of: timeout, blocked, dns\_error, tls\_error, 5xx, 4xx, empty, unknown), pipelineData.steps\[] (per-step execution metadata: {name, status, durationMs, required, error?}; status is `ok` or `error`, error shape mirrors scrapeError — see docs/NextSteps/PerStepErrorIsolation.md), missing\_fields\[] (names of absent fields), and meta {errors\[], warnings\[], duration\_ms}. Fetch via GET {template-resolved URL} to retrieve the full list as JSON.

## `runSummary` (type: `string`):

JSON object aggregating per-step error counts and dropped-record metrics across the whole run. Shipped via the local-mode JSON envelope (and, in a follow-on, the Apify `/stats` HTTP endpoint). Shape: { stepErrors: { <stepName>: <count> }, droppedRecords: number, totalRecords: number, durationMs: number }. `stepErrors` is a per-pipeline-step counter (e.g. { scrape: 0, whois: 12, companyName: 3, addresses: 8, socials: 22, emailClassification: 0, companyType: 1, sendability: 5, qualityScore: 0, phoneDedup: 0 }) — each value is the count of records where that step ended in `status: "error"`. `droppedRecords` counts records lost to a failed required step (only `scrape` today, plus records dropped by the post-batch companyTypes filter when active). `totalRecords` reflects every record that landed in the dataset (success + dropped). `durationMs` is the total wall-clock run time. Consumers can fetch this in cloud mode via the Key-Value Store `OUTPUT_RUN_SUMMARY` key, or read it directly from the local-mode JSON envelope.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://example.com"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("operational_zirconia/website-lead-enricher").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": ["https://example.com"] }

# Run the Actor and wait for it to finish
run = client.actor("operational_zirconia/website-lead-enricher").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://example.com"
  ]
}' |
apify call operational_zirconia/website-lead-enricher --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=operational_zirconia/website-lead-enricher",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Lead Enricher",
        "description": "Extract emails, phones, social profiles, and company data from any website. CRM-ready B2B lead enrichment with HubSpot, Salesforce, and Pipedrive export modes. Quality score, WHOIS lookup, and E.164 phone normalization included.",
        "version": "0.0",
        "x-build-id": "iojJsmueZgBrGRj4y"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/operational_zirconia~website-lead-enricher/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-operational_zirconia-website-lead-enricher",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/operational_zirconia~website-lead-enricher/runs": {
            "post": {
                "operationId": "runs-sync-operational_zirconia-website-lead-enricher",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/operational_zirconia~website-lead-enricher/run-sync": {
            "post": {
                "operationId": "run-sync-operational_zirconia-website-lead-enricher",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs to scrape",
                        "type": "array",
                        "description": "List of URLs or domains to enrich. Each entry is fetched, parsed, and analysed for contact data (emails, phones, social links). Bare domains automatically get 'https://' prepended. Example: ['https://acme.com', 'stripe.com', 'www.example.org/contact']. Total runtime scales with the list size; processing runs in parallel up to 'maxConcurrency'.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Maximum number of URLs processed in parallel. Example: with 20 URLs and concurrency 5, the actor handles 5 at a time and queues the remaining 15. Higher values finish faster but may trigger rate limits or get your IP blocked by target sites; lower values are slower but safer. Recommended: 3-5 for small batches (< 50 URLs), 1-2 for large batches (50+ URLs).",
                        "default": 5
                    },
                    "includeWhois": {
                        "title": "Include WHOIS",
                        "type": "boolean",
                        "description": "When true, each output row includes a 'whois' object with domain registration data: registrar name, creation/expiration dates, and registrant organization (when public per GDPR redaction). When false, only scraped page data is returned and no WHOIS lookup is performed. Example: enabling this for 'acme.com' adds {registrar: 'GoDaddy', createdAt: '2010-03-15', registrantOrg: 'ACME Corp'} to that row; disabling it skips the lookup entirely and saves ~1 second per URL. Default is false to minimize per-domain cost — enable explicitly when you need registrant data.",
                        "default": false
                    },
                    "skipEmailPatternFinder": {
                        "title": "Skip email pattern finder",
                        "type": "boolean",
                        "description": "Skip Step 2 (email pattern detection) — when true, the pipeline emits a synthetic 'pattern' step with status 'skipped' and no DNS lookups are performed. When false (default), Step 2 runs and adds emailPattern, patternConfidence, generatedEmails[], patternAnalysis, alternateEmailPatterns, and dataQuality to every record. See docs/NextSteps/EmailPatternFinder.md.",
                        "default": false
                    },
                    "searchWhois": {
                        "title": "Search WHOIS for registrant email",
                        "type": "boolean",
                        "description": "Mine the WHOIS registrant email (when includeWhois is true and the registry exposes it) and add it to generatedEmails[] with source 'whois-registrant'. No-op when skipEmailPatternFinder is true. Default false.",
                        "default": false
                    },
                    "goal": {
                        "title": "Outreach goal",
                        "enum": [
                            "quick-outreach",
                            "high-deliverability",
                            "max-coverage"
                        ],
                        "type": "string",
                        "description": "Tunes patternAnalysis.bounceRiskBucket thresholds and sequenceStrategy per outreach intent. 'quick-outreach' is strict — only the safest ~20% of generated patterns ship (sequenceStrategy: 'single-shot'). 'high-deliverability' (default) is medium — top ~60% ship (sequenceStrategy: 'fallback'). 'max-coverage' is loose — top ~95% ship (sequenceStrategy: 'progressive'). See docs/NextSteps/EmailPatternFinder.md §goal semantics.",
                        "default": "high-deliverability"
                    },
                    "hunterApiKey": {
                        "title": "Hunter.io API key (optional)",
                        "type": "string",
                        "description": "Optional Hunter.io API key — when set, the Step 2 pipeline additionally pulls emails from Hunter.io's domain-search API and adds them to generatedEmails[] with source 'hunter-api'. Failures (401, 429, network) populate patternAnalysis.hunterError without failing the step. Free tier is fine for low-volume runs. Leave empty to skip. See docs/NextSteps/EmailPatternFinder.md."
                    },
                    "csvMode": {
                        "title": "CSV Export Mode",
                        "enum": [
                            "standard",
                            "hubspot",
                            "salesforce",
                            "pipedrive"
                        ],
                        "type": "string",
                        "description": "Format of the generated CSV. 'standard' (generic - 18 columns, booleans true/false), 'hubspot' (First Name, Last Name, Email, Company, Phone, LinkedIn Profile URL - booleans true/false), 'salesforce' (FirstName, LastName, Email, Company, Phone, LeadSource - booleans TRUE/FALSE), 'pipedrive' (name, email, phone, org_name, linkedin - booleans 1/0). Default: 'standard'. Use the mode matching your CRM to import the CSV directly without mapping columns. In Apify cloud, the CRM-ready CSV is written to the Key-value store tab as OUTPUT_<MODE>_CSV. Locally, it is written to LOCAL_OUTPUT_DIR with the mode included in the file name.",
                        "default": "standard"
                    },
                    "companyTypes": {
                        "title": "Company Types Filter",
                        "type": "array",
                        "description": "Allow-list filter for company types. Each input website is heuristically classified as one of the supported verticals (saas, saas_b2b, agency, ecommerce, legal, medical, consulting, manufacturing, media, nonprofit, education, realestate, finance, or 'other' when no rule matches). When this list is non-empty, only records whose inferred companyType appears in the list are kept in the output; the rest are marked with passedCompanyTypeFilter=false in the dataset and dropped from the local CSV / JSON export. When this list is empty (default), every record is kept regardless of companyType. Note: 'other' is intentionally not in the enum — it is the fallback label when no rule matches, so filtering on it is meaningless.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "saas",
                                "saas_b2b",
                                "agency",
                                "ecommerce",
                                "legal",
                                "medical",
                                "consulting",
                                "manufacturing",
                                "media",
                                "nonprofit",
                                "education",
                                "realestate",
                                "finance"
                            ],
                            "enumTitles": [
                                "SaaS",
                                "SaaS B2B",
                                "Agency",
                                "E-commerce",
                                "Legal",
                                "Medical",
                                "Consulting",
                                "Manufacturing",
                                "Media",
                                "Nonprofit",
                                "Education",
                                "Real estate",
                                "Finance"
                            ]
                        },
                        "default": []
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Apify proxy passthrough. **Default is OFF (no proxy)** — the actor makes direct HTTP requests when this section is left untouched. Flip 'Use Apify Proxy' on to route through the Apify-managed pool. Tier matters: **DATACENTER** is US-only and ~$2.50/GB (cheap US scrapes, blocked by Cloudflare/Akamai); **RESIDENTIAL** supports 200+ countries (DE, FR, UK, JP, BR, etc.) and is ~$12/GB (bypasses most anti-bot, required for non-US geo-targeting). Use 'apifyProxyCountry' to pin exit-node geography. For BYO proxies, set 'useApifyProxy' off and provide 'proxyUrls' in 'scheme://user:pass@host:port' form."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
