# Email & Contact Extractor — Emails, Phones, Socials (`haketa/email-extractor`) Actor

Give a URL or list of domains — get back emails, phones and social profiles (LinkedIn, Twitter/X, Facebook, Instagram, YouTube, GitHub, TikTok, Pinterest) from the homepage and canonical contact pages. Decodes Cloudflare-obfuscated and text-obfuscated emails.

- **URL**: https://apify.com/haketa/email-extractor.md
- **Developed by:** [Haketa](https://apify.com/haketa) (community)
- **Categories:** Lead generation, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.20 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Email & Contact Extractor | Emails, Phones, Socials from Any Website

Paste a list of company URLs or bare domains. Get back a clean dataset with every email address, phone number and social-profile link the actor can find on the homepage and the canonical contact pages — `/contact`, `/about`, `/team`, `/impressum`, `/contatti`, and the rest of the universal contact-page conventions.

Cloudflare-obfuscated emails (the most common protection on company sites) are decoded automatically. So are `name [at] domain [dot] com` and `&#64;` HTML-entity tricks. The output is normalised, deduplicated and ready to ship straight to your CRM, lead-gen workflow, recruiting pipeline or AI agent.

> **TL;DR** — One Apify run, one row per website, all the contact data the site itself publishes. No login. No browser. No spreadsheet wrangling.

---

### What you get

For each website you paste in, the actor returns **one row** with:

| Field | What |
| --- | --- |
| `rootDomain` | Normalised domain (no `www.`, no protocol) |
| `title` | Homepage / contact-page title — useful for filtering |
| `emails` | Array of every email found, deduplicated and lower-cased |
| `phones` | Array of every phone number found, normalised |
| `linkedinUrl` | Company LinkedIn page |
| `twitterUrl` | Company Twitter / X handle |
| `facebookUrl` | Company Facebook page |
| `instagramUrl` | Company Instagram handle |
| `youtubeUrl` | Company YouTube channel |
| `githubUrl` | Company GitHub organisation |
| `tiktokUrl` | Company TikTok |
| `pinterestUrl` | Company Pinterest |
| `emailCount`, `phoneCount`, `socialCount` | Counts for quick filtering |
| `pagesScraped` | Which pages on the domain contributed data |
| `fetchedPagesCount` | How many pages were successfully fetched |
| `errors` | Errors per page if any (null when everything worked) |
| `inputUrl` | The URL you gave us, for round-tripping |
| `scrapedAt` | ISO timestamp |

The complete schema is in `dataset_schema.json`.

---

### How the matching works

#### Emails

- **`mailto:` links** — anything inside `<a href="mailto:…">` is captured.
- **Cloudflare email protection** — every site that ticks the "Email Address Obfuscation" box in their Cloudflare dashboard puts emails behind a `data-cfemail="…"` span. The actor decodes the hex / XOR cipher automatically, so `[email&#160;protected]` becomes the real address.
- **Plain-text regex** — strict pattern that demands a real TLD (`.com`, `.io`, etc., not `.png`).
- **Obfuscation patterns** — `name [at] domain [dot] com`, `(at)`, `&#64;`, `_at_` — all normalised before regex extraction.
- **False-positive filter** — known decoys (`example.com`, `yourdomain.com`, `sentry.io` token hashes), image-extension tails (`@2x.png`), version-string-looking numerics — all dropped.

#### Phone numbers

- **`tel:` links** — `<a href="tel:+1-415-…">` extracted and normalised.
- **International / explicit patterns** — `+CC` prefixed digits in body text.
- **Noise control** — bare 4-digit-only numbers ignored, capped at 20 phones per page to keep the column clean.

#### Socials

Per-platform regexes match the canonical profile URL shape, skipping share / intent / login links. For LinkedIn we accept `/company/`, `/in/` and `/school/`; for Twitter we skip `/intent/`, `/share?`, `/i/`; for Facebook we skip `/sharer/`, `/dialog/`, `/tr?`; etc.

#### Multi-page coverage

Every input URL becomes a small queue of up to `maxPagesPerDomain` URLs: the input itself, then the canonical contact paths (`/contact`, `/contact-us`, `/about`, `/about-us`, `/team`, `/company`, `/impressum`, `/imprint`, `/contatti`, `/contacto`, `/kontakt`). Pages 404 silently and don't fail the row — we just move on and aggregate whatever we find.

---

### Use cases

#### B2B sales prospecting
You bought a list of 5,000 SaaS companies from Apollo / Crunchbase but only have website URLs. Run the list through this actor, get emails + LinkedIn + phones for every domain, then push to your outbound tool.

#### Recruiter pipeline
Have a list of target companies for a hiring campaign? Run them through, get HR / talent / hiring-manager emails (anything published on `/team` or `/about`) plus the company LinkedIn for InMail follow-up.

#### Brand monitoring & PR
Maintain a database of journalists' / agencies' contact emails. Run their domains periodically to track changes — new staff appear on `/team`, contact emails change with company restructures.

#### Investor / VC sourcing
Have a list of portfolio companies' websites? Pull their emails, founders' socials and channel counts in one pass.

#### Marketing partnerships / affiliate programs
Find the right inbox at thousands of potential partners (`partnerships@`, `bizdev@`, `affiliates@`). The actor surfaces every email on the page; you filter for the role-keywords you care about.

#### Lead-list enrichment
Already have an existing lead list? Paste the URL column and you get an enriched table back with emails, phones and social handles per row. Pair it with the dataset's CSV export and you're 30 seconds away from a refreshed Sheets tab.

#### AI agents & workflows
This actor is a perfect input for downstream LLM-driven enrichment: deduplicate companies by domain, get their public contact surface, then let your agent decide which mailbox to send the outbound to.

#### Compliance / GDPR audit
Map a company's entire externally-visible contact surface in one run. Compare against your own internal records to find stale or orphaned mailboxes.

---

### Inputs (full list)

The canonical definitions live in `input_schema.json`; here's the human summary.

- **`startUrls`** *(array)* — URLs or bare domains. Both `stripe.com` and `https://stripe.com/about` are accepted. Each entry becomes one output row.
- **`maxPagesPerDomain`** *(integer)* — How many pages on the website to fetch. Page 1 is always the URL you gave us; the rest come from `contactPaths`. Default `6`, minimum `1` (cheapest, just the input page), maximum `30`.
- **`contactPaths`** *(array)* — Which sub-paths to probe on each domain. Defaults cover EN + DE + IT + ES + DE for international portfolios.
- **`includeSocials`** *(boolean)* — Toggle social-link extraction. Default `true`.
- **`includePhones`** *(boolean)* — Toggle phone extraction. Default `true`.
- **`decodeObfuscatedEmails`** *(boolean)* — Toggle Cloudflare + text obfuscation decoding. Default `true`.
- **`maxConcurrency`** *(integer)* — How many websites to crawl in parallel. Default `5`, max `20`.
- **`requestDelay`** *(integer)* — Milliseconds between page fetches *inside* one website. Default `800`.
- **`proxyConfiguration`** *(proxy)* — Apify Proxy. Defaults to Apify Datacenter (rotated per request).

---

### Example inputs

#### 1. Enrich a list of 50 SaaS companies (default)

```json
{
  "startUrls": [
    { "url": "https://stripe.com" },
    { "url": "https://ramp.com" },
    { "url": "https://airtable.com" },
    { "url": "https://notion.so" },
    { "url": "https://figma.com" }
  ]
}
````

#### 2. Bare domains, just the homepage (cheapest mode)

```json
{
  "startUrls": [
    { "url": "stripe.com" },
    { "url": "ramp.com" }
  ],
  "maxPagesPerDomain": 1
}
```

#### 3. Emails only, no socials / phones

```json
{
  "startUrls": [{ "url": "https://anthropic.com" }],
  "includeSocials": false,
  "includePhones": false
}
```

#### 4. Deep contact-page coverage for a tricky DACH site

```json
{
  "startUrls": [{ "url": "https://www.bosch.com" }],
  "maxPagesPerDomain": 12,
  "contactPaths": [
    "/de/contact",
    "/en/contact",
    "/impressum",
    "/de/impressum",
    "/en/legal-notice",
    "/de/karriere",
    "/en/careers",
    "/about-bosch",
    "/de/unternehmen",
    "/contact-us"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}
```

#### 5. Big list, high concurrency

```json
{
  "startUrls": [
    { "url": "https://stripe.com" },
    { "url": "https://ramp.com" },
    { "url": "https://airtable.com" }
  ],
  "maxConcurrency": 10,
  "requestDelay": 500
}
```

***

### Output sample

```json
{
  "inputUrl": "https://stripe.com",
  "rootDomain": "stripe.com",
  "title": "Stripe | Financial Infrastructure to Grow Your Revenue",
  "emails": ["press@stripe.com", "support@stripe.com"],
  "phones": ["+1 888 926 2289"],
  "linkedinUrl": "https://www.linkedin.com/company/stripe",
  "twitterUrl": "https://twitter.com/stripe",
  "facebookUrl": "https://www.facebook.com/StripeHQ",
  "instagramUrl": null,
  "youtubeUrl": "https://www.youtube.com/@StripeDevs",
  "githubUrl": "https://github.com/stripe",
  "tiktokUrl": null,
  "pinterestUrl": null,
  "emailCount": 2,
  "phoneCount": 1,
  "socialCount": 5,
  "pagesScraped": [
    "https://stripe.com",
    "https://stripe.com/contact",
    "https://stripe.com/about"
  ],
  "fetchedPagesCount": 3,
  "errors": null,
  "scrapedAt": "2026-05-31T18:14:02.000Z"
}
```

***

### Cost & throughput

This actor uses Apify's pay-per-event pricing. The exact tier is set on the Apify Store listing.

Throughput on the default config (`maxPagesPerDomain: 6`, `maxConcurrency: 5`, `requestDelay: 800`):

- \~5–8 websites per minute → 300–500 / hour.
- Drop `maxPagesPerDomain` to `1` and raise `maxConcurrency` to `10` for ~30 sites/minute.

Each input URL touches up to 6 sub-pages, so a 1,000-URL run hits at most 6,000 requests. Light by HTTP standards; this is why the actor stays cheap.

***

### How the technique stacks up

There are dozens of "email finder" tools out there. Most are closed-API SaaS with monthly subscriptions and per-credit pricing. Here's where this Apify actor positions itself:

- **No subscription** — pay only for the runs you trigger.
- **Cloudflare email-protection decoder built in** — most generic scrapers miss obfuscated emails; we decode every `data-cfemail` span automatically.
- **Multi-language contact-page paths** — DE / IT / ES / EN by default, so EU runs aren't crippled by the English-only assumption.
- **Per-request session rotation via Apify Proxy** — surviving long lists without IP rate-limit drama is the difference between 95% success and 25%.
- **0-row runs exit cleanly, not as FAILED** — your scheduled enrichment job stays green when a few of the input domains 404.
- **All eight socials in one pass** — LinkedIn, Twitter / X, Facebook, Instagram, YouTube, GitHub, TikTok, Pinterest — no second tool needed for that column.

***

### Tips & troubleshooting

**Q: Half my rows have `emailCount: 0`. What's wrong?**
A: Three usual suspects:

1. The domain's `/contact` page lives at a custom path (e.g. `/contact-us-en`, `/about/team`). Extend `contactPaths` with the candidate you saw in your browser.
2. The site uses an aggressive bot wall (Cloudflare full challenge, PerimeterX). Switch the proxy group to `RESIDENTIAL` — the row's `errors` column will tell you which page got blocked.
3. The site simply doesn't publish emails on its public pages. Some enterprise / private-equity sites do this on purpose; no scraper will fix that.

**Q: I'm seeing too many "junk" emails (`hash@sentry.io`, `you@example.com`, …).**
A: We filter the worst offenders, but new ones appear every week. Filter the `emails` column downstream by domain: keep only entries whose root-domain (`@stripe.com`) matches the row's `rootDomain`.

**Q: Phones are noisy / wrong.**
A: Phone extraction is intentionally conservative — we only pick `tel:` links plus clearly-international `+CC …` patterns. If your inputs are US-only sites you can disable `includePhones` and rely on a US-specific phone scraper downstream.

**Q: I want one row per email, not per website.**
A: Use Apify's "Transform dataset" integration with `flat`-ish JS — `for (const e of item.emails) yield { ...item, email: e }`. The aggregate shape is easier to reason about as the default but trivial to explode.

**Q: How do I deduplicate when I run the same list weekly?**
A: Use Apify Scheduler + the dataset's `cleanItemCount` ID = `rootDomain`. Or run a small transform: `groupBy(rootDomain)`.

**Q: How fresh is the data?**
A: Real-time. Every page is fetched live; there's no cache layer in the actor.

**Q: Can I scrape a list of 100,000 URLs in one go?**
A: Yes, but break it into 5,000-URL batches per run for predictable cost and to keep the dataset payload reasonable. Schedule them daily via Apify Schedules.

**Q: My proxy errors are persistent.**
A: Two levers — (1) drop `maxConcurrency` to 2-3, (2) switch the proxy group to `RESIDENTIAL`. The actor automatically rotates the proxy session per request, but very heavy load on cheap datacenter IPs will eventually get throttled by Cloudflare.

***

### Legal & ethical use

This actor reads public, indexable HTML — the same pages Google sees. Use the output responsibly:

- **GDPR / CAN-SPAM** apply to your usage, not to the act of scraping. If you cold-email EU contacts, you need a lawful basis (legitimate interest works for B2B if the email is professional, the topic is relevant, and you respect opt-outs). The actor does not bypass any login, paywall or robots.txt-disallowed path.
- **Role addresses vs personal** — `info@`, `support@`, `press@` are explicitly company contact points and safe to use. `firstname.lastname@` are individuals; treat them per the rules above.
- Don't spam. Don't scrape behind paywalls. Don't use this for credential stuffing or harassment. Apify will deactivate any account that does.

***

### How this compares to SaaS email finders

| | This actor | Hunter.io | Snov.io | Apollo.io |
| --- | --- | --- | --- | --- |
| Pricing | Pay-per-run, no subscription | $49–$499/mo | $39–$249/mo | $59–$149/mo |
| Per-result cost | Pennies | $0.10–$1.00/credit | $0.10–$0.30/credit | Bundled |
| Cloudflare email-protection decoder | ✅ Built in | ❌ | ❌ | ❌ |
| Multi-page crawl per domain | ✅ Up to 30 | Single page | Single page | Single page |
| Eight social profiles in one pass | ✅ | LinkedIn only | LinkedIn only | LinkedIn focus |
| DE / IT / ES / EN contact paths | ✅ Built in | English | English | English |
| Self-host / data ownership | ✅ Your Apify account | ❌ | ❌ | ❌ |
| Roll into your own pipeline | ✅ REST / webhook / SDK | API | API | API |

The trade-off: SaaS finders try to **predict** the email of a specific person (`first.last@domain`). This actor **extracts** every email the website actually publishes. If you need predicted emails for individuals, pair this with a verification tool. If you need real published mailboxes (PR, support, sales, partnerships) — this is what you want.

***

### Industry-specific playbooks

#### B2B SaaS sales (outbound)

Run your TAM list through the actor, filter rows where the email's domain matches the row's `rootDomain` (drops noise like `support@cloudflare.com` from sites using CF). Concatenate `linkedinUrl` into your sequencer for combined email + LinkedIn touch.

#### Staffing agencies / executive search

Map a target list of companies to their `linkedinUrl` (for InMail) plus public emails (`hr@`, `talent@`, `careers@`). Then dedup against your existing reach-out CRM so you don't double-touch.

#### PR & media outreach

Pull `press@`, `media@`, `pr@` mailboxes off a list of brand websites. The `title` column doubles as a quick brand-positioning hint before you draft the pitch.

#### Real-estate / property investors

Agency websites typically expose `info@`, `sales@` plus a phone number. Run a postcode-filtered list of agent sites through this actor and you've built a regional outbound list in one batch.

#### Venture capital / corp dev

For every portfolio / target company, pull socials + emails + phones. The `githubUrl` is gold for technical-due-diligence — it surfaces public open-source activity that paid databases often miss.

#### Local SEO agencies

Pair this actor with a Google Maps scraper: feed agency website URLs from the Maps results into this actor, get back the emails and socials you couldn't see in the Maps card.

***

### Common patterns we've seen

A few patterns that crop up frequently and how to handle them:

- **`info@`, `hello@`, `contact@` dominate the results.** These are the universal role mailboxes — perfectly usable for B2B outreach, but tend to be triaged slowly. Pair them with the LinkedIn URL for a faster route to a human.
- **Agencies hide their team behind "Book a call" forms.** When a row has `emailCount: 0` but `socialCount: 5+` and the page title looks polished, you've hit one of these. The LinkedIn URL is still the best outbound entry point.
- **DACH (DE / AT / CH) sites concentrate everything on `/impressum`.** Required by German law. Our default `contactPaths` already includes `/impressum` and `/imprint`.
- **Cloudflare-obfuscated emails appear as `[email&#160;protected]` in raw HTML.** Without the decoder you'd miss them entirely; with it on (the default), they're transparently captured.
- **Newer sites publish socials via `<link rel>` tags in `<head>`.** We scan raw HTML so these are captured too even when they're missing from anchor links.

***

### Changelog

- **1.0** — Initial release. Email regex + mailto + Cloudflare-decoder + text-obfuscation patterns. Phones via `tel:` and international patterns. Eight social profiles. Multi-page-per-domain crawl, concurrent workers, Apify Proxy.

***

### Roadmap & feature requests

We read every Apify Store review and comment. High-priority candidates for v1.1+:

- Per-email row mode (one row per discovered email).
- Name extraction near emails (`John Smith — john@…`).
- Email pattern detection (`{first}.{last}@domain.com`) for known-staff inference.
- Configurable phone region (US-only / UK-only filter).
- Optional sitemap.xml expansion when the input domain has a sitemap.
- DNS-based deliverability check (MX / SMTP probe).

Drop a comment on the Store page if any of these would unblock you.

# Actor input Schema

## `startUrls` (type: `array`):

URLs or bare domains. 'stripe.com', 'https://stripe.com' and 'https://stripe.com/contact' are all accepted. Each entry becomes one output row.

## `maxPagesPerDomain` (type: `integer`):

How many pages of a website to crawl. Page 1 is always the URL you supplied; additional pages come from the list below in order. Set to 1 to only look at the page you supplied (cheapest), 5–8 for typical contact-page coverage.

## `contactPaths` (type: `array`):

Sub-paths the actor probes on each domain when Max Pages Per Website > 1. The defaults cover the canonical places companies stash their contact info, including German Impressum / Italian Contatti for EU.

## `includeSocials` (type: `boolean`):

Extract LinkedIn, Twitter/X, Facebook, Instagram, YouTube, GitHub, TikTok and Pinterest profile links.

## `includePhones` (type: `boolean`):

Extract phone numbers from `tel:` links and explicit phone patterns. Disable if you only need emails.

## `decodeObfuscatedEmails` (type: `boolean`):

Decode common email obfuscations: Cloudflare /cdn-cgi/l/email-protection (the most common one — many sites use it), 'name \[at] domain \[dot] com', '@' HTML entities, and similar.

## `maxConcurrency` (type: `integer`):

How many websites to crawl at the same time. 3–5 is friendly. Raise for fast runs on large lists, lower if you see proxy errors.

## `requestDelay` (type: `integer`):

Delay between page fetches inside one website. 500–1500 ms is polite.

## `proxyConfiguration` (type: `object`):

ON by default — Apify Datacenter rotates IPs which is plenty for the long tail of company websites. If you target Cloudflare-protected or rate-limited domains and see HTTP 403 in the log, switch the group to RESIDENTIAL.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://stripe.com"
    },
    {
      "url": "https://ramp.com"
    }
  ],
  "maxPagesPerDomain": 6,
  "contactPaths": [
    "/contact",
    "/contact-us",
    "/about",
    "/about-us",
    "/team",
    "/company",
    "/impressum",
    "/imprint",
    "/contatti",
    "/contacto",
    "/kontakt"
  ],
  "includeSocials": true,
  "includePhones": true,
  "decodeObfuscatedEmails": true,
  "maxConcurrency": 5,
  "requestDelay": 800,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": []
  }
}
```

# Actor output Schema

## `rootDomain` (type: `string`):

The website's root domain

## `title` (type: `string`):

Homepage / contact-page title

## `emails` (type: `string`):

Comma-separated emails

## `phones` (type: `string`):

Comma-separated phone numbers

## `linkedinUrl` (type: `string`):

LinkedIn profile URL

## `twitterUrl` (type: `string`):

Twitter / X profile URL

## `facebookUrl` (type: `string`):

Facebook page URL

## `instagramUrl` (type: `string`):

Instagram profile URL

## `youtubeUrl` (type: `string`):

YouTube channel URL

## `githubUrl` (type: `string`):

GitHub organisation URL

## `tiktokUrl` (type: `string`):

TikTok profile URL

## `inputUrl` (type: `string`):

Echo of the URL you supplied

## `scrapedAt` (type: `string`):

ISO timestamp

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://stripe.com"
        },
        {
            "url": "https://ramp.com"
        }
    ],
    "maxPagesPerDomain": 6,
    "contactPaths": [
        "/contact",
        "/contact-us",
        "/about",
        "/about-us",
        "/team",
        "/company",
        "/impressum",
        "/imprint",
        "/contatti",
        "/contacto",
        "/kontakt"
    ],
    "proxyConfiguration": {
        "useApifyProxy": true,
        "apifyProxyGroups": []
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("haketa/email-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [
        { "url": "https://stripe.com" },
        { "url": "https://ramp.com" },
    ],
    "maxPagesPerDomain": 6,
    "contactPaths": [
        "/contact",
        "/contact-us",
        "/about",
        "/about-us",
        "/team",
        "/company",
        "/impressum",
        "/imprint",
        "/contatti",
        "/contacto",
        "/kontakt",
    ],
    "proxyConfiguration": {
        "useApifyProxy": True,
        "apifyProxyGroups": [],
    },
}

# Run the Actor and wait for it to finish
run = client.actor("haketa/email-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://stripe.com"
    },
    {
      "url": "https://ramp.com"
    }
  ],
  "maxPagesPerDomain": 6,
  "contactPaths": [
    "/contact",
    "/contact-us",
    "/about",
    "/about-us",
    "/team",
    "/company",
    "/impressum",
    "/imprint",
    "/contatti",
    "/contacto",
    "/kontakt"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": []
  }
}' |
apify call haketa/email-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=haketa/email-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Email & Contact Extractor — Emails, Phones, Socials",
        "description": "Give a URL or list of domains — get back emails, phones and social profiles (LinkedIn, Twitter/X, Facebook, Instagram, YouTube, GitHub, TikTok, Pinterest) from the homepage and canonical contact pages. Decodes Cloudflare-obfuscated and text-obfuscated emails.",
        "version": "1.0",
        "x-build-id": "3St7A6FPt6oMbm8Jc"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/haketa~email-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-haketa-email-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/haketa~email-extractor/runs": {
            "post": {
                "operationId": "runs-sync-haketa-email-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/haketa~email-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-haketa-email-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Websites",
                        "type": "array",
                        "description": "URLs or bare domains. 'stripe.com', 'https://stripe.com' and 'https://stripe.com/contact' are all accepted. Each entry becomes one output row.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxPagesPerDomain": {
                        "title": "Max Pages Per Website",
                        "minimum": 1,
                        "maximum": 30,
                        "type": "integer",
                        "description": "How many pages of a website to crawl. Page 1 is always the URL you supplied; additional pages come from the list below in order. Set to 1 to only look at the page you supplied (cheapest), 5–8 for typical contact-page coverage.",
                        "default": 6
                    },
                    "contactPaths": {
                        "title": "Contact-Page Paths to Try",
                        "type": "array",
                        "description": "Sub-paths the actor probes on each domain when Max Pages Per Website > 1. The defaults cover the canonical places companies stash their contact info, including German Impressum / Italian Contatti for EU.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "includeSocials": {
                        "title": "Include Social Profiles",
                        "type": "boolean",
                        "description": "Extract LinkedIn, Twitter/X, Facebook, Instagram, YouTube, GitHub, TikTok and Pinterest profile links.",
                        "default": true
                    },
                    "includePhones": {
                        "title": "Include Phone Numbers",
                        "type": "boolean",
                        "description": "Extract phone numbers from `tel:` links and explicit phone patterns. Disable if you only need emails.",
                        "default": true
                    },
                    "decodeObfuscatedEmails": {
                        "title": "Decode Obfuscated Emails",
                        "type": "boolean",
                        "description": "Decode common email obfuscations: Cloudflare /cdn-cgi/l/email-protection (the most common one — many sites use it), 'name [at] domain [dot] com', '&#64;' HTML entities, and similar.",
                        "default": true
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency (parallel websites)",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "How many websites to crawl at the same time. 3–5 is friendly. Raise for fast runs on large lists, lower if you see proxy errors.",
                        "default": 5
                    },
                    "requestDelay": {
                        "title": "Per-Page Delay (ms)",
                        "minimum": 100,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Delay between page fetches inside one website. 500–1500 ms is polite.",
                        "default": 800
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "ON by default — Apify Datacenter rotates IPs which is plenty for the long tail of company websites. If you target Cloudflare-protected or rate-limited domains and see HTTP 403 in the log, switch the group to RESIDENTIAL."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
