# Email Scraper (`weu/my-actor`) Actor

It scrapes anything that looks like an email on a website. It applies 9 protocols and gathers raw material that you later need to clean. It's a garbage collector. Gather everything there is, and then pick what you need. Results are exported to kv store in an xlsx file. Give it a go and have fun! 👍

- **URL**: https://apify.com/weu/my-actor.md
- **Developed by:** [WEU](https://apify.com/weu) (community)
- **Categories:** Automation, Developer tools, Lead generation
- **Stats:** 1 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Python Email Scraper

<!--  This is the Email Scraper Readme file -->

This actor specializes in raw scraping of email addresses out of websites. It crawls a website (or a list) in search for anything that looks like an email. It implements 9 different scraping protocols to leave nothing behind. It works as a garbage collector providing strings with the format "something@domain". Very useful when dealing with business projects that require finding email lists for marketing purposes or reaching out to a wide range of audiences.

### Included features

- **Flexibility in the scraping parameters to adjust to any circumstances. From shallow to deep scraping, depending on your specific interests and apify budget.
- **Convenient exporting results to an excel file to avoid additional formatting of data. Just download it and you're set!
- **Strong email detection, going from plain html crawl to JavaScript injection. No stone unturned.
- **Resilient to Apify server swap, keeping data across servers changes and resuming from where it left off.
- **Parrallel scraping of several pages at a time, saving time and costs. The amount of parallel scraping can be determined in the input data.

### How it works

It basically fishes for emails on websites, page by page, starting at the top level pages, and going down the rabbit hole as deep as you set in the parameters. Everytime it finds emails, it stores them in an excel file in kv store. That way, if you cancel the scrape midway or it shutsdown for any reason, you won't lose your progress. It's just one output file. Easy, compact, ready to use. No additional CSV exporting or formatting. Just the good old excel, widely compatible with every platform there is.
An important technical note. The crawler will first scrape all pages found with depth 0, and then it will move on to depth 1, 2, etc, until maximum depth is reached, or maximum emails, or maximum pages per website. The multiple boundaries guarantee that your scrape won't run forever and spending all your credits. Those boundaries also define the type of scrape you want to do.

### Getting started

Before starting the scrape you have to fill in the necessary input data. If any value is missing it will fall back to a default value, except for the website addresses that are explicitly required. The level of success of this scraper is directly tied to the input data you provide. If you want a shallow scrape, you need to lower the values of depth, load and post-load timeouts, maximum pages per website and maximum emails to find. Do the opposite for deeper scraping. The parallel scraping feature is better to be kept as high as your Apify credits plan allow it. Higher parallel activity means faster credits usage, but also faster crawl, and if you do the math, faster is always better. You must keep in mind that more tasks in parallel consume more memory and therefore you must set a memory allocation high enough to handle the amount of parallel scraping you're aiming for. It also increases the cost per compute unit, so it must be handled with care. Just to give you a rough idea, for 10 parallel pages being scraped, 1 GB of memory is a manageable threshold for most runs, but the exact allocation will strongly depend on your specific crawling needs.

Also important to handle depth input value carefully. Most websites keep their emails from depth 0 to 2. Going beyond that will multiply your scraping exponentially. Handle with care.

# Actor input Schema

## `startUrls` (type: `array`):

List of starting URLs to crawl
## `maxDepth` (type: `integer`):

Maximum depth of links to follow from each start URL
## `maxPagesPerSite` (type: `integer`):

Maximum number of pages to scrape per site
## `parallelLimit` (type: `integer`):

Maximum number of pages to scrape in parallel
## `maxEmails` (type: `integer`):

Stop crawling a site once this many emails have been found
## `pageLoadTimeoutMs` (type: `integer`):

Timeout for page load before giving up.
## `postLoadWaitMs` (type: `integer`):

Extra wait time after DOM content is loaded

## Actor input object example

```json
{
  "maxDepth": 2,
  "maxPagesPerSite": 100,
  "parallelLimit": 5,
  "maxEmails": 30,
  "pageLoadTimeoutMs": 30000,
  "postLoadWaitMs": 3000
}
````

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("weu/my-actor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("weu/my-actor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call weu/my-actor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=weu/my-actor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Email Scraper",
        "description": "It scrapes anything that looks like an email on a website. It applies 9 protocols and gathers raw material that you later need to clean. It's a garbage collector. Gather everything there is, and then pick what you need. Results are exported to kv store in an xlsx file. Give it a go and have fun! 👍",
        "version": "0.0",
        "x-build-id": "U7oNJqAFt1bp8LhZP"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/weu~my-actor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-weu-my-actor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/weu~my-actor/runs": {
            "post": {
                "operationId": "runs-sync-weu-my-actor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/weu~my-actor/run-sync": {
            "post": {
                "operationId": "run-sync-weu-my-actor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "List of starting URLs to crawl",
                        "items": {
                            "type": "object",
                            "properties": {
                                "url": {
                                    "title": "URL",
                                    "type": "string",
                                    "description": "The starting URL to crawl"
                                }
                            },
                            "required": [
                                "url"
                            ]
                        }
                    },
                    "maxDepth": {
                        "title": "Max Crawl Depth",
                        "type": "integer",
                        "description": "Maximum depth of links to follow from each start URL",
                        "default": 2
                    },
                    "maxPagesPerSite": {
                        "title": "Max Pages Per Site",
                        "type": "integer",
                        "description": "Maximum number of pages to scrape per site",
                        "default": 100
                    },
                    "parallelLimit": {
                        "title": "Parallel Limit",
                        "type": "integer",
                        "description": "Maximum number of pages to scrape in parallel",
                        "default": 5
                    },
                    "maxEmails": {
                        "title": "Max Emails to Collect",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Stop crawling a site once this many emails have been found",
                        "default": 30
                    },
                    "pageLoadTimeoutMs": {
                        "title": "Page Load Timeout (ms)",
                        "minimum": 1000,
                        "type": "integer",
                        "description": "Timeout for page load before giving up.",
                        "default": 30000
                    },
                    "postLoadWaitMs": {
                        "title": "Post Load Wait (ms)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Extra wait time after DOM content is loaded",
                        "default": 3000
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
