# Serp CWD (`l_axis/serp-cwd`) Actor

Website discovery for companies

- **URL**: https://apify.com/l\_axis/serp-cwd.md
- **Developed by:** [LR](https://apify.com/l_axis) (community)
- **Categories:** Automation, Developer tools, Agents
- **Stats:** 4 total users, 3 monthly users, 50.0% runs succeeded, NaN bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Google SERP Company Discovery

This Actor finds likely official company websites from Google search results.

It is built for company-website discovery workflows like:

- `"Company Name" + town`
- parsing the top organic results
- rejecting directories, social pages, job boards, and obvious junk
- returning either:
  - an accepted website
  - a review candidate
  - or a rejected/no-site result

### What It Does

For each input company row, the Actor:

1. Builds a Google query from `companyName` and `town`
2. Fetches raw HTML through either:
   - Apify `GOOGLE_SERP` proxy by default
   - or your own proxy URLs if you provide them
3. Parses the organic SERP results
4. Scores candidate domains with either:
   - `strict`
   - `loose`
   - `raw`
5. Pushes normalized rows into the default dataset

### Why This Actor Exists

The goal is to keep the discovery engine portable and publishable:

- portable because the core matching logic is not tied to a specific SERP SaaS
- publishable because Apify handles proxying, hosting, scaling, scheduling, and monetization

### Input

You can provide company rows in either of these ways:

- `searches`
  - inline array of `{ companyNumber, companyName, town }`
- `sourceDatasetId`
  - a dataset where items expose:
    - `companyName` or `company_name`
    - optional `companyNumber` or `company_number`
    - optional `town`

Useful input fields:

- `limit`
- `maxConcurrency`
- `googleDomain`
- `language`
- `pagesPerQuery`
- `matchMode`
- `proxySettings`
- `customProxyUrls`
- `proxyProviderLabel`
- `resumeFromCheckpoint`

### Output

Each dataset row includes:

- `company_number`
- `company_name`
- `town`
- `query`
- `classification`
- `selected_url`
- `selected_domain`
- `selected_title`
- `selected_position`
- `selected_score`
- `review_candidate_url`
- `review_candidate_domain`
- `organic_result_count`
- `raw_organic_results`
- `http_status`
- `elapsed_seconds`
- `response_bytes`
- `proxy_provider`
- `error`

#### Classification meanings

- `accepted`
  - strong heuristic match to the company’s own site
- `review`
  - plausible candidate, but not strong enough to auto-accept
- `rejected`
  - no credible official website found
- `raw`
  - SERP parsed only, no selection applied
- `error`
  - request or parse failure

### Match Modes

#### `strict`

Production-oriented heuristic.

Accepts only strong brand/domain matches. Borderline results are marked `review`.

#### `loose`

Fast benchmark mode.

Uses token-domain matching that is useful for quick smoke tests but more permissive.

#### `raw`

Returns parsed SERP results without choosing a winner.

### Checkpointing

The Actor writes resumable state to the default key-value store using `checkpointKey`.

If you rerun with:

- the same input order
- the same `checkpointKey`
- `resumeFromCheckpoint = true`

the Actor skips rows already completed in the earlier run.

### Proxy options

By default, the Actor uses Apify `GOOGLE_SERP`.

If you want to test another provider such as DataImpulse, pass one or more full proxy URLs in `customProxyUrls`. Example:

```json
{
  "searches": [
    { "companyName": "Example Engineering Ltd", "town": "Leeds" }
  ],
  "customProxyUrls": [
    "http://LOGIN:PASSWORD@gw.dataimpulse.com:823"
  ],
  "proxyProviderLabel": "dataimpulse_residential"
}
````

If `customProxyUrls` is present, it overrides Apify proxy usage. The Apify SDK rotates the provided URLs round-robin. If you provide only one rotating gateway URL, the provider's own rotation still happens server-side.

### Benchmark script

Use `scripts/benchmark_proxy_providers.py` to run the same fixed sample through:

- Apify `GOOGLE_SERP`
- DataImpulse datacenter
- DataImpulse residential
- DataImpulse mobile
- DataImpulse premium residential

Expected environment variables:

- `APIFY_TOKEN`
- `DATAIMPULSE_DATACENTER_PROXY_URL`
- `DATAIMPULSE_RESIDENTIAL_PROXY_URL`
- `DATAIMPULSE_MOBILE_PROXY_URL`
- `DATAIMPULSE_PREMIUM_RESIDENTIAL_PROXY_URL`

Example:

```bash
python scripts/benchmark_proxy_providers.py --input-json sample_searches.json --max-concurrency 25
```

### Notes

- This Actor uses raw HTTP requests, not browser automation.
- `pagesPerQuery > 1` increases proxy spend because each page counts separately.
- Google HTML changes over time, so parsing logic should be revalidated periodically.

### Suggested internal benchmark

Compare this Actor against your current SERP providers on the same fixed 100-company sample and track:

- HTTP success rate
- accepted count
- review count
- obvious false positives
- average `response_bytes`
- estimated proxy cost per 1k searches
- cost per accepted website

# Actor input Schema

## `searches` (type: `array`):

Inline company rows to process. Each item should contain companyName and optionally companyNumber and town.

## `sourceDatasetId` (type: `string`):

Optional dataset ID containing company rows. Items should expose companyName or company\_name, plus optional companyNumber/company\_number and town.

## `limit` (type: `integer`):

Maximum number of searches to run after loading inline rows and dataset rows.

## `maxConcurrency` (type: `integer`):

Number of concurrent Google SERP requests. GOOGLE\_SERP allows up to 200 concurrent connections per account.

## `proxyGroup` (type: `string`):

Apify proxy group to use. GOOGLE\_SERP is most reliable; RESIDENTIAL is cheaper but may get blocked.

## `googleDomain` (type: `string`):

Google hostname to query. For UK search use google.co.uk.

## `language` (type: `string`):

Google hl parameter.

## `pagesPerQuery` (type: `integer`):

How many SERP pages to fetch. Each page counts as a separate SERP proxy request.

## `matchMode` (type: `string`):

Scoring mode for choosing likely company websites.

## `proxySettings` (type: `object`):

Optional Apify proxy selection. Leave empty to use the default GOOGLE\_SERP group unless customProxyUrls are provided.

## `customProxyUrls` (type: `array`):

Optional full proxy URLs. If provided, these override Apify proxy usage and are rotated round-robin by the Apify SDK.

## `proxyProviderLabel` (type: `string`):

Optional label written to each output row and summary, e.g. dataimpulse\_residential.

## `includeHtmlSnippet` (type: `boolean`):

Store the first part of the raw Google HTML in each output row for debugging.

## `checkpointKey` (type: `string`):

Key-value store key used for resumable state. Reuse the same key to continue a previous run.

## `resumeFromCheckpoint` (type: `boolean`):

When enabled, skip searches already recorded under the checkpoint key.

## `pushBatchSize` (type: `integer`):

How many result rows to buffer before pushing them to the default dataset.

## `requestTimeoutSecs` (type: `integer`):

Per-request timeout in seconds.

## Actor input object example

```json
{
  "limit": 100,
  "maxConcurrency": 50,
  "proxyGroup": "GOOGLE_SERP",
  "googleDomain": "google.co.uk",
  "language": "en",
  "pagesPerQuery": 1,
  "matchMode": "strict",
  "proxyProviderLabel": "",
  "includeHtmlSnippet": false,
  "checkpointKey": "CHECKPOINT",
  "resumeFromCheckpoint": false,
  "pushBatchSize": 25,
  "requestTimeoutSecs": 45
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("l_axis/serp-cwd").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("l_axis/serp-cwd").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call l_axis/serp-cwd --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=l_axis/serp-cwd",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Serp CWD",
        "description": "Website discovery for companies",
        "version": "0.0",
        "x-build-id": "WrndB8alUG5LhaWde"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/l_axis~serp-cwd/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-l_axis-serp-cwd",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/l_axis~serp-cwd/runs": {
            "post": {
                "operationId": "runs-sync-l_axis-serp-cwd",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/l_axis~serp-cwd/run-sync": {
            "post": {
                "operationId": "run-sync-l_axis-serp-cwd",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "searches": {
                        "title": "Searches",
                        "type": "array",
                        "description": "Inline company rows to process. Each item should contain companyName and optionally companyNumber and town.",
                        "items": {
                            "type": "object",
                            "properties": {
                                "companyNumber": {
                                    "title": "Company number",
                                    "type": "string",
                                    "description": "Companies House number (e.g. 12345678)."
                                },
                                "companyName": {
                                    "title": "Company name",
                                    "type": "string",
                                    "description": "Registered company name to search for."
                                },
                                "town": {
                                    "title": "Town",
                                    "type": "string",
                                    "description": "Town or city to narrow the search (optional)."
                                }
                            },
                            "required": [
                                "companyName"
                            ]
                        }
                    },
                    "sourceDatasetId": {
                        "title": "Source dataset",
                        "type": "string",
                        "description": "Optional dataset ID containing company rows. Items should expose companyName or company_name, plus optional companyNumber/company_number and town."
                    },
                    "limit": {
                        "title": "Limit",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of searches to run after loading inline rows and dataset rows.",
                        "default": 100
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "maximum": 200,
                        "type": "integer",
                        "description": "Number of concurrent Google SERP requests. GOOGLE_SERP allows up to 200 concurrent connections per account.",
                        "default": 50
                    },
                    "proxyGroup": {
                        "title": "Proxy group",
                        "enum": [
                            "GOOGLE_SERP",
                            "RESIDENTIAL"
                        ],
                        "type": "string",
                        "description": "Apify proxy group to use. GOOGLE_SERP is most reliable; RESIDENTIAL is cheaper but may get blocked.",
                        "default": "GOOGLE_SERP"
                    },
                    "googleDomain": {
                        "title": "Google domain",
                        "type": "string",
                        "description": "Google hostname to query. For UK search use google.co.uk.",
                        "default": "google.co.uk"
                    },
                    "language": {
                        "title": "Language",
                        "type": "string",
                        "description": "Google hl parameter.",
                        "default": "en"
                    },
                    "pagesPerQuery": {
                        "title": "Pages per query",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "How many SERP pages to fetch. Each page counts as a separate SERP proxy request.",
                        "default": 1
                    },
                    "matchMode": {
                        "title": "Match mode",
                        "enum": [
                            "strict",
                            "loose",
                            "raw"
                        ],
                        "type": "string",
                        "description": "Scoring mode for choosing likely company websites.",
                        "default": "strict"
                    },
                    "proxySettings": {
                        "title": "Apify proxy settings",
                        "type": "object",
                        "description": "Optional Apify proxy selection. Leave empty to use the default GOOGLE_SERP group unless customProxyUrls are provided."
                    },
                    "customProxyUrls": {
                        "title": "Custom proxy URLs",
                        "type": "array",
                        "description": "Optional full proxy URLs. If provided, these override Apify proxy usage and are rotated round-robin by the Apify SDK.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "proxyProviderLabel": {
                        "title": "Proxy provider label",
                        "type": "string",
                        "description": "Optional label written to each output row and summary, e.g. dataimpulse_residential.",
                        "default": ""
                    },
                    "includeHtmlSnippet": {
                        "title": "Include raw HTML snippet",
                        "type": "boolean",
                        "description": "Store the first part of the raw Google HTML in each output row for debugging.",
                        "default": false
                    },
                    "checkpointKey": {
                        "title": "Checkpoint key",
                        "type": "string",
                        "description": "Key-value store key used for resumable state. Reuse the same key to continue a previous run.",
                        "default": "CHECKPOINT"
                    },
                    "resumeFromCheckpoint": {
                        "title": "Resume from checkpoint",
                        "type": "boolean",
                        "description": "When enabled, skip searches already recorded under the checkpoint key.",
                        "default": false
                    },
                    "pushBatchSize": {
                        "title": "Push batch size",
                        "minimum": 1,
                        "maximum": 200,
                        "type": "integer",
                        "description": "How many result rows to buffer before pushing them to the default dataset.",
                        "default": 25
                    },
                    "requestTimeoutSecs": {
                        "title": "Request timeout",
                        "minimum": 5,
                        "maximum": 180,
                        "type": "integer",
                        "description": "Per-request timeout in seconds.",
                        "default": 45
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
