Serp CWD avatar

Serp CWD

Under maintenance

Pricing

Pay per usage

Go to Apify Store
Serp CWD

Serp CWD

Under maintenance

Website discovery for companies

Pricing

Pay per usage

Rating

5.0

(1)

Developer

LR

LR

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

3

Monthly active users

4 days ago

Last modified

Share

Google SERP Company Discovery

This Actor finds likely official company websites from Google search results.

It is built for company-website discovery workflows like:

  • "Company Name" + town
  • parsing the top organic results
  • rejecting directories, social pages, job boards, and obvious junk
  • returning either:
    • an accepted website
    • a review candidate
    • or a rejected/no-site result

What It Does

For each input company row, the Actor:

  1. Builds a Google query from companyName and town
  2. Fetches raw HTML through either:
    • Apify GOOGLE_SERP proxy by default
    • or your own proxy URLs if you provide them
  3. Parses the organic SERP results
  4. Scores candidate domains with either:
    • strict
    • loose
    • raw
  5. Pushes normalized rows into the default dataset

Why This Actor Exists

The goal is to keep the discovery engine portable and publishable:

  • portable because the core matching logic is not tied to a specific SERP SaaS
  • publishable because Apify handles proxying, hosting, scaling, scheduling, and monetization

Input

You can provide company rows in either of these ways:

  • searches
    • inline array of { companyNumber, companyName, town }
  • sourceDatasetId
    • a dataset where items expose:
      • companyName or company_name
      • optional companyNumber or company_number
      • optional town

Useful input fields:

  • limit
  • maxConcurrency
  • googleDomain
  • language
  • pagesPerQuery
  • matchMode
  • proxySettings
  • customProxyUrls
  • proxyProviderLabel
  • resumeFromCheckpoint

Output

Each dataset row includes:

  • company_number
  • company_name
  • town
  • query
  • classification
  • selected_url
  • selected_domain
  • selected_title
  • selected_position
  • selected_score
  • review_candidate_url
  • review_candidate_domain
  • organic_result_count
  • raw_organic_results
  • http_status
  • elapsed_seconds
  • response_bytes
  • proxy_provider
  • error

Classification meanings

  • accepted
    • strong heuristic match to the company’s own site
  • review
    • plausible candidate, but not strong enough to auto-accept
  • rejected
    • no credible official website found
  • raw
    • SERP parsed only, no selection applied
  • error
    • request or parse failure

Match Modes

strict

Production-oriented heuristic.

Accepts only strong brand/domain matches. Borderline results are marked review.

loose

Fast benchmark mode.

Uses token-domain matching that is useful for quick smoke tests but more permissive.

raw

Returns parsed SERP results without choosing a winner.

Checkpointing

The Actor writes resumable state to the default key-value store using checkpointKey.

If you rerun with:

  • the same input order
  • the same checkpointKey
  • resumeFromCheckpoint = true

the Actor skips rows already completed in the earlier run.

Proxy options

By default, the Actor uses Apify GOOGLE_SERP.

If you want to test another provider such as DataImpulse, pass one or more full proxy URLs in customProxyUrls. Example:

{
"searches": [
{ "companyName": "Example Engineering Ltd", "town": "Leeds" }
],
"customProxyUrls": [
"http://LOGIN:PASSWORD@gw.dataimpulse.com:823"
],
"proxyProviderLabel": "dataimpulse_residential"
}

If customProxyUrls is present, it overrides Apify proxy usage. The Apify SDK rotates the provided URLs round-robin. If you provide only one rotating gateway URL, the provider's own rotation still happens server-side.

Benchmark script

Use scripts/benchmark_proxy_providers.py to run the same fixed sample through:

  • Apify GOOGLE_SERP
  • DataImpulse datacenter
  • DataImpulse residential
  • DataImpulse mobile
  • DataImpulse premium residential

Expected environment variables:

  • APIFY_TOKEN
  • DATAIMPULSE_DATACENTER_PROXY_URL
  • DATAIMPULSE_RESIDENTIAL_PROXY_URL
  • DATAIMPULSE_MOBILE_PROXY_URL
  • DATAIMPULSE_PREMIUM_RESIDENTIAL_PROXY_URL

Example:

$python scripts/benchmark_proxy_providers.py --input-json sample_searches.json --max-concurrency 25

Notes

  • This Actor uses raw HTTP requests, not browser automation.
  • pagesPerQuery > 1 increases proxy spend because each page counts separately.
  • Google HTML changes over time, so parsing logic should be revalidated periodically.

Suggested internal benchmark

Compare this Actor against your current SERP providers on the same fixed 100-company sample and track:

  • HTTP success rate
  • accepted count
  • review count
  • obvious false positives
  • average response_bytes
  • estimated proxy cost per 1k searches
  • cost per accepted website