SCOUTR Nordics (Google Maps Scraper)

Pricing

Pay per event

SCOUTR Nordics (Google Maps Scraper)

Specialized Google Maps scraper for Nordic countries. Geocodes a start address, finds businesses within a radius, and extracts name, address, website, phone. Visits each website to fetch real contact information. Other countries will be added in different actors when the code is fully optimized.

Pricing

Pay per event

Rating

0.0

(0)

Developer

SLASH

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

8 days ago

Last modified

What’s new (2025-11-13)

Quality & correctness

Single record per venue: each (name, address) combination is emitted once per keyword. There is no longer a “basic record” followed by a second “enriched record” in the dataset.
Global deduplication: global cross-tile, cross-keyword dedupe by (name, address) so the same venue isn’t emitted multiple times when it appears in overlapping tiles.
Keyword buckets & expansion: you can now use broad category labels like "Food & Drinks" or "Retail & Stores"; the actor expands them into multiple concrete search terms (e.g. restaurant, bakery, café, etc.) and then deduplicates results.
Distance fallback (optional): enable_distance_fallback can estimate distance_m by geocoding the listing address when the Maps URL doesn’t expose coordinates.
Address enrichment with caching: address-based geocoding (for enrichment and distance fallback) is cached in a TTL cache, so the same address is only geocoded once per run, improving consistency and speed.

Stability & performance

Prime tile execution & early-stop: the actor processes all keywords on the first grid point (page 0) before starting the worker pool. If no results are found for any keyword on that first tile, the run ends early with an empty dataset instead of wasting time on a hopeless grid.
Anti-stall work queue: keywords × tiles are queued and processed by a bounded worker pool; each keyword respects a global max_per_keyword cap across all tiles.
Global Google rate limiting: Google Local requests are funneled through a shared gate + token bucket, so multiple tiles/keywords don’t collectively hammer Google and trigger hard throttling.
Backoff & pacing: per-request sleeps, retry/backoff on 429/403, and gentle HTML fetch cadence to reduce throttling.
Timeout guards:
- Hard timeouts around tile processing (worker jobs) and dataset writes to prevent long-running stalls on a single tile or record.
- Timeboxed geocoding and address enrichment so slow geocoders cannot freeze a page.
Geocoding hard-fail guard: after repeated Nominatim/maps.co failures, the actor disables further geocoding for the rest of the run, so the rest of the pipeline can continue.
Compact debug logs: high-signal log lines at each major step: [boot], [geocode], [sched], [prime], [tile], [google.lcl], [site.search], [crawl], [addr], [push], [done].

Input

{
  "start_address": "Karl Johans gate 1, Oslo, Norway",
  "range_meters": 2000,
  "keywords": ["restaurant", "florist", "bakery"],
  "max_per_keyword": 120,
  "tile_m": 1200,
  "fetch_emails": true,
  "fallback_search": "google",
  "max_pages_per_site": 5,
  "request_timeout_s": 20,
  "concurrency": 6,
  "country_override": null,
  "respect_robots": false,
  "user_agent": null,
  "enable_distance_fallback": false
}

Parameters

Key	Type	Default	Description
`start_address`	string	—	Geocoding center (Nominatim; maps.co fallback).
`range_meters`	integer	2000	Radius around the center to cover with a hex grid of search points.
`keywords`	array[string]	—	Categories/queries searched via Google Local (`tbm=lcl`). Can be specific terms (`"florist"`, `"dentist"`) or broad buckets like `"Food & Drinks"`.
`max_per_keyword`	integer	120	Hard cap on total items per keyword across all grid tiles. Prevents task explosion.
`tile_m`	integer	1200	Approx spacing between grid points. Smaller means denser coverage and more requests.
`fetch_emails`	boolean	true	If true, crawl discovered websites (same registered domain) for emails/phones/socials.
`fallback_search`	enum	`"google"`	Fallback site search provider order; rotates across Google, DuckDuckGo, Bing.
`max_pages_per_site`	integer	5	Crawl budget per site.
`request_timeout_s`	integer	20	HTTP timeout for requests.
`concurrency`	integer	6	Worker count for keyword×tile jobs and a separate limiter for site crawling.
`country_override`	string	null	Force country ISO2 (`NO/SE/DK/FI/IS`) when geocoding is ambiguous.
`respect_robots`	boolean	false	When true, skip paths disallowed by robots.txt during site crawl.
`user_agent`	string	null	Override the default desktop UA if needed.
`enable_distance_fallback`	boolean	false	When true, geocodes listing addresses to estimate `distance_m` if Maps coordinates are missing.

Output

Each dataset item is a JSON object like:

{
  "_type": "listing",
  "source": "google_lcl",
  "keyword": "florist",
  "search_center": {"lat": 59.9139, "lng": 10.7522},
  "start_center": {"lat": 59.9139, "lng": 10.7522},
  "distance_m": 340,
  "country": "NO",
  "name": "Blomst AS",
  "address": "Dronningens gate 1, 0152 Oslo",
  "rating": 4.6,
  "reviews": 37,
  "phone": "+4722000000",
  "phones_from_site": ["+4722000000"],
  "gmaps_url": "https://www.google.com/maps/place/...",
  "maps_url": "https://www.google.com/maps/place/...",
  "website": "https://blomst.no/",
  "emails": ["post@blomst.no"],
  "social": {"instagram": ["https://www.instagram.com/blomst/"]},
  "fallback_used": false,
  "lat": 59.9139,
  "lon": 10.7522,
  "start_address": "Karl Johans gate 1, Oslo, Norway",
  "range_meters": 2000,
  "keywords": ["restaurant", "florist", "bakery"]
}

Field notes

address: cleaned to exclude hours/amenities/phones; prefers variants containing postcode and city.
distance_m:
- Primary: distance from the start center when the Maps URL exposes coordinates.
- Optional fallback: when enable_distance_fallback=true, attempts to geocode the listing address and compute distance from the start center.
- May be null when neither coordinate source is available or geocoding fails.
phone: first phone parsed from the local card text and normalized to E.164 for the target country.
phones_from_site: phones harvested from the website, normalized to E.164.
emails: filtered to avoid platforms (CDNs, analytics, ESPs). Same-domain prioritized; common freemail allowed.
fallback_used: true if the website came from the fallback web search rather than the local card.
Uniqueness: per keyword, each (name, address) pair appears at most once in the dataset, even if it shows up in multiple tiles.

How it works

Geocoding: resolves start_address to lat/lng and a rich address dict used to set gl/hl and ccTLD constraints.
Keyword expansion: the keywords list is normalized and expanded. Broad buckets like "Food & Drinks" or "Retail & Stores" fan out into many concrete queries (e.g. restaurant, bakery, café, supermarket), then deduped.
Hex grid: builds a compact hexagon grid of points covering range_meters.
Prime pass for early results: the first grid point is processed for all keywords (page 0) before spinning up the worker pool, so the dataset gets initial rows quickly. If nothing is found on this prime tile, the actor stops early with an empty dataset.
Local search: for each (keyword, grid point) the actor fetches tbm=lcl HTML pages in steps of 10 results, with backoff on throttling.
Parsing: extracts name, Maps URL, ratings, reviews, and a cleaned address using three sources:
- Card text (with category/headline and hours removed)
- /maps/dir//… address segment
- /maps/place/… slug when it looks address-like It then picks the richest candidate, preferring ones with postcode + city.
Website: uses visible “Website” buttons and, if missing, performs a ccTLD-biased fallback search.
Crawl (optional): up to max_pages_per_site, prioritizing /kontakt, /contact, /about, etc.
- Emails from visible text, mailto:, JSON-LD, meta tags, attributes, Cloudflare data-cfemail, JavaScript string tricks, and base64.
- Phones normalized to E.164 with country rules; rejects times/prices.
- Social links captured if present.
Distance fallback (optional): when enable_distance_fallback is true and Maps coordinates are missing, the actor geocodes the listing address and recomputes distance_m.
Deduplication: (name, address) is used as a global key across all tiles and keywords; only one enriched record per venue per keyword is emitted.
Throttling/backoff: per-request pacing, retry/backoff on 429/403, global token bucket for Google Local, and modest default concurrency.

Debug logging

The actor writes compact, high-signal logs:

[boot] startup, input summary, proxy usage
[geocode] geocoding provider progress
[sched] number of tiles, number of keywords, job queue size, workers
[prime] initial prime tile/keyword execution for early dataset results (and early-exit when no results exist)
[tile] which keyword/tile is executing
[google.lcl] fetch/page status, result counts, backoffs
[site.search] fallback site search attempts/hits
[crawl] per-site crawl start and summary (emails/phones found)
[addr] address pipeline result
[push] each pushed item with short summary
[done] completion

Performance tips

Keep concurrency modest; prefer residential proxies if you see many 429/403.
For large areas, increase tile_m first before increasing range_meters.
Use max_per_keyword to prevent runaway item counts when keywords are broad (especially when using buckets like "Food & Drinks").
Set fetch_emails=false for fast discovery runs; re-crawl email data in a second pass.
If you only care about approximate distance and many listings lack coordinates, consider enabling enable_distance_fallback.

FAQ

I’m not technical. How am I supposed to use this?

In simple terms:

Pick a starting point Type a real-world address in start_address (for example: “Karl Johans gate 1, Oslo, Norway”). This is the center of your search.
Choose how far to search Set range_meters to how far around that point you care about. 2,000–3,000 meters = “neighborhood / city center” Larger values = more area, more results, more time.
Tell it what you’re looking for In keywords, you can:
- Put specific things: "restaurant", "florist", "dentist", "kindergarten".
- Or use broad groups like "Food & Drinks", "Retail & Stores", "Health & Beauty". The actor automatically expands these into many detailed searches.
Decide if you want emails or just places
- fetch_emails = true: slower, but tries to visit each website to collect emails, phones, and social links.
- fetch_emails = false: much faster, returns basic listing info (name, address, phone, website, etc.).
Run the actor and wait The actor will:
- Geocode your start address.
- Sweep the area tile by tile.
- Collect matching businesses into the dataset.
Download your dataset When it’s done, open the dataset in Apify and export as JSON, CSV, XLSX, whatever you prefer.

What counts as a “keyword”? Can I just say “Food & Drinks”?

Yes.

You have two options:

Specific keyword Example: "florist" → the actor searches Google for florists near your area.
Bucket keyword (broad group) Example: "Food & Drinks" → the actor internally expands this into many concrete queries like restaurant, bakery, café, kiosk, bar, pub, pizzeria, etc., then deduplicates the results.

Using buckets is helpful when you don’t know all the specific labels but you want broad coverage. Just remember: broader buckets = more searches = more time and more results.

Why does it take so long sometimes?

Because the actor is doing a lot of work to avoid getting you blocked and to squeeze out contact data:

It scans an area, not just a single point Your range_meters is turned into a hexagon grid of many small “tiles” around your start address. Each tile runs a search for each keyword.
Google results come in pages of 10 For each tile, the actor walks results page by page and parses each listing card.
It has to be polite to Google The actor:
- Slows down between requests.
- Backs off when it sees “too many requests” or access denied.
- Uses a global rate limiter so multiple workers don’t overload Google at once.
This means more safety, but no instant gratification.
Website crawling is heavier than just searching If fetch_emails=true:
- Each site may have several pages visited (/contact, /about, etc.).
- It reads the HTML and JavaScript, tries to decode obfuscated emails, and normalizes phone numbers.
- All this has to respect timeouts so one slow site doesn’t freeze the whole run.
Safety timeouts and retries There are time limits around:
- Tile processing
- Geocoding
- Website crawling
- Writing to the dataset
These keep things from hanging forever, but they also mean the actor would rather wait a bit and retry than fail immediately.

So if you give it:

A large radius
Very broad buckets ("Food & Drinks", "Retail & Stores", "Events & Hospitality")
fetch_emails = true
And high max_per_keyword

…it will happily eat time while collecting a large, enriched lead list.

How do I run it for “maximum data”?

Use settings like this when you want as many enriched leads as possible, not speed:

range_meters: 3,000–7,000 (or more, depending on city size).
tile_m: 1,000–1,500 (denser tiles if you want fewer gaps).
keywords: mix of specific terms and buckets (e.g. "Food & Drinks", "Retail & Stores", "Health & Beauty").
max_per_keyword: 200–400 (or higher, but be mindful of dataset size).
fetch_emails: true.
max_pages_per_site: 5–8 (more pages = better chance of finding emails).
enable_distance_fallback: true if you care about distance and many listings miss coordinates.
concurrency: moderate (e.g. 6–10) with decent proxies to avoid throttling.

This mode is for when you’re building contact/lead lists and are fine with the job taking longer.

How do I run it for a “fast scan”?

If you just want a quick overview of places, try:

range_meters: 1,000–2,000.
tile_m: 500–2,000 (fewer tiles).
keywords: more focused ("restaurant", "café", "florist"), avoid huge buckets.
max_per_keyword: 50–100.
fetch_emails: false (this is the biggest speed boost).
max_pages_per_site: ignored if fetch_emails=false.
enable_distance_fallback: false (skip extra geocoding).
concurrency: 4–6.

You’ll still get name, address, phone, website, Maps URL, rating, reviews, but you skip the heavier website crawling work.

Why does it sometimes finish quickly with no results?

Two common reasons:

The prime tile found nothing The actor first tries all keywords at the first grid point. If literally nothing is returned for any of them, it assumes:
- The keywords are not a good match for that area; or
- There’s some issue with the search that will repeat everywhere.
In that case it stops early and returns an empty dataset instead of pointlessly scanning the rest of the grid.
Your keywords are very niche or misspelled If you’re using very specific or misspelled keywords, Google may have few or no local results. Try:
- More general terms ("restaurant" instead of "organic vegan fine dining"), or
- Bucket labels like "Food & Drinks".

Troubleshooting

No or very late results in the dataset

The actor runs a prime pass on the first grid point for all keywords (page 0) before the worker pool. You should see early results from that tile; if not, check logs around [prime], [tile], and [push].
If the prime tile returns zero results for all keywords, the actor ends early with an empty dataset by design.
Verify the dataset view is filtered to the current run.

Duplicates in output

Current builds emit one enriched record per (name, address, keyword).
If you still see identical rows, check for:
- Slight address variations (e.g. different postcode or formatting).
- Different keywords targeting the same venue (this is expected: per keyword, not globally).

Addresses include hours or are missing postcode/city

The address resolver strips hours/amenities and merges candidates from the card text and Maps URL slugs, preferring those with postcode + city.
If the card address is very minimal, enabling enable_distance_fallback may also improve downstream data quality via geocoding enrichment.

Actor “stalls” with many keywords

The job queue and per-keyword caps prevent stalls. Check logs for [sched] jobs queued=… workers=… and ongoing [google.lcl] or [crawl] lines.
If logs are quiet, raise APIFY_LOG_LEVEL to DEBUG temporarily to observe progress.

Few or no emails

Many sites hide emails; try increasing max_pages_per_site slightly.
Ensure fetch_emails=true.
Consider running during local business hours when some sites expose contact widgets.

Roadmap / future improvements

Deeper Maps parsing: structured card parsing for more stable address and coordinates extraction.
Entity resolution: fuzzy dedupe across alternate names and addresses.
Smart pagination: adaptive stop rules based on per-keyword coverage quality.
Site crawl heuristics: sitemap discovery and targeted link scoring for contact pages.
Language models for email extraction: context-aware extraction where regex/DOM misses.
Rate-aware scheduler: dynamic concurrency based on recent throttle signals.
Optional CSV/Parquet export: direct tabular outputs with schema validation.

Notes & tips

Blocking: Google may throttle or redirect. Use Apify proxy with appropriate pools and keep concurrency conservative.
Languages: Parsing handles multiple container patterns and Nordic website-button labels (nettside, webbplats, etc.).
Email quality: Filters out ESPs/CDNs/analytics/platform domains. Allows common freemail.
Robots: Enable respect_robots when policy requires it; note some sites block generic crawler paths.
Legal: Always ensure your use complies with local laws and target site terms.

Supported & planned regions

Region	Status	Details	Link
Nordics	Optimized	Last optimized: 2025-11-13 (NO/SE/DK/FI/IS)	https://apify.com/odaudlegur/scoutr-nordics-google-maps-scraper
Western EU	Planned	—	—
Eastern EU	Planned	—	—
North America	Not started	—	—
South America	Not started	—	—
East/SE Asia	Not started	—	—
Middle East	Not started	—	—
Africa	Not started	—	—
Oceania	Not started	—	—

Create an issue if you’d like your country prioritized.

Changelog

2025-11-13
- Added keyword bucket expansion (e.g. "Food & Drinks", "Retail & Stores") that expands into multiple concrete queries.
- Introduced cached, timeboxed geocoding with TTL cache and hard-fail guard to keep runs alive under provider throttling.
- Added global Google Local rate limiting (shared gate + token bucket) and stricter job-level timeouts around tile processing, geocoding, and dataset writes.
- Prime tile now runs all keywords on the first grid point and ends early if no results are found for any keyword.
- Improved logging around [push], distance fallback, and geocoding; added a default maps.co API key for smoother fallback behavior.
- Added this FAQ explaining non-technical usage and speed vs data tradeoffs.
2025-11-11
- Single enriched record per (name, address, keyword); removed separate “basic then enriched” pushes to avoid duplicates.
- Prime tile/keyword processed before the worker pool to ensure early dataset results.
- Added enable_distance_fallback to optionally geocode listing addresses for distance estimation.
- Added hard timeouts around tile processing and dataset writes; minor logging improvements ([prime], clearer [push]).
2025-11-07
- Address cleanup to remove hours/amenities/inline phones from address text.
- Address enrichment from /maps/dir//… and /maps/place/… slugs; prefer postcode + city.
- Global (name, address) dedupe across tiles and keywords.
- Anti-stall queue with per-keyword caps; improved backoff on throttling.
- Compact debug logs per major step.
2025-11-04
- Nordic tuning for language/region and ccTLD-biased fallback search.
- Email filtering improvements and E.164 phone normalization.

Disclaimer & License

This Apify Actor is provided “as is”, without warranty of any kind — express or implied — including but not limited to the warranties of merchantability, fitness for a particular purpose, and non-infringement. Please follow local laws, do not use for malicious purposes and do not use my code to spam.

ToS & legality (Reminder): Scraping Google HTML may violate their Terms of Service. Use responsibly, at low rates, with proxies if needed, and comply with local laws and site policies. This Actor avoids official Google APIs and parses public HTML only.

PeachParser (Beta)

odaudlegur/peachparser-beta

Crawl arbitrary websites, checks which are alive, and crawls them for emails and social links. Filters common telemetry and template junk.

SLASH

Monday.com Exporter (Subscription)

odaudlegur/monday-com-exporter-subscription

Unlimited exports of items from an Apify dataset to Monday.com as items or subitems. Designed to be triggered automatically when a source actor run succeeds, but can also run manually. If you want to try it out, please see the pay per event version.

SLASH

Hitta.se Lead Scraper (Beta)

odaudlegur/hitta-se-lead-scraper-beta

Retrieve leads on hitta.se, the easy way. This actor will retrieve the business' name, address, email addresses, phone numbers and social links.

SLASH

Monday.com Exporter (Pay per event)

odaudlegur/monday-com-exporter-ppe

Exports items from an Apify dataset to Monday.com as items or subitems. Designed to be triggered automatically when a source actor run succeeds, but can also run manually.

SLASH

Proff.no Lead Scraper (Beta)

odaudlegur/proff-no-lead-scraper-beta

Retrieve leads on proff.no, the easy way. This actor will retrieve the business' name, address, email addresses, phone numbers and social links.

SLASH

Hitta.se Business Search Scraper

powerai/hitta-search-scraper

Scrape business listings from Hitta.se (Swedish directory) with automatic pagination and comprehensive company data extraction.

PowerAI

5.0

My Actor

storeleads/my-actor

Storeleads

Wolt Scraper

odaudlegur/wolt-scraper

Retrieves restaurant information on wolt.com. Checks which restaurant websites are alive, and crawls them for emails and social links. Filters common telemetry and template junk.

SLASH

My Actor

prospeo/storleads

Prospeo

Monday Marketplace Scraper

needy_hammock/monday-marketplace-scraper

Extract comprehensive data from the Monday.com marketplace including app details, pricing, reviews, ratings, and installation counts. Scrape multiple categories for market research, competitive analysis, or app development insights.