Proff.no Lead Scraper (Beta)

Pricing

Pay per event

Proff.no Lead Scraper (Beta)

Retrieve leads on proff.no, the easy way. This actor will retrieve the business' name, address, email addresses, phone numbers and social links.

Pricing

Pay per event

Rating

0.0

(0)

Developer

SLASH

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

What’s new (2025-11-14)

Quality & correctness

Proff-specific detail extraction: the actor now directly understands Proff layout:
- Detail links discovered via a[href*="/selskap/"] on listing pages.
- Pagination resolved from Side X av Y text plus ?page= parameter.
Structured address parsing:
- First tries JSON-LD PostalAddress blocks when present.
- Falls back to microdata and finally heuristic address extraction tuned for Norwegian 4-digit postcodes (with optional Swedish NNN NN support).
- JSON-LD is no longer limited to Sweden-only postcodes.
Website detection and validation:
- Uses canonical URL (<link rel="canonical">), og:url, and JSON-LD url/sameAs before falling back to “best external link”.
- Marks each website with a website_details status (ok, 404, unavailable, banned, n/a).

Layered email collection:
- Detail page: visible text, mailto: links, Proff-style button/slug patterns, HTML entity unescape.
- Website crawl: same-domain pages prioritized by contact-like paths (/kontakt, /contact, /about, etc.).
- Social profiles: optionally fetch emails from Facebook, Instagram, LinkedIn pages when no site emails are found.
Sanitization & filtering:
- Strict and fallback regexes, length checks, tracking-substring filters (button-email, tracking, click, census).
- Avoids .jpg/.pdf/.js tails, slashes inside email, suspicious tracking locals.
Domain-aware filtering:
- When a website is valid:
  - Keeps emails whose domain matches the website base domain (or subdomain) or belong to common freemail providers.
- When the website is missing/banned/unavailable:
  - Keeps all emails that pass basic validity checks from the detail page and social sources.

Stability & performance

Bounded website crawling:
- Same-domain internal links only (by registered base domain).
- Skips binary / heavy assets (pdf/doc/media/archives).
- Hard limit on site_email_max_pages per root website.
Per-domain crawl cache:
- Results for each base domain are cached in memory for the run.
- Multiple companies under the same domain reuse the same (emails, socials) rather than recrawling.
Concurrency-aware max_results:
- Uses an async lock and shared counters to enforce a strict upper bound on the number of detail pages processed and dataset records pushed.
- Once max_results is reached, workers stop picking up new detail jobs.
Time-bounded HTTP:
- Configurable timeout_seconds for HTTP operations.
- httpx.AsyncClient with limits on total and keep-alive connections.

Input

{
  "start_urls": [
    {
      "url": "https://www.proff.no/bransjesøk?q=Restauranter%20og%20kafeer&region=Østlandet&county=Innlandet"
    }
  ],
  "max_depth": 3,
  "max_results": 200,
  "site_email_max_pages": 3,
  "timeout_seconds": 30,
  "concurrency": 5,
  "headers": null
}

Parameters

Key	Type	Default	Description
`start_urls`	array[object]	See above	List of starting listing URLs, each object with a `url` field. Typically Proff bransjesøk result pages.
`max_depth`	integer	3	Maximum depth of listing pagination per seed. Depth `0` is the initial listing page; depth `1` corresponds to `?page=2`, and so on.
`max_results`	integer	0	Global hard cap on the number of detail pages processed & records pushed. `0` means “no cap”.
`site_email_max_pages`	integer	3	Maximum number of HTML pages to crawl per website when searching for emails and social links.
`timeout_seconds`	integer	30	Read timeout for HTTP responses (also influences overall HTTP timeout via `httpx.Timeout`).
`concurrency`	integer	5	Number of worker coroutines fetching from the Proff request queue in parallel.
`headers`	object or null	Default headers	Optional HTTP headers override. When `null`, a sensible Norwegian desktop UA + `Accept-Language` is used.

Default headers

When you do not provide headers, the actor uses:

{
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
  "Accept-Language": "nb-NO,nb;q=0.9,no;q=0.8,sv;q=0.7,en;q=0.6"
}

You can override these if you need a different UA or language profile, but the defaults are tuned for Norway / Nordic sites.

Output

Each dataset item is a JSON object like:

{
  "source_url": "https://www.proff.no/selskap/eksempel-restaurant-as/oslo/123456789/",
  "name": "Eksempel Restaurant AS",
  "categories": "Restauranter, Serveringssteder",
  "phone": "22 00 00 00",
  "address": "Eksempelveien 12, 0123 Oslo",
  "website": "https://eksempelrestaurant.no",
  "website_details": "ok",
  "email1": "post@eksempelrestaurant.no",
  "email2": "booking@eksempelrestaurant.no",
  "email3": "eksempelrestaurant@gmail.com",
  "social_facebook": "https://www.facebook.com/eksempelrestaurant",
  "social_instagram": "https://www.instagram.com/eksempelrestaurant",
  "social_linkedin": "n/a",
  "social_x": "n/a",
  "social_youtube": "n/a",
  "social_tiktok": "n/a",
  "social_pinterest": "n/a"
}

Field notes

source_url The company detail URL on Proff that was scraped. This is the primary identifier and is always present.
name Business name, usually from <h1>, og:title, or the <title> tag.
categories A comma-separated string of categories/industries extracted from Proff category containers (e.g. data-qa="categories" or similar elements).
phone Primary phone as seen on Proff detail page:
- First attempt: tel: links.
- Fallback: Telefon NNN NN NN pattern in the page text.
- Last resort: generic Nordic-ish phone regex on the page.
address Postal address:
- Prefers JSON-LD PostalAddress if available and valid.
- Otherwise uses microdata and finally heuristic extraction with Norwegian 4-digit or Swedish-style postcodes and street tokens.
- Set to "n/a" when no plausible address is found.
website Best guess at the company’s website:
- Canonical URL, og:url, or JSON-LD url / sameAs when external and not banned.
- Otherwise, a “best external link” found on the page (excluding social and banned domains).
- Set to "n/a" when no suitable candidate is found or when the site is on the ban list.
website_details Status of the website check:
- "ok" – HTTP 2xx/3xx.
- "404" – site returns 404.
- "unavailable" – persistent errors / non-OK status.
- "banned" – site is in the ban list.
- "n/a" – no website to check.
Emails (email1, email2, …)
- Emails originate from:
  - Proff detail page (visible text, mailto:, slug/button patterns).
  - Website crawl, same-domain pages.
  - Social profiles (Facebook/Instagram/LinkedIn) when detail+website had none.
- At least one field will be present (email1), and if no emails are found at all, email1 is set to "n/a".
- email2, email3, etc. are only present when extra distinct emails were discovered.
Social fields
- social_facebook, social_instagram, social_linkedin, social_x, social_youtube, social_tiktok, social_pinterest.
- Each field is either a cleaned canonical URL (tracking params removed) or "n/a" when not found.

How it works

1. Request queue setup

Reads start_urls from actor input.
For each entry with a url, enqueues a listing request with:
- user_data = {'depth': 0, 'type': 'listing'}
- unique_key = url for deduplication.

2. Listing handling

For each listing request:

Fetches the Proff listing page.
Parses HTML with BeautifulSoup.
Extracts detail links:
- Any <a> with href containing "/selskap/".
- Normalized to absolute URLs.
Applies max_results constraint:
- A shared detail_enqueued counter and lock ensure that only up to max_results detail pages are ever enqueued.
Enqueues detail requests for each selected company URL:
- user_data = {'depth': depth + 1, 'type': 'detail'}.
Handles pagination:
- Searches the text for Side X av Y.
- Constructs ?page=N for the next listing page while X < Y and depth < max_depth.
- Enqueues additional listing requests with the same depth.

3. Detail handling

For each detail request:

Fetches the company detail page.
Extracts:
- Name from h1, Open Graph title, or <title>.
- Categories from Proff-specific category containers.
- Phone via tel: links and common Norwegian phone patterns.
- Address via:
  - JSON-LD PostalAddress.
  - Microdata [itemprop="streetAddress"], [itemprop="postalCode"], [itemprop="addressLocality"].
  - Proff-specific fallback (Adresse ...) and general heuristics.
- Website via:
  - Canonical URL.
  - og:url.
  - JSON-LD url / sameAs.
  - Best external link excluding Proff and social domains.
- Raw emails from:
  - mailto: links.
  - Page text (clipped to 500k chars).
  - Proff/Hitta-style slug/button patterns.

4. Website validation & crawl

If a website candidate exists and is not banned:

Status check:
- Performs a GET with redirect following.
- Marks website_details as "ok", "404", or "unavailable".
Site crawl (when status is ok):
- Computes a base domain from the host (example.no).
- Looks up a per-run cache:
  - If cached: reuse (emails, socials) for this domain.
  - If not cached:
    - Starts from the root URL.
    - Visits same-domain HTML pages, skipping binary extensions.
    - Prioritizes URLs containing contact-related keywords.
    - Stops after site_email_max_pages pages or when the queue empties.
    - Extracts emails + socials from each page.
    - Stores results in WEBSITE_CRAWL_CACHE.
Email filtering with website context:
- When website is ok:
  - Keeps only emails that:
    - Match the website base domain or
    - Are from known freemail providers (gmail.com, outlook.com, etc.).
- When website is missing/banned/unavailable:
  - Uses all valid detail-page emails, plus any valid emails from socials.

The actor:

Parses JSON-LD sameAs arrays/strings and regular <a> tags.
Detects social domains using an exact/endswith match against:
- facebook.com, instagram.com, linkedin.com, x.com, twitter.com, youtube.com, tiktok.com, pinterest.com.
Normalizes URLs by removing tracking parameters (utm_*, fbclid, etc.).
Fills the social_* fields if they exist; otherwise sets them to "n/a".

6. Email fallback via socials

If no emails were found after:

Detail page extraction, and
Website crawl (or website missing/unavailable),

then:

The actor fetches high-priority social profiles (Facebook, Instagram, LinkedIn).
Extracts emails from those pages.
If still none are found, email1 is set to "n/a".

7. Concurrency & stopping conditions

Multiple worker coroutines pull requests from the queue as long as stop_crawl is False.
max_results is enforced with a lock:
- detail_enqueued limits how many detail requests can be added.
- results_pushed limits how many records are pushed.
When results_pushed reaches max_results, stop_crawl is set to True and workers stop after finishing the current job.

Debug logging

The actor writes concise, useful logs such as:

Enqueuing https://www.proff.no/bransjesøk?... Initial listing seeds.
Scraping https://www.proff.no/bransjesøk?... (depth=0, type=listing) ... Listing processing start.
Scraping https://www.proff.no/selskap/... (depth=1, type=detail) ... Detail processing start.
While crawling for contacts, skipping https://example.com/path: <error> Website crawl failures (non-fatal).
While crawling social page, skipping https://facebook.com/...: <error> Social crawl failures (non-fatal).
Worker N finished. Each worker’s completion summary.

Increase APIFY_LOG_LEVEL to DEBUG for more granular insights during development or troubleshooting.

Performance tips

Limit max_results when doing exploratory runs to keep datasets manageable and runs short.
Keep concurrency moderate (e.g. 5–10) when crawling many external sites; raise gradually if you have strong infrastructure/proxies.
Tune site_email_max_pages:
- 1–2 for quick scans (basic contacts).
- 3–5 for deeper email hunting.
Use sensible start_urls:
- More focused Proff filters (region/county, query) mean less noise and fewer unnecessary detail pages.
Consider adding your own HTTP headers if you need a different language or user agent profile.

FAQ

I’m not technical. How am I supposed to use this?

Find a Proff search page Go to Proff and run a bransjesøk (e.g. “Restauranter og kafeer” in a region/county). Copy the URL from your browser.

Paste it into start_urls In the actor input, set:

"start_urls": [
  { "url": "PASTE_YOUR_PROFF_URL_HERE" }
]

You can add multiple listing URLs if you want to cover several regions or queries in one run.

Decide how many results you want
- For a small test, use "max_results": 50.
- For a serious lead list, use 200–500 or leave 0 (no cap, but the run may be long).
Choose how hard to search for emails
- site_email_max_pages = 1–2 → faster, fewer emails, minimal crawling.
- site_email_max_pages = 3–5 → slower, better chance to find contact pages.
Run the actor Wait for the run to finish, then open the dataset in Apify and export as CSV/JSON/XLSX.

What counts as a “good” `start_url`?

Any Proff listing page that shows a list of companies with multiple pages when you scroll or paginate is fine, e.g.:

https://www.proff.no/bransjesøk?q=Restauranter%20og%20kafeer&region=Østlandet&county=Innlandet
https://www.proff.no/bransjesøk?q=barnehage&region=Hele%20Norge

Avoid linking directly to a single company detail page as a start_url; the actor handles those too, but you lose the whole “listing to many details” benefit.

Why does it sometimes return fewer records than `max_results`?

Common reasons:

The Proff listing(s) simply do not have that many companies matching your search.
Pagination stops when Side X av Y indicates there are no more pages.
Some detail pages may fail completely (rare), and the actor logs an error and moves on.

The actor guarantees not to exceed max_results, but it cannot invent records that Proff does not provide.

Why do some records only have `email1 = "n/a"`?

That means:

No valid emails were found on:
- The Proff detail page,
- The company website (if any and reachable),
- The prioritized social profiles (Facebook/Instagram/LinkedIn).
Many businesses use contact forms or hide emails behind logins/JS widgets; those are out of reach without more invasive techniques.

You can still use the phone, address, and website fields for outreach or enrichment elsewhere.

Why are some emails freemail (gmail/outlook/etc.)?

When the website is reachable and valid:

The actor keeps:
- Emails on the same base domain as the website.
- Emails on common freemail providers (gmail, outlook, etc.), since many small businesses use them.

This means you will see combinations such as:

post@firma.no
navn.firma@gmail.com

When the website is missing or down, the actor keeps all valid-looking emails from the detail page and social profiles.

Why is the website sometimes `"n/a"` even though Proff shows something clickable?

Likely causes:

The Proff “website” link points to a banned domain (e.g. generic info/lookup services) that is intentionally ignored.
The site consistently returns non-HTML or very broken responses that fail the validation.
The click leads to something that is detected as social media or share/intent URL, not a real website.

The actor prefers to output "n/a" rather than pretend a non-company site is the official website.

Troubleshooting

Actor exits immediately with no data

Check that start_urls is a non-empty array, each element an object with a non-empty url field.
The actor logs an error and exits if there are no valid start_urls.

Timeouts and partial data

Increase timeout_seconds slightly if you see frequent timeouts in logs for slower sites.
Reduce site_email_max_pages and/or concurrency to lighten the load.

Too many external-sites-related errors

Many external sites may block or throttle; this is normal at small scale.
If you rely heavily on site crawling, consider using a suitable proxy configuration on the Apify platform.

Roadmap / future improvements

Better Proff layout adaptation if/when Proff updates their DOM structure.
Company ID extraction from URLs for easier merging with other datasets.
Optional Proff organization number field (org.nr parsing).
Multi-country tuning for similar portals (e.g., Swedish company directories).
Configurable social crawl strategy (which networks to prioritize or exclude).
Schema validation for output with stricter types.

Supported & planned regions

Region	Status	Details	Link
Norway	Optimized	Proff.no	—
Sweden	Planned	Allabolag.se	—

Create an issue or contact the author if you’d like a specific directory or country prioritized.

Changelog

2025-11-14
- Reworked actor to target Proff.no instead of Google Maps Local results.
- Added listing vs detail separation with Proff-specific detail link & pagination logic.
- Implemented JSON-LD-first address extraction with Norwegian postcode awareness.
- Hardened social extraction from JSON-LD sameAs and <a> tags with safe domain matching.
- Introduced per-domain website crawl cache to avoid redundant crawling.
- Enforced concurrency-safe max_results limits using async locks and shared counters.

Disclaimer & License

This Apify Actor is provided “as is”, without warranty of any kind — express or implied — including but not limited to the warranties of merchantability, fitness for a particular purpose, and non-infringement. Please follow local laws, do not use for malicious purposes and do not use this code to spam.

ToS & legality (Reminder): Scraping Proff HTML may violate their Terms of Service. Use responsibly, at low rates, with proxies if needed, and comply with local laws and site policies. This Actor avoids official APIs and parses public HTML only.

Hitta.se Lead Scraper (Beta)

odaudlegur/hitta-se-lead-scraper-beta

Retrieve leads on hitta.se, the easy way. This actor will retrieve the business' name, address, email addresses, phone numbers and social links.

SLASH

Monday.com Exporter (Pay per event)

odaudlegur/monday-com-exporter-ppe

Exports items from an Apify dataset to Monday.com as items or subitems. Designed to be triggered automatically when a source actor run succeeds, but can also run manually.

SLASH

Monday.com Exporter (Subscription)

odaudlegur/monday-com-exporter-subscription

Unlimited exports of items from an Apify dataset to Monday.com as items or subitems. Designed to be triggered automatically when a source actor run succeeds, but can also run manually. If you want to try it out, please see the pay per event version.

SLASH

PeachParser (Beta)

odaudlegur/peachparser-beta

Crawl arbitrary websites, checks which are alive, and crawls them for emails and social links. Filters common telemetry and template junk.

SLASH

Hitta.se Business Search Scraper

powerai/hitta-search-scraper

Scrape business listings from Hitta.se (Swedish directory) with automatic pagination and comprehensive company data extraction.

PowerAI

5.0

SCOUTR Nordics (Google Maps Scraper)

odaudlegur/scoutr-nordics-google-maps-scraper

Specialized Google Maps scraper for Nordic countries. Geocodes a start address, finds businesses within a radius, and extracts name, address, website, phone. Visits each website to fetch real contact information. Other countries will be added in different actors when the code is fully optimized.

SLASH

My Actor

storeleads/my-actor

Storeleads

My Actor

prospeo/storleads

Prospeo

Sweden Business Data Extractor - Allabolag

datavault/allabolag-se

Data collection tool built to organize and structure company information that is publicly available through the Allabolag platform. It is designed for users who need reliable, structured access to Swedish company data for analytical and informational purposes.

Datavault

5.0

Allabolag Business Details Scraper

ecomscrape/allabolag-business-details-scraper

Scrape detailed Swedish business profiles from Allabolag.se automatically. Extract company info, contact details, reviews, ratings & geo data in JSON/CSV/Excel. Ideal for lead generation & market research. User-friendly tool with proxy support & bulk processing capabilities.

ecomscrape

Proff.no Lead Scraper (Beta)

Proff.no Lead Scraper (Beta)

What’s new (2025-11-14)

Quality & correctness

Emails & social links

Stability & performance

Input

Parameters

Default headers

Output

Field notes

How it works

1. Request queue setup

2. Listing handling

3. Detail handling

4. Website validation & crawl

5. Social extraction

6. Email fallback via socials

7. Concurrency & stopping conditions

Debug logging

Performance tips

FAQ

I’m not technical. How am I supposed to use this?

What counts as a “good” start_url?

Why does it sometimes return fewer records than max_results?

Why do some records only have email1 = "n/a"?

Why are some emails freemail (gmail/outlook/etc.)?

Why is the website sometimes "n/a" even though Proff shows something clickable?

Troubleshooting

Roadmap / future improvements

Supported & planned regions

Changelog

Disclaimer & License

You might also like

Hitta.se Lead Scraper (Beta)

Monday.com Exporter (Pay per event)

Monday.com Exporter (Subscription)

PeachParser (Beta)

Hitta.se Business Search Scraper

SCOUTR Nordics (Google Maps Scraper)

My Actor

My Actor

Sweden Business Data Extractor - Allabolag

Allabolag Business Details Scraper

Related articles

What counts as a “good” `start_url`?

Why does it sometimes return fewer records than `max_results`?

Why do some records only have `email1 = "n/a"`?

Why is the website sometimes `"n/a"` even though Proff shows something clickable?