Cybersecurity Intelligence Directory Scraper
Pricing
from $0.001 / actor start
Cybersecurity Intelligence Directory Scraper
Scrapes the Cybersecurity Intelligence Supplier Directory (cybersecurityintelligence.com) for company profiles including name, website, description, location, phone, and category tags.
Pricing
from $0.001 / actor start
Rating
0.0
(0)
Developer
Jon Froemming
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Apify Actor that scrapes the Cybersecurity Intelligence Supplier Directory (cybersecurityintelligence.com) using Python, Crawlee, and Playwright (Chromium).
It collects company listings from category pages, optionally follows each company’s detail page for richer fields (website, phone, tags), and writes structured rows to the default dataset.
This file is linked from .actor/actor.json as the Actor readme shown on the Apify platform.
Table of contents
- What it scrapes
- Input
- Output
- Deduplication
- How a run completes on Apify
- Local development
- Deploy
- Legal and etiquette
- Project layout
What it scrapes
| Phase | Description |
|---|---|
| Categories | If you do not pass categories, the Actor opens browse_categories.php, collects every supplier-directory category link, and enqueues them. Blog /category/ links are ignored so only real listing URLs are used. |
| Listings | For each category URL, it parses .listingsWrapper blocks (name, short description, address snippet) and follows pagination via ul.pagination. |
| Details (optional) | When scrapeDetailPages is true, each company link is enqueued as a detail request; the handler extracts full profile data and pushes one dataset item per company. |
Country filter: If country is set, a location segment is appended to category URLs (for example US → location/usa/) using a small built-in code → slug map. Leave country empty to scrape all locations.
Input
Configure the Actor in the Apify console or via JSON input. All fields are optional unless noted.
| Field | Type | Default | Description |
|---|---|---|---|
categories | string[] | [] | Category slugs only (e.g. cloud-security, managed-security-services). Empty = scrape all categories from the browse index. |
country | string | "" | Filter by country code (US, UK, DE, …) or leave empty for worldwide. |
maxPagesPerCategory | integer | 0 | Cap listing pages per category. 0 = unlimited (follow “next” until none). Max allowed in schema: 500. |
scrapeDetailPages | boolean | true | true: visit each company detail page (website, phone, tags). false: only data visible on listing cards (faster, fewer fields). |
maxConcurrency | integer | 3 | Playwright concurrency (1–10). Raise carefully on Apify; higher values increase load on the target site and memory use. |
Example input (full directory, details on)
{"categories": [],"country": "","maxPagesPerCategory": 0,"scrapeDetailPages": true,"maxConcurrency": 3}
Example input (specific categories, US only)
{"categories": ["cloud-security", "managed-security-services"],"country": "US","maxPagesPerCategory": 0,"scrapeDetailPages": true,"maxConcurrency": 2}
Output
Results are stored in the default dataset (see the Actor Output tab in Apify for the dataset items link).
Each item is one company. Typical fields:
| Field | Description |
|---|---|
company_name | Display name |
website | Company site URL when found on the detail page |
domain | Hostname derived from website |
description | Longer text from the detail page (truncated in code for safety) |
location | Address / region text |
phone | Phone if present (tel: links) |
industry_tags | Comma-separated category/tag strings from the page |
source_url | Page URL used for this row |
directory_source | Constant label identifying this directory |
date_scraped | UTC date (YYYY-MM-DD) |
Field presence depends on scrapeDetailPages and what the site exposes for each company.
Deduplication
The directory lists the same supplier profile URL under multiple categories. Without dedupe you would get repeated rows for one company.
| Mode | Behavior |
|---|---|
scrapeDetailPages: true | Immediately before each push_data, the Actor checks a normalized profile URL (scheme, host, path; UTM query params stripped). The first successful extraction for that URL is written to the dataset; later handler invocations for the same URL skip output and log Skip duplicate profile output. |
scrapeDetailPages: false | Listing-only rows use the same rule on the company link URL from the category page so each company appears at most once per run. |
Details:
- Normalization uses the same logic as the scraper’s
clean_urlhelper (e.g. trailing slashes,utm_*removed). - Dedupe state is in memory for the current run only. A new Apify run starts with an empty set, so the default dataset for that run can contain one row per company again (expected for a fresh dataset).
- Reservations are released if
push_datafails so Crawlee retries can still emit a row for that profile URL. - The Crawlee request queue may still drop duplicate detail URLs by URL key; the output gate is an extra guarantee when listing cards or retries could otherwise double-emit.
How a run completes on Apify
The entrypoint is python -m my_actor, which calls crawler.run() once. Crawlee drains the request queue for that run: categories → listing pages → detail pages (if enabled). You do not need a shell loop on the platform for a full crawl.
For local development, scripts/run_until_done.sh can repeat apify run if you want to retry until the local queue reports zero pending requests (optional; see script header comments).
Local development
Requirements: Python 3.x, Apify CLI, Docker (for apify run with the same image as production).
cd cybersecurity-intelligence-scraperapify loginapify run
Optional full local loop:
$./scripts/run_until_done.sh
Environment variables used by the helper script: MAX_ATTEMPTS (default 200), SLEEP_SECONDS (default 5).
Deploy
From this directory:
$apify push
Ensure .actor/actor.json, input_schema.json, output_schema.json, and dataset_schema.json stay valid; Apify validates them at build time.
Legal and etiquette
Only run this Actor in compliance with the target site’s terms of service, robots.txt, and applicable law. Use reasonable concurrency; the defaults are conservative.
Project layout
| Path | Role |
|---|---|
my_actor/ | main.py (crawler setup, start URLs), routes.py (handlers) |
.actor/ | Actor manifest, input/output/dataset schemas |
Dockerfile | apify/actor-python-playwright base, CMD python -m my_actor |
scripts/run_until_done.sh | Optional local multi-attempt runner |