Cybersecurity Intelligence Directory Scraper avatar

Cybersecurity Intelligence Directory Scraper

Pricing

from $0.001 / actor start

Go to Apify Store
Cybersecurity Intelligence Directory Scraper

Cybersecurity Intelligence Directory Scraper

Scrapes the Cybersecurity Intelligence Supplier Directory (cybersecurityintelligence.com) for company profiles including name, website, description, location, phone, and category tags.

Pricing

from $0.001 / actor start

Rating

0.0

(0)

Developer

Jon Froemming

Jon Froemming

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Apify Actor that scrapes the Cybersecurity Intelligence Supplier Directory (cybersecurityintelligence.com) using Python, Crawlee, and Playwright (Chromium).

It collects company listings from category pages, optionally follows each company’s detail page for richer fields (website, phone, tags), and writes structured rows to the default dataset.

This file is linked from .actor/actor.json as the Actor readme shown on the Apify platform.


Table of contents

  1. What it scrapes
  2. Input
  3. Output
  4. Deduplication
  5. How a run completes on Apify
  6. Local development
  7. Deploy
  8. Legal and etiquette
  9. Project layout

What it scrapes

PhaseDescription
CategoriesIf you do not pass categories, the Actor opens browse_categories.php, collects every supplier-directory category link, and enqueues them. Blog /category/ links are ignored so only real listing URLs are used.
ListingsFor each category URL, it parses .listingsWrapper blocks (name, short description, address snippet) and follows pagination via ul.pagination.
Details (optional)When scrapeDetailPages is true, each company link is enqueued as a detail request; the handler extracts full profile data and pushes one dataset item per company.

Country filter: If country is set, a location segment is appended to category URLs (for example USlocation/usa/) using a small built-in code → slug map. Leave country empty to scrape all locations.


Input

Configure the Actor in the Apify console or via JSON input. All fields are optional unless noted.

FieldTypeDefaultDescription
categoriesstring[][]Category slugs only (e.g. cloud-security, managed-security-services). Empty = scrape all categories from the browse index.
countrystring""Filter by country code (US, UK, DE, …) or leave empty for worldwide.
maxPagesPerCategoryinteger0Cap listing pages per category. 0 = unlimited (follow “next” until none). Max allowed in schema: 500.
scrapeDetailPagesbooleantruetrue: visit each company detail page (website, phone, tags). false: only data visible on listing cards (faster, fewer fields).
maxConcurrencyinteger3Playwright concurrency (110). Raise carefully on Apify; higher values increase load on the target site and memory use.

Example input (full directory, details on)

{
"categories": [],
"country": "",
"maxPagesPerCategory": 0,
"scrapeDetailPages": true,
"maxConcurrency": 3
}

Example input (specific categories, US only)

{
"categories": ["cloud-security", "managed-security-services"],
"country": "US",
"maxPagesPerCategory": 0,
"scrapeDetailPages": true,
"maxConcurrency": 2
}

Output

Results are stored in the default dataset (see the Actor Output tab in Apify for the dataset items link).

Each item is one company. Typical fields:

FieldDescription
company_nameDisplay name
websiteCompany site URL when found on the detail page
domainHostname derived from website
descriptionLonger text from the detail page (truncated in code for safety)
locationAddress / region text
phonePhone if present (tel: links)
industry_tagsComma-separated category/tag strings from the page
source_urlPage URL used for this row
directory_sourceConstant label identifying this directory
date_scrapedUTC date (YYYY-MM-DD)

Field presence depends on scrapeDetailPages and what the site exposes for each company.


Deduplication

The directory lists the same supplier profile URL under multiple categories. Without dedupe you would get repeated rows for one company.

ModeBehavior
scrapeDetailPages: trueImmediately before each push_data, the Actor checks a normalized profile URL (scheme, host, path; UTM query params stripped). The first successful extraction for that URL is written to the dataset; later handler invocations for the same URL skip output and log Skip duplicate profile output.
scrapeDetailPages: falseListing-only rows use the same rule on the company link URL from the category page so each company appears at most once per run.

Details:

  • Normalization uses the same logic as the scraper’s clean_url helper (e.g. trailing slashes, utm_* removed).
  • Dedupe state is in memory for the current run only. A new Apify run starts with an empty set, so the default dataset for that run can contain one row per company again (expected for a fresh dataset).
  • Reservations are released if push_data fails so Crawlee retries can still emit a row for that profile URL.
  • The Crawlee request queue may still drop duplicate detail URLs by URL key; the output gate is an extra guarantee when listing cards or retries could otherwise double-emit.

How a run completes on Apify

The entrypoint is python -m my_actor, which calls crawler.run() once. Crawlee drains the request queue for that run: categories → listing pages → detail pages (if enabled). You do not need a shell loop on the platform for a full crawl.

For local development, scripts/run_until_done.sh can repeat apify run if you want to retry until the local queue reports zero pending requests (optional; see script header comments).


Local development

Requirements: Python 3.x, Apify CLI, Docker (for apify run with the same image as production).

cd cybersecurity-intelligence-scraper
apify login
apify run

Optional full local loop:

$./scripts/run_until_done.sh

Environment variables used by the helper script: MAX_ATTEMPTS (default 200), SLEEP_SECONDS (default 5).


Deploy

From this directory:

$apify push

Ensure .actor/actor.json, input_schema.json, output_schema.json, and dataset_schema.json stay valid; Apify validates them at build time.


Only run this Actor in compliance with the target site’s terms of service, robots.txt, and applicable law. Use reasonable concurrency; the defaults are conservative.


Project layout

PathRole
my_actor/main.py (crawler setup, start URLs), routes.py (handlers)
.actor/Actor manifest, input/output/dataset schemas
Dockerfileapify/actor-python-playwright base, CMD python -m my_actor
scripts/run_until_done.shOptional local multi-attempt runner