# Cybersecurity Intelligence Directory Scraper (`jonfr0/cybersecurity-intelligence-scraper`) Actor

Scrapes the Cybersecurity Intelligence Supplier Directory (cybersecurityintelligence.com) for company profiles including name, website, description, location, phone, and category tags.

- **URL**: https://apify.com/jonfr0/cybersecurity-intelligence-scraper.md
- **Developed by:** [Jon Froemming](https://apify.com/jonfr0) (community)
- **Categories:** Lead generation, Agents
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.001 / actor start

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Cybersecurity Intelligence Directory Scraper

Apify Actor that scrapes the **Cybersecurity Intelligence** [Supplier Directory](https://www.cybersecurityintelligence.com) (`cybersecurityintelligence.com`) using **Python**, **Crawlee**, and **Playwright** (Chromium).

It collects company listings from category pages, optionally follows each company’s **detail page** for richer fields (website, phone, tags), and writes structured rows to the **default dataset**.

This file is linked from [`.actor/actor.json`](.actor/actor.json) as the Actor **readme** shown on the Apify platform.

---

### Table of contents

1. [What it scrapes](#what-it-scrapes)
2. [Input](#input)
3. [Output](#output)
4. [Deduplication](#deduplication)
5. [How a run completes on Apify](#how-a-run-completes-on-apify)
6. [Local development](#local-development)
7. [Deploy](#deploy)
8. [Legal and etiquette](#legal-and-etiquette)
9. [Project layout](#project-layout)

---

### What it scrapes

| Phase | Description |
|--------|-------------|
| **Categories** | If you do not pass `categories`, the Actor opens `browse_categories.php`, collects every supplier-directory category link, and enqueues them. Blog `/category/` links are ignored so only real listing URLs are used. |
| **Listings** | For each category URL, it parses `.listingsWrapper` blocks (name, short description, address snippet) and follows pagination via `ul.pagination`. |
| **Details** (optional) | When `scrapeDetailPages` is `true`, each company link is enqueued as a detail request; the handler extracts full profile data and pushes one dataset item per company. |

**Country filter:** If `country` is set, a location segment is appended to category URLs (for example `US` → `location/usa/`) using a small built-in code → slug map. Leave `country` empty to scrape all locations.

---

### Input

Configure the Actor in the Apify console or via JSON input. All fields are optional unless noted.

| Field | Type | Default | Description |
|--------|------|---------|-------------|
| `categories` | `string[]` | `[]` | Category **slugs** only (e.g. `cloud-security`, `managed-security-services`). Empty = scrape **all** categories from the browse index. |
| `country` | `string` | `""` | Filter by country code (`US`, `UK`, `DE`, …) or leave empty for worldwide. |
| `maxPagesPerCategory` | `integer` | `0` | Cap listing pages **per category**. `0` = unlimited (follow “next” until none). Max allowed in schema: `500`. |
| `scrapeDetailPages` | `boolean` | `true` | `true`: visit each company detail page (website, phone, tags). `false`: only data visible on listing cards (faster, fewer fields). |
| `maxConcurrency` | `integer` | `3` | Playwright concurrency (`1`–`10`). Raise carefully on Apify; higher values increase load on the target site and memory use. |

#### Example input (full directory, details on)

```json
{
  "categories": [],
  "country": "",
  "maxPagesPerCategory": 0,
  "scrapeDetailPages": true,
  "maxConcurrency": 3
}
````

#### Example input (specific categories, US only)

```json
{
  "categories": ["cloud-security", "managed-security-services"],
  "country": "US",
  "maxPagesPerCategory": 0,
  "scrapeDetailPages": true,
  "maxConcurrency": 2
}
```

***

### Output

Results are stored in the **default dataset** (see the Actor **Output** tab in Apify for the dataset items link).

Each item is one company. Typical fields:

| Field | Description |
|--------|-------------|
| `company_name` | Display name |
| `website` | Company site URL when found on the detail page |
| `domain` | Hostname derived from `website` |
| `description` | Longer text from the detail page (truncated in code for safety) |
| `location` | Address / region text |
| `phone` | Phone if present (`tel:` links) |
| `industry_tags` | Comma-separated category/tag strings from the page |
| `source_url` | Page URL used for this row |
| `directory_source` | Constant label identifying this directory |
| `date_scraped` | UTC date (`YYYY-MM-DD`) |

Field presence depends on `scrapeDetailPages` and what the site exposes for each company.

***

### Deduplication

The directory lists the **same supplier profile URL** under multiple categories. Without dedupe you would get repeated rows for one company.

| Mode | Behavior |
|------|------------|
| **`scrapeDetailPages: true`** | Immediately before each `push_data`, the Actor checks a **normalized profile URL** (scheme, host, path; UTM query params stripped). The **first** successful extraction for that URL is written to the dataset; later handler invocations for the same URL skip output and log `Skip duplicate profile output`. |
| **`scrapeDetailPages: false`** | Listing-only rows use the same rule on the **company link URL** from the category page so each company appears **at most once** per run. |

Details:

- Normalization uses the same logic as the scraper’s `clean_url` helper (e.g. trailing slashes, `utm_*` removed).
- Dedupe state is **in memory for the current run only**. A **new** Apify run starts with an empty set, so the default dataset for that run can contain one row per company again (expected for a fresh dataset).
- Reservations are **released** if `push_data` fails so Crawlee retries can still emit a row for that profile URL.
- The Crawlee **request queue** may still drop duplicate detail URLs by URL key; the output gate is an extra guarantee when listing cards or retries could otherwise double-emit.

***

### How a run completes on Apify

The entrypoint is `python -m my_actor`, which calls **`crawler.run()`** once. Crawlee drains the **request queue** for that run: categories → listing pages → detail pages (if enabled). You do **not** need a shell loop on the platform for a full crawl.

For **local** development, `scripts/run_until_done.sh` can repeat `apify run` if you want to retry until the local queue reports zero pending requests (optional; see script header comments).

***

### Local development

**Requirements:** Python 3.x, [Apify CLI](https://docs.apify.com/cli/), Docker (for `apify run` with the same image as production).

```bash
cd cybersecurity-intelligence-scraper
apify login
apify run
```

Optional full local loop:

```bash
./scripts/run_until_done.sh
```

Environment variables used by the helper script: `MAX_ATTEMPTS` (default `200`), `SLEEP_SECONDS` (default `5`).

***

### Deploy

From this directory:

```bash
apify push
```

Ensure `.actor/actor.json`, `input_schema.json`, `output_schema.json`, and `dataset_schema.json` stay valid; Apify validates them at build time.

***

### Legal and etiquette

Only run this Actor in compliance with the target site’s **terms of service**, **robots.txt**, and applicable law. Use reasonable concurrency; the defaults are conservative.

***

### Project layout

| Path | Role |
|------|------|
| `my_actor/` | `main.py` (crawler setup, start URLs), `routes.py` (handlers) |
| `.actor/` | Actor manifest, input/output/dataset schemas |
| `Dockerfile` | `apify/actor-python-playwright` base, `CMD python -m my_actor` |
| `scripts/run_until_done.sh` | Optional local multi-attempt runner |

# Actor input Schema

## `categories` (type: `array`):

Specific category slugs to scrape (e.g. 'cloud-security', 'managed-security-services'). Leave empty to scrape all categories.

## `country` (type: `string`):

Filter companies by country code (e.g. 'US', 'UK', 'DE'). Leave empty to scrape all countries.

## `maxPagesPerCategory` (type: `integer`):

Maximum listing pages per category (~15 companies per page). Use 0 for unlimited (follow pagination until no next page).

## `scrapeDetailPages` (type: `boolean`):

If true, visits each company's detail page for full data (website, phone, etc.). If false, only extracts data visible on category listing pages.

## `maxConcurrency` (type: `integer`):

Maximum number of concurrent browser pages.

## Actor input object example

```json
{
  "categories": [],
  "country": "",
  "maxPagesPerCategory": 0,
  "scrapeDetailPages": true,
  "maxConcurrency": 3
}
```

# Actor output Schema

## `companies` (type: `string`):

Rows from the default dataset (company profiles from the supplier directory).

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("jonfr0/cybersecurity-intelligence-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("jonfr0/cybersecurity-intelligence-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call jonfr0/cybersecurity-intelligence-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=jonfr0/cybersecurity-intelligence-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Cybersecurity Intelligence Directory Scraper",
        "description": "Scrapes the Cybersecurity Intelligence Supplier Directory (cybersecurityintelligence.com) for company profiles including name, website, description, location, phone, and category tags.",
        "version": "0.1",
        "x-build-id": "4kIcWePDKJKM69018"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/jonfr0~cybersecurity-intelligence-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-jonfr0-cybersecurity-intelligence-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/jonfr0~cybersecurity-intelligence-scraper/runs": {
            "post": {
                "operationId": "runs-sync-jonfr0-cybersecurity-intelligence-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/jonfr0~cybersecurity-intelligence-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-jonfr0-cybersecurity-intelligence-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "categories": {
                        "title": "Categories",
                        "type": "array",
                        "description": "Specific category slugs to scrape (e.g. 'cloud-security', 'managed-security-services'). Leave empty to scrape all categories.",
                        "items": {
                            "type": "string"
                        },
                        "default": []
                    },
                    "country": {
                        "title": "Country Filter",
                        "type": "string",
                        "description": "Filter companies by country code (e.g. 'US', 'UK', 'DE'). Leave empty to scrape all countries.",
                        "default": ""
                    },
                    "maxPagesPerCategory": {
                        "title": "Max Pages Per Category",
                        "minimum": 0,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Maximum listing pages per category (~15 companies per page). Use 0 for unlimited (follow pagination until no next page).",
                        "default": 0
                    },
                    "scrapeDetailPages": {
                        "title": "Scrape Detail Pages",
                        "type": "boolean",
                        "description": "If true, visits each company's detail page for full data (website, phone, etc.). If false, only extracts data visible on category listing pages.",
                        "default": true
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Maximum number of concurrent browser pages.",
                        "default": 3
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
