# HHS Data Breach Scraper (`automation-lab/hhs-data-breach-scraper`) Actor

Extract public HIPAA breach reports from the HHS OCR portal for compliance monitoring, cybersecurity research, and legal lead workflows.

- **URL**: https://apify.com/automation-lab/hhs-data-breach-scraper.md
- **Developed by:** [Stas Persiianenko](https://apify.com/automation-lab) (community)
- **Categories:** Other
- **Stats:** 3 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.02 / 1,000 breach report saveds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## HHS Data Breach Scraper

Extract public HIPAA breach report rows from the HHS OCR Breach Portal.

### What does HHS Data Breach Scraper do?

HHS Data Breach Scraper collects rows from the public U.S. Department of Health and Human Services Office for Civil Rights breach portal.
It turns the public HIPAA breach report table into clean JSON records for monitoring, compliance dashboards, legal lead generation, and cybersecurity research.

### Who is it for?

- 🏥 Healthcare compliance teams monitoring newly reported HIPAA breaches.
- 🛡️ Cybersecurity vendors tracking healthcare incidents and affected organizations.
- ⚖️ Legal and insurance teams building breach-response lead lists.
- 📊 Data teams maintaining internal breach intelligence dashboards.
- 🧾 Consultants preparing recurring reports for covered entities and business associates.

### Why use this actor?

The HHS OCR portal is public, but the data is exposed through a JSF/PrimeFaces table that is inconvenient to automate manually.
This actor handles the session, ViewState token, and report-table pagination, then emits typed records that are ready for export.

### Data source

The actor uses the public HHS OCR Breach Portal:

`https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf`

No login, private account, or captcha is required for the public report table.

### Data fields

| Field | Description |
| --- | --- |
| `coveredEntity` | Name of the covered entity in the HHS table |
| `state` | State or territory abbreviation |
| `coveredEntityType` | Covered entity type such as Healthcare Provider or Business Associate |
| `individualsAffected` | Number of affected individuals as an integer |
| `breachSubmissionDate` | Submission date normalized to `YYYY-MM-DD` |
| `breachSubmissionDateRaw` | Original HHS `MM/DD/YYYY` date |
| `typeOfBreach` | Breach type list |
| `locationOfBreachedInformation` | Breached information location list |
| `businessAssociatePresent` | Boolean value from the HHS hidden column |
| `webDescription` | Optional web description column when HHS provides it |
| `hhsBreachId` | HHS table row key |
| `sourceUrl` | HHS report page URL |
| `scrapedAt` | Timestamp when the row was saved |

### How much does it cost to scrape HHS data breach reports?

The actor uses pay-per-event pricing.
There is a small start fee for each run and a per-record fee for each breach report saved.
Use a small `maxItems` value for quick checks and larger values for scheduled backfills.

### Input options

- `maxItems` — maximum number of breach rows to save.
- `startPage` — zero-based HHS report page to start from.
- `state` — optional state abbreviation filter.
- `coveredEntityQuery` — optional case-insensitive covered-entity name filter.
- `includeWebDescription` — include the hidden web description field when available.

### Example input

```json
{
  "maxItems": 100,
  "startPage": 0,
  "state": "",
  "coveredEntityQuery": "",
  "includeWebDescription": true
}
````

### Example output

```json
{
  "coveredEntity": "JASON R EGBERT OD PC",
  "state": "WA",
  "coveredEntityType": "Healthcare Provider",
  "individualsAffected": 1225,
  "breachSubmissionDate": "2026-06-02",
  "breachSubmissionDateRaw": "06/02/2026",
  "typeOfBreach": ["Hacking/IT Incident"],
  "locationOfBreachedInformation": ["Network Server"],
  "businessAssociatePresent": true,
  "webDescription": null,
  "hhsBreachId": "1453895",
  "sourceUrl": "https://ocrportal.hhs.gov/ocr/breach/breach_report_hip.jsf",
  "scrapedAt": "2026-06-21T03:04:29.531Z"
}
```

### How to run

1. Open the actor on Apify.
2. Set `maxItems` to the number of breach rows you need.
3. Optionally add a `state` or `coveredEntityQuery` filter.
4. Start the run.
5. Export the dataset as JSON, CSV, Excel, or via API.

### Monitoring workflow

Schedule the actor daily or weekly with `maxItems` set to 100 or 200.
Compare new `hhsBreachId` values against your previous dataset to detect newly disclosed breach reports.

### Compliance workflow

Compliance teams can use the output to enrich internal registers with affected-count totals, breach type, covered entity type, and submission date.
The normalized fields reduce manual cleanup before loading the data into spreadsheets or BI tools.

### Cybersecurity workflow

Security vendors can monitor healthcare breach disclosures, prioritize incidents by affected individuals, and identify covered entities that may need response services.

### Lead generation workflow

Legal, insurance, and consulting teams can filter by state or entity name, then combine the results with CRM enrichment and outreach tools.

### Tips

- Start with `maxItems: 100` for the newest portal page.
- Use `startPage` for older pages when backfilling.
- Keep scheduled runs conservative; HHS is a public government portal.
- Use `hhsBreachId` to de-duplicate records across runs.
- Use `breachSubmissionDate` for chronological sorting.

### Limitations

The actor extracts the public report table as provided by HHS.
If HHS changes JSF component names or the table structure, the actor may need an update.
Filters are applied after fetching rows from the portal page, so very narrow filters may require a higher `maxItems` or `startPage` strategy.

### Integrations

- Export JSON to a data lake for breach intelligence.
- Send CSV output to a compliance analyst.
- Trigger alerts when a new `hhsBreachId` appears.
- Join by `coveredEntity` with enrichment providers.
- Use the Apify API to feed dashboards.

### API usage with Node.js

```js
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('automation-lab/hhs-data-breach-scraper').call({
  maxItems: 100,
  includeWebDescription: true
});
console.log(run.defaultDatasetId);
```

### API usage with Python

```python
from apify_client import ApifyClient
import os

client = ApifyClient(os.environ['APIFY_TOKEN'])
run = client.actor('automation-lab/hhs-data-breach-scraper').call(run_input={
    'maxItems': 100,
    'includeWebDescription': True,
})
print(run['defaultDatasetId'])
```

### API usage with cURL

```bash
curl -X POST "https://api.apify.com/v2/acts/automation-lab~hhs-data-breach-scraper/runs?token=$APIFY_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"maxItems":100,"includeWebDescription":true}'
```

### MCP usage

Use this actor from Apify MCP with:

`https://mcp.apify.com/?tools=automation-lab/hhs-data-breach-scraper`

Claude Code setup:

```bash
claude mcp add apify-hhs-breaches https://mcp.apify.com/?tools=automation-lab/hhs-data-breach-scraper
```

Claude Desktop JSON config:

```json
{
  "mcpServers": {
    "apify-hhs-breaches": {
      "url": "https://mcp.apify.com/?tools=automation-lab/hhs-data-breach-scraper"
    }
  }
}
```

Example prompts:

- "Run the HHS data breach scraper for the newest 100 reports and summarize the largest incidents."
- "Find California HIPAA breach reports from the latest HHS OCR page."
- "Compare today's HHS breach IDs with yesterday's dataset."

### Dataset exports

Apify datasets can be downloaded as JSON, CSV, Excel, XML, RSS, or HTML.
For recurring monitoring, use the dataset API and store the latest `hhsBreachId` values in your own system.

### Legality and responsible use

This actor collects publicly available government records from the HHS OCR Breach Portal.
Always use the data responsibly and follow applicable privacy, compliance, and outreach rules.
The actor does not bypass access controls or collect private account data.

### Troubleshooting

If a run returns fewer items than expected, increase `maxItems` or remove narrow filters.
If HHS changes its JSF table, open an issue with the run ID and logs so the extractor can be updated.

### Related scrapers

Automation Lab also builds public-data and compliance-focused Apify actors.
Use this actor alongside future security-header, trust-center, privacy, and government-record scrapers for broader risk monitoring.

### FAQ

#### Does this actor need proxies?

No proxy is required for the public HHS OCR report table in normal operation.

#### Can it scrape all historical rows?

Yes, use a higher `maxItems` value. The actor paginates the PrimeFaces report table in 100-row batches.

#### Can I filter by state?

Yes. Set `state` to a two-letter abbreviation such as `CA` or `TX`.

#### Can I monitor only new breaches?

Yes. Schedule the actor and compare new runs against previously stored `hhsBreachId` values.

#### Is this official HHS data?

The actor extracts the public HHS OCR breach report table, but the actor itself is not affiliated with or endorsed by HHS.

### Changelog

- Initial version: HTTP-only JSF extraction for the public HHS OCR HIPAA breach report table.

# Actor input Schema

## `maxItems` (type: `integer`):

Maximum number of HHS OCR breach report rows to save. The portal currently returns 100 rows per page.

## `startPage` (type: `integer`):

Zero-based HHS report page to start from. Use 0 for the newest reports.

## `state` (type: `string`):

Optional two-letter US state or territory abbreviation. Filtering is applied after fetching rows from HHS.

## `coveredEntityQuery` (type: `string`):

Optional case-insensitive substring to match in the covered entity name.

## `includeWebDescription` (type: `boolean`):

Include the hidden Web Description field when the portal provides it.

## Actor input object example

```json
{
  "maxItems": 20,
  "startPage": 0,
  "includeWebDescription": true
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "maxItems": 20,
    "startPage": 0,
    "state": "",
    "coveredEntityQuery": "",
    "includeWebDescription": true
};

// Run the Actor and wait for it to finish
const run = await client.actor("automation-lab/hhs-data-breach-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "maxItems": 20,
    "startPage": 0,
    "state": "",
    "coveredEntityQuery": "",
    "includeWebDescription": True,
}

# Run the Actor and wait for it to finish
run = client.actor("automation-lab/hhs-data-breach-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "maxItems": 20,
  "startPage": 0,
  "state": "",
  "coveredEntityQuery": "",
  "includeWebDescription": true
}' |
apify call automation-lab/hhs-data-breach-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=automation-lab/hhs-data-breach-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "HHS Data Breach Scraper",
        "description": "Extract public HIPAA breach reports from the HHS OCR portal for compliance monitoring, cybersecurity research, and legal lead workflows.",
        "version": "0.1",
        "x-build-id": "tfVgOJAIXAwAS4Irn"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/automation-lab~hhs-data-breach-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-automation-lab-hhs-data-breach-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/automation-lab~hhs-data-breach-scraper/runs": {
            "post": {
                "operationId": "runs-sync-automation-lab-hhs-data-breach-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/automation-lab~hhs-data-breach-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-automation-lab-hhs-data-breach-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "maxItems": {
                        "title": "Maximum breach reports",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Maximum number of HHS OCR breach report rows to save. The portal currently returns 100 rows per page.",
                        "default": 20
                    },
                    "startPage": {
                        "title": "Start page",
                        "minimum": 0,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Zero-based HHS report page to start from. Use 0 for the newest reports.",
                        "default": 0
                    },
                    "state": {
                        "title": "State filter",
                        "type": "string",
                        "description": "Optional two-letter US state or territory abbreviation. Filtering is applied after fetching rows from HHS."
                    },
                    "coveredEntityQuery": {
                        "title": "Covered entity contains",
                        "type": "string",
                        "description": "Optional case-insensitive substring to match in the covered entity name."
                    },
                    "includeWebDescription": {
                        "title": "Include web description",
                        "type": "boolean",
                        "description": "Include the hidden Web Description field when the portal provides it.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
