# Pastebin Public Archive OSINT Scraper (`thescrapelab/pastebin-osint-scraper`) Actor

Monitor recent public Pastebin pastes, filter them with keywords or regex, and export structured OSINT-ready results on Apify without browser automation.

- **URL**: https://apify.com/thescrapelab/pastebin-osint-scraper.md
- **Developed by:** [Inus Grobler](https://apify.com/thescrapelab) (community)
- **Categories:** Developer tools, Other, SEO tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.99 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Pastebin Public Archive OSINT Scraper

<p align="center">
  <img src="./assets/logo.svg" alt="Pastebin Public Archive OSINT Scraper logo" width="220">
</p>

Scrape the current public Pastebin archive, filter it for useful signals, and export clean results to Apify without using a browser.

This actor is built for people who want recent public Pastebin data in a format they can actually use. It is lightweight, stateless, cheap to run, and easy to schedule.

### What this actor does

- checks the current public Pastebin archive
- fetches the raw text for each selected paste
- optionally filters by keywords
- optionally extracts emails, URLs, keys, or other indicators with regex
- saves structured results to the dataset
- saves a plain-language `RUN_SUMMARY` so you can see what happened in the run

### Best use cases

- OSINT monitoring
- credential and leak discovery
- threat-intelligence enrichment
- scheduled monitoring of recent public pastes
- low-cost high-volume checks without browser automation

### What to expect

- Every run is independent. The actor does not remember previous runs.
- It uses plain HTTP requests, not Playwright or Puppeteer.
- It is designed to keep Compute Unit usage low.
- It works against the public archive page, so it can only collect what Pastebin exposes at that moment.

### Quick start

If you just want to confirm it works, run it with the default settings or use this input:

```json
{
  "maxPastesPerRun": 25,
  "maxResults": 3,
  "fetchDetailMetadata": false,
  "keywords": [],
  "regexPatterns": []
}
````

That gives you a fast, inexpensive test run with a small result set.

### Input fields

| Field | What it means |
| --- | --- |
| `maxPastesPerRun` | How many recent archive entries the actor should inspect in this run. |
| `maxResults` | Maximum number of matching items to save to the dataset. |
| `fetchDetailMetadata` | If enabled, the actor makes one extra request per saved item to fetch the author and posted date. |
| `keywords` | Optional list of words or phrases. If you provide keywords, only pastes containing at least one of them are saved. |
| `regexPatterns` | Optional Python regex patterns used to extract structured matches from each saved paste. |

### Recommended settings

#### Cheap first test

- `maxPastesPerRun`: `25`
- `maxResults`: `3`
- `fetchDetailMetadata`: `false`
- `keywords`: `[]`
- `regexPatterns`: `[]`

#### Broad low-cost monitoring

- `maxPastesPerRun`: `50` to `125`
- `maxResults`: keep this lower if you only want a shortlist
- `fetchDetailMetadata`: `false`

#### High-signal filtering

- add keywords such as `password`, `token`, `private key`, or brand-specific terms
- add regex patterns for emails, domains, URLs, API keys, or wallet strings

### Output

The actor gives you two main outputs:

- dataset items with the saved Pastebin records
- a `RUN_SUMMARY` record in the default key-value store

Useful dataset fields include:

- `paste_id`
- `url`
- `raw_url`
- `title`
- `author`
- `date_posted`
- `raw_text_preview`
- `matched_keywords`
- `regex_matches_flat`
- `regex_match_count`
- `detail_metadata_requested`
- `fetched_at`

The `RUN_SUMMARY` helps explain the run at a glance. It includes:

- how many archive entries were seen
- how many were selected for the run
- how many were processed and saved
- how many were filtered out
- whether the source page exposed fewer items than you requested
- counts for fetch, metadata, or processing failures

### Why it is cheap

- no browser sessions
- pure HTTP workflow
- metadata requests are optional
- lightweight default memory settings
- raw text can stay out of the dataset unless you explicitly need it

### Reliability features

- Apify Proxy support
- retry handling for temporary blocks and upstream issues
- proxy session rotation on retries
- lightweight concurrency tuned for HTTP scraping
- challenge-page detection

### Important limitation

This actor uses Pastebin's public archive page as its discovery source.

That means:

- it does not backfill historical pastes
- it does not access private or deleted pastes
- it can only collect as many public items as the archive page exposes at run time

If Pastebin exposes fewer rows than `maxPastesPerRun`, the actor records that clearly in `RUN_SUMMARY` with `archive_source_capped` and `archive_source_note`.

### Local development

Use Python 3.11:

```bash
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3.11 -m unittest discover -s tests -v
```

Run locally with:

```bash
python3.11 -m src
```

For real scraping runs, use Apify or `apify run` so proxy and storage behavior match production.

### Project files

- [main.py](./main.py): core actor logic
- [example\_input.json](./example_input.json): sample input
- [Dockerfile](./Dockerfile): Apify Python runtime image
- [requirements.txt](./requirements.txt): Python dependencies
- [`.actor/actor.json`](./.actor/actor.json): actor definition
- [`.actor/input_schema.json`](./.actor/input_schema.json): input schema
- [`.actor/output_schema.json`](./.actor/output_schema.json): output schema
- [`.actor/dataset_schema.json`](./.actor/dataset_schema.json): dataset schema

# Actor input Schema

## `maxPastesPerRun` (type: `integer`):

How many recent Pastebin archive entries the actor should inspect in this run before filters and result limits are applied. Higher values help when you use strict keywords or regex filters.

## `maxResults` (type: `integer`):

Maximum number of dataset items the actor should save in this run. This lets you inspect many pastes but return only a small result set.

## `fetchDetailMetadata` (type: `boolean`):

When enabled, the actor makes one extra Pastebin page request for each saved item to enrich it with author and publication timestamp. Leave this off for the cheapest high-volume runs.

## `keywords` (type: `array`):

Optional text terms to look for. If you enter any keywords, the actor saves only pastes that contain at least one of them.

## `regexPatterns` (type: `array`):

Optional Python regex patterns used to extract emails, keys, URLs, or other structured matches from saved pastes.

## Actor input object example

```json
{
  "maxPastesPerRun": 25,
  "maxResults": 10,
  "fetchDetailMetadata": false,
  "keywords": [
    "password",
    "api_key",
    "BEGIN RSA PRIVATE KEY"
  ],
  "regexPatterns": [
    "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}",
    "AKIA[0-9A-Z]{16}"
  ]
}
```

# Actor output Schema

## `datasetItems` (type: `string`):

Structured Pastebin records saved during the actor run.

## `runSummary` (type: `string`):

Human-friendly JSON summary of what the actor discovered, processed, saved, skipped, and failed during the run.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("thescrapelab/pastebin-osint-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("thescrapelab/pastebin-osint-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call thescrapelab/pastebin-osint-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=thescrapelab/pastebin-osint-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Pastebin Public Archive OSINT Scraper",
        "description": "Monitor recent public Pastebin pastes, filter them with keywords or regex, and export structured OSINT-ready results on Apify without browser automation.",
        "version": "0.2",
        "x-build-id": "3XItAOpmjFHELwdTV"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/thescrapelab~pastebin-osint-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-thescrapelab-pastebin-osint-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/thescrapelab~pastebin-osint-scraper/runs": {
            "post": {
                "operationId": "runs-sync-thescrapelab-pastebin-osint-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/thescrapelab~pastebin-osint-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-thescrapelab-pastebin-osint-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "maxPastesPerRun": {
                        "title": "Pastes to check",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "How many recent Pastebin archive entries the actor should inspect in this run before filters and result limits are applied. Higher values help when you use strict keywords or regex filters.",
                        "default": 25
                    },
                    "maxResults": {
                        "title": "Maximum results to save",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Maximum number of dataset items the actor should save in this run. This lets you inspect many pastes but return only a small result set.",
                        "default": 3
                    },
                    "fetchDetailMetadata": {
                        "title": "Fetch author and date",
                        "type": "boolean",
                        "description": "When enabled, the actor makes one extra Pastebin page request for each saved item to enrich it with author and publication timestamp. Leave this off for the cheapest high-volume runs.",
                        "default": false
                    },
                    "keywords": {
                        "title": "Keywords",
                        "type": "array",
                        "description": "Optional text terms to look for. If you enter any keywords, the actor saves only pastes that contain at least one of them.",
                        "items": {
                            "type": "string"
                        },
                        "default": []
                    },
                    "regexPatterns": {
                        "title": "Regex patterns",
                        "type": "array",
                        "description": "Optional Python regex patterns used to extract emails, keys, URLs, or other structured matches from saved pastes.",
                        "items": {
                            "type": "string"
                        },
                        "default": []
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
