Pricing

from $0.99 / 1,000 results

Pastebin Keyword Search & OSINT Scraper

Search public Pastebin archive data by keyword or regex, auto-expand into syntax archives, and stream matching OSINT results or a no-match search summary.

Pricing

from $0.99 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

What You Can Do

Search discovered public Pastebin pastes for keywords such as company names, domains, usernames, product names, password, token, or api_key.
Build a cumulative URL-only Pastebin index with scheduled runs so keyword searches can fall back to URLs collected over time when live data has no match.
Extract emails, URLs, API keys, wallet strings, private-key markers, or custom indicators with regex patterns.
Monitor the rolling public archive on a schedule and collect matching results over time.
Add Pastebin syntax archives such as bash, python, javascript, json, sql, xml, or text to discover additional public paste IDs, including some older public pastes.
Choose recent, expanded, or deep discovery. Deep discovery reads Pastebin's public language archive list and can scan many more syntax archive pages.
Optionally add externally indexed Pastebin URLs from search result pages when you need a best-effort historical fallback.
Keep larger runs memory-safe with result streaming, raw-paste size caps, regex timeouts, and configurable request timeouts.

Keyword Search

Yes, this actor can do keyword search on Pastebin data it discovers.

Add one or more values to keywords. The actor fetches the raw text for each discovered public paste and saves only pastes that contain at least one keyword. Keyword matching is case-insensitive by default.

This is not a complete historical Pastebin search engine. Pastebin does not expose a full public historical index through the archive pages. The actor can search the current public archive, common syntax archives, an optional deep sweep of Pastebin's public language archives, and optional externally indexed Pastebin URLs from search result pages. It cannot search private, deleted, expired, password-protected, or unlisted pastes that Pastebin does not expose.

If a keyword run checks live public pastes and finds no matches, it can search the scheduled URL index next, up to maxUrlIndexEntriesToSearch. If it still finds no matches, the actor saves one search_summary dataset record by default. That record explains how many paste IDs were discovered and processed, which keywords were searched, which archive pages or fallback sources were used, and how to widen the run.

Quick Start

Run the actor with the default settings for a small smoke test, or use this input:

{
  "maxPastesPerRun": 25,
  "maxResults": 3,
  "keywords": ["example.com", "api_key", "password"]
}

For a keyword-monitoring run:

{
  "runMode": "scrape",
  "maxPastesPerRun": 100,
  "maxResults": 10,
  "fetchDetailMetadata": false,
  "keywords": ["example.com", "api_key", "password"],
  "discoveryMode": "expanded",
  "regexPatterns": [
    "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}",
    "AKIA[0-9A-Z]{16}"
  ]
}

For a scheduled URL collection run:

{
  "runMode": "url_index",
  "maxPastesPerRun": 100,
  "urlIndexMaxRecordsPerRun": 100,
  "urlIndexDatasetName": "pastebin-url-index",
  "urlIndexStateStoreName": "pastebin-url-index-state",
  "urlIndexDeduplicate": true,
  "urlIndexRecentIdLimit": 25,
  "urlIndexSaveToDefaultDataset": true,
  "discoveryMode": "recent",
  "syntaxArchives": [],
  "autoExpandSyntaxArchives": false,
  "searchEngineDiscovery": false
}

Create an Apify Schedule for this input to append new paste_url records to the named pastebin-url-index dataset. Normal scrape-mode client runs use that same named dataset by default when keywords are supplied and live archive discovery finds no matches, so results can come from both live Pastebin archive pages and URLs collected by previous scheduled runs.

Input Fields

Field	Description
`runMode`	`scrape` fetches raw paste text and applies filters/extraction. `url_index` saves lightweight Pastebin URL records only.
`maxPastesPerRun`	Maximum public archive entries to inspect before filtering and result limits are applied.
`maxResults`	Maximum matching paste records to stream in scrape mode.
`urlIndexMaxRecordsPerRun`	Maximum new `paste_url` records to append in URL index mode.
`urlIndexDatasetName`	Named dataset where scheduled URL index runs append cumulative URL records.
`urlIndexStateStoreName`	Named key-value store used to remember recently indexed paste IDs and avoid duplicates.
`urlIndexDeduplicate`	Skips URLs already seen in the URL index state store.
`urlIndexRecentIdLimit`	Number of recent paste IDs kept for dedupe state.
`urlIndexSaveToDefaultDataset`	Also saves URL index records to the current run dataset for easier inspection.
`useUrlIndex`	In scrape mode, searches previously collected indexed URLs only after the live archive search finds no keyword matches.
`maxUrlIndexEntriesToSearch`	Maximum recent URL index records to search as a fallback. Higher values search further back but cost more.
`fetchDetailMetadata`	Fetches the paste page for each saved result to enrich it with author, title, and posted date. Leave off for cheaper high-volume runs.
`keywords`	Optional words or phrases. If provided, only pastes containing at least one keyword are saved.
`discoveryMode`	Discovery depth. `recent` checks only the rolling archive, `expanded` adds common syntax archives, and `deep` scans Pastebin language archives up to `maxSyntaxArchivesToScan`.
`syntaxArchives`	Optional Pastebin syntax archive slugs such as `bash`, `python`, `javascript`, `json`, `sql`, `xml`, `yaml`, or `text`. Manual values override automatic archive selection.
`maxSyntaxArchivesToScan`	Maximum number of language archive pages to scan in `deep` mode. Higher values broaden discovery but increase runtime.
`autoExpandSyntaxArchives`	Automatically includes common syntax archives for keyword runs that request more than the main rolling archive exposes. This is useful when `maxPastesPerRun` is high and `syntaxArchives` is empty.
`saveNoMatchSummary`	Saves one `search_summary` dataset record when a keyword run finds no matching paste records.
`searchEngineDiscovery`	Optional best-effort fallback that searches externally indexed Pastebin URLs with multiple query variants for each keyword and adds discovered paste IDs to the run.
`maxSearchResultsPerKeyword`	Maximum externally indexed Pastebin URLs to add per keyword when `searchEngineDiscovery` is enabled.
`noMatchStopAfterPastes`	Cost guard for keyword searches. Stops early if zero paste records have matched after this many processed pastes. Set to `0` to disable.
`stopAfterPastesWithoutNewMatch`	Cost guard for sparse keyword searches. After the first saved match, stops if this many additional processed pastes do not produce another saved match. Set to `0` to disable.
`regexPatterns`	Optional Python-compatible regex patterns used to extract emails, keys, URLs, or other structured matches from saved pastes.

Recommended Settings

Cheap First Test

maxPastesPerRun: 25
maxResults: 3
fetchDetailMetadata: false
keywords: []
discoveryMode: expanded
syntaxArchives: []
maxSyntaxArchivesToScan: 75
autoExpandSyntaxArchives: true
saveNoMatchSummary: true
searchEngineDiscovery: false
maxSearchResultsPerKeyword: 20
noMatchStopAfterPastes: 1000
stopAfterPastesWithoutNewMatch: 1000
regexPatterns: []

Scheduled Pastebin Monitoring

maxPastesPerRun: 50 to 150
maxResults: set to the number of matches you want per run
fetchDetailMetadata: false for lowest cost, true when author/date matters
keywords: use company names, domains, product names, or indicator terms
useUrlIndex: true to search scheduled URL index records only when live archive data has no keyword matches
maxUrlIndexEntriesToSearch: 100 for low-cost fallback searches, 500 to 1000 when you intentionally want more historical coverage
discoveryMode: expanded
autoExpandSyntaxArchives: true
saveNoMatchSummary: true
searchEngineDiscovery: false for archive-only monitoring, true for best-effort indexed URL discovery
noMatchStopAfterPastes: 500 to 1000 to avoid expensive zero-match runs
stopAfterPastesWithoutNewMatch: 500 to 1000 to stop sparse runs after an early match

Scheduled URL Index

runMode: url_index
maxPastesPerRun: 50 to 150
urlIndexMaxRecordsPerRun: match or slightly exceed maxPastesPerRun
urlIndexDatasetName: keep a stable name such as pastebin-url-index
urlIndexStateStoreName: keep a stable name such as pastebin-url-index-state
urlIndexDeduplicate: true
discoveryMode: recent for latest rolling archive collection
autoExpandSyntaxArchives: false for lowest-cost latest-URL checks
Schedule interval: every few minutes if you want a denser URL history

Older Public Paste Discovery

Set discoveryMode to deep to scan Pastebin's public language archive list.
Keep maxSyntaxArchivesToScan at 75 for a broad first pass, or increase it up to 266 for the widest syntax archive sweep.
Add targeted syntaxArchives values such as bash, python, javascript, json, sql, xml, yaml, or text when you know the likely paste category.
Increase maxPastesPerRun when you use deep discovery. The actor can inspect up to 5000 discovered paste IDs per run.
Enable searchEngineDiscovery as an extra historical fallback when archive pages do not find enough matches.
Keep noMatchStopAfterPastes enabled for broad keyword searches. Set it to 0 only when you intentionally want to exhaust the full run budget even if nothing matches.
Keep stopAfterPastesWithoutNewMatch enabled for broad keyword searches. This prevents one early match from causing the actor to scan the entire remaining budget while looking for another match.
Keep maxResults modest when scanning many syntax archives.
Use keyword filters so the actor saves only relevant records from older archive pages.

Syntax archives and the scheduled URL index can reveal older public paste IDs than the main rolling archive, but they are not a full Pastebin history. The URL index only reaches back to URLs collected since the schedule started.

Output

The main output is the default Apify dataset. In URL index mode, records are also appended to the configured named dataset so scheduled runs build one cumulative URL history. In scrape mode, indexed URLs are searched as a bounded fallback when useUrlIndex is enabled, keywords are provided, and the live archive search saves no matching paste records. Each saved record can include:

record_type
paste_id
url
raw_url
title
author
date_posted
raw_text
raw_text_preview
raw_text_length
syntax
archive_relative_time
archive_source
matched_keywords
matched_keyword_count
extracted_regex_data
regex_matches_flat
regex_match_count
detail_metadata_requested
source
discovered_at
fetched_at

For paste records, record_type is paste. URL index records use record_type paste_url and contain only metadata such as paste_id, url, raw_url, title, syntax, archive_source, archive_relative_time, and discovered_at. If a keyword run finds no matching paste records, the actor can save a search_summary record with fields such as summary_message, searched_keywords, discovery_sources, archive_entries_seen, processed_entries, effective_syntax_archives, and suggested_next_steps.

Example matching paste row:

{
  "record_type": "paste",
  "paste_id": "AbCd1234",
  "url": "https://pastebin.com/AbCd1234",
  "raw_url": "https://pastebin.com/raw/AbCd1234",
  "title": "Public monitoring example",
  "author": "guest",
  "raw_text_preview": "Contact security@example.com for the public report.",
  "raw_text_length": 58,
  "syntax": "text",
  "matched_keywords": ["example.com"],
  "matched_keyword_count": 1,
  "source": "pastebin",
  "fetched_at": "2026-07-23T12:00:00.000Z"
}

The actor also writes a RUN_SUMMARY record to the default key-value store. It reports how many archive entries were seen, selected, processed, saved, filtered out, skipped as oversized, or failed. In URL index mode it also reports the named URL index dataset, duplicate count, and dedupe state store.

Why It Is Cost Efficient

Uses plain HTTP requests instead of browser automation.
URL index mode saves only small metadata records and does not fetch raw paste text.
Streams dataset items as each paste finishes instead of buffering the whole run in memory.
Keeps full raw text out of memory beyond the configured raw-paste size cap.
Lets you disable extra metadata page requests.
Uses regex timeouts, regex match caps, and raw response size limits to avoid runaway runs.

Python API Example

import os
from apify_client import ApifyClient

client = ApifyClient(os.environ["APIFY_TOKEN"])

run_input = {
    "runMode": "scrape",
    "maxPastesPerRun": 100,
    "maxResults": 10,
    "fetchDetailMetadata": False,
    "keywords": ["example.com", "api_key", "password"],
    "discoveryMode": "deep",
    "syntaxArchives": [],
    "maxSyntaxArchivesToScan": 75,
    "autoExpandSyntaxArchives": True,
    "saveNoMatchSummary": True,
    "useUrlIndex": True,
    "maxUrlIndexEntriesToSearch": 100,
    "searchEngineDiscovery": True,
    "maxSearchResultsPerKeyword": 20,
    "noMatchStopAfterPastes": 1000,
    "stopAfterPastesWithoutNewMatch": 1000,
    "regexPatterns": [
        r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}",
        r"AKIA[0-9A-Z]{16}",
    ]
}

run = client.actor("thescrapelab/pastebin-osint-scraper").call(run_input=run_input)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["paste_id"], item["url"], item.get("matched_keywords"), item.get("regex_matches_flat"))

For URL indexing, use runMode="url_index" and keep the same urlIndexDatasetName on every scheduled run:

run_input = {
    "runMode": "url_index",
    "maxPastesPerRun": 100,
    "urlIndexMaxRecordsPerRun": 100,
    "urlIndexDatasetName": "pastebin-url-index",
    "urlIndexStateStoreName": "pastebin-url-index-state",
    "urlIndexDeduplicate": True,
    "useUrlIndex": True,
    "maxUrlIndexEntriesToSearch": 100,
    "discoveryMode": "recent",
    "autoExpandSyntaxArchives": False,
}

client.actor("thescrapelab/pastebin-osint-scraper").call(run_input=run_input)

url_index_dataset = client.datasets().get_or_create(name="pastebin-url-index")

for item in client.dataset(url_index_dataset["id"]).iterate_items():
    print(item["discovered_at"], item["paste_id"], item["url"])

Limitations

The actor only uses public Pastebin surfaces.
It cannot access private, deleted, expired, password-protected, or unlisted pastes that Pastebin does not expose.
The main archive is a rolling page, so it only shows what Pastebin exposes at run time.
Pastebin's native search page is not treated as a reliable public source because unauthenticated requests can redirect to login or challenge pages.
Pastebin does not expose a usable public sitemap index at the standard /sitemap.xml location.
Search-engine discovery is best effort. Search engines may throttle, omit, reorder, or stop returning Pastebin results.
Syntax archives and deep discovery can help discover some older public paste IDs, but they do not provide complete historical backfill.
Pastebin may expose fewer rows than requested. When that happens, the actor records the reason in RUN_SUMMARY using archive_source_capped and archive_source_note.

Use this actor only for lawful monitoring, research, and security workflows where you are allowed to process the resulting data.

Troubleshooting

Problem	What to try
No matching paste records	Use broader keywords, run on a schedule, enable `saveNoMatchSummary`, or try `discoveryMode` set to `deep` for a wider public syntax-archive sweep.
The run is slower than expected	Lower `maxPastesPerRun`, keep `fetchDetailMetadata` off, and keep `searchEngineDiscovery` off unless you need indexed URL fallback.
Too many irrelevant results	Add keywords, add regex patterns, or lower `maxResults` for a shorter triage list.
Pastebin exposes fewer rows than requested	Check `RUN_SUMMARY.archive_source_note`. The public archive often exposes fewer rows than a large requested limit.
Some raw pastes are skipped	Check `oversized_pastes`, `skipped_binary_pastes`, and `raw_fetch_failures` in `RUN_SUMMARY`. The actor skips very large or non-text content to keep costs predictable.

Pricing

The current Store price appears before you run the Actor. Small tests use low memory, stream results as they are found, and avoid extra metadata requests unless you enable them. URL index mode writes lightweight URL records; schedule it with modest limits such as 50 to 150 URLs per run.

For large keyword runs, keep maxResults and the no-match guards enabled so the run stops after it has enough useful results or after it has checked enough non-matching pastes to avoid wasting compute.

FAQ

Can this actor search all historical Pastebin pastes?

No. It searches public sources that Pastebin or search engines expose during the run, plus any URL index you build with scheduled runs. It does not access private, deleted, expired, password-protected, or unlisted pastes.

Can I monitor my company name or domain?

Yes. Add your company name, domains, product names, usernames, or indicator terms to keywords, then schedule the actor.

Does it use a browser?

No. It uses HTTP requests, which keeps memory and compute costs lower than a browser-based scraper.

Does it save full paste text?

By default it saves a preview and metadata, not the full raw body. The direct raw_url is included when you need to inspect the paste.

What is the best setup for continuous monitoring?

Run scrape mode on a schedule with focused keywords. Optionally add a separate scheduled url_index run to build a cumulative URL history that later keyword runs can search as a fallback.

What keywords should I start with?

Start with your domains, brand names, product names, known usernames, and high-signal terms such as api_key, token, password, or private-key markers.

OSINT Scraper

epctex/osint-scraper

Harness the power of OSINT data with our advanced OSINT Scraper. Discover keywords and leaked information from platforms like Ideone, Dumpz, Github Gist, Pastebin, Pasteorg and Textbin. You can specify search terms, customize and retrieve OSINT data out of the box.

epctex

1.4K

5.0

OSINT Footprint Scanner

admirable_rough_guts/osint-scanner

Passive reconnaissance tool — drop in a username or email and discover 22+ associated online profiles in seconds. Zero aggressive scanning. Just passive OSINT data. Scans GitHub, Reddit, Telegram, X, TikTok, Instagram, Steam, Pastebin, Keybase, LinkedIn, Medium, and more.

Marc Davis

OSINT Scraper

crawlerbros/osint-scraper

Search paste sites and code sharing platforms (Pastebin, GitHub Gist, Ideone, Paste.org, Textbin) for leaked keywords, credentials, and sensitive data using Google SERP-based discovery.

Crawler Bros

Brand Mention & OSINT Monitor

second_coming/brand-mention-monitor

Monitor brand mentions across Reddit, Hacker News, Pastebin, GitHub, and web forums. Detect data leaks, security incidents, and track brand sentiment with urgency flags.

Richard P

Face Search OSINT

nkactors/face-search

Search people across the web by face using a fast and simple API

NK Actors

255

5.0

X (twitter) Search Scraper

datamagnet/x-twitter-search-actor

Search for posts or content matching a keyword and return the results in a simple, readable format.

Datamagnet

Gab OSINT Scraper

thescrapelab/gab-osint-scraper

Scrape public Gab posts, profiles, direct post URLs, and thread context for OSINT investigations, social media monitoring, research, and datasets.

Inus Grobler

Cross-platform OSINT Seed-list Monitor

automation-lab/cross-platform-osint-seed-list-monitor

Monitor public Telegram, RSS, and web seed lists. Export normalized OSINT evidence records with source URLs, content hashes, and deltas.

Stas Persiianenko

Username OSINT Availability Checker

dev00/username-osint

Perform deep OSINT lookups to instantly check username availability across 30+ top social media, professional, and gaming platforms.

dev00

Internet Archive & Wayback Machine Scraper

cloud9_ai/internet-archive-scraper

Search Internet Archive and check Wayback Machine snapshots. Access 800B+ archived pages, books, movies, audio. Search items, get metadata, or check URL archive history. No API key needed. For SEO, OSINT, legal, and research.