Pastebin Public Archive OSINT Scraper avatar

Pastebin Public Archive OSINT Scraper

Pricing

from $0.99 / 1,000 results

Go to Apify Store
Pastebin Public Archive OSINT Scraper

Pastebin Public Archive OSINT Scraper

Monitor recent public Pastebin pastes, filter them with keywords or regex, and export structured OSINT-ready results on Apify without browser automation.

Pricing

from $0.99 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Pastebin Public Archive OSINT Scraper logo

Scrape the current public Pastebin archive, filter it for useful signals, and export clean results to Apify without using a browser.

This actor is built for people who want recent public Pastebin data in a format they can actually use. It is lightweight, stateless, cheap to run, and easy to schedule.

What this actor does

  • checks the current public Pastebin archive
  • fetches the raw text for each selected paste
  • optionally filters by keywords
  • optionally extracts emails, URLs, keys, or other indicators with regex
  • saves structured results to the dataset
  • saves a plain-language RUN_SUMMARY so you can see what happened in the run

Best use cases

  • OSINT monitoring
  • credential and leak discovery
  • threat-intelligence enrichment
  • scheduled monitoring of recent public pastes
  • low-cost high-volume checks without browser automation

What to expect

  • Every run is independent. The actor does not remember previous runs.
  • It uses plain HTTP requests, not Playwright or Puppeteer.
  • It is designed to keep Compute Unit usage low.
  • It works against the public archive page, so it can only collect what Pastebin exposes at that moment.

Quick start

If you just want to confirm it works, run it with the default settings or use this input:

{
"maxPastesPerRun": 25,
"maxResults": 3,
"fetchDetailMetadata": false,
"keywords": [],
"regexPatterns": []
}

That gives you a fast, inexpensive test run with a small result set.

Input fields

FieldWhat it means
maxPastesPerRunHow many recent archive entries the actor should inspect in this run.
maxResultsMaximum number of matching items to save to the dataset.
fetchDetailMetadataIf enabled, the actor makes one extra request per saved item to fetch the author and posted date.
keywordsOptional list of words or phrases. If you provide keywords, only pastes containing at least one of them are saved.
regexPatternsOptional Python regex patterns used to extract structured matches from each saved paste.

Cheap first test

  • maxPastesPerRun: 25
  • maxResults: 3
  • fetchDetailMetadata: false
  • keywords: []
  • regexPatterns: []

Broad low-cost monitoring

  • maxPastesPerRun: 50 to 125
  • maxResults: keep this lower if you only want a shortlist
  • fetchDetailMetadata: false

High-signal filtering

  • add keywords such as password, token, private key, or brand-specific terms
  • add regex patterns for emails, domains, URLs, API keys, or wallet strings

Output

The actor gives you two main outputs:

  • dataset items with the saved Pastebin records
  • a RUN_SUMMARY record in the default key-value store

Useful dataset fields include:

  • paste_id
  • url
  • raw_url
  • title
  • author
  • date_posted
  • raw_text_preview
  • matched_keywords
  • regex_matches_flat
  • regex_match_count
  • detail_metadata_requested
  • fetched_at

The RUN_SUMMARY helps explain the run at a glance. It includes:

  • how many archive entries were seen
  • how many were selected for the run
  • how many were processed and saved
  • how many were filtered out
  • whether the source page exposed fewer items than you requested
  • counts for fetch, metadata, or processing failures

Why it is cheap

  • no browser sessions
  • pure HTTP workflow
  • metadata requests are optional
  • lightweight default memory settings
  • raw text can stay out of the dataset unless you explicitly need it

Reliability features

  • Apify Proxy support
  • retry handling for temporary blocks and upstream issues
  • proxy session rotation on retries
  • lightweight concurrency tuned for HTTP scraping
  • challenge-page detection

Important limitation

This actor uses Pastebin's public archive page as its discovery source.

That means:

  • it does not backfill historical pastes
  • it does not access private or deleted pastes
  • it can only collect as many public items as the archive page exposes at run time

If Pastebin exposes fewer rows than maxPastesPerRun, the actor records that clearly in RUN_SUMMARY with archive_source_capped and archive_source_note.

Local development

Use Python 3.11:

python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3.11 -m unittest discover -s tests -v

Run locally with:

$python3.11 -m src

For real scraping runs, use Apify or apify run so proxy and storage behavior match production.

Project files

  • ./main.py: core actor logic
  • ./example_input.json: sample input
  • ./Dockerfile: Apify Python runtime image
  • ./requirements.txt: Python dependencies
  • ./.actor/actor.json: actor definition
  • ./.actor/input_schema.json: input schema
  • ./.actor/output_schema.json: output schema
  • ./.actor/dataset_schema.json: dataset schema