Pastebin Public Archive OSINT Scraper
Pricing
from $0.99 / 1,000 results
Pastebin Public Archive OSINT Scraper
Monitor recent public Pastebin pastes, filter them with keywords or regex, and export structured OSINT-ready results on Apify without browser automation.
Pricing
from $0.99 / 1,000 results
Rating
0.0
(0)
Developer
Inus Grobler
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Scrape the current public Pastebin archive, filter it for useful signals, and export clean results to Apify without using a browser.
This actor is built for people who want recent public Pastebin data in a format they can actually use. It is lightweight, stateless, cheap to run, and easy to schedule.
What this actor does
- checks the current public Pastebin archive
- fetches the raw text for each selected paste
- optionally filters by keywords
- optionally extracts emails, URLs, keys, or other indicators with regex
- saves structured results to the dataset
- saves a plain-language
RUN_SUMMARYso you can see what happened in the run
Best use cases
- OSINT monitoring
- credential and leak discovery
- threat-intelligence enrichment
- scheduled monitoring of recent public pastes
- low-cost high-volume checks without browser automation
What to expect
- Every run is independent. The actor does not remember previous runs.
- It uses plain HTTP requests, not Playwright or Puppeteer.
- It is designed to keep Compute Unit usage low.
- It works against the public archive page, so it can only collect what Pastebin exposes at that moment.
Quick start
If you just want to confirm it works, run it with the default settings or use this input:
{"maxPastesPerRun": 25,"maxResults": 3,"fetchDetailMetadata": false,"keywords": [],"regexPatterns": []}
That gives you a fast, inexpensive test run with a small result set.
Input fields
| Field | What it means |
|---|---|
maxPastesPerRun | How many recent archive entries the actor should inspect in this run. |
maxResults | Maximum number of matching items to save to the dataset. |
fetchDetailMetadata | If enabled, the actor makes one extra request per saved item to fetch the author and posted date. |
keywords | Optional list of words or phrases. If you provide keywords, only pastes containing at least one of them are saved. |
regexPatterns | Optional Python regex patterns used to extract structured matches from each saved paste. |
Recommended settings
Cheap first test
maxPastesPerRun:25maxResults:3fetchDetailMetadata:falsekeywords:[]regexPatterns:[]
Broad low-cost monitoring
maxPastesPerRun:50to125maxResults: keep this lower if you only want a shortlistfetchDetailMetadata:false
High-signal filtering
- add keywords such as
password,token,private key, or brand-specific terms - add regex patterns for emails, domains, URLs, API keys, or wallet strings
Output
The actor gives you two main outputs:
- dataset items with the saved Pastebin records
- a
RUN_SUMMARYrecord in the default key-value store
Useful dataset fields include:
paste_idurlraw_urltitleauthordate_postedraw_text_previewmatched_keywordsregex_matches_flatregex_match_countdetail_metadata_requestedfetched_at
The RUN_SUMMARY helps explain the run at a glance. It includes:
- how many archive entries were seen
- how many were selected for the run
- how many were processed and saved
- how many were filtered out
- whether the source page exposed fewer items than you requested
- counts for fetch, metadata, or processing failures
Why it is cheap
- no browser sessions
- pure HTTP workflow
- metadata requests are optional
- lightweight default memory settings
- raw text can stay out of the dataset unless you explicitly need it
Reliability features
- Apify Proxy support
- retry handling for temporary blocks and upstream issues
- proxy session rotation on retries
- lightweight concurrency tuned for HTTP scraping
- challenge-page detection
Important limitation
This actor uses Pastebin's public archive page as its discovery source.
That means:
- it does not backfill historical pastes
- it does not access private or deleted pastes
- it can only collect as many public items as the archive page exposes at run time
If Pastebin exposes fewer rows than maxPastesPerRun, the actor records that clearly in RUN_SUMMARY with archive_source_capped and archive_source_note.
Local development
Use Python 3.11:
python3.11 -m venv .venvsource .venv/bin/activatepip install -r requirements.txtpython3.11 -m unittest discover -s tests -v
Run locally with:
$python3.11 -m src
For real scraping runs, use Apify or apify run so proxy and storage behavior match production.
Project files
- ./main.py: core actor logic
- ./example_input.json: sample input
- ./Dockerfile: Apify Python runtime image
- ./requirements.txt: Python dependencies
- ./.actor/actor.json: actor definition
- ./.actor/input_schema.json: input schema
- ./.actor/output_schema.json: output schema
- ./.actor/dataset_schema.json: dataset schema