Pastebin Keyword Search & OSINT Scraper
Pricing
from $0.99 / 1,000 results
Pastebin Keyword Search & OSINT Scraper
Search public Pastebin archive data by keyword or regex, auto-expand into syntax archives, and stream matching OSINT results or a no-match search summary.
Pricing
from $0.99 / 1,000 results
Rating
0.0
(0)
Developer
Inus Grobler
Maintained by CommunityActor stats
0
Bookmarked
5
Total users
2
Monthly active users
5 days ago
Last modified
Categories
Share
Monitor public Pastebin archive data on Apify. This actor discovers public paste IDs from Pastebin archive pages, can collect a low-storage URL-only index on a schedule, then uses that indexed URL history during normal client scrape runs to search further back in time. It fetches raw paste text when needed, filters by keywords, extracts indicators with regex, and streams structured records to an Apify dataset.
Use it for lightweight Pastebin keyword search, leak monitoring, threat intelligence enrichment, and scheduled OSINT checks without running a browser.
What You Can Do
- Search discovered public Pastebin pastes for keywords such as company names, domains, usernames, product names,
password,token, orapi_key. - Build a cumulative URL-only Pastebin index with scheduled runs so keyword searches can fall back to URLs collected over time when live data has no match.
- Extract emails, URLs, API keys, wallet strings, private-key markers, or custom indicators with regex patterns.
- Monitor the rolling public archive on a schedule and collect matching results over time.
- Add Pastebin syntax archives such as
bash,python,javascript,json,sql,xml, ortextto discover additional public paste IDs, including some older public pastes. - Choose recent, expanded, or deep discovery. Deep discovery reads Pastebin's public language archive list and can scan many more syntax archive pages.
- Optionally add externally indexed Pastebin URLs from search result pages when you need a best-effort historical fallback.
- Keep larger runs memory-safe with result streaming, raw-paste size caps, regex timeouts, and configurable request timeouts.
Keyword Search
Yes, this actor can do keyword search on Pastebin data it discovers.
Add one or more values to keywords. The actor fetches the raw text for each discovered public paste and saves only pastes that contain at least one keyword. Keyword matching is case-insensitive by default.
This is not a complete historical Pastebin search engine. Pastebin does not expose a full public historical index through the archive pages. The actor can search the current public archive, common syntax archives, an optional deep sweep of Pastebin's public language archives, and optional externally indexed Pastebin URLs from search result pages. It cannot search private, deleted, expired, password-protected, or unlisted pastes that Pastebin does not expose.
If a keyword run checks live public pastes and finds no matches, it can search the scheduled URL index next, up to maxUrlIndexEntriesToSearch. If it still finds no matches, the actor saves one search_summary dataset record by default. That record explains how many paste IDs were discovered and processed, which keywords were searched, which archive pages or fallback sources were used, and how to widen the run.
Quick Start
Run the actor with the default settings for a small smoke test, or use this input:
{"runMode": "scrape","maxPastesPerRun": 25,"maxResults": 3,"urlIndexMaxRecordsPerRun": 250,"urlIndexDatasetName": "pastebin-url-index","urlIndexStateStoreName": "pastebin-url-index-state","urlIndexDeduplicate": true,"urlIndexRecentIdLimit": 200000,"urlIndexSaveToDefaultDataset": true,"useUrlIndex": true,"maxUrlIndexEntriesToSearch": 100,"fetchDetailMetadata": false,"keywords": [],"discoveryMode": "expanded","syntaxArchives": [],"maxSyntaxArchivesToScan": 75,"autoExpandSyntaxArchives": true,"saveNoMatchSummary": true,"searchEngineDiscovery": false,"maxSearchResultsPerKeyword": 20,"noMatchStopAfterPastes": 1000,"stopAfterPastesWithoutNewMatch": 1000,"regexPatterns": [],"maxConcurrency": 4,"requestTimeoutSecs": 20,"maxRawTextBytes": 1000000}
For a keyword-monitoring run:
{"runMode": "scrape","maxPastesPerRun": 100,"maxResults": 10,"fetchDetailMetadata": false,"keywords": ["example.com", "api_key", "password"],"discoveryMode": "deep","syntaxArchives": [],"maxSyntaxArchivesToScan": 75,"autoExpandSyntaxArchives": true,"saveNoMatchSummary": true,"useUrlIndex": true,"maxUrlIndexEntriesToSearch": 100,"searchEngineDiscovery": true,"maxSearchResultsPerKeyword": 20,"noMatchStopAfterPastes": 1000,"stopAfterPastesWithoutNewMatch": 1000,"regexPatterns": ["[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}","AKIA[0-9A-Z]{16}"],"maxConcurrency": 4,"requestTimeoutSecs": 20,"maxRawTextBytes": 1000000}
For a scheduled URL collection run:
{"runMode": "url_index","maxPastesPerRun": 100,"urlIndexMaxRecordsPerRun": 100,"urlIndexDatasetName": "pastebin-url-index","urlIndexStateStoreName": "pastebin-url-index-state","urlIndexDeduplicate": true,"urlIndexRecentIdLimit": 200000,"urlIndexSaveToDefaultDataset": true,"discoveryMode": "recent","syntaxArchives": [],"autoExpandSyntaxArchives": false,"searchEngineDiscovery": false,"maxConcurrency": 4,"requestTimeoutSecs": 20}
Create an Apify Schedule for this input to append new paste_url records to the named pastebin-url-index dataset. Normal scrape-mode client runs use that same named dataset by default when keywords or regex patterns are supplied, so results can come from both live Pastebin archive pages and URLs collected by previous scheduled runs.
Input Fields
| Field | Description |
|---|---|
runMode | scrape fetches raw paste text and applies filters/extraction. url_index saves lightweight Pastebin URL records only. |
maxPastesPerRun | Maximum public archive entries to inspect before filtering and result limits are applied. |
maxResults | Maximum matching paste records to stream in scrape mode. |
urlIndexMaxRecordsPerRun | Maximum new paste_url records to append in URL index mode. |
urlIndexDatasetName | Named dataset where scheduled URL index runs append cumulative URL records. |
urlIndexStateStoreName | Named key-value store used to remember recently indexed paste IDs and avoid duplicates. |
urlIndexDeduplicate | Skips URLs already seen in the URL index state store. |
urlIndexRecentIdLimit | Number of recent paste IDs kept for dedupe state. |
urlIndexSaveToDefaultDataset | Also saves URL index records to the current run dataset for easier inspection. |
useUrlIndex | In scrape mode, searches previously collected indexed URLs only after the live archive search finds no keyword matches. |
maxUrlIndexEntriesToSearch | Maximum recent URL index records to search as a fallback. Higher values search further back but cost more. |
fetchDetailMetadata | Fetches the paste page for each saved result to enrich it with author, title, and posted date. Leave off for cheaper high-volume runs. |
keywords | Optional words or phrases. If provided, only pastes containing at least one keyword are saved. |
discoveryMode | Discovery depth. recent checks only the rolling archive, expanded adds common syntax archives, and deep scans Pastebin language archives up to maxSyntaxArchivesToScan. |
syntaxArchives | Optional Pastebin syntax archive slugs such as bash, python, javascript, json, sql, xml, yaml, or text. Manual values override automatic archive selection. |
maxSyntaxArchivesToScan | Maximum number of language archive pages to scan in deep mode. Higher values broaden discovery but increase runtime. |
autoExpandSyntaxArchives | Automatically includes common syntax archives for keyword runs that request more than the main rolling archive exposes. This is useful when maxPastesPerRun is high and syntaxArchives is empty. |
saveNoMatchSummary | Saves one search_summary dataset record when a keyword run finds no matching paste records. |
searchEngineDiscovery | Optional best-effort fallback that searches externally indexed Pastebin URLs with multiple query variants for each keyword and adds discovered paste IDs to the run. |
maxSearchResultsPerKeyword | Maximum externally indexed Pastebin URLs to add per keyword when searchEngineDiscovery is enabled. |
noMatchStopAfterPastes | Cost guard for keyword searches. Stops early if zero paste records have matched after this many processed pastes. Set to 0 to disable. |
stopAfterPastesWithoutNewMatch | Cost guard for sparse keyword searches. After the first saved match, stops if this many additional processed pastes do not produce another saved match. Set to 0 to disable. |
regexPatterns | Optional Python-compatible regex patterns used to extract emails, keys, URLs, or other structured matches from saved pastes. |
maxConcurrency | Maximum raw paste fetches running at the same time. Lower values reduce memory and blocking risk. |
requestTimeoutSecs | Maximum seconds to wait for one Pastebin/archive/proxy request before retrying. |
maxRawTextBytes | Largest raw paste response held in memory. Larger pastes are skipped and counted as oversized_pastes in RUN_SUMMARY. |
Recommended Settings
Cheap First Test
maxPastesPerRun:25maxResults:3fetchDetailMetadata:falsekeywords:[]discoveryMode:expandedsyntaxArchives:[]maxSyntaxArchivesToScan:75autoExpandSyntaxArchives:truesaveNoMatchSummary:truesearchEngineDiscovery:falsemaxSearchResultsPerKeyword:20noMatchStopAfterPastes:1000stopAfterPastesWithoutNewMatch:1000regexPatterns:[]maxConcurrency:4requestTimeoutSecs:20maxRawTextBytes:1000000
Scheduled Pastebin Monitoring
maxPastesPerRun:50to150maxResults: set to the number of matches you want per runfetchDetailMetadata:falsefor lowest cost,truewhen author/date matterskeywords: use company names, domains, product names, or indicator termsuseUrlIndex:trueto search scheduled URL index records only when live archive data has no keyword matchesmaxUrlIndexEntriesToSearch:100for low-cost fallback searches,500to1000when you intentionally want more historical coveragediscoveryMode:expandedautoExpandSyntaxArchives:truesaveNoMatchSummary:truesearchEngineDiscovery:falsefor archive-only monitoring,truefor best-effort indexed URL discoverynoMatchStopAfterPastes:500to1000to avoid expensive zero-match runsstopAfterPastesWithoutNewMatch:500to1000to stop sparse runs after an early matchmaxConcurrency:2to4
Scheduled URL Index
runMode:url_indexmaxPastesPerRun:50to150urlIndexMaxRecordsPerRun: match or slightly exceedmaxPastesPerRunurlIndexDatasetName: keep a stable name such aspastebin-url-indexurlIndexStateStoreName: keep a stable name such aspastebin-url-index-stateurlIndexDeduplicate:truediscoveryMode:recentfor latest rolling archive collectionautoExpandSyntaxArchives:falsefor lowest-cost latest-URL checks- Schedule interval: every few minutes if you want a denser URL history
Older Public Paste Discovery
- Set
discoveryModetodeepto scan Pastebin's public language archive list. - Keep
maxSyntaxArchivesToScanat75for a broad first pass, or increase it up to266for the widest syntax archive sweep. - Add targeted
syntaxArchivesvalues such asbash,python,javascript,json,sql,xml,yaml, ortextwhen you know the likely paste category. - Increase
maxPastesPerRunwhen you use deep discovery. The actor can inspect up to5000discovered paste IDs per run. - Enable
searchEngineDiscoveryas an extra historical fallback when archive pages do not find enough matches. - Keep
noMatchStopAfterPastesenabled for broad keyword searches. Set it to0only when you intentionally want to exhaust the full run budget even if nothing matches. - Keep
stopAfterPastesWithoutNewMatchenabled for broad keyword searches. This prevents one early match from causing the actor to scan the entire remaining budget while looking for another match. - Keep
maxResultsmodest when scanning many syntax archives. - Use keyword filters so the actor saves only relevant records from older archive pages.
Syntax archives and the scheduled URL index can reveal older public paste IDs than the main rolling archive, but they are not a full Pastebin history. The URL index only reaches back to URLs collected since the schedule started.
Output
The main output is the default Apify dataset. In URL index mode, records are also appended to the configured named dataset so scheduled runs build one cumulative URL history. In scrape mode, indexed URLs are searched as a bounded fallback when useUrlIndex is enabled, keywords are provided, and the live archive search saves no matching paste records. Each saved record can include:
record_typepaste_idurlraw_urltitleauthordate_postedraw_textraw_text_previewraw_text_lengthsyntaxarchive_relative_timearchive_sourcematched_keywordsmatched_keyword_countextracted_regex_dataregex_matches_flatregex_match_countdetail_metadata_requestedsourcediscovered_atfetched_at
For paste records, record_type is paste. URL index records use record_type paste_url and contain only metadata such as paste_id, url, raw_url, title, syntax, archive_source, archive_relative_time, and discovered_at. If a keyword run finds no matching paste records, the actor can save a search_summary record with fields such as summary_message, searched_keywords, discovery_sources, archive_entries_seen, processed_entries, effective_syntax_archives, and suggested_next_steps.
The actor also writes a RUN_SUMMARY record to the default key-value store. It reports how many archive entries were seen, selected, processed, saved, filtered out, skipped as oversized, or failed. In URL index mode it also reports the named URL index dataset, duplicate count, and dedupe state store.
Why It Is Cost Efficient
- Uses plain HTTP requests instead of browser automation.
- URL index mode saves only small metadata records and does not fetch raw paste text.
- Streams dataset items as each paste finishes instead of buffering the whole run in memory.
- Keeps full raw text out of memory beyond the configured raw-paste size cap.
- Lets you disable extra metadata page requests.
- Uses regex timeouts, regex match caps, and raw response size limits to avoid runaway runs.
- Lets you tune
maxConcurrency,requestTimeoutSecs, andmaxRawTextBytesfor larger jobs.
Python API Example
import osfrom apify_client import ApifyClientclient = ApifyClient(os.environ["APIFY_TOKEN"])run_input = {"runMode": "scrape","maxPastesPerRun": 100,"maxResults": 10,"fetchDetailMetadata": False,"keywords": ["example.com", "api_key", "password"],"discoveryMode": "deep","syntaxArchives": [],"maxSyntaxArchivesToScan": 75,"autoExpandSyntaxArchives": True,"saveNoMatchSummary": True,"useUrlIndex": True,"maxUrlIndexEntriesToSearch": 100,"searchEngineDiscovery": True,"maxSearchResultsPerKeyword": 20,"noMatchStopAfterPastes": 1000,"stopAfterPastesWithoutNewMatch": 1000,"regexPatterns": [r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}",r"AKIA[0-9A-Z]{16}",],"maxConcurrency": 4,"requestTimeoutSecs": 20,"maxRawTextBytes": 1_000_000,}run = client.actor("thescrapelab/pastebin-osint-scraper").call(run_input=run_input)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["paste_id"], item["url"], item.get("matched_keywords"), item.get("regex_matches_flat"))
For URL indexing, use runMode="url_index" and keep the same urlIndexDatasetName on every scheduled run:
run_input = {"runMode": "url_index","maxPastesPerRun": 100,"urlIndexMaxRecordsPerRun": 100,"urlIndexDatasetName": "pastebin-url-index","urlIndexStateStoreName": "pastebin-url-index-state","urlIndexDeduplicate": True,"useUrlIndex": True,"maxUrlIndexEntriesToSearch": 100,"discoveryMode": "recent","autoExpandSyntaxArchives": False,}client.actor("thescrapelab/pastebin-osint-scraper").call(run_input=run_input)url_index_dataset = client.datasets().get_or_create(name="pastebin-url-index")for item in client.dataset(url_index_dataset["id"]).iterate_items():print(item["discovered_at"], item["paste_id"], item["url"])
Limitations
- The actor only uses public Pastebin surfaces.
- It cannot access private, deleted, expired, password-protected, or unlisted pastes that Pastebin does not expose.
- The main archive is a rolling page, so it only shows what Pastebin exposes at run time.
- Pastebin's native search page is not treated as a reliable public source because unauthenticated requests can redirect to login or challenge pages.
- Pastebin does not expose a usable public sitemap index at the standard
/sitemap.xmllocation. - Search-engine discovery is best effort. Search engines may throttle, omit, reorder, or stop returning Pastebin results.
- Syntax archives and deep discovery can help discover some older public paste IDs, but they do not provide complete historical backfill.
- Pastebin may expose fewer rows than requested. When that happens, the actor records the reason in
RUN_SUMMARYusingarchive_source_cappedandarchive_source_note.
Use this actor only for lawful monitoring, research, and security workflows where you are allowed to process the resulting data.
