Wayback Machine Bulk Lookup
Pricing
Pay per event
Wayback Machine Bulk Lookup
Look up Wayback Machine snapshots for any URL or list of URLs. Returns capture timeline, optional snapshot markdown, and live-vs-snapshot diff. Date range filtering, capture limit, bulk input. Built for OSINT, journalism, SEO link-rot recovery, and legal evidence.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Look up Wayback Machine (archive.org) snapshots for any URL or list of URLs. Returns the full capture timeline, optional snapshot HTML-to-markdown content, and a live-vs-snapshot text diff. Built for OSINT analysts, journalists verifying sources, SEO teams recovering link-rot, and legal evidence collection.
What this actor does
For each input URL, the actor:
- Queries the Wayback CDX API to retrieve the snapshot index in your specified date range and capture limit
- Optionally fetches snapshot HTML for each capture and converts it to markdown (for reading or archiving)
- Optionally fetches the current live URL and computes a line-level text diff against the most recent snapshot (to detect page changes)
Each output record contains the full snapshot timeline plus optional diff and content fields.
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | array of strings | — | Required. URLs to look up in the Wayback Machine |
maxItems | integer | — | Maximum total output records across all URLs |
dateFrom | string | — | Earliest snapshot date to include (ISO date, e.g. 2020-01-01) |
dateTo | string | — | Latest snapshot date to include (ISO date, e.g. 2024-12-31) |
captureLimit | integer | 100 | Max snapshots per URL |
fetchSnapshotContent | boolean | false | Download snapshot HTML and convert to markdown |
diffWithLive | boolean | false | Compute text diff between latest snapshot and current live URL |
proxyConfiguration | object | none | Optional proxy config (usually not needed for Wayback) |
Example input:
{"urls": ["https://example.com/news/2024-article","https://example.com/about"],"dateFrom": "2023-01-01","dateTo": "2024-12-31","captureLimit": 50,"diffWithLive": true}
Output
One record per input URL.
| Field | Type | Description |
|---|---|---|
url | string | The input URL |
snapshotCount | number | Number of snapshots found in the date range |
firstCaptured | string | Earliest snapshot timestamp (ISO 8601) |
lastCaptured | string | Latest snapshot timestamp (ISO 8601) |
captures | array | Snapshot entries — each a JSON-encoded string with timestamp, archiveUrl, status, mimetype, and optionally contentMarkdown |
diff | object | { addedLines, removedLines, changedRatio } — only present when diffWithLive=true |
liveStatus | number | Current HTTP status of the live URL — only present when diffWithLive=true |
finalLiveUrl | string | Final URL after redirects |
status | string | success, timeout, or error |
errorMsg | string | Error details on failure, null on success |
Example output record:
{"url": "https://example.com/news/2024-article","snapshotCount": 14,"firstCaptured": "2024-03-12T08:42:00Z","lastCaptured": "2026-04-29T22:11:00Z","captures": ["{\"timestamp\":\"2026-04-29T22:11:00Z\",\"archiveUrl\":\"https://web.archive.org/web/20260429221100/https://example.com/news/2024-article\",\"status\":200,\"mimetype\":\"text/html\"}"],"diff": { "addedLines": 12, "removedLines": 3, "changedRatio": 0.04 },"liveStatus": 200,"finalLiveUrl": "https://example.com/news/2024-article","status": "success","errorMsg": null}
Dataset views
The actor produces two dataset views in the Apify console:
- Capture Timeline —
url,snapshotCount,firstCaptured,lastCaptured,captures - Live vs Snapshot Diff —
url,liveStatus,diff,lastCaptured
Rate limits and performance
The actor respects Wayback Machine's rate limits:
- CDX API queries: ~10 requests/second (110ms minimum delay)
- Snapshot content fetches: ~1-2 requests/second (700ms minimum delay)
For large batches with fetchSnapshotContent=true, expect longer runtimes. The default timeout is 2 hours. Start with a small captureLimit (e.g. 10) to estimate runtime before running at full scale.
Use cases
- OSINT / research: Check whether a source URL existed, when it was captured, and how its content has changed
- Journalism: Verify archived versions of articles or government pages for fact-checking
- SEO / link-rot recovery: Find archived versions of dead inbound links and plan redirects or outreach
- Legal evidence: Retrieve timestamped snapshots of web pages for documentation
- Web archiving: Bulk-check coverage for a list of URLs before deeper archiving work