Wayback Machine Archive Scraper
Pricing
$1.00 / 1,000 snapshot retrieveds
Wayback Machine Archive Scraper
Fetch historical snapshots of any webpage from the Internet Archive. Perfect for digital forensics and tracking deleted content.
Pricing
$1.00 / 1,000 snapshot retrieveds
Rating
0.0
(0)
Developer
Andok
Actor stats
0
Bookmarked
9
Total users
2
Monthly active users
7 days ago
Last modified
Categories
Share
Wayback Machine Scraper for Historical Snapshots
Retrieve historical web page snapshots from the Internet Archive for compliance checks, competitive due diligence, and content recovery. Feed it a list of URLs and get back every archived snapshot with timestamps, status codes, and archive links — or optionally fetch the full HTML of the latest snapshot. Built on the official Wayback CDX API for accurate, structured results.
Features
- Bulk URL processing — check snapshot history for dozens of URLs in a single run
- Date range filtering — narrow results to a specific time window with
fromandtoparameters - Deduplication — collapse identical snapshots by digest to reduce noise
- Status code filtering — only return snapshots with specific HTTP status codes (default: 200)
- HTML retrieval — optionally fetch the archived HTML content for the most recent snapshot
- Concurrent processing — configurable parallelism for faster batch runs
- Structured metadata — every snapshot includes timestamp, original URL, MIME type, and archive URL
Input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
urls | array | Yes | ["https://example.com"] | List of URLs to look up in the Wayback Machine |
url | string | No | — | Single URL (backwards compatible, merged with urls) |
from | string | No | — | Start date for snapshot range (format: YYYY or YYYYMMDDhhmmss) |
to | string | No | — | End date for snapshot range (format: YYYY or YYYYMMDDhhmmss) |
limit | integer | No | 50 | Maximum snapshots to return per URL (1-5000) |
collapse | string | No | digest | Collapse parameter to deduplicate snapshots (e.g. digest, timestamp:8) |
filterStatus | string | No | statuscode:200 | HTTP status filter for snapshots (e.g. statuscode:200) |
includeHtml | boolean | No | false | Fetch the archived HTML content for the latest snapshot (experimental) |
timeoutSeconds | integer | No | 20 | Per-request timeout in seconds (1-120) |
concurrency | integer | No | 5 | Number of URLs to process in parallel (1-25) |
Input Example
{"urls": ["https://example.com", "https://news.ycombinator.com"],"from": "2023","to": "2025","limit": 10,"includeHtml": false}
Output
Each dataset item represents one input URL with its snapshot history. Key fields:
inputUrl(string) — the URL that was looked upsnapshotCount(number) — total number of matching snapshots foundsnapshots(array) — list of snapshot objects withtimestamp,original,statuscode,mimetype,length, andarchiveUrllatestSnapshot(object) — the most recent snapshot, ornullif none foundlatestHtml(string) — archived HTML content (only whenincludeHtmlis enabled)checkedAt(string) — ISO timestamp of when the check was performederror(string) — error message if the lookup failed, otherwisenull
Output Example
{"inputUrl": "https://example.com","snapshotCount": 3,"snapshots": [{"timestamp": "20250110153022","original": "https://example.com","statuscode": 200,"mimetype": "text/html","length": 1256,"archiveUrl": "https://web.archive.org/web/20250110153022/https://example.com"}],"latestSnapshot": {"timestamp": "20250110153022","original": "https://example.com","statuscode": 200,"mimetype": "text/html","length": 1256,"archiveUrl": "https://web.archive.org/web/20250110153022/https://example.com"},"latestHtml": null,"checkedAt": "2025-01-20T12:00:00.000Z","error": null}
Pricing
| Event | Cost |
|---|---|
| Snapshot Retrieved | Pay-per-event (see actor pricing page) |
Use Cases
- Compliance & legal — retrieve historical versions of terms of service, privacy policies, or product pages
- Competitive due diligence — review how a competitor's website evolved over time before a deal or partnership
- Content recovery — recover lost or deleted web pages from the Internet Archive
- SEO auditing — check when a page was last crawled and compare historical content changes
- Brand monitoring — verify historical claims or track how a brand's messaging changed
- Research & journalism — access archived versions of news articles or government pages
Related Actors
| Actor | What it adds |
|---|---|
| Google News Scraper | Monitor current news coverage alongside historical archive lookups |
| Broken Links Checker | Find dead links on your site, then recover them via Wayback Machine |
| Sitemap Extractor | Extract all URLs from a sitemap to feed into bulk Wayback lookups |
Notes
- The Wayback Machine CDX API is free but may throttle under heavy load. Use the
concurrencysetting conservatively for large batches. - The
includeHtmloption is experimental and may fail for very large pages or pages with complex JavaScript rendering.