Wayback Machine Archive Scraper avatar

Wayback Machine Archive Scraper

Pricing

$1.00 / 1,000 snapshot retrieveds

Go to Apify Store
Wayback Machine Archive Scraper

Wayback Machine Archive Scraper

Fetch historical snapshots of any webpage from the Internet Archive. Perfect for digital forensics and tracking deleted content.

Pricing

$1.00 / 1,000 snapshot retrieveds

Rating

0.0

(0)

Developer

Andok

Andok

Maintained by Community

Actor stats

0

Bookmarked

9

Total users

2

Monthly active users

7 days ago

Last modified

Categories

Share

Wayback Machine Scraper for Historical Snapshots

Retrieve historical web page snapshots from the Internet Archive for compliance checks, competitive due diligence, and content recovery. Feed it a list of URLs and get back every archived snapshot with timestamps, status codes, and archive links — or optionally fetch the full HTML of the latest snapshot. Built on the official Wayback CDX API for accurate, structured results.

Features

  • Bulk URL processing — check snapshot history for dozens of URLs in a single run
  • Date range filtering — narrow results to a specific time window with from and to parameters
  • Deduplication — collapse identical snapshots by digest to reduce noise
  • Status code filtering — only return snapshots with specific HTTP status codes (default: 200)
  • HTML retrieval — optionally fetch the archived HTML content for the most recent snapshot
  • Concurrent processing — configurable parallelism for faster batch runs
  • Structured metadata — every snapshot includes timestamp, original URL, MIME type, and archive URL

Input

FieldTypeRequiredDefaultDescription
urlsarrayYes["https://example.com"]List of URLs to look up in the Wayback Machine
urlstringNoSingle URL (backwards compatible, merged with urls)
fromstringNoStart date for snapshot range (format: YYYY or YYYYMMDDhhmmss)
tostringNoEnd date for snapshot range (format: YYYY or YYYYMMDDhhmmss)
limitintegerNo50Maximum snapshots to return per URL (1-5000)
collapsestringNodigestCollapse parameter to deduplicate snapshots (e.g. digest, timestamp:8)
filterStatusstringNostatuscode:200HTTP status filter for snapshots (e.g. statuscode:200)
includeHtmlbooleanNofalseFetch the archived HTML content for the latest snapshot (experimental)
timeoutSecondsintegerNo20Per-request timeout in seconds (1-120)
concurrencyintegerNo5Number of URLs to process in parallel (1-25)

Input Example

{
"urls": ["https://example.com", "https://news.ycombinator.com"],
"from": "2023",
"to": "2025",
"limit": 10,
"includeHtml": false
}

Output

Each dataset item represents one input URL with its snapshot history. Key fields:

  • inputUrl (string) — the URL that was looked up
  • snapshotCount (number) — total number of matching snapshots found
  • snapshots (array) — list of snapshot objects with timestamp, original, statuscode, mimetype, length, and archiveUrl
  • latestSnapshot (object) — the most recent snapshot, or null if none found
  • latestHtml (string) — archived HTML content (only when includeHtml is enabled)
  • checkedAt (string) — ISO timestamp of when the check was performed
  • error (string) — error message if the lookup failed, otherwise null

Output Example

{
"inputUrl": "https://example.com",
"snapshotCount": 3,
"snapshots": [
{
"timestamp": "20250110153022",
"original": "https://example.com",
"statuscode": 200,
"mimetype": "text/html",
"length": 1256,
"archiveUrl": "https://web.archive.org/web/20250110153022/https://example.com"
}
],
"latestSnapshot": {
"timestamp": "20250110153022",
"original": "https://example.com",
"statuscode": 200,
"mimetype": "text/html",
"length": 1256,
"archiveUrl": "https://web.archive.org/web/20250110153022/https://example.com"
},
"latestHtml": null,
"checkedAt": "2025-01-20T12:00:00.000Z",
"error": null
}

Pricing

EventCost
Snapshot RetrievedPay-per-event (see actor pricing page)

Use Cases

  • Compliance & legal — retrieve historical versions of terms of service, privacy policies, or product pages
  • Competitive due diligence — review how a competitor's website evolved over time before a deal or partnership
  • Content recovery — recover lost or deleted web pages from the Internet Archive
  • SEO auditing — check when a page was last crawled and compare historical content changes
  • Brand monitoring — verify historical claims or track how a brand's messaging changed
  • Research & journalism — access archived versions of news articles or government pages
ActorWhat it adds
Google News ScraperMonitor current news coverage alongside historical archive lookups
Broken Links CheckerFind dead links on your site, then recover them via Wayback Machine
Sitemap ExtractorExtract all URLs from a sitemap to feed into bulk Wayback lookups

Notes

  • The Wayback Machine CDX API is free but may throttle under heavy load. Use the concurrency setting conservatively for large batches.
  • The includeHtml option is experimental and may fail for very large pages or pages with complex JavaScript rendering.