Wayback Snapshots — CSV, Date-Filter, Bulk JSON avatar

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Pricing

Pay per usage

Go to Apify Store
Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Wayback Machine snapshots CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable. CDX API, no key. 21+ runs. For competitor history-tracking + SEO recovery + brand archaeology. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Alex

Alex

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

0

Monthly active users

4 days ago

Last modified

Categories

Share

Wayback Machine Scraper — Historical Snapshot Index

Pull the full index of archived snapshots for any URL from the Internet Archive's official CDX Server API. No headless browser, no API key, no HTML scraping that breaks on archive.org UI changes.

Verified against src/main.js — every output field below is what the actor actually pushes to the dataset.


What you get per snapshot (10 fields)

FieldExampleSource
urlhttps://example.com/original URL captured
timestamp20240615083000raw CDX timestamp YYYYMMDDhhmmss
dateISO2024-06-15T08:30:00Zparsed ISO 8601 form of timestamp (raw string echoed back if length < 14)
statusCode200HTTP status at capture time (parsed int; null on non-numeric)
mimeTypetext/htmlMIME reported by archive
size45230response size in bytes (parsed int; null on non-numeric)
digestBASE32SHA1...content hash (CDX digest field) — use for deduplication
archiveUrlhttps://web.archive.org/web/20240615083000/https://example.com/direct link to the archived copy
inputUrlhttps://example.comthe URL you passed in (pre-normalisation)
scrapedAt2026-04-29T10:30:00.000Zactor capture timestamp

The actor returns the snapshot index — pointers to archived copies. To fetch the cached HTML body itself, follow archiveUrl from your own pipeline (one extra GET per snapshot).


Inputs (from .actor/input_schema.json — 4 visible fields)

ParameterTypeDefaultRangeDescription
urlsstring[][] (prefilled ["google.com"])required, ≥1URLs to look up (with or without scheme; leading http(s):// and trailing / stripped)
maxSnapshotsPerUrlinteger201–1000Cap on snapshots returned per URL (sent as CDX limit)
fromDatestring''YYYY-MM-DDLower bound; dashes stripped before sending as CDX from
toDatestring''YYYY-MM-DDUpper bound; dashes stripped before sending as CDX to

Hidden default — collapseBy is fixed at timestamp:8 (collapse to ~one snapshot per day)

The code reads a collapseBy parameter that is NOT exposed in the UI form (intentionally omitted from .actor/input_schema.json). UI users always get the day-level default. SDK / API users CAN override by passing collapseBy in the run input — the value flows straight to CDX:

  • timestamp:8 (default) — one snapshot per calendar day (leading 8 digits of timestamp)
  • timestamp:10 — one per hour
  • timestamp:6 — one per month
  • digest — collapse identical content (Wayback content hash)
  • '' (empty) — no collapse, all snapshots

If you need this exposed in the UI form, request a custom build (see Apify-as-a-Service tiers below).


Use cases

  • Competitive intelligence — see when a competitor changed their pricing or copy, then GET the archiveUrl to read the old version
  • Legal / compliance — pull a date-bounded list of snapshots as evidence of a site's historical state
  • SEO research — feed dateISO + statusCode + digest into a diff pipeline to detect when major redesigns happened
  • Content recovery — find the latest pre-deletion snapshot of a page that 404s today
  • Journalism / fact-checking — verify what was published on a specific date

Quick start

  1. Click Try for free above.
  2. Provide input (UI form — 4 visible fields):
{
"urls": ["https://example.com"],
"maxSnapshotsPerUrl": 50,
"fromDate": "2020-01-01",
"toDate": "2025-12-31"
}
  1. Run. Results in Storage → Dataset (JSON / CSV / Excel).

Python (apify-client) — SDK can also override the hidden collapseBy

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("knotless_cadence/wayback-machine-scraper").call(
run_input={
"urls": ["https://example.com"],
"maxSnapshotsPerUrl": 50,
"fromDate": "2020-01-01",
"toDate": "2025-12-31",
# optional, SDK-only — UI form does not expose this:
"collapseBy": "digest", # or "timestamp:10", "timestamp:8", ""
}
)
for snap in client.dataset(run["defaultDatasetId"]).iterate_items():
print(snap["dateISO"], snap["statusCode"], snap["archiveUrl"])

Fetch the cached HTML body for any snapshot

import requests
html = requests.get(snap["archiveUrl"]).text

How it works

The actor calls https://web.archive.org/cdx/search/cdx with the parameters from your input, parses the JSON response (first row = headers, rest = data), and pushes each row as a flat dataset entry. There's no headless browser, no Cheerio, no HTML parsing — only the CDX API. That's why this scraper doesn't break when archive.org rebuilds its UI.


Honest limitations (read before bulk runs)

  • Single fetch attempt per URL — no retry. A non-2xx CDX response throws and is caught at the outer loop; that URL produces zero rows for the run.
  • A single CDX HTTP error halts the entire batch. The for (const url of urls) loop is wrapped in ONE outer try/catch. If URL #5 of 100 returns 503, the actor logs the error and exits — URLs #6 through #100 are NEVER processed. Workaround: split large batches into multiple runs of ≤25 URLs, or request a per-URL try/catch custom build.
  • No proxy. Direct fetch from the Apify worker IP. CDX rate-limits per IP; bulk runs across many URLs in quick succession may hit 429 and (per the bullet above) kill the whole batch. Throttle yourself, or request a proxy-routed custom build.
  • collapseBy defaults to day-level for everyone using the UI. Same-day captures collapse to one row. SDK callers can override; UI callers cannot (until custom build).
  • Date filter is calendar-day, not timestamped. fromDate/toDate are date-only at CDX side.
  • maxSnapshotsPerUrl capped at 1000 by the input schema. Beyond that you need cursor pagination, which this actor does not implement.
  • No JS rendering, no HTML body fetch. This is a snapshot INDEX. Body fetch is one extra GET against archiveUrl from your own code.
  • statusCode / size may be null. If CDX returns a non-numeric value, parseInt falls back to null rather than failing the row.
  • dateISO echoes raw timestamp when timestamp length is <14 characters (CDX edge case for partial captures).
  • Empty urls = [] is silently accepted — actor logs No URLs provided. and exits 0.
  • Independent project — not affiliated with the Internet Archive. CDX is a free public API; this actor wraps it for schema-validated output and dedupe via digest.

  • Trustpilot Review Scraper — 951 successful runs, full review archives past the 200-result UI cap → apify.com/knotless_cadence/trustpilot-review-scraper
  • Reddit Discussion Scraper — posts, comments, subreddits, no API key → apify.com/knotless_cadence/reddit-discussion-scraper
  • Google News Scraper — track news mentions and media coverage → apify.com/knotless_cadence/google-news-scraper
  • Email Extractor Pro — bulk email extraction from websites → apify.com/knotless_cadence/email-extractor-pro
  • SEO Audit Tool — 15 on-page SEO signals, site-wide summary record → apify.com/knotless_cadence/seo-audit-tool
  • Website Tech Stack Detectorapify.com/knotless_cadence/website-tech-stack-detector
  • Robots.txt Analyzerapify.com/knotless_cadence/robots-txt-analyzer

Browse the rest at apify.com/knotless_cadence (31 public, 78 total in portfolio).


Proof of delivery

22 lifetime runs on this actor — but the broader portfolio is what backs every pilot:

  • 31 published / 78 total Apify scrapers across socials, B2B, dev tools.
  • Flagship: Trustpilot Review Scraper951 lifetime runs, 0 bot-detection failures across 30 days.
  • Recent paid series: $150 / 3-article postmortem for a client in the proxy industry (March 2026, delivered).
  • Code-honest READMEs: every claim in this readme is verified against src/. No "supports X" without proof.

Pilot pricing locked through May 2026:

  • 1 case-study article (1100w+, code blocks): $50
  • 3-article series: $150
  • Custom build (this actor → your variant: multi-year diff harvests, full HTML snapshot rehydration, change-point detection across timestamps): from $50 depending on schema delta.

Reply sample to spinov001@gmail.com — get 2 published case-study articles within 24h. No commitment.


Need a custom build?

Apify-as-a-Service tiers:

  • Pilot — $97: 1 actor configured for your inputs + Slack/email delivery on schedule, 7-day support
  • Standard — $297: 3 actors + custom output schema + dedupe on digest + S3/Sheets sync, 30-day support
  • Premium — $797: unlimited actors + dedicated proxy pool + 1:1 calls + per-URL retry/retry-on-429 + cursor pagination + body-fetch wired through, 90-day support + 1 modification round

Email: spinov001@gmail.com Blog (case studies + writeups): https://blog.spinov.online Telegram channel (scraping & data engineering tips): https://t.me/scraping_ai


Honest disclosure

  • Public archive only — no auth, no scraping behind a login
  • Independent project — not affiliated with the Internet Archive
  • This actor returns the index of archived snapshots; downloading the cached HTML body itself is a one-line follow-up against archiveUrl from your code
  • Maintained by the same author who runs apify.com/knotless_cadence (78 actors, 31 public). Recent paid client: 3-article series for a proxy-industry company ($150)