Wayback Snapshots — CSV, Date-Filter, Bulk JSON
Pricing
Pay per usage
Wayback Snapshots — CSV, Date-Filter, Bulk JSON
Wayback Machine snapshots CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable. CDX API, no key. 21+ runs. For competitor history-tracking + SEO recovery + brand archaeology. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Alex
Actor stats
0
Bookmarked
5
Total users
0
Monthly active users
4 days ago
Last modified
Categories
Share
Wayback Machine Scraper — Historical Snapshot Index
Pull the full index of archived snapshots for any URL from the Internet Archive's official CDX Server API. No headless browser, no API key, no HTML scraping that breaks on archive.org UI changes.
Verified against src/main.js — every output field below is what the actor actually pushes to the dataset.
What you get per snapshot (10 fields)
| Field | Example | Source |
|---|---|---|
url | https://example.com/ | original URL captured |
timestamp | 20240615083000 | raw CDX timestamp YYYYMMDDhhmmss |
dateISO | 2024-06-15T08:30:00Z | parsed ISO 8601 form of timestamp (raw string echoed back if length < 14) |
statusCode | 200 | HTTP status at capture time (parsed int; null on non-numeric) |
mimeType | text/html | MIME reported by archive |
size | 45230 | response size in bytes (parsed int; null on non-numeric) |
digest | BASE32SHA1... | content hash (CDX digest field) — use for deduplication |
archiveUrl | https://web.archive.org/web/20240615083000/https://example.com/ | direct link to the archived copy |
inputUrl | https://example.com | the URL you passed in (pre-normalisation) |
scrapedAt | 2026-04-29T10:30:00.000Z | actor capture timestamp |
The actor returns the snapshot index — pointers to archived copies. To fetch the cached HTML body itself, follow archiveUrl from your own pipeline (one extra GET per snapshot).
Inputs (from .actor/input_schema.json — 4 visible fields)
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
urls | string[] | [] (prefilled ["google.com"]) | required, ≥1 | URLs to look up (with or without scheme; leading http(s):// and trailing / stripped) |
maxSnapshotsPerUrl | integer | 20 | 1–1000 | Cap on snapshots returned per URL (sent as CDX limit) |
fromDate | string | '' | YYYY-MM-DD | Lower bound; dashes stripped before sending as CDX from |
toDate | string | '' | YYYY-MM-DD | Upper bound; dashes stripped before sending as CDX to |
Hidden default — collapseBy is fixed at timestamp:8 (collapse to ~one snapshot per day)
The code reads a collapseBy parameter that is NOT exposed in the UI form (intentionally omitted from .actor/input_schema.json). UI users always get the day-level default. SDK / API users CAN override by passing collapseBy in the run input — the value flows straight to CDX:
timestamp:8(default) — one snapshot per calendar day (leading 8 digits of timestamp)timestamp:10— one per hourtimestamp:6— one per monthdigest— collapse identical content (Wayback content hash)''(empty) — no collapse, all snapshots
If you need this exposed in the UI form, request a custom build (see Apify-as-a-Service tiers below).
Use cases
- Competitive intelligence — see when a competitor changed their pricing or copy, then GET the
archiveUrlto read the old version - Legal / compliance — pull a date-bounded list of snapshots as evidence of a site's historical state
- SEO research — feed
dateISO+statusCode+digestinto a diff pipeline to detect when major redesigns happened - Content recovery — find the latest pre-deletion snapshot of a page that 404s today
- Journalism / fact-checking — verify what was published on a specific date
Quick start
- Click Try for free above.
- Provide input (UI form — 4 visible fields):
{"urls": ["https://example.com"],"maxSnapshotsPerUrl": 50,"fromDate": "2020-01-01","toDate": "2025-12-31"}
- Run. Results in Storage → Dataset (JSON / CSV / Excel).
Python (apify-client) — SDK can also override the hidden collapseBy
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("knotless_cadence/wayback-machine-scraper").call(run_input={"urls": ["https://example.com"],"maxSnapshotsPerUrl": 50,"fromDate": "2020-01-01","toDate": "2025-12-31",# optional, SDK-only — UI form does not expose this:"collapseBy": "digest", # or "timestamp:10", "timestamp:8", ""})for snap in client.dataset(run["defaultDatasetId"]).iterate_items():print(snap["dateISO"], snap["statusCode"], snap["archiveUrl"])
Fetch the cached HTML body for any snapshot
import requestshtml = requests.get(snap["archiveUrl"]).text
How it works
The actor calls https://web.archive.org/cdx/search/cdx with the parameters from your input, parses the JSON response (first row = headers, rest = data), and pushes each row as a flat dataset entry. There's no headless browser, no Cheerio, no HTML parsing — only the CDX API. That's why this scraper doesn't break when archive.org rebuilds its UI.
Honest limitations (read before bulk runs)
- Single fetch attempt per URL — no retry. A non-2xx CDX response throws and is caught at the outer loop; that URL produces zero rows for the run.
- A single CDX HTTP error halts the entire batch. The
for (const url of urls)loop is wrapped in ONE outertry/catch. If URL #5 of 100 returns 503, the actor logs the error and exits — URLs #6 through #100 are NEVER processed. Workaround: split large batches into multiple runs of ≤25 URLs, or request a per-URL try/catch custom build. - No proxy. Direct fetch from the Apify worker IP. CDX rate-limits per IP; bulk runs across many URLs in quick succession may hit 429 and (per the bullet above) kill the whole batch. Throttle yourself, or request a proxy-routed custom build.
collapseBydefaults to day-level for everyone using the UI. Same-day captures collapse to one row. SDK callers can override; UI callers cannot (until custom build).- Date filter is calendar-day, not timestamped.
fromDate/toDateare date-only at CDX side. maxSnapshotsPerUrlcapped at 1000 by the input schema. Beyond that you need cursor pagination, which this actor does not implement.- No JS rendering, no HTML body fetch. This is a snapshot INDEX. Body fetch is one extra GET against
archiveUrlfrom your own code. statusCode/sizemay benull. If CDX returns a non-numeric value,parseIntfalls back tonullrather than failing the row.dateISOechoes rawtimestampwhen timestamp length is <14 characters (CDX edge case for partial captures).- Empty
urls = []is silently accepted — actor logsNo URLs provided.and exits 0. - Independent project — not affiliated with the Internet Archive. CDX is a free public API; this actor wraps it for schema-validated output and dedupe via
digest.
Related actors
- Trustpilot Review Scraper — 951 successful runs, full review archives past the 200-result UI cap →
apify.com/knotless_cadence/trustpilot-review-scraper - Reddit Discussion Scraper — posts, comments, subreddits, no API key →
apify.com/knotless_cadence/reddit-discussion-scraper - Google News Scraper — track news mentions and media coverage →
apify.com/knotless_cadence/google-news-scraper - Email Extractor Pro — bulk email extraction from websites →
apify.com/knotless_cadence/email-extractor-pro - SEO Audit Tool — 15 on-page SEO signals, site-wide summary record →
apify.com/knotless_cadence/seo-audit-tool - Website Tech Stack Detector →
apify.com/knotless_cadence/website-tech-stack-detector - Robots.txt Analyzer →
apify.com/knotless_cadence/robots-txt-analyzer
Browse the rest at apify.com/knotless_cadence (31 public, 78 total in portfolio).
Proof of delivery
22 lifetime runs on this actor — but the broader portfolio is what backs every pilot:
- 31 published / 78 total Apify scrapers across socials, B2B, dev tools.
- Flagship: Trustpilot Review Scraper — 951 lifetime runs, 0 bot-detection failures across 30 days.
- Recent paid series: $150 / 3-article postmortem for a client in the proxy industry (March 2026, delivered).
- Code-honest READMEs: every claim in this readme is verified against
src/. No "supports X" without proof.
Pilot pricing locked through May 2026:
- 1 case-study article (1100w+, code blocks): $50
- 3-article series: $150
- Custom build (this actor → your variant: multi-year diff harvests, full HTML snapshot rehydration, change-point detection across timestamps): from $50 depending on schema delta.
Reply sample to spinov001@gmail.com — get 2 published case-study articles within 24h. No commitment.
Need a custom build?
Apify-as-a-Service tiers:
- Pilot — $97: 1 actor configured for your inputs + Slack/email delivery on schedule, 7-day support
- Standard — $297: 3 actors + custom output schema + dedupe on
digest+ S3/Sheets sync, 30-day support - Premium — $797: unlimited actors + dedicated proxy pool + 1:1 calls + per-URL retry/retry-on-429 + cursor pagination + body-fetch wired through, 90-day support + 1 modification round
Email: spinov001@gmail.com Blog (case studies + writeups): https://blog.spinov.online Telegram channel (scraping & data engineering tips): https://t.me/scraping_ai
Honest disclosure
- Public archive only — no auth, no scraping behind a login
- Independent project — not affiliated with the Internet Archive
- This actor returns the index of archived snapshots; downloading the cached HTML body itself is a one-line follow-up against
archiveUrlfrom your code - Maintained by the same author who runs
apify.com/knotless_cadence(78 actors, 31 public). Recent paid client: 3-article series for a proxy-industry company ($150)