Pricing

Pay per usage

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Wayback Machine snapshots CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable. CDX API, no key. 21+ runs. For competitor history-tracking + SEO recovery + brand archaeology. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Alex

Actor stats

Bookmarked

Total users

Monthly active users

24 days ago

Last modified

Wayback Machine Scraper — Historical Snapshot Index

Pull the full index of archived snapshots for any URL from the Internet Archive's official CDX Server API. No headless browser, no API key, no HTML scraping that breaks on archive.org UI changes.

Verified against src/main.js — every output field below is what the actor actually pushes to the dataset.

What you get per snapshot (10 fields)

Field	Example	Source
`url`	`https://example.com/`	original URL captured
`timestamp`	`20240615083000`	raw CDX timestamp `YYYYMMDDhhmmss`
`dateISO`	`2024-06-15T08:30:00Z`	parsed ISO 8601 form of `timestamp` (raw string echoed back if length < 14)
`statusCode`	`200`	HTTP status at capture time (parsed int; `null` on non-numeric)
`mimeType`	`text/html`	MIME reported by archive
`size`	`45230`	response size in bytes (parsed int; `null` on non-numeric)
`digest`	`BASE32SHA1...`	content hash (CDX `digest` field) — use for deduplication
`archiveUrl`	`https://web.archive.org/web/20240615083000/https://example.com/`	direct link to the archived copy
`inputUrl`	`https://example.com`	the URL you passed in (pre-normalisation)
`scrapedAt`	`2026-04-29T10:30:00.000Z`	actor capture timestamp

The actor returns the snapshot index — pointers to archived copies. To fetch the cached HTML body itself, follow archiveUrl from your own pipeline (one extra GET per snapshot).

Inputs (from `.actor/input_schema.json` — 4 visible fields)

Parameter	Type	Default	Range	Description
`urls`	string[]	`[]` (prefilled `["google.com"]`)	required, ≥1	URLs to look up (with or without scheme; leading `http(s)://` and trailing `/` stripped)
`maxSnapshotsPerUrl`	integer	`20`	1–1000	Cap on snapshots returned per URL (sent as CDX `limit`)
`fromDate`	string	`''`	`YYYY-MM-DD`	Lower bound; dashes stripped before sending as CDX `from`
`toDate`	string	`''`	`YYYY-MM-DD`	Upper bound; dashes stripped before sending as CDX `to`

Hidden default — `collapseBy` is fixed at `timestamp:8` (collapse to ~one snapshot per day)

The code reads a collapseBy parameter that is NOT exposed in the UI form (intentionally omitted from .actor/input_schema.json). UI users always get the day-level default. SDK / API users CAN override by passing collapseBy in the run input — the value flows straight to CDX:

timestamp:8 (default) — one snapshot per calendar day (leading 8 digits of timestamp)
timestamp:10 — one per hour
timestamp:6 — one per month
digest — collapse identical content (Wayback content hash)
'' (empty) — no collapse, all snapshots

If you need this exposed in the UI form, request a custom build (see Apify-as-a-Service tiers below).

Use cases

Competitive intelligence — see when a competitor changed their pricing or copy, then GET the archiveUrl to read the old version
Legal / compliance — pull a date-bounded list of snapshots as evidence of a site's historical state
SEO research — feed dateISO + statusCode + digest into a diff pipeline to detect when major redesigns happened
Content recovery — find the latest pre-deletion snapshot of a page that 404s today
Journalism / fact-checking — verify what was published on a specific date

Quick start

Click Try for free above.
Provide input (UI form — 4 visible fields):

{
  "urls": ["https://example.com"],
  "maxSnapshotsPerUrl": 50,
  "fromDate": "2020-01-01",
  "toDate": "2025-12-31"
}

Run. Results in Storage → Dataset (JSON / CSV / Excel).

Python (apify-client) — SDK can also override the hidden `collapseBy`

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("knotless_cadence/wayback-machine-scraper").call(
    run_input={
        "urls": ["https://example.com"],
        "maxSnapshotsPerUrl": 50,
        "fromDate": "2020-01-01",
        "toDate": "2025-12-31",
        # optional, SDK-only — UI form does not expose this:
        "collapseBy": "digest",  # or "timestamp:10", "timestamp:8", ""
    }
)
for snap in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(snap["dateISO"], snap["statusCode"], snap["archiveUrl"])

Fetch the cached HTML body for any snapshot

import requests
html = requests.get(snap["archiveUrl"]).text

How it works

The actor calls https://web.archive.org/cdx/search/cdx with the parameters from your input, parses the JSON response (first row = headers, rest = data), and pushes each row as a flat dataset entry. There's no headless browser, no Cheerio, no HTML parsing — only the CDX API. That's why this scraper doesn't break when archive.org rebuilds its UI.

Honest limitations (read before bulk runs)

Single fetch attempt per URL — no retry. A non-2xx CDX response throws and is caught at the outer loop; that URL produces zero rows for the run.
A single CDX HTTP error halts the entire batch. The for (const url of urls) loop is wrapped in ONE outer try/catch. If URL #5 of 100 returns 503, the actor logs the error and exits — URLs #6 through #100 are NEVER processed. Workaround: split large batches into multiple runs of ≤25 URLs, or request a per-URL try/catch custom build.
No proxy. Direct fetch from the Apify worker IP. CDX rate-limits per IP; bulk runs across many URLs in quick succession may hit 429 and (per the bullet above) kill the whole batch. Throttle yourself, or request a proxy-routed custom build.
collapseBy defaults to day-level for everyone using the UI. Same-day captures collapse to one row. SDK callers can override; UI callers cannot (until custom build).
Date filter is calendar-day, not timestamped. fromDate/toDate are date-only at CDX side.
maxSnapshotsPerUrl capped at 1000 by the input schema. Beyond that you need cursor pagination, which this actor does not implement.
No JS rendering, no HTML body fetch. This is a snapshot INDEX. Body fetch is one extra GET against archiveUrl from your own code.
statusCode / size may be null. If CDX returns a non-numeric value, parseInt falls back to null rather than failing the row.
dateISO echoes raw timestamp when timestamp length is <14 characters (CDX edge case for partial captures).
Empty urls = [] is silently accepted — actor logs No URLs provided. and exits 0.
Independent project — not affiliated with the Internet Archive. CDX is a free public API; this actor wraps it for schema-validated output and dedupe via digest.

Trustpilot Review Scraper — 951 successful runs, full review archives past the 200-result UI cap → apify.com/knotless_cadence/trustpilot-review-scraper
Reddit Discussion Scraper — posts, comments, subreddits, no API key → apify.com/knotless_cadence/reddit-discussion-scraper
Google News Scraper — track news mentions and media coverage → apify.com/knotless_cadence/google-news-scraper
Email Extractor Pro — bulk email extraction from websites → apify.com/knotless_cadence/email-extractor-pro
SEO Audit Tool — 15 on-page SEO signals, site-wide summary record → apify.com/knotless_cadence/seo-audit-tool
Website Tech Stack Detector → apify.com/knotless_cadence/website-tech-stack-detector
Robots.txt Analyzer → apify.com/knotless_cadence/robots-txt-analyzer

Browse the rest at apify.com/knotless_cadence (31 public, 78 total in portfolio).

Proof of delivery

22 lifetime runs on this actor — but the broader portfolio is what backs every pilot:

31 published / 78 total Apify scrapers across socials, B2B, dev tools.
Flagship: Trustpilot Review Scraper — 951 lifetime runs, 0 bot-detection failures across 30 days.
Recent paid series: $150 / 3-article postmortem for a client in the proxy industry (March 2026, delivered).
Code-honest READMEs: every claim in this readme is verified against src/. No "supports X" without proof.

Pilot pricing locked through May 2026:

1 case-study article (1100w+, code blocks): $50
3-article series: $150
Custom build (this actor → your variant: multi-year diff harvests, full HTML snapshot rehydration, change-point detection across timestamps): from $50 depending on schema delta.

Reply sample to spinov001@gmail.com — get 2 published case-study articles within 24h. No commitment.

Need a custom build?

Apify-as-a-Service tiers:

Pilot — $97: 1 actor configured for your inputs + Slack/email delivery on schedule, 7-day support
Standard — $297: 3 actors + custom output schema + dedupe on digest + S3/Sheets sync, 30-day support
Premium — $797: unlimited actors + dedicated proxy pool + 1:1 calls + per-URL retry/retry-on-429 + cursor pagination + body-fetch wired through, 90-day support + 1 modification round

Email: spinov001@gmail.com Blog (case studies + writeups): https://blog.spinov.online Telegram channel (scraping & data engineering tips): https://t.me/scraping_ai

Honest disclosure

Public archive only — no auth, no scraping behind a login
Independent project — not affiliated with the Internet Archive
This actor returns the index of archived snapshots; downloading the cached HTML body itself is a one-line follow-up against archiveUrl from your code
Maintained by the same author who runs apify.com/knotless_cadence (78 actors, 31 public). Recent paid client: 3-article series for a proxy-industry company ($150)

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Glass Ventures

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Crawler Bros

5.0

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

Stas Persiianenko

Wayback Machine Snapshots Scraper — Internet Archive History

seemuapps/wayback-machine-snapshots-scraper

List every Internet Archive snapshot of a URL, page, or whole domain. Timestamp, snapshot URL, status code, mime type, content length. No login.

Andrew

Booking.com Scraper — Hotels, Prices, Ratings, CSV, No API Key

knotless_cadence/booking-com-scraper

Booking.com hotels JSON/CSV — 16 fields: name, URL, price, currency, rating, stars, reviewCount, location, distance, image, dates, adults, rooms. Filters, no key. 21+ runs. For travel-tech + price-comp + market benchmarking. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

ParseForge

Wayback Machine Historical Content Scraper

happyfhantum/wayback-machine-historical-content-scraper

Compare archived website snapshots through the Wayback Machine and extract page-history change signals.

Kelsey Todd

4.0

Google Maps Scraper — Reviews, Contacts & Leads [No API Key]

knotless_cadence/google-maps-scraper-pro

18 runs. Google Maps: name, address, phone, site, category, rating, reviews, hours, GPS, place-ID. CSV/JSON, no key. Local-biz prospecting + competitor scout + territory mapping. Backed by 951-run Trustpilot flagship + 31-actor portfolio. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

GitHub Profile — Repos, Stars, Activity, CSV, No Token, Bulk

knotless_cadence/github-profile-scraper

21 runs. GitHub user intel in CSV/JSON — repos, stars, followers, contribs, languages, bio, email. No API token, no rate blocks. Backed by 951-run Trustpilot flagship + 31-actor portfolio. For recruiter outreach + talent mapping. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex