Wayback Snapshots — CSV, Date-Filter, Bulk JSON avatar

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Pricing

Pay per usage

Go to Apify Store
Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Wayback Machine snapshots in CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable + collapse-by-day. Uses CDX API, no API key. Built for competitive intel, SEO recovery, content audits. spinov001@gmail.com · t.me/scraping_ai

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Alex

Alex

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

0

Monthly active users

3 hours ago

Last modified

Categories

Share

Wayback Machine Scraper — Extract Historical Website Snapshots

Retrieve archived versions of any website from the Internet Archive. See how any URL looked at any point in history — no API key, no rate-limit headaches, no HTML parsing breakage.

Why This Scraper?

Most "Wayback" scrapers scrape the web.archive.org HTML directly, which breaks whenever the archive UI changes. This actor uses the official CDX Server API that archive.org exposes for programmatic access — the same endpoint used by researchers and journalists worldwide. That means:

  • Never breaks on UI changes — CDX API is stable and documented
  • No authentication — public archive, no credentials needed
  • Bulk lookups — submit hundreds of URLs in a single run
  • Structured output — clean JSON/CSV, ready for analysis pipelines
  • Full HTML retrieval — optionally pull the cached page body, not just metadata

Features

  • Historical snapshots — full timeline of cached versions per URL
  • Date filtering — narrow to a year, month, or custom range
  • Bulk processing — 100s of URLs per run, automatic deduplication
  • Content extraction — pull cached HTML/text, not just metadata
  • Status code filtering — skip 404/redirect snapshots, keep only 200
  • MIME filtering — HTML only, or include images/PDFs/JSON
  • Proxy support — uses Apify Proxy for reliable access at volume

Output Data

{
"url": "https://example.com",
"snapshotDate": "2024-06-15T08:30:00Z",
"statusCode": 200,
"mimeType": "text/html",
"contentLength": 45230,
"archiveUrl": "https://web.archive.org/web/20240615083000/https://example.com",
"title": "Example Domain",
"htmlContent": "<!doctype html>..."
}

Use Cases

  • Competitive intelligence — track how competitors changed pricing, messaging, and features over months or years
  • Legal / compliance evidence — document historical website state for disputes, IP claims, or regulatory filings
  • SEO research — analyze how page structure, titles, meta tags, and internal linking evolved
  • Content recovery — rescue pages that were deleted, redesigned, or moved
  • Brand monitoring — visualize a company's public image shift over time
  • Journalism & fact-checking — verify what was published on a date, with an auditable source

Integration Examples

Python

import requests
response = requests.post(
"https://api.apify.com/v2/acts/knotless_cadence~wayback-machine-scraper/runs",
params={"token": "YOUR_APIFY_TOKEN"},
json={"urls": ["https://example.com"], "fromDate": "2020-01-01", "toDate": "2025-12-31"}
)
run_id = response.json()["data"]["id"]
# Poll for completion, then fetch dataset
items = requests.get(
f"https://api.apify.com/v2/actor-runs/{run_id}/dataset/items",
params={"token": "YOUR_APIFY_TOKEN"}
).json()
for snap in items:
print(f"[{snap['snapshotDate']}] {snap['title']}")

n8n Workflow

  1. HTTP RequestPOST https://api.apify.com/v2/acts/knotless_cadence~wayback-machine-scraper/runs
  2. Wait → 60 seconds (or loop-poll on run status)
  3. HTTP RequestGET .../dataset/items
  4. Slack / Google Sheets / Postgres → push snapshots to your system

Pricing

VolumeEstimated Cost
100 URLs, metadata only~$0.30
100 URLs + full HTML~$0.80
500 URLs + full HTML~$3.00

Runs on Apify's free tier (with datacenter-proxy limitations).

FAQ

Q: Is this legal? A: Yes. The Wayback Machine is a public archive operated by the non-profit Internet Archive. This actor reads publicly-available data and respects robots.txt and archive.org's access guidelines.

Q: Why use this over scraping web.archive.org's HTML directly? A: The CDX Server API is the officially-supported programmatic interface. HTML scrapers break every time the UI is tweaked — this one doesn't.

Q: Can I pull the full page HTML, not just metadata? A: Yes. Set extractContent: true in the input schema.

Q: How many snapshots are available for a given URL? A: Depends on the site. Popular sites (e.g., nytimes.com) may have tens of thousands. Small sites may have a handful.

Q: How fast is it? A: Typically 100 URL lookups (metadata only) in under 2 minutes. Adding full HTML extraction adds ~3-5 seconds per snapshot.

Q: Can I filter by status code? A: Yes. Filter to 200 only to skip redirects and 404s.


Need a Custom Scraper or Data Pipeline?

Get a tailored scraper built for YOUR use case in 48 hours — $100 pilot rate, or $150 for a 3-article series if you also need written deliverables.

**

Email: spinov001@gmail.com Portfolio: 78 published Apify actors — Trustpilot 249+ runs, Reddit 72+, Google News 32+, Email Extractor 19+ Tips & tutorials: t.me/scraping_ai