Wayback Snapshots — CSV, Date-Filter, Bulk JSON
Pricing
Pay per usage
Wayback Snapshots — CSV, Date-Filter, Bulk JSON
Wayback Machine snapshots in CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable + collapse-by-day. Uses CDX API, no API key. Built for competitive intel, SEO recovery, content audits. spinov001@gmail.com · t.me/scraping_ai
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Alex
Actor stats
0
Bookmarked
5
Total users
0
Monthly active users
3 hours ago
Last modified
Categories
Share
Wayback Machine Scraper — Extract Historical Website Snapshots
Retrieve archived versions of any website from the Internet Archive. See how any URL looked at any point in history — no API key, no rate-limit headaches, no HTML parsing breakage.
Why This Scraper?
Most "Wayback" scrapers scrape the web.archive.org HTML directly, which breaks whenever the archive UI changes. This actor uses the official CDX Server API that archive.org exposes for programmatic access — the same endpoint used by researchers and journalists worldwide. That means:
- ✅ Never breaks on UI changes — CDX API is stable and documented
- ✅ No authentication — public archive, no credentials needed
- ✅ Bulk lookups — submit hundreds of URLs in a single run
- ✅ Structured output — clean JSON/CSV, ready for analysis pipelines
- ✅ Full HTML retrieval — optionally pull the cached page body, not just metadata
Features
- Historical snapshots — full timeline of cached versions per URL
- Date filtering — narrow to a year, month, or custom range
- Bulk processing — 100s of URLs per run, automatic deduplication
- Content extraction — pull cached HTML/text, not just metadata
- Status code filtering — skip 404/redirect snapshots, keep only
200 - MIME filtering — HTML only, or include images/PDFs/JSON
- Proxy support — uses Apify Proxy for reliable access at volume
Output Data
{"url": "https://example.com","snapshotDate": "2024-06-15T08:30:00Z","statusCode": 200,"mimeType": "text/html","contentLength": 45230,"archiveUrl": "https://web.archive.org/web/20240615083000/https://example.com","title": "Example Domain","htmlContent": "<!doctype html>..."}
Use Cases
- Competitive intelligence — track how competitors changed pricing, messaging, and features over months or years
- Legal / compliance evidence — document historical website state for disputes, IP claims, or regulatory filings
- SEO research — analyze how page structure, titles, meta tags, and internal linking evolved
- Content recovery — rescue pages that were deleted, redesigned, or moved
- Brand monitoring — visualize a company's public image shift over time
- Journalism & fact-checking — verify what was published on a date, with an auditable source
Integration Examples
Python
import requestsresponse = requests.post("https://api.apify.com/v2/acts/knotless_cadence~wayback-machine-scraper/runs",params={"token": "YOUR_APIFY_TOKEN"},json={"urls": ["https://example.com"], "fromDate": "2020-01-01", "toDate": "2025-12-31"})run_id = response.json()["data"]["id"]# Poll for completion, then fetch datasetitems = requests.get(f"https://api.apify.com/v2/actor-runs/{run_id}/dataset/items",params={"token": "YOUR_APIFY_TOKEN"}).json()for snap in items:print(f"[{snap['snapshotDate']}] {snap['title']}")
n8n Workflow
- HTTP Request →
POST https://api.apify.com/v2/acts/knotless_cadence~wayback-machine-scraper/runs - Wait → 60 seconds (or loop-poll on run status)
- HTTP Request →
GET .../dataset/items - Slack / Google Sheets / Postgres → push snapshots to your system
Pricing
| Volume | Estimated Cost |
|---|---|
| 100 URLs, metadata only | ~$0.30 |
| 100 URLs + full HTML | ~$0.80 |
| 500 URLs + full HTML | ~$3.00 |
Runs on Apify's free tier (with datacenter-proxy limitations).
FAQ
Q: Is this legal? A: Yes. The Wayback Machine is a public archive operated by the non-profit Internet Archive. This actor reads publicly-available data and respects robots.txt and archive.org's access guidelines.
Q: Why use this over scraping web.archive.org's HTML directly? A: The CDX Server API is the officially-supported programmatic interface. HTML scrapers break every time the UI is tweaked — this one doesn't.
Q: Can I pull the full page HTML, not just metadata?
A: Yes. Set extractContent: true in the input schema.
Q: How many snapshots are available for a given URL? A: Depends on the site. Popular sites (e.g., nytimes.com) may have tens of thousands. Small sites may have a handful.
Q: How fast is it? A: Typically 100 URL lookups (metadata only) in under 2 minutes. Adding full HTML extraction adds ~3-5 seconds per snapshot.
Q: Can I filter by status code?
A: Yes. Filter to 200 only to skip redirects and 404s.
Related Actors
- Trustpilot Review Scraper — 249+ runs, ratings & sentiment
- Reddit Scraper Pro — 72+ runs, posts & comment trees via Reddit JSON API
- Google News Scraper — Track news mentions and media coverage
- Email Extractor Pro — Bulk email extraction from websites
Need a Custom Scraper or Data Pipeline?
Get a tailored scraper built for YOUR use case in 48 hours — $100 pilot rate, or $150 for a 3-article series if you also need written deliverables.
**
Email: spinov001@gmail.com Portfolio: 78 published Apify actors — Trustpilot 249+ runs, Reddit 72+, Google News 32+, Email Extractor 19+ Tips & tutorials: t.me/scraping_ai