📚 Wayback Machine Scraper
Pricing
Pay per event
📚 Wayback Machine Scraper
Scrape the Wayback Machine to find archived web pages. Search historical website versions, recover lost SEO content, and export scraped URLs.
📚 Wayback Machine Checker
Check if URLs are archived on the Wayback Machine and find closest snapshots by date. Essential for compliance, legal evidence, and content restoration.
Store Quickstart
Start with the Quickstart template to verify 3 archived URLs. For bulk verification, use Portfolio Archive Check with up to 500 URLs.
Key Features
- 📚 Official Internet Archive API — Uses archive.org/wayback/available endpoint
- 📅 Closest-snapshot lookup — Find archived version nearest to any date
- 🔍 Availability check — Know if a URL was ever archived
- 📊 Snapshot count — Total archived versions per URL
- ⚡ Bulk processing — Up to 500 URLs per run
- 🔑 No API key needed — Free, open Internet Archive service
Use Cases
| Who | Why |
|---|---|
| Compliance teams | Legal evidence preservation for regulated industries |
| Journalists | Verify historical versions of web pages that may have been edited |
| SEO recovery | Restore content from accidentally deleted pages |
| Brand protection | Track archived versions of competitor sites over time |
| Academic research | Cite archived web sources in publications |
Input
| Field | Type | Default | Description |
|---|---|---|---|
| urls | string[] | (required) | URLs to check in archive (max 500) |
| closest | string | Target date YYYY-MM-DD (optional) | |
| checkAvailability | boolean | true | Return availability details |
Input Example
{"urls": ["https://example.com/old-article", "https://deleted-site.com"],"closest": "2020-01-01","checkAvailability": true}
Output
| Field | Type | Description |
|---|---|---|
url | string | URL queried |
archived | boolean | Whether the URL has any snapshots in the Wayback Machine |
closestSnapshotUrl | string | URL of the closest snapshot to the requested date |
closestSnapshotDate | string | Date of the closest snapshot (YYYYMMDDhhmmss) |
totalSnapshots | integer | Approximate total snapshots ever taken |
firstSnapshotDate | string | Date of the earliest known snapshot |
lastSnapshotDate | string | Date of the most recent snapshot |
Output Example
{"url": "https://example.com/old-article","available": true,"closestSnapshot": {"url": "https://web.archive.org/web/20200115000000/https://example.com/old-article","timestamp": "20200115000000"},"archivedVersions": 23}
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~wayback-machine-checker/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "urls": ["https://example.com/old-article", "https://deleted-site.com"], "closest": "2020-01-01", "checkAvailability": true }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/wayback-machine-checker").call(run_input={"urls": ["https://example.com/old-article", "https://deleted-site.com"],"closest": "2020-01-01","checkAvailability": true})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/wayback-machine-checker').call({"urls": ["https://example.com/old-article", "https://deleted-site.com"],"closest": "2020-01-01","checkAvailability": true});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Tips & Limitations
- Use
closestDate: "20200101"format (YYYYMMDD) to find a specific historical snapshot. - Great for verifying when a page was first published or last modified.
- Combine with Broken Link Checker to recover content from dead pages via archive links.
- Wayback Machine is free but rate-limits aggressive callers — keep concurrency low.
FAQ
How far back can I check?
Internet Archive has snapshots back to 1996. Coverage depends on whether a URL was crawled.
Why is my URL 'not available'?
Either it was never archived, or Internet Archive excluded it (due to robots.txt or removal request).
Is this the same as running curl to archive.org?
Yes, but with bulk processing, error handling, and structured output for datasets.
Can I archive new URLs?
This actor only reads from the archive. To save NEW pages, use archive.org's /save/ endpoint.
Why is archived false for my URL?
The Internet Archive may not have crawled that URL yet, or robots.txt blocked it at the time.
Can I trigger a new snapshot?
Not via this actor. Use the Wayback Machine 'Save Page Now' feature manually.
Related Actors
URL/Link Tools cluster — explore related Apify tools:
- 🔗 URL Health Checker — Bulk-check HTTP status codes, redirects, SSL validity, and response times for thousands of URLs.
- 🔗 Broken Link Checker — Crawl websites to find broken links, 404 errors, and dead URLs.
- 🔗 URL Unshortener — Expand bit.
- 🏷️ Meta Tag Analyzer — Analyze meta tags, Open Graph, Twitter Cards, JSON-LD, and hreflang for any URL.
- Sitemap Analyzer API | sitemap.xml SEO Audit — Analyze sitemap.
- Schema.org Validator API | JSON-LD + Microdata — Validate JSON-LD and Microdata across multiple pages, score markup quality, and flag missing or malformed Schema.
- Site Governance Monitor | Robots, Sitemap & Schema — Recurring robots.
- RDAP Domain Monitor API | Ownership + Expiry — Monitor domain registration data via RDAP and track expiry, registrar, nameserver, and ownership changes in structured rows.
- Domain Security Audit API | SSL Expiry, DMARC, Domain Expiry — Summary-first portfolio monitor for SSL expiry, DMARC/SPF/DKIM, domain expiry/ownership, and security headers with remediation-ready outputs.
Cost
Pay Per Event:
actor-start: $0.01 (flat fee per run)dataset-item: $0.003 per output item
Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01
No subscription required — you only pay for what you use.
