Wayback Machine Scraper avatar

Wayback Machine Scraper

Pricing

Pay per usage

Go to Apify Store
Wayback Machine Scraper

Wayback Machine Scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glass Ventures

Glass Ventures

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

3 days ago

Last modified

Share

Scrape archived snapshots from the Wayback Machine (Archive.org) for any URL or domain. Extract archive URLs, timestamps, HTTP status codes, MIME types, and content sizes.

What does Wayback Machine Scraper do?

Wayback Machine Scraper uses the official Wayback Machine CDX API to retrieve historical snapshots of any website. It lets you discover every archived version of a page, filter by date range, content type, and HTTP status code.

Whether you need to track how a website changed over time, recover lost content, monitor competitor website changes, or build a historical dataset of web pages, this actor makes it easy. It handles pagination, rate limiting, and exports data in JSON, CSV, or Excel format.

The Wayback Machine (Archive.org) has archived over 800 billion web pages since 1996. This actor gives you structured access to that massive archive without writing any code.

Use Cases

  • SEO specialists -- Track historical changes to competitor pages, find old URLs for redirect mapping, discover deleted content
  • Researchers -- Build datasets of how websites evolved over time, study web history trends
  • Content recovery -- Find and recover deleted or changed web pages from the archive
  • Compliance teams -- Document historical versions of terms of service, privacy policies, or regulatory pages
  • Developers -- Programmatically access Wayback Machine data via API for integration into tools and pipelines

Features

  • Search by exact URL or entire domain (wildcard matching)
  • Filter snapshots by date range (from/to)
  • Filter by MIME type (HTML, JSON, CSS, JavaScript, images)
  • Filter by HTTP status code (200, 301, 404, etc.)
  • Bulk processing of multiple URLs and domains
  • Proxy support with automatic rotation
  • Handles rate limiting and large datasets automatically
  • Exports to JSON, CSV, Excel, or connect via API

How much will it cost?

The Wayback Machine CDX API is free and public. The only cost is Apify platform compute time.

ResultsEstimated Cost
1,000~$0.01
10,000~$0.05
100,000~$0.25
Cost ComponentPer 10,000 Results
Platform compute~$0.05
Proxy (optional)~$0.00
Total~$0.05

How to use

  1. Go to the Wayback Machine Scraper page on Apify Store
  2. Click "Start" or "Try for free"
  3. Enter URLs to look up in the archive, or domain names for full-domain search
  4. Optionally set date range filters, MIME type, and status code filters
  5. Set the maximum number of items
  6. Click "Start" and wait for the results

Input parameters

ParameterTypeDescriptionDefault
startUrlsarrayWebsite URLs to look up in the Wayback Machine-
domainsarrayDomain names for full-domain archive search-
dateFromstringOnly include snapshots after this date-
dateTostringOnly include snapshots before this date-
mimeTypeFilterstringFilter by content type (text/html, application/json, all)all
statusCodeFilterstringFilter by HTTP status code (e.g., "200")-
maxItemsnumberMaximum snapshot records to return1000
proxyConfigobjectProxy settings (optional)-

Output

The actor produces a dataset with the following fields:

{
"originalUrl": "https://www.example.com",
"archiveUrl": "https://web.archive.org/web/20230115120000/https://www.example.com",
"timestamp": "20230115120000",
"statusCode": "200",
"mimeType": "text/html",
"length": "1256",
"archivedDate": "2023-01-15T12:00:00.000Z",
"scrapedAt": "2026-04-23T10:30:00.000Z"
}
FieldTypeDescription
originalUrlstringThe original URL that was archived
archiveUrlstringFull Wayback Machine URL to view the snapshot
timestampstringRaw Wayback Machine timestamp (YYYYMMDDHHmmss)
statusCodestringHTTP status code of the archived response
mimeTypestringContent type of the archived resource
lengthstringSize of the archived resource in bytes
archivedDatestringISO 8601 date when the snapshot was taken
scrapedAtstringISO 8601 timestamp when data was extracted

Integrations

Connect Wayback Machine Scraper with other tools:

  • Apify API -- REST API for programmatic access
  • Webhooks -- get notified when a run finishes
  • Zapier / Make -- connect to 5,000+ apps
  • Google Sheets -- export directly to spreadsheets

API Example (Node.js)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('YOUR_USERNAME/wayback-machine-scraper').call({
startUrls: [{ url: 'https://www.example.com' }],
maxItems: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();

API Example (Python)

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('YOUR_USERNAME/wayback-machine-scraper').call(run_input={
'startUrls': [{'url': 'https://www.example.com'}],
'maxItems': 100,
})
items = client.dataset(run['defaultDatasetId']).list_items().items

API Example (cURL)

curl "https://api.apify.com/v2/acts/YOUR_USERNAME~wayback-machine-scraper/runs" \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{"startUrls": [{"url": "https://www.example.com"}], "maxItems": 100}'

Tips and tricks

  • Start with a small maxItems (10-50) to test before running large scrapes
  • Use date filters (dateFrom/dateTo) to narrow results for popular sites with thousands of snapshots
  • Domain-wide searches can return very large datasets -- always set a maxItems limit
  • Filter by statusCode: "200" to only get successful snapshots (skip redirects and errors)
  • The Wayback Machine API can be slow for domains with millions of snapshots -- be patient

FAQ

Q: Does this actor require login credentials? A: No. The Wayback Machine CDX API is completely free and public.

Q: How fast is the scraping? A: Typically 1,000-10,000 results per minute depending on the API response time. Large domain searches may take longer.

Q: What should I do if I get rate limited? A: Enable proxy configuration to rotate IPs automatically. Also reduce maxConcurrency.

Q: Can I get the actual page content from the archive? A: This actor returns snapshot metadata (URLs, dates, status codes). Use the archiveUrl field to access the actual archived page content.

Q: Why are some snapshots missing? A: The Wayback Machine does not archive every page on every visit. Some pages may have been excluded by robots.txt or simply not crawled.

The Wayback Machine (Archive.org) provides a public API specifically designed for programmatic access to archive data. This actor uses only the official CDX API. Always review and respect Archive.org's Terms of Service. For more information, see Apify's blog on web scraping legality.

Limitations

  • The CDX API may rate-limit requests for very high-volume queries
  • Domain-wide searches for popular domains (e.g., google.com) can return millions of records -- use date filters and maxItems
  • The actor returns snapshot metadata, not the actual archived page content
  • Some timestamps may have reduced precision (date only, no time)

Changelog

  • v0.1 (2026-04-23) -- Initial release