# Wayback Machine Scraper (`glassventures/wayback-machine-scraper`) Actor

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

- **URL**: https://apify.com/glassventures/wayback-machine-scraper.md
- **Developed by:** [Glass Ventures](https://apify.com/glassventures) (community)
- **Categories:** Developer tools, SEO tools, Automation
- **Stats:** 1 total users, 0 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Wayback Machine Scraper

Scrape archived snapshots from the Wayback Machine (Archive.org) for any URL or domain. Extract archive URLs, timestamps, HTTP status codes, MIME types, and content sizes.

### What does Wayback Machine Scraper do?

Wayback Machine Scraper uses the official Wayback Machine CDX API to retrieve historical snapshots of any website. It lets you discover every archived version of a page, filter by date range, content type, and HTTP status code.

Whether you need to track how a website changed over time, recover lost content, monitor competitor website changes, or build a historical dataset of web pages, this actor makes it easy. It handles pagination, rate limiting, and exports data in JSON, CSV, or Excel format.

The Wayback Machine (Archive.org) has archived over 800 billion web pages since 1996. This actor gives you structured access to that massive archive without writing any code.

### Use Cases

- **SEO specialists** -- Track historical changes to competitor pages, find old URLs for redirect mapping, discover deleted content
- **Researchers** -- Build datasets of how websites evolved over time, study web history trends
- **Content recovery** -- Find and recover deleted or changed web pages from the archive
- **Compliance teams** -- Document historical versions of terms of service, privacy policies, or regulatory pages
- **Developers** -- Programmatically access Wayback Machine data via API for integration into tools and pipelines

### Features

- Search by exact URL or entire domain (wildcard matching)
- Filter snapshots by date range (from/to)
- Filter by MIME type (HTML, JSON, CSS, JavaScript, images)
- Filter by HTTP status code (200, 301, 404, etc.)
- Bulk processing of multiple URLs and domains
- Proxy support with automatic rotation
- Handles rate limiting and large datasets automatically
- Exports to JSON, CSV, Excel, or connect via API

### How much will it cost?

The Wayback Machine CDX API is free and public. The only cost is Apify platform compute time.

| Results | Estimated Cost |
|---------|---------------|
| 1,000   | ~$0.01        |
| 10,000  | ~$0.05        |
| 100,000 | ~$0.25        |

| Cost Component | Per 10,000 Results |
|----------------|-------------------|
| Platform compute | ~$0.05 |
| Proxy (optional) | ~$0.00 |
| **Total** | **~$0.05** |

### How to use

1. Go to the Wayback Machine Scraper page on Apify Store
2. Click "Start" or "Try for free"
3. Enter URLs to look up in the archive, or domain names for full-domain search
4. Optionally set date range filters, MIME type, and status code filters
5. Set the maximum number of items
6. Click "Start" and wait for the results

### Input parameters

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| startUrls | array | Website URLs to look up in the Wayback Machine | - |
| domains | array | Domain names for full-domain archive search | - |
| dateFrom | string | Only include snapshots after this date | - |
| dateTo | string | Only include snapshots before this date | - |
| mimeTypeFilter | string | Filter by content type (text/html, application/json, all) | all |
| statusCodeFilter | string | Filter by HTTP status code (e.g., "200") | - |
| maxItems | number | Maximum snapshot records to return | 1000 |
| proxyConfig | object | Proxy settings (optional) | - |

### Output

The actor produces a dataset with the following fields:

```json
{
    "originalUrl": "https://www.example.com",
    "archiveUrl": "https://web.archive.org/web/20230115120000/https://www.example.com",
    "timestamp": "20230115120000",
    "statusCode": "200",
    "mimeType": "text/html",
    "length": "1256",
    "archivedDate": "2023-01-15T12:00:00.000Z",
    "scrapedAt": "2026-04-23T10:30:00.000Z"
}
````

| Field | Type | Description |
|-------|------|-------------|
| originalUrl | string | The original URL that was archived |
| archiveUrl | string | Full Wayback Machine URL to view the snapshot |
| timestamp | string | Raw Wayback Machine timestamp (YYYYMMDDHHmmss) |
| statusCode | string | HTTP status code of the archived response |
| mimeType | string | Content type of the archived resource |
| length | string | Size of the archived resource in bytes |
| archivedDate | string | ISO 8601 date when the snapshot was taken |
| scrapedAt | string | ISO 8601 timestamp when data was extracted |

### Integrations

Connect Wayback Machine Scraper with other tools:

- **Apify API** -- REST API for programmatic access
- **Webhooks** -- get notified when a run finishes
- **Zapier / Make** -- connect to 5,000+ apps
- **Google Sheets** -- export directly to spreadsheets

#### API Example (Node.js)

```javascript
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('YOUR_USERNAME/wayback-machine-scraper').call({
    startUrls: [{ url: 'https://www.example.com' }],
    maxItems: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
```

#### API Example (Python)

```python
from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('YOUR_USERNAME/wayback-machine-scraper').call(run_input={
    'startUrls': [{'url': 'https://www.example.com'}],
    'maxItems': 100,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
```

#### API Example (cURL)

```bash
curl "https://api.apify.com/v2/acts/YOUR_USERNAME~wayback-machine-scraper/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"startUrls": [{"url": "https://www.example.com"}], "maxItems": 100}'
```

### Tips and tricks

- Start with a small `maxItems` (10-50) to test before running large scrapes
- Use date filters (`dateFrom`/`dateTo`) to narrow results for popular sites with thousands of snapshots
- Domain-wide searches can return very large datasets -- always set a `maxItems` limit
- Filter by `statusCode: "200"` to only get successful snapshots (skip redirects and errors)
- The Wayback Machine API can be slow for domains with millions of snapshots -- be patient

### FAQ

**Q: Does this actor require login credentials?**
A: No. The Wayback Machine CDX API is completely free and public.

**Q: How fast is the scraping?**
A: Typically 1,000-10,000 results per minute depending on the API response time. Large domain searches may take longer.

**Q: What should I do if I get rate limited?**
A: Enable proxy configuration to rotate IPs automatically. Also reduce maxConcurrency.

**Q: Can I get the actual page content from the archive?**
A: This actor returns snapshot metadata (URLs, dates, status codes). Use the `archiveUrl` field to access the actual archived page content.

**Q: Why are some snapshots missing?**
A: The Wayback Machine does not archive every page on every visit. Some pages may have been excluded by robots.txt or simply not crawled.

### Is it legal to scrape the Wayback Machine?

The Wayback Machine (Archive.org) provides a public API specifically designed for programmatic access to archive data. This actor uses only the official CDX API. Always review and respect Archive.org's Terms of Service. For more information, see [Apify's blog on web scraping legality](https://blog.apify.com/is-web-scraping-legal/).

### Related Actors

- [Website Content Crawler](https://apify.com/apify/website-content-crawler) -- Crawl and extract content from live websites
- [Google Cache Scraper](https://apify.com/apify/google-cache-scraper) -- Access Google's cached versions of web pages

### Limitations

- The CDX API may rate-limit requests for very high-volume queries
- Domain-wide searches for popular domains (e.g., google.com) can return millions of records -- use date filters and maxItems
- The actor returns snapshot metadata, not the actual archived page content
- Some timestamps may have reduced precision (date only, no time)

### Changelog

- **v0.1** (2026-04-23) -- Initial release

# Actor input Schema

## `startUrls` (type: `array`):

Website URLs to look up in the Wayback Machine archive. The actor will find all archived snapshots for each URL.

## `domains` (type: `array`):

Domain names to search across the entire domain (e.g., "example.com"). Uses wildcard matching to find all archived pages under the domain.

## `dateFrom` (type: `string`):

Only include snapshots archived on or after this date.

## `dateTo` (type: `string`):

Only include snapshots archived on or before this date.

## `mimeTypeFilter` (type: `string`):

Filter snapshots by content type.

## `statusCodeFilter` (type: `string`):

Filter by HTTP status code (e.g., "200" for successful responses, "301" for redirects). Leave empty for all status codes.

## `maxItems` (type: `integer`):

Maximum number of snapshot records to return. Use 0 for unlimited.

## `maxConcurrency` (type: `integer`):

Maximum number of URLs processed in parallel.

## `debugMode` (type: `boolean`):

Enables verbose logging for troubleshooting.

## `proxyConfig` (type: `object`):

Proxy settings. The Wayback Machine API is public, but proxies can help avoid rate limits at scale.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://www.example.com"
    }
  ],
  "mimeTypeFilter": "all",
  "maxItems": 1000,
  "maxConcurrency": 5,
  "debugMode": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://www.example.com"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("glassventures/wayback-machine-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [{ "url": "https://www.example.com" }] }

# Run the Actor and wait for it to finish
run = client.actor("glassventures/wayback-machine-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://www.example.com"
    }
  ]
}' |
apify call glassventures/wayback-machine-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=glassventures/wayback-machine-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Wayback Machine Scraper",
        "description": "Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.",
        "version": "0.1",
        "x-build-id": "e7ngMNc9mnqHMBhgT"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/glassventures~wayback-machine-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-glassventures-wayback-machine-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/glassventures~wayback-machine-scraper/runs": {
            "post": {
                "operationId": "runs-sync-glassventures-wayback-machine-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/glassventures~wayback-machine-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-glassventures-wayback-machine-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "Website URLs to look up in the Wayback Machine archive. The actor will find all archived snapshots for each URL.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "domains": {
                        "title": "Domains",
                        "type": "array",
                        "description": "Domain names to search across the entire domain (e.g., \"example.com\"). Uses wildcard matching to find all archived pages under the domain.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "dateFrom": {
                        "title": "Date From",
                        "type": "string",
                        "description": "Only include snapshots archived on or after this date."
                    },
                    "dateTo": {
                        "title": "Date To",
                        "type": "string",
                        "description": "Only include snapshots archived on or before this date."
                    },
                    "mimeTypeFilter": {
                        "title": "MIME Type Filter",
                        "enum": [
                            "all",
                            "text/html",
                            "application/json",
                            "text/css",
                            "application/javascript",
                            "image/jpeg",
                            "image/png"
                        ],
                        "type": "string",
                        "description": "Filter snapshots by content type.",
                        "default": "all"
                    },
                    "statusCodeFilter": {
                        "title": "Status Code Filter",
                        "type": "string",
                        "description": "Filter by HTTP status code (e.g., \"200\" for successful responses, \"301\" for redirects). Leave empty for all status codes."
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of snapshot records to return. Use 0 for unlimited.",
                        "default": 1000
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum number of URLs processed in parallel.",
                        "default": 5
                    },
                    "debugMode": {
                        "title": "Debug Mode",
                        "type": "boolean",
                        "description": "Enables verbose logging for troubleshooting.",
                        "default": false
                    },
                    "proxyConfig": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Proxy settings. The Wayback Machine API is public, but proxies can help avoid rate limits at scale."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
