# Wayback Machine Toolkit (`logical_vivacity/wayback-machine-toolkit`) Actor

A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between

- **URL**: https://apify.com/logical\_vivacity/wayback-machine-toolkit.md
- **Developed by:** [Logical Vivacity](https://apify.com/logical_vivacity) (community)
- **Categories:** Automation, Developer tools, SEO tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.10 / result

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Wayback Machine Toolkit

A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between dates.

### Why this actor

The free archive APIs tell you *that* something was captured. This actor tells you *what changed*, *what's still readable*, and *what you can recover*. Five focused modes, one consistent interface, structured output ready for a database or spreadsheet.

### Killer features

- **Diff two snapshots** of any URL and get a unified text diff plus a similarity score, computed on cleaned prose (no HTML noise).
- **Link rot audit**: feed a list of URLs, get back which ones are dead in the wild, which are still archived, and the exact archive URL you can swap in. Recover broken citations, broken backlinks, and lost references in bulk.
- **Change detection across many URLs** between two dates: one summary record per URL with a `changed: bool` and similarity score. Watch competitor pages, policy pages, pricing pages, or your own content for silent edits.
- **Clean content extraction** from archived HTML: title, author, date, language, word count, and markdown body — not raw page source.
- **Snapshot index** lookups (the basic mode) for compatibility with audit and forensics workflows.

### Modes

| Mode | What it does |
|---|---|
| `snapshots` | Lists archive index entries for each URL within an optional date range. |
| `content` | Fetches the archived HTML at a target date and returns cleaned markdown + structured metadata. |
| `diff` | For each URL, compares two snapshots and returns a unified text diff plus stats. |
| `link-rot` | For each URL, checks current reachability AND archive availability. Flags dead-but-recoverable links. |
| `change-detection` | For each URL, summarises whether content changed between two dates (similarity ratio + `changed` flag). |

### Inputs

| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| `urls` | array&lt;string&gt; | yes | — | URLs to process. |
| `mode` | enum | yes | `snapshots` | One of `snapshots`, `content`, `diff`, `link-rot`, `change-detection`. |
| `fromDate` | string | for `diff`, `change-detection` | — | Lower bound. `YYYY-MM-DD` or `YYYYMMDD[hhmmss]`. |
| `toDate` | string | for `diff`, `change-detection` | — | Upper bound. Same formats. |
| `targetDate` | string | optional for `content` | newest | Which snapshot to fetch in `content` mode. |
| `maxSnapshotsPerUrl` | integer | no | `100` | Cap for `snapshots` mode. |
| `userAgent` | string | no | `Apify Actor wayback-machine` | The archive prefers a descriptive UA, ideally with contact info. |
| `concurrency` | integer (1-10) | no | `5` | Parallelism for the bulk modes. |
| `includeDiffText` | boolean | no | `false` | If true, `change-detection` records include the full unified diff text. |

### Output samples

#### `snapshots`

```json
{
  "url": "https://example.com",
  "snapshot_url": "https://web.archive.org/web/20200101000000/https://example.com/",
  "timestamp": "20200101000000",
  "status_code": "200",
  "mime_type": "text/html",
  "digest": "ABCDEF1234567890ABCDEF1234567890"
}
````

#### `content`

```json
{
  "url": "https://example.com/post",
  "snapshot_url": "https://web.archive.org/web/20230615120000/https://example.com/post",
  "timestamp": "20230615120000",
  "title": "How we shipped X",
  "byline": "Jane Doe",
  "date": "2023-06-14",
  "language": "en",
  "text": "Plain text body...",
  "markdown": "## How we shipped X\n\nPlain markdown body...",
  "word_count": 842,
  "status_code": 200
}
```

#### `diff`

```json
{
  "url": "https://example.com/pricing",
  "from_timestamp": "20240101000000",
  "to_timestamp": "20240601000000",
  "added_lines": 12,
  "removed_lines": 7,
  "changed_chars": 318,
  "similarity_ratio": 0.9421,
  "diff_unified": "--- https://example.com/pricing@20240101000000\n+++ https://example.com/pricing@20240601000000\n@@ ...\n-Old plan: $9/mo\n+New plan: $12/mo\n"
}
```

#### `link-rot`

```json
{
  "url": "https://www.geocities.com/SiliconValley/",
  "current_status_code": null,
  "current_reachable": false,
  "current_error": "name resolution failure",
  "last_archived_at": "20091026152611",
  "last_archived_status": "200",
  "archived_alternatives_count": 318,
  "recommended_archive_url": "http://web.archive.org/web/20091026152611/http://www.geocities.com/SiliconValley/",
  "recoverable_from_archive": true
}
```

#### `change-detection`

```json
{
  "url": "https://competitor.com/pricing",
  "from_timestamp": "20240101000000",
  "to_timestamp": "20240601000000",
  "similarity_ratio": 0.8732,
  "added_lines": 18,
  "removed_lines": 11,
  "changed_chars": 612,
  "changed": true
}
```

### Limitations

- The public web archive's index and availability APIs rate-limit aggressive callers. Keep `concurrency` modest and provide a descriptive `userAgent` (ideally with contact info) for large runs.
- Coverage depends on whether each URL was crawled and archived. Missing URLs return a record with empty fields and an `error` note rather than failing the whole run.
- Diff and change-detection compare *cleaned prose* extracted from each snapshot. Boilerplate (nav, footer) is mostly excluded, which is what you usually want — but small structural-only edits may not register.
- `content` mode requests the archive's raw (`id_`) capture flavour to avoid the archive's own UI rewriting; binary or non-HTML captures will produce empty extracted text.
- Live link checks in `link-rot` follow redirects and fall back from `HEAD` to `GET` on `405`. Some hostile origins block both; those are reported as `current_reachable: false` with the underlying error string.

### Licensing

This actor is MIT-licensed. It uses permissively-licensed open-source components; see `LICENSE` for the preserved upstream copyright notices.

# Actor input Schema

## `urls` (type: `array`):

List of URLs to process. Required for all modes.

## `mode` (type: `string`):

Operation to run. 'snapshots' indexes captures. 'content' extracts clean markdown from an archived snapshot. 'diff' compares two snapshots of one URL. 'link-rot' checks live status and archive availability for a list of URLs. 'change-detection' summarises whether each URL changed between two dates.

## `fromDate` (type: `string`):

Lower-bound date for 'diff' and 'change-detection'. Accepts YYYY-MM-DD or YYYYMMDD\[hhmmss].

## `toDate` (type: `string`):

Upper-bound date for 'diff' and 'change-detection'. Accepts YYYY-MM-DD or YYYYMMDD\[hhmmss].

## `targetDate` (type: `string`):

Target date for 'content' mode (which snapshot to fetch). Defaults to the newest available capture if omitted.

## `maxSnapshotsPerUrl` (type: `integer`):

Cap on records per URL in 'snapshots' mode.

## `userAgent` (type: `string`):

Custom User-Agent header. The archive prefers a descriptive UA with contact info.

## `concurrency` (type: `integer`):

Parallelism for bulk modes ('link-rot', 'change-detection').

## `includeDiffText` (type: `boolean`):

If true, 'change-detection' records include the full unified diff text. Off by default to keep records compact.

## Actor input object example

```json
{
  "urls": [
    "https://example.com"
  ],
  "mode": "snapshots",
  "maxSnapshotsPerUrl": 100,
  "userAgent": "Apify Actor wayback-machine",
  "concurrency": 5,
  "includeDiffText": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://example.com"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("logical_vivacity/wayback-machine-toolkit").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": ["https://example.com"] }

# Run the Actor and wait for it to finish
run = client.actor("logical_vivacity/wayback-machine-toolkit").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://example.com"
  ]
}' |
apify call logical_vivacity/wayback-machine-toolkit --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=logical_vivacity/wayback-machine-toolkit",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Wayback Machine Toolkit",
        "description": "A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between",
        "version": "0.1",
        "x-build-id": "vaPW1kZSBVcyAxMZW"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/logical_vivacity~wayback-machine-toolkit/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-logical_vivacity-wayback-machine-toolkit",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/logical_vivacity~wayback-machine-toolkit/runs": {
            "post": {
                "operationId": "runs-sync-logical_vivacity-wayback-machine-toolkit",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/logical_vivacity~wayback-machine-toolkit/run-sync": {
            "post": {
                "operationId": "run-sync-logical_vivacity-wayback-machine-toolkit",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls",
                    "mode"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs",
                        "type": "array",
                        "description": "List of URLs to process. Required for all modes.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "snapshots",
                            "content",
                            "diff",
                            "link-rot",
                            "change-detection"
                        ],
                        "type": "string",
                        "description": "Operation to run. 'snapshots' indexes captures. 'content' extracts clean markdown from an archived snapshot. 'diff' compares two snapshots of one URL. 'link-rot' checks live status and archive availability for a list of URLs. 'change-detection' summarises whether each URL changed between two dates.",
                        "default": "snapshots"
                    },
                    "fromDate": {
                        "title": "From date",
                        "type": "string",
                        "description": "Lower-bound date for 'diff' and 'change-detection'. Accepts YYYY-MM-DD or YYYYMMDD[hhmmss]."
                    },
                    "toDate": {
                        "title": "To date",
                        "type": "string",
                        "description": "Upper-bound date for 'diff' and 'change-detection'. Accepts YYYY-MM-DD or YYYYMMDD[hhmmss]."
                    },
                    "targetDate": {
                        "title": "Target date",
                        "type": "string",
                        "description": "Target date for 'content' mode (which snapshot to fetch). Defaults to the newest available capture if omitted."
                    },
                    "maxSnapshotsPerUrl": {
                        "title": "Max snapshots per URL",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Cap on records per URL in 'snapshots' mode.",
                        "default": 100
                    },
                    "userAgent": {
                        "title": "User agent",
                        "type": "string",
                        "description": "Custom User-Agent header. The archive prefers a descriptive UA with contact info.",
                        "default": "Apify Actor wayback-machine"
                    },
                    "concurrency": {
                        "title": "Concurrency",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Parallelism for bulk modes ('link-rot', 'change-detection').",
                        "default": 5
                    },
                    "includeDiffText": {
                        "title": "Include diff text",
                        "type": "boolean",
                        "description": "If true, 'change-detection' records include the full unified diff text. Off by default to keep records compact.",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
