Wayback Machine Toolkit
Pricing
from $0.10 / result
Wayback Machine Toolkit
A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between
Pricing
from $0.10 / result
Rating
0.0
(0)
Developer
Logical Vivacity
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
17 days ago
Last modified
Categories
Share
A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between dates.
Why this actor
The free archive APIs tell you that something was captured. This actor tells you what changed, what's still readable, and what you can recover. Five focused modes, one consistent interface, structured output ready for a database or spreadsheet.
Killer features
- Diff two snapshots of any URL and get a unified text diff plus a similarity score, computed on cleaned prose (no HTML noise).
- Link rot audit: feed a list of URLs, get back which ones are dead in the wild, which are still archived, and the exact archive URL you can swap in. Recover broken citations, broken backlinks, and lost references in bulk.
- Change detection across many URLs between two dates: one summary record per URL with a
changed: booland similarity score. Watch competitor pages, policy pages, pricing pages, or your own content for silent edits. - Clean content extraction from archived HTML: title, author, date, language, word count, and markdown body — not raw page source.
- Snapshot index lookups (the basic mode) for compatibility with audit and forensics workflows.
Modes
| Mode | What it does |
|---|---|
snapshots | Lists archive index entries for each URL within an optional date range. |
content | Fetches the archived HTML at a target date and returns cleaned markdown + structured metadata. |
diff | For each URL, compares two snapshots and returns a unified text diff plus stats. |
link-rot | For each URL, checks current reachability AND archive availability. Flags dead-but-recoverable links. |
change-detection | For each URL, summarises whether content changed between two dates (similarity ratio + changed flag). |
Inputs
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
urls | array<string> | yes | — | URLs to process. |
mode | enum | yes | snapshots | One of snapshots, content, diff, link-rot, change-detection. |
fromDate | string | for diff, change-detection | — | Lower bound. YYYY-MM-DD or YYYYMMDD[hhmmss]. |
toDate | string | for diff, change-detection | — | Upper bound. Same formats. |
targetDate | string | optional for content | newest | Which snapshot to fetch in content mode. |
maxSnapshotsPerUrl | integer | no | 100 | Cap for snapshots mode. |
userAgent | string | no | Apify Actor wayback-machine | The archive prefers a descriptive UA, ideally with contact info. |
concurrency | integer (1-10) | no | 5 | Parallelism for the bulk modes. |
includeDiffText | boolean | no | false | If true, change-detection records include the full unified diff text. |
Output samples
snapshots
{"url": "https://example.com","snapshot_url": "https://web.archive.org/web/20200101000000/https://example.com/","timestamp": "20200101000000","status_code": "200","mime_type": "text/html","digest": "ABCDEF1234567890ABCDEF1234567890"}
content
{"url": "https://example.com/post","snapshot_url": "https://web.archive.org/web/20230615120000/https://example.com/post","timestamp": "20230615120000","title": "How we shipped X","byline": "Jane Doe","date": "2023-06-14","language": "en","text": "Plain text body...","markdown": "# How we shipped X\n\nPlain markdown body...","word_count": 842,"status_code": 200}
diff
{"url": "https://example.com/pricing","from_timestamp": "20240101000000","to_timestamp": "20240601000000","added_lines": 12,"removed_lines": 7,"changed_chars": 318,"similarity_ratio": 0.9421,"diff_unified": "--- https://example.com/pricing@20240101000000\n+++ https://example.com/pricing@20240601000000\n@@ ...\n-Old plan: $9/mo\n+New plan: $12/mo\n"}
link-rot
{"url": "https://www.geocities.com/SiliconValley/","current_status_code": null,"current_reachable": false,"current_error": "name resolution failure","last_archived_at": "20091026152611","last_archived_status": "200","archived_alternatives_count": 318,"recommended_archive_url": "http://web.archive.org/web/20091026152611/http://www.geocities.com/SiliconValley/","recoverable_from_archive": true}
change-detection
{"url": "https://competitor.com/pricing","from_timestamp": "20240101000000","to_timestamp": "20240601000000","similarity_ratio": 0.8732,"added_lines": 18,"removed_lines": 11,"changed_chars": 612,"changed": true}
Limitations
- The public web archive's index and availability APIs rate-limit aggressive callers. Keep
concurrencymodest and provide a descriptiveuserAgent(ideally with contact info) for large runs. - Coverage depends on whether each URL was crawled and archived. Missing URLs return a record with empty fields and an
errornote rather than failing the whole run. - Diff and change-detection compare cleaned prose extracted from each snapshot. Boilerplate (nav, footer) is mostly excluded, which is what you usually want — but small structural-only edits may not register.
contentmode requests the archive's raw (id_) capture flavour to avoid the archive's own UI rewriting; binary or non-HTML captures will produce empty extracted text.- Live link checks in
link-rotfollow redirects and fall back fromHEADtoGETon405. Some hostile origins block both; those are reported ascurrent_reachable: falsewith the underlying error string.
Licensing
This actor is MIT-licensed. It uses permissively-licensed open-source components; see LICENSE for the preserved upstream copyright notices.


