Wayback Machine Toolkit avatar

Wayback Machine Toolkit

Pricing

from $0.10 / result

Go to Apify Store
Wayback Machine Toolkit

Wayback Machine Toolkit

A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between

Pricing

from $0.10 / result

Rating

0.0

(0)

Developer

Logical Vivacity

Logical Vivacity

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

17 days ago

Last modified

Share

A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between dates.

Why this actor

The free archive APIs tell you that something was captured. This actor tells you what changed, what's still readable, and what you can recover. Five focused modes, one consistent interface, structured output ready for a database or spreadsheet.

Killer features

  • Diff two snapshots of any URL and get a unified text diff plus a similarity score, computed on cleaned prose (no HTML noise).
  • Link rot audit: feed a list of URLs, get back which ones are dead in the wild, which are still archived, and the exact archive URL you can swap in. Recover broken citations, broken backlinks, and lost references in bulk.
  • Change detection across many URLs between two dates: one summary record per URL with a changed: bool and similarity score. Watch competitor pages, policy pages, pricing pages, or your own content for silent edits.
  • Clean content extraction from archived HTML: title, author, date, language, word count, and markdown body — not raw page source.
  • Snapshot index lookups (the basic mode) for compatibility with audit and forensics workflows.

Modes

ModeWhat it does
snapshotsLists archive index entries for each URL within an optional date range.
contentFetches the archived HTML at a target date and returns cleaned markdown + structured metadata.
diffFor each URL, compares two snapshots and returns a unified text diff plus stats.
link-rotFor each URL, checks current reachability AND archive availability. Flags dead-but-recoverable links.
change-detectionFor each URL, summarises whether content changed between two dates (similarity ratio + changed flag).

Inputs

FieldTypeRequiredDefaultDescription
urlsarray<string>yesURLs to process.
modeenumyessnapshotsOne of snapshots, content, diff, link-rot, change-detection.
fromDatestringfor diff, change-detectionLower bound. YYYY-MM-DD or YYYYMMDD[hhmmss].
toDatestringfor diff, change-detectionUpper bound. Same formats.
targetDatestringoptional for contentnewestWhich snapshot to fetch in content mode.
maxSnapshotsPerUrlintegerno100Cap for snapshots mode.
userAgentstringnoApify Actor wayback-machineThe archive prefers a descriptive UA, ideally with contact info.
concurrencyinteger (1-10)no5Parallelism for the bulk modes.
includeDiffTextbooleannofalseIf true, change-detection records include the full unified diff text.

Output samples

snapshots

{
"url": "https://example.com",
"snapshot_url": "https://web.archive.org/web/20200101000000/https://example.com/",
"timestamp": "20200101000000",
"status_code": "200",
"mime_type": "text/html",
"digest": "ABCDEF1234567890ABCDEF1234567890"
}

content

{
"url": "https://example.com/post",
"snapshot_url": "https://web.archive.org/web/20230615120000/https://example.com/post",
"timestamp": "20230615120000",
"title": "How we shipped X",
"byline": "Jane Doe",
"date": "2023-06-14",
"language": "en",
"text": "Plain text body...",
"markdown": "# How we shipped X\n\nPlain markdown body...",
"word_count": 842,
"status_code": 200
}

diff

{
"url": "https://example.com/pricing",
"from_timestamp": "20240101000000",
"to_timestamp": "20240601000000",
"added_lines": 12,
"removed_lines": 7,
"changed_chars": 318,
"similarity_ratio": 0.9421,
"diff_unified": "--- https://example.com/pricing@20240101000000\n+++ https://example.com/pricing@20240601000000\n@@ ...\n-Old plan: $9/mo\n+New plan: $12/mo\n"
}
{
"url": "https://www.geocities.com/SiliconValley/",
"current_status_code": null,
"current_reachable": false,
"current_error": "name resolution failure",
"last_archived_at": "20091026152611",
"last_archived_status": "200",
"archived_alternatives_count": 318,
"recommended_archive_url": "http://web.archive.org/web/20091026152611/http://www.geocities.com/SiliconValley/",
"recoverable_from_archive": true
}

change-detection

{
"url": "https://competitor.com/pricing",
"from_timestamp": "20240101000000",
"to_timestamp": "20240601000000",
"similarity_ratio": 0.8732,
"added_lines": 18,
"removed_lines": 11,
"changed_chars": 612,
"changed": true
}

Limitations

  • The public web archive's index and availability APIs rate-limit aggressive callers. Keep concurrency modest and provide a descriptive userAgent (ideally with contact info) for large runs.
  • Coverage depends on whether each URL was crawled and archived. Missing URLs return a record with empty fields and an error note rather than failing the whole run.
  • Diff and change-detection compare cleaned prose extracted from each snapshot. Boilerplate (nav, footer) is mostly excluded, which is what you usually want — but small structural-only edits may not register.
  • content mode requests the archive's raw (id_) capture flavour to avoid the archive's own UI rewriting; binary or non-HTML captures will produce empty extracted text.
  • Live link checks in link-rot follow redirects and fall back from HEAD to GET on 405. Some hostile origins block both; those are reported as current_reachable: false with the underlying error string.

Licensing

This actor is MIT-licensed. It uses permissively-licensed open-source components; see LICENSE for the preserved upstream copyright notices.