Pricing

from $0.10 / result

Wayback Machine Toolkit

A practical toolkit on top of the public web archive. Goes well beyond raw snapshot listings: extract clean article markdown from any past capture, diff two points in time, audit a list of URLs for link rot, and detect content changes across pages between

Pricing

from $0.10 / result

Rating

0.0

(0)

Developer

Logical Vivacity

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Why this actor

The free archive APIs tell you that something was captured. This actor tells you what changed, what's still readable, and what you can recover. Five focused modes, one consistent interface, structured output ready for a database or spreadsheet.

Killer features

Diff two snapshots of any URL and get a unified text diff plus a similarity score, computed on cleaned prose (no HTML noise).
Link rot audit: feed a list of URLs, get back which ones are dead in the wild, which are still archived, and the exact archive URL you can swap in. Recover broken citations, broken backlinks, and lost references in bulk.
Change detection across many URLs between two dates: one summary record per URL with a changed: bool and similarity score. Watch competitor pages, policy pages, pricing pages, or your own content for silent edits.
Clean content extraction from archived HTML: title, author, date, language, word count, and markdown body — not raw page source.
Snapshot index lookups (the basic mode) for compatibility with audit and forensics workflows.

Modes

Mode	What it does
`snapshots`	Lists archive index entries for each URL within an optional date range.
`content`	Fetches the archived HTML at a target date and returns cleaned markdown + structured metadata.
`diff`	For each URL, compares two snapshots and returns a unified text diff plus stats.
`link-rot`	For each URL, checks current reachability AND archive availability. Flags dead-but-recoverable links.
`change-detection`	For each URL, summarises whether content changed between two dates (similarity ratio + `changed` flag).

Inputs

Field	Type	Required	Default	Description
`urls`	array<string>	yes	—	URLs to process.
`mode`	enum	yes	`snapshots`	One of `snapshots`, `content`, `diff`, `link-rot`, `change-detection`.
`fromDate`	string	for `diff`, `change-detection`	—	Lower bound. `YYYY-MM-DD` or `YYYYMMDD[hhmmss]`.
`toDate`	string	for `diff`, `change-detection`	—	Upper bound. Same formats.
`targetDate`	string	optional for `content`	newest	Which snapshot to fetch in `content` mode.
`maxSnapshotsPerUrl`	integer	no	`100`	Cap for `snapshots` mode.
`userAgent`	string	no	`Apify Actor wayback-machine`	The archive prefers a descriptive UA, ideally with contact info.
`concurrency`	integer (1-10)	no	`5`	Parallelism for the bulk modes.
`includeDiffText`	boolean	no	`false`	If true, `change-detection` records include the full unified diff text.

Output samples

`snapshots`

{
  "url": "https://example.com",
  "snapshot_url": "https://web.archive.org/web/20200101000000/https://example.com/",
  "timestamp": "20200101000000",
  "status_code": "200",
  "mime_type": "text/html",
  "digest": "ABCDEF1234567890ABCDEF1234567890"
}

`content`

{
  "url": "https://example.com/post",
  "snapshot_url": "https://web.archive.org/web/20230615120000/https://example.com/post",
  "timestamp": "20230615120000",
  "title": "How we shipped X",
  "byline": "Jane Doe",
  "date": "2023-06-14",
  "language": "en",
  "text": "Plain text body...",
  "markdown": "# How we shipped X\n\nPlain markdown body...",
  "word_count": 842,
  "status_code": 200
}

`diff`

{
  "url": "https://example.com/pricing",
  "from_timestamp": "20240101000000",
  "to_timestamp": "20240601000000",
  "added_lines": 12,
  "removed_lines": 7,
  "changed_chars": 318,
  "similarity_ratio": 0.9421,
  "diff_unified": "--- https://example.com/pricing@20240101000000\n+++ https://example.com/pricing@20240601000000\n@@ ...\n-Old plan: $9/mo\n+New plan: $12/mo\n"
}

`link-rot`

{
  "url": "https://www.geocities.com/SiliconValley/",
  "current_status_code": null,
  "current_reachable": false,
  "current_error": "name resolution failure",
  "last_archived_at": "20091026152611",
  "last_archived_status": "200",
  "archived_alternatives_count": 318,
  "recommended_archive_url": "http://web.archive.org/web/20091026152611/http://www.geocities.com/SiliconValley/",
  "recoverable_from_archive": true
}

`change-detection`

{
  "url": "https://competitor.com/pricing",
  "from_timestamp": "20240101000000",
  "to_timestamp": "20240601000000",
  "similarity_ratio": 0.8732,
  "added_lines": 18,
  "removed_lines": 11,
  "changed_chars": 612,
  "changed": true
}

Limitations

The public web archive's index and availability APIs rate-limit aggressive callers. Keep concurrency modest and provide a descriptive userAgent (ideally with contact info) for large runs.
Coverage depends on whether each URL was crawled and archived. Missing URLs return a record with empty fields and an error note rather than failing the whole run.
Diff and change-detection compare cleaned prose extracted from each snapshot. Boilerplate (nav, footer) is mostly excluded, which is what you usually want — but small structural-only edits may not register.
content mode requests the archive's raw (id_) capture flavour to avoid the archive's own UI rewriting; binary or non-HTML captures will produce empty extracted text.
Live link checks in link-rot follow redirects and fall back from HEAD to GET on 405. Some hostile origins block both; those are reported as current_reachable: false with the underlying error string.

Licensing

This actor is MIT-licensed. It uses permissively-licensed open-source components; see LICENSE for the preserved upstream copyright notices.

Wayback Machine Bulk Lookup

jungle_synthesizer/wayback-machine-bulk-lookup

Look up Wayback Machine snapshots for any URL or list of URLs. Returns capture timeline, optional snapshot markdown, and live-vs-snapshot diff. Date range filtering, capture limit, bulk input. Built for OSINT, journalism, SEO link-rot recovery, and legal evidence.

BowTiedRaccoon

Wayback Machine Scraper

gio21/wayback-machine-scraper

List Internet Archive Wayback Machine snapshots for one or more URLs. Returns timestamp, snapshot URL, HTTP status, MIME type, digest. Useful for tracking website changes over time, OSINT research, content recovery, and brand monitoring.

Gio

Python Web Scraping Toolkit

fipper_ai/Python-web-scraping-toolkit

Josh Baker

Wayback Machine Snapshots Scraper — Internet Archive History

seemuapps/wayback-machine-snapshots-scraper

List every Internet Archive snapshot of a URL, page, or whole domain. Timestamp, snapshot URL, status code, mime type, content length. No login.

Andrew

Websites Archiver (Wayback Machine)

web.harvester/websites-archiver

Effortlessly archive any website with our Automated Website Archiving Tool. It leverages the power of the Wayback Machine at web.archive.org to ensure your sites are preserved for future reference.

Web Harvester

5.0

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

Wayback Machine Checker

automation-lab/wayback-machine-checker

This actor checks if URLs are archived in the Internet Archive Wayback Machine. It retrieves snapshot counts, oldest and newest archive dates, and direct links to archived versions. Uses both the Availability API and CDX API for comprehensive results.