Pricing

Pay per event

Wayback Machine CDX Bulk Extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What does it do?

Wayback Machine CDX Bulk Extractor uses the Internet Archive's CDX (Capture Index) API to extract complete snapshot metadata for any domain, URL, or wildcard pattern. For every archived page the Wayback Machine has ever crawled, you get the timestamp, HTTP status code, MIME type, content digest, file size, and a direct replay link — all exported to a structured dataset in seconds.

Unlike manually browsing https://web.archive.org/, this actor programmatically paginates through millions of CDX records, applies server-side and client-side filters, and exports clean, structured data at scale.

Who is it for?

👩‍💻 SEO professionals & digital marketers — audit historical URL structures, find old redirects, identify pages that returned 404 errors over time, and recover lost link equity.

🔍 Web archivists & researchers — build comprehensive inventories of how a website evolved, what pages existed at what timestamps, and which content versions were captured.

🛡️ Security analysts — discover exposed endpoints, track historical subdomain activity, or detect when sensitive paths were briefly indexed.

📊 Data journalists & OSINT investigators — reconstruct a site's history, verify when specific pages first appeared, or find crawl evidence of content that has since been removed.

🧑‍💻 Developers & QA engineers — validate archive coverage for migration projects, check historical status code patterns, or build link-checking tools with historical context.

Why use it?

✅ No scraping restrictions — the CDX API is public, free, and built for bulk access
✅ Handles millions of records — automatic pagination via resumeKey with no manual intervention
✅ Flexible filtering — narrow by date range, HTTP status codes, MIME types, or collapse duplicates by URL/content
✅ Zero proxy cost — Internet Archive's CDX API requires no proxies, so every run is extremely cheap
✅ Full Wayback Machine replay URLs — each record includes a direct link to view the archived snapshot
✅ Domain-wide coverage — a single input query can retrieve snapshots for all subdomains

What data does it extract?

Each snapshot record in the output dataset contains:

Field	Type	Description
`urlKey`	string	Canonical URL key in SURT format (e.g., `com,example)/path`)
`timestamp`	string	Capture timestamp in YYYYMMDDHHmmss format
`originalUrl`	string	The original URL as crawled
`mimeType`	string	MIME type of the captured content (e.g., `text/html`, `application/pdf`)
`statusCode`	string	HTTP status code at capture time (e.g., `200`, `301`, `404`)
`digest`	string	SHA-1 content digest for deduplication
`length`	number	Compressed size of the stored WARC record in bytes
`waybackUrl`	string	Full Wayback Machine replay URL (when enabled)

How much does it cost to extract Wayback Machine snapshots?

This actor uses pay-per-event (PPE) pricing — you only pay for the snapshots you extract.

Start event: $0.005 (one-time per run)
Per snapshot: $0.000046 (FREE tier) — effectively $0.046 per 1,000 snapshots

Example costs:

Snapshot count	Estimated cost
1,000	~$0.051
10,000	~$0.465
100,000	~$4.605
1,000,000	~$46.005

Because the CDX API is public and no proxy is used, this actor has near-zero infrastructure cost. Most of what you pay goes directly toward supporting the service. Higher subscription tiers (BRONZE through DIAMOND) get significant per-snapshot discounts.

How to use it

Step 1 — Enter your target URL

Type a domain, exact URL, or wildcard pattern in the URL or domain field:

example.com — all pages on example.com (use with domain matchType)
https://example.com/blog/ — all blog pages (use with prefix matchType)
*.example.com/* — all subdomains (use with domain matchType)

Step 2 — Choose a match type

Match Type	What it returns
`exact`	Only the exact URL specified
`prefix`	All URLs that start with the given URL
`host`	All URLs on the same hostname
`domain`	All URLs on the host AND all of its subdomains

Step 3 — Set limits and filters (optional)

Max snapshots — cap the total records extracted (0 = unlimited)
From/To date — narrow to a specific time window (YYYYMMDD format)
Filter status codes — include only specific HTTP codes (e.g., [200, 301])
Exclude status codes — remove specific HTTP codes (e.g., [404, 500])
Filter MIME types — include only specific content types (e.g., ["text/html"])
Collapse — deduplicate by URL key, content digest, year, month, or day

Step 4 — Run and export

Click Run and wait for extraction to complete. Export results as JSON, CSV, XML, or Excel directly from the dataset tab.

Input schema

{
    "url": "example.com",
    "matchType": "domain",
    "maxSnapshots": 1000,
    "fromDate": "20200101",
    "toDate": "20231231",
    "filterStatusCodes": [200],
    "excludeStatusCodes": [404],
    "filterMimeTypes": ["text/html"],
    "pageSize": 10000,
    "collapse": "urlkey",
    "outputWaybackUrl": true
}

Parameter	Type	Default	Description
`url`	string	required	URL, domain, or wildcard to query
`matchType`	enum	`domain`	URL matching strategy
`maxSnapshots`	integer	1000	Max records (0 = unlimited)
`fromDate`	string	—	Start date YYYYMMDD
`toDate`	string	—	End date YYYYMMDD
`filterStatusCodes`	array	`[]`	Include only these HTTP codes
`excludeStatusCodes`	array	`[]`	Exclude these HTTP codes
`filterMimeTypes`	array	`[]`	Include only these MIME types
`pageSize`	integer	10000	Records per CDX API page
`collapse`	enum	—	Deduplication strategy
`outputWaybackUrl`	boolean	`true`	Include Wayback Machine replay URL

Output

Sample output record:

{
    "urlKey": "com,example)/",
    "timestamp": "20230415120000",
    "originalUrl": "https://example.com/",
    "mimeType": "text/html",
    "statusCode": "200",
    "digest": "JI6OR3QR4CI526JD6TMMNZNV4QPMPQCH",
    "length": 1248,
    "waybackUrl": "https://web.archive.org/web/20230415120000/https://example.com/"
}

Tips & tricks

💡 Use collapse=urlkey to find all unique URLs — this returns only the first capture per URL, giving you a clean list of unique pages the Wayback Machine ever visited.

💡 Use collapse=digest to find unique content versions — skip duplicate captures that archived the same byte-identical content.

💡 Use matchType=domain for full subdomain coverage — this is the broadest option and will include www.example.com, blog.example.com, etc.

💡 Use date filters for historical analysis — narrow to a specific year to audit what a site looked like in that period.

💡 Filter statusCode=200 for live content only — remove redirects, errors, and crawl artefacts to focus on successful captures.

💡 CDX API notes:

The CDX API sometimes returns warc/revisit MIME type for records where only HTTP headers were re-crawled (not the full content). Use filterMimeTypes: ["text/html"] to exclude these.
Status code - in CDX output means the capture type is a revisit (no real HTTP response).
Large domains (e.g., major news sites) can have tens of millions of snapshots — set a reasonable maxSnapshots to avoid very long runs.

Integrations

Export to Google Sheets

After the run, click Export → Google Sheets in the dataset view. Use the data to build URL history timelines, pivot tables by status code over time, or visualize crawl density by year.

Combine with SEO tools

Export the snapshot list as CSV and import into Screaming Frog or Ahrefs to cross-reference current URL statuses against historical captures — a powerful way to identify redirect chains and 404 link equity leaks.

Archive monitoring workflow

Schedule this actor to run weekly on a specific domain and compare new captures to a previous run. Any new statusCode=404 entries indicate recently broken pages. Connect to Google Sheets or a webhook to get automated alerts.

Automated redirects audit

Run with filterStatusCodes: [301, 302] and export the list of all historical redirects on a domain. Cross-reference with your current redirect rules to find redirect chains or outdated configurations.

API usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_TOKEN' });

const run = await client.actor('automation-lab/wayback-machine-cdx-extractor').call({
    url: 'example.com',
    matchType: 'domain',
    maxSnapshots: 5000,
    filterStatusCodes: [200],
    collapse: 'urlkey',
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} unique URLs`);

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")

run = client.actor("automation-lab/wayback-machine-cdx-extractor").call(run_input={
    "url": "example.com",
    "matchType": "domain",
    "maxSnapshots": 5000,
    "filterStatusCodes": [200],
    "collapse": "urlkey",
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Extracted {len(items)} unique URLs")

cURL

curl -X POST 'https://api.apify.com/v2/acts/automation-lab~wayback-machine-cdx-extractor/runs' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_TOKEN' \
  -d '{
    "url": "example.com",
    "matchType": "domain",
    "maxSnapshots": 1000
  }'

Use with AI agents via MCP

Wayback Machine CDX Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"
        }
    }
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

"Extract all snapshots of example.com from 2020 with status code 200"
"Get the unique URLs archived for blog.example.com using collapse by urlkey"
"Find all 404 pages archived for example.com in the last 5 years"

Learn more in the Apify MCP documentation.

Is it legal to use the Wayback Machine CDX API?

Yes. The Internet Archive CDX API is a public API explicitly provided by the Internet Archive for programmatic access to its index data. It is documented and freely accessible at https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server.

This actor does not bypass any authentication or rate limiting mechanisms. It accesses only the public CDX search endpoint, which is designed for exactly this type of bulk query. The Internet Archive actively encourages researchers, archivists, and developers to use their APIs.

FAQ

How many snapshots can I extract?

There is no hard limit — set maxSnapshots to 0 for unlimited extraction. Major domains like news sites or social networks may have tens of millions of snapshots. For performance, the actor paginates automatically using the CDX resumeKey mechanism.

Why do some records show status code `-`?

The CDX API uses - for "revisit" records, where the Wayback Machine re-crawled a page but only stored a reference to a previous capture (because the content was identical). These are real crawl events but don't have a traditional HTTP response code. Filter them out with excludeStatusCodes: [-1] or use filterStatusCodes: [200] to get only real successful captures.

Why are some MIME types `warc/revisit`?

Same as above — revisit records use warc/revisit as their MIME type. Use filterMimeTypes: ["text/html"] to exclude these if you only want full content captures.

The API returned 503 errors during my run. What happened?

The Internet Archive CDX API occasionally returns 503 errors under load. This actor automatically retries up to 3 times with exponential backoff before failing. If you consistently get 503s, try reducing pageSize from 10000 to 1000.

How do I get all unique URLs (not every snapshot)?

Use collapse: "urlkey" — this returns only the first capture per unique URL, giving you a clean inventory of every URL the Wayback Machine ever crawled on your domain.

Can I use wildcard patterns?

Yes. Enter patterns like *.example.com/* as the URL with matchType: "domain" to match all subdomains and paths.

Broken Link Checker — find broken links on live websites
Canonical URL Checker — validate canonical tags and redirect chains
AAAA Record Checker — bulk DNS lookup for IPv6 records

Wayback Machine Archive Scraper

andok/wayback-machine-scraper

Fetch historical snapshots of any webpage from the Internet Archive. Perfect for digital forensics and tracking deleted content.

Andok

✨ Free Youtube Playlist Scraper

toludare/youtube-playlist-scraper

Your all-in-one tool for extracting data from YouTube playlists, including podcasts, courses, and releases. Retrieve rich details including titles, descriptions, thumbnails, full video and channel metadata, and engagement statistics.

tolu.

208

YouTube Metadata Scraper

scrapier/youtube-metadata-scraper

Scrape comprehensive YouTube video data with the YouTube Metadata Scraper. Extract titles, descriptions, tags, views, likes, comments, upload dates, and more. Perfect for SEO, content analysis, trend tracking, and research. Fast, accurate, and scalable for single or bulk videos.

Scrapier

Fast YouTube Playlist Scraper API | Extract Videos & Metadata

apidojo/youtube-playlist-scraper

The ultimate solution for detailed YouTube playlist information. Enjoy unmatched speed and thoroughness in both search and direct video retrieval from playlists. Additionally, it's remarkably cost-effective at just $0.50 per 1000 videos!

API Dojo

393

5.0

Wayback Cdx Scraper

fortuitous_pirate/wayback-cdx-scraper

Scrape the Internet Archive Wayback Machine CDX index: find all archived snapshots of any URL with timestamps, HTTP status codes, and MIME types.

Fortuitous Pirate

YouTube Video Details Scraper

deanter/youtube-video-details-scraper

Input: YouTube video link | 👉 Output: Transcript, video description, video title 👌👉 This actor processes a YouTube video link and extracts the transcript, description, and title of the video. It's perfect for gathering video metadata and subtitles for further analysis or content creation.

Dawn

677

4.3

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Crawler Bros

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Glass Ventures

Wayback Machine URL Extractor - Archived URLs

logiover/wayback-machine-url-extractor

Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.

Logiover

Wayback Machine Scraper

gio21/wayback-machine-scraper

List Internet Archive Wayback Machine snapshots for one or more URLs. Returns timestamp, snapshot URL, HTTP status, MIME type, digest. Useful for tracking website changes over time, OSINT research, content recovery, and brand monitoring.

Gio

Wayback Machine CDX Bulk Extractor

What does it do?

Who is it for?

Why use it?

What data does it extract?

How much does it cost to extract Wayback Machine snapshots?

How to use it

Step 1 — Enter your target URL

Step 2 — Choose a match type

Step 3 — Set limits and filters (optional)

Step 4 — Run and export

Input schema

Output

Tips & tricks

Integrations

Export to Google Sheets

Combine with SEO tools

Archive monitoring workflow

Automated redirects audit

API usage

Node.js

Python

cURL

Use with AI agents via MCP

Setup for Claude Code

Setup for Claude Desktop, Cursor, or VS Code

Example prompts

Is it legal to use the Wayback Machine CDX API?

FAQ

How many snapshots can I extract?

Why do some records show status code -?

Why are some MIME types warc/revisit?

The API returned 503 errors during my run. What happened?

How do I get all unique URLs (not every snapshot)?

Can I use wildcard patterns?

Related actors

You might also like

Wayback Machine Archive Scraper

✨ Free Youtube Playlist Scraper

YouTube Metadata Scraper

Fast YouTube Playlist Scraper API | Extract Videos & Metadata

Wayback Cdx Scraper

YouTube Video Details Scraper

Wayback Machine Search

Wayback Machine Scraper

Wayback Machine URL Extractor - Archived URLs

Wayback Machine Scraper

Why do some records show status code `-`?

Why are some MIME types `warc/revisit`?