Wayback Machine CDX Bulk Extractor avatar

Wayback Machine CDX Bulk Extractor

Pricing

Pay per event

Go to Apify Store
Wayback Machine CDX Bulk Extractor

Wayback Machine CDX Bulk Extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

What does it do?

Wayback Machine CDX Bulk Extractor uses the Internet Archive's CDX (Capture Index) API to extract complete snapshot metadata for any domain, URL, or wildcard pattern. For every archived page the Wayback Machine has ever crawled, you get the timestamp, HTTP status code, MIME type, content digest, file size, and a direct replay link โ€” all exported to a structured dataset in seconds.

Unlike manually browsing https://web.archive.org/, this actor programmatically paginates through millions of CDX records, applies server-side and client-side filters, and exports clean, structured data at scale.


Who is it for?

๐Ÿ‘ฉโ€๐Ÿ’ป SEO professionals & digital marketers โ€” audit historical URL structures, find old redirects, identify pages that returned 404 errors over time, and recover lost link equity.

๐Ÿ” Web archivists & researchers โ€” build comprehensive inventories of how a website evolved, what pages existed at what timestamps, and which content versions were captured.

๐Ÿ›ก๏ธ Security analysts โ€” discover exposed endpoints, track historical subdomain activity, or detect when sensitive paths were briefly indexed.

๐Ÿ“Š Data journalists & OSINT investigators โ€” reconstruct a site's history, verify when specific pages first appeared, or find crawl evidence of content that has since been removed.

๐Ÿง‘โ€๐Ÿ’ป Developers & QA engineers โ€” validate archive coverage for migration projects, check historical status code patterns, or build link-checking tools with historical context.


Why use it?

  • โœ… No scraping restrictions โ€” the CDX API is public, free, and built for bulk access
  • โœ… Handles millions of records โ€” automatic pagination via resumeKey with no manual intervention
  • โœ… Flexible filtering โ€” narrow by date range, HTTP status codes, MIME types, or collapse duplicates by URL/content
  • โœ… Zero proxy cost โ€” Internet Archive's CDX API requires no proxies, so every run is extremely cheap
  • โœ… Full Wayback Machine replay URLs โ€” each record includes a direct link to view the archived snapshot
  • โœ… Domain-wide coverage โ€” a single input query can retrieve snapshots for all subdomains

What data does it extract?

Each snapshot record in the output dataset contains:

FieldTypeDescription
urlKeystringCanonical URL key in SURT format (e.g., com,example)/path)
timestampstringCapture timestamp in YYYYMMDDHHmmss format
originalUrlstringThe original URL as crawled
mimeTypestringMIME type of the captured content (e.g., text/html, application/pdf)
statusCodestringHTTP status code at capture time (e.g., 200, 301, 404)
digeststringSHA-1 content digest for deduplication
lengthnumberCompressed size of the stored WARC record in bytes
waybackUrlstringFull Wayback Machine replay URL (when enabled)

How much does it cost to extract Wayback Machine snapshots?

This actor uses pay-per-event (PPE) pricing โ€” you only pay for the snapshots you extract.

  • Start event: $0.005 (one-time per run)
  • Per snapshot: $0.000046 (FREE tier) โ€” effectively $0.046 per 1,000 snapshots

Example costs:

Snapshot countEstimated cost
1,000~$0.051
10,000~$0.465
100,000~$4.605
1,000,000~$46.005

Because the CDX API is public and no proxy is used, this actor has near-zero infrastructure cost. Most of what you pay goes directly toward supporting the service. Higher subscription tiers (BRONZE through DIAMOND) get significant per-snapshot discounts.


How to use it

Step 1 โ€” Enter your target URL

Type a domain, exact URL, or wildcard pattern in the URL or domain field:

  • example.com โ€” all pages on example.com (use with domain matchType)
  • https://example.com/blog/ โ€” all blog pages (use with prefix matchType)
  • *.example.com/* โ€” all subdomains (use with domain matchType)

Step 2 โ€” Choose a match type

Match TypeWhat it returns
exactOnly the exact URL specified
prefixAll URLs that start with the given URL
hostAll URLs on the same hostname
domainAll URLs on the host AND all of its subdomains

Step 3 โ€” Set limits and filters (optional)

  • Max snapshots โ€” cap the total records extracted (0 = unlimited)
  • From/To date โ€” narrow to a specific time window (YYYYMMDD format)
  • Filter status codes โ€” include only specific HTTP codes (e.g., [200, 301])
  • Exclude status codes โ€” remove specific HTTP codes (e.g., [404, 500])
  • Filter MIME types โ€” include only specific content types (e.g., ["text/html"])
  • Collapse โ€” deduplicate by URL key, content digest, year, month, or day

Step 4 โ€” Run and export

Click Run and wait for extraction to complete. Export results as JSON, CSV, XML, or Excel directly from the dataset tab.


Input schema

{
"url": "example.com",
"matchType": "domain",
"maxSnapshots": 1000,
"fromDate": "20200101",
"toDate": "20231231",
"filterStatusCodes": [200],
"excludeStatusCodes": [404],
"filterMimeTypes": ["text/html"],
"pageSize": 10000,
"collapse": "urlkey",
"outputWaybackUrl": true
}
ParameterTypeDefaultDescription
urlstringrequiredURL, domain, or wildcard to query
matchTypeenumdomainURL matching strategy
maxSnapshotsinteger1000Max records (0 = unlimited)
fromDatestringโ€”Start date YYYYMMDD
toDatestringโ€”End date YYYYMMDD
filterStatusCodesarray[]Include only these HTTP codes
excludeStatusCodesarray[]Exclude these HTTP codes
filterMimeTypesarray[]Include only these MIME types
pageSizeinteger10000Records per CDX API page
collapseenumโ€”Deduplication strategy
outputWaybackUrlbooleantrueInclude Wayback Machine replay URL

Output

Sample output record:

{
"urlKey": "com,example)/",
"timestamp": "20230415120000",
"originalUrl": "https://example.com/",
"mimeType": "text/html",
"statusCode": "200",
"digest": "JI6OR3QR4CI526JD6TMMNZNV4QPMPQCH",
"length": 1248,
"waybackUrl": "https://web.archive.org/web/20230415120000/https://example.com/"
}

Tips & tricks

๐Ÿ’ก Use collapse=urlkey to find all unique URLs โ€” this returns only the first capture per URL, giving you a clean list of unique pages the Wayback Machine ever visited.

๐Ÿ’ก Use collapse=digest to find unique content versions โ€” skip duplicate captures that archived the same byte-identical content.

๐Ÿ’ก Use matchType=domain for full subdomain coverage โ€” this is the broadest option and will include www.example.com, blog.example.com, etc.

๐Ÿ’ก Use date filters for historical analysis โ€” narrow to a specific year to audit what a site looked like in that period.

๐Ÿ’ก Filter statusCode=200 for live content only โ€” remove redirects, errors, and crawl artefacts to focus on successful captures.

๐Ÿ’ก CDX API notes:

  • The CDX API sometimes returns warc/revisit MIME type for records where only HTTP headers were re-crawled (not the full content). Use filterMimeTypes: ["text/html"] to exclude these.
  • Status code - in CDX output means the capture type is a revisit (no real HTTP response).
  • Large domains (e.g., major news sites) can have tens of millions of snapshots โ€” set a reasonable maxSnapshots to avoid very long runs.

Integrations

Export to Google Sheets

After the run, click Export โ†’ Google Sheets in the dataset view. Use the data to build URL history timelines, pivot tables by status code over time, or visualize crawl density by year.

Combine with SEO tools

Export the snapshot list as CSV and import into Screaming Frog or Ahrefs to cross-reference current URL statuses against historical captures โ€” a powerful way to identify redirect chains and 404 link equity leaks.

Archive monitoring workflow

Schedule this actor to run weekly on a specific domain and compare new captures to a previous run. Any new statusCode=404 entries indicate recently broken pages. Connect to Google Sheets or a webhook to get automated alerts.

Automated redirects audit

Run with filterStatusCodes: [301, 302] and export the list of all historical redirects on a domain. Cross-reference with your current redirect rules to find redirect chains or outdated configurations.


API usage

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('automation-lab/wayback-machine-cdx-extractor').call({
url: 'example.com',
matchType: 'domain',
maxSnapshots: 5000,
filterStatusCodes: [200],
collapse: 'urlkey',
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} unique URLs`);

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("automation-lab/wayback-machine-cdx-extractor").call(run_input={
"url": "example.com",
"matchType": "domain",
"maxSnapshots": 5000,
"filterStatusCodes": [200],
"collapse": "urlkey",
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Extracted {len(items)} unique URLs")

cURL

curl -X POST 'https://api.apify.com/v2/acts/automation-lab~wayback-machine-cdx-extractor/runs' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_TOKEN' \
-d '{
"url": "example.com",
"matchType": "domain",
"maxSnapshots": 1000
}'

Use with AI agents via MCP

Wayback Machine CDX Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"
}
}
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

  • "Extract all snapshots of example.com from 2020 with status code 200"
  • "Get the unique URLs archived for blog.example.com using collapse by urlkey"
  • "Find all 404 pages archived for example.com in the last 5 years"

Learn more in the Apify MCP documentation.


Yes. The Internet Archive CDX API is a public API explicitly provided by the Internet Archive for programmatic access to its index data. It is documented and freely accessible at https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server.

This actor does not bypass any authentication or rate limiting mechanisms. It accesses only the public CDX search endpoint, which is designed for exactly this type of bulk query. The Internet Archive actively encourages researchers, archivists, and developers to use their APIs.


FAQ

How many snapshots can I extract?

There is no hard limit โ€” set maxSnapshots to 0 for unlimited extraction. Major domains like news sites or social networks may have tens of millions of snapshots. For performance, the actor paginates automatically using the CDX resumeKey mechanism.

Why do some records show status code -?

The CDX API uses - for "revisit" records, where the Wayback Machine re-crawled a page but only stored a reference to a previous capture (because the content was identical). These are real crawl events but don't have a traditional HTTP response code. Filter them out with excludeStatusCodes: [-1] or use filterStatusCodes: [200] to get only real successful captures.

Why are some MIME types warc/revisit?

Same as above โ€” revisit records use warc/revisit as their MIME type. Use filterMimeTypes: ["text/html"] to exclude these if you only want full content captures.

The API returned 503 errors during my run. What happened?

The Internet Archive CDX API occasionally returns 503 errors under load. This actor automatically retries up to 3 times with exponential backoff before failing. If you consistently get 503s, try reducing pageSize from 10000 to 1000.

How do I get all unique URLs (not every snapshot)?

Use collapse: "urlkey" โ€” this returns only the first capture per unique URL, giving you a clean inventory of every URL the Wayback Machine ever crawled on your domain.

Can I use wildcard patterns?

Yes. Enter patterns like *.example.com/* as the URL with matchType: "domain" to match all subdomains and paths.