Wayback Machine CDX Bulk Extractor
Pricing
Pay per event
Wayback Machine CDX Bulk Extractor
Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
What does it do?
Wayback Machine CDX Bulk Extractor uses the Internet Archive's CDX (Capture Index) API to extract complete snapshot metadata for any domain, URL, or wildcard pattern. For every archived page the Wayback Machine has ever crawled, you get the timestamp, HTTP status code, MIME type, content digest, file size, and a direct replay link โ all exported to a structured dataset in seconds.
Unlike manually browsing https://web.archive.org/, this actor programmatically paginates through millions of CDX records, applies server-side and client-side filters, and exports clean, structured data at scale.
Who is it for?
๐ฉโ๐ป SEO professionals & digital marketers โ audit historical URL structures, find old redirects, identify pages that returned 404 errors over time, and recover lost link equity.
๐ Web archivists & researchers โ build comprehensive inventories of how a website evolved, what pages existed at what timestamps, and which content versions were captured.
๐ก๏ธ Security analysts โ discover exposed endpoints, track historical subdomain activity, or detect when sensitive paths were briefly indexed.
๐ Data journalists & OSINT investigators โ reconstruct a site's history, verify when specific pages first appeared, or find crawl evidence of content that has since been removed.
๐งโ๐ป Developers & QA engineers โ validate archive coverage for migration projects, check historical status code patterns, or build link-checking tools with historical context.
Why use it?
- โ No scraping restrictions โ the CDX API is public, free, and built for bulk access
- โ Handles millions of records โ automatic pagination via resumeKey with no manual intervention
- โ Flexible filtering โ narrow by date range, HTTP status codes, MIME types, or collapse duplicates by URL/content
- โ Zero proxy cost โ Internet Archive's CDX API requires no proxies, so every run is extremely cheap
- โ Full Wayback Machine replay URLs โ each record includes a direct link to view the archived snapshot
- โ Domain-wide coverage โ a single input query can retrieve snapshots for all subdomains
What data does it extract?
Each snapshot record in the output dataset contains:
| Field | Type | Description |
|---|---|---|
urlKey | string | Canonical URL key in SURT format (e.g., com,example)/path) |
timestamp | string | Capture timestamp in YYYYMMDDHHmmss format |
originalUrl | string | The original URL as crawled |
mimeType | string | MIME type of the captured content (e.g., text/html, application/pdf) |
statusCode | string | HTTP status code at capture time (e.g., 200, 301, 404) |
digest | string | SHA-1 content digest for deduplication |
length | number | Compressed size of the stored WARC record in bytes |
waybackUrl | string | Full Wayback Machine replay URL (when enabled) |
How much does it cost to extract Wayback Machine snapshots?
This actor uses pay-per-event (PPE) pricing โ you only pay for the snapshots you extract.
- Start event: $0.005 (one-time per run)
- Per snapshot: $0.000046 (FREE tier) โ effectively $0.046 per 1,000 snapshots
Example costs:
| Snapshot count | Estimated cost |
|---|---|
| 1,000 | ~$0.051 |
| 10,000 | ~$0.465 |
| 100,000 | ~$4.605 |
| 1,000,000 | ~$46.005 |
Because the CDX API is public and no proxy is used, this actor has near-zero infrastructure cost. Most of what you pay goes directly toward supporting the service. Higher subscription tiers (BRONZE through DIAMOND) get significant per-snapshot discounts.
How to use it
Step 1 โ Enter your target URL
Type a domain, exact URL, or wildcard pattern in the URL or domain field:
example.comโ all pages on example.com (use withdomainmatchType)https://example.com/blog/โ all blog pages (use withprefixmatchType)*.example.com/*โ all subdomains (use withdomainmatchType)
Step 2 โ Choose a match type
| Match Type | What it returns |
|---|---|
exact | Only the exact URL specified |
prefix | All URLs that start with the given URL |
host | All URLs on the same hostname |
domain | All URLs on the host AND all of its subdomains |
Step 3 โ Set limits and filters (optional)
- Max snapshots โ cap the total records extracted (0 = unlimited)
- From/To date โ narrow to a specific time window (YYYYMMDD format)
- Filter status codes โ include only specific HTTP codes (e.g.,
[200, 301]) - Exclude status codes โ remove specific HTTP codes (e.g.,
[404, 500]) - Filter MIME types โ include only specific content types (e.g.,
["text/html"]) - Collapse โ deduplicate by URL key, content digest, year, month, or day
Step 4 โ Run and export
Click Run and wait for extraction to complete. Export results as JSON, CSV, XML, or Excel directly from the dataset tab.
Input schema
{"url": "example.com","matchType": "domain","maxSnapshots": 1000,"fromDate": "20200101","toDate": "20231231","filterStatusCodes": [200],"excludeStatusCodes": [404],"filterMimeTypes": ["text/html"],"pageSize": 10000,"collapse": "urlkey","outputWaybackUrl": true}
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | required | URL, domain, or wildcard to query |
matchType | enum | domain | URL matching strategy |
maxSnapshots | integer | 1000 | Max records (0 = unlimited) |
fromDate | string | โ | Start date YYYYMMDD |
toDate | string | โ | End date YYYYMMDD |
filterStatusCodes | array | [] | Include only these HTTP codes |
excludeStatusCodes | array | [] | Exclude these HTTP codes |
filterMimeTypes | array | [] | Include only these MIME types |
pageSize | integer | 10000 | Records per CDX API page |
collapse | enum | โ | Deduplication strategy |
outputWaybackUrl | boolean | true | Include Wayback Machine replay URL |
Output
Sample output record:
{"urlKey": "com,example)/","timestamp": "20230415120000","originalUrl": "https://example.com/","mimeType": "text/html","statusCode": "200","digest": "JI6OR3QR4CI526JD6TMMNZNV4QPMPQCH","length": 1248,"waybackUrl": "https://web.archive.org/web/20230415120000/https://example.com/"}
Tips & tricks
๐ก Use collapse=urlkey to find all unique URLs โ this returns only the first capture per URL, giving you a clean list of unique pages the Wayback Machine ever visited.
๐ก Use collapse=digest to find unique content versions โ skip duplicate captures that archived the same byte-identical content.
๐ก Use matchType=domain for full subdomain coverage โ this is the broadest option and will include www.example.com, blog.example.com, etc.
๐ก Use date filters for historical analysis โ narrow to a specific year to audit what a site looked like in that period.
๐ก Filter statusCode=200 for live content only โ remove redirects, errors, and crawl artefacts to focus on successful captures.
๐ก CDX API notes:
- The CDX API sometimes returns
warc/revisitMIME type for records where only HTTP headers were re-crawled (not the full content). UsefilterMimeTypes: ["text/html"]to exclude these. - Status code
-in CDX output means the capture type is a revisit (no real HTTP response). - Large domains (e.g., major news sites) can have tens of millions of snapshots โ set a reasonable
maxSnapshotsto avoid very long runs.
Integrations
Export to Google Sheets
After the run, click Export โ Google Sheets in the dataset view. Use the data to build URL history timelines, pivot tables by status code over time, or visualize crawl density by year.
Combine with SEO tools
Export the snapshot list as CSV and import into Screaming Frog or Ahrefs to cross-reference current URL statuses against historical captures โ a powerful way to identify redirect chains and 404 link equity leaks.
Archive monitoring workflow
Schedule this actor to run weekly on a specific domain and compare new captures to a previous run. Any new statusCode=404 entries indicate recently broken pages. Connect to Google Sheets or a webhook to get automated alerts.
Automated redirects audit
Run with filterStatusCodes: [301, 302] and export the list of all historical redirects on a domain. Cross-reference with your current redirect rules to find redirect chains or outdated configurations.
API usage
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('automation-lab/wayback-machine-cdx-extractor').call({url: 'example.com',matchType: 'domain',maxSnapshots: 5000,filterStatusCodes: [200],collapse: 'urlkey',});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Extracted ${items.length} unique URLs`);
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_TOKEN")run = client.actor("automation-lab/wayback-machine-cdx-extractor").call(run_input={"url": "example.com","matchType": "domain","maxSnapshots": 5000,"filterStatusCodes": [200],"collapse": "urlkey",})items = client.dataset(run["defaultDatasetId"]).list_items().itemsprint(f"Extracted {len(items)} unique URLs")
cURL
curl -X POST 'https://api.apify.com/v2/acts/automation-lab~wayback-machine-cdx-extractor/runs' \-H 'Content-Type: application/json' \-H 'Authorization: Bearer YOUR_TOKEN' \-d '{"url": "example.com","matchType": "domain","maxSnapshots": 1000}'
Use with AI agents via MCP
Wayback Machine CDX Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"}}}
Your AI assistant will use OAuth to authenticate with your Apify account on first use.
Example prompts
Once connected, try asking your AI assistant:
- "Extract all snapshots of example.com from 2020 with status code 200"
- "Get the unique URLs archived for blog.example.com using collapse by urlkey"
- "Find all 404 pages archived for example.com in the last 5 years"
Learn more in the Apify MCP documentation.
Is it legal to use the Wayback Machine CDX API?
Yes. The Internet Archive CDX API is a public API explicitly provided by the Internet Archive for programmatic access to its index data. It is documented and freely accessible at https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server.
This actor does not bypass any authentication or rate limiting mechanisms. It accesses only the public CDX search endpoint, which is designed for exactly this type of bulk query. The Internet Archive actively encourages researchers, archivists, and developers to use their APIs.
FAQ
How many snapshots can I extract?
There is no hard limit โ set maxSnapshots to 0 for unlimited extraction. Major domains like news sites or social networks may have tens of millions of snapshots. For performance, the actor paginates automatically using the CDX resumeKey mechanism.
Why do some records show status code -?
The CDX API uses - for "revisit" records, where the Wayback Machine re-crawled a page but only stored a reference to a previous capture (because the content was identical). These are real crawl events but don't have a traditional HTTP response code. Filter them out with excludeStatusCodes: [-1] or use filterStatusCodes: [200] to get only real successful captures.
Why are some MIME types warc/revisit?
Same as above โ revisit records use warc/revisit as their MIME type. Use filterMimeTypes: ["text/html"] to exclude these if you only want full content captures.
The API returned 503 errors during my run. What happened?
The Internet Archive CDX API occasionally returns 503 errors under load. This actor automatically retries up to 3 times with exponential backoff before failing. If you consistently get 503s, try reducing pageSize from 10000 to 1000.
How do I get all unique URLs (not every snapshot)?
Use collapse: "urlkey" โ this returns only the first capture per unique URL, giving you a clean inventory of every URL the Wayback Machine ever crawled on your domain.
Can I use wildcard patterns?
Yes. Enter patterns like *.example.com/* as the URL with matchType: "domain" to match all subdomains and paths.
Related actors
- Broken Link Checker โ find broken links on live websites
- Canonical URL Checker โ validate canonical tags and redirect chains
- AAAA Record Checker โ bulk DNS lookup for IPv6 records