Wayback Machine CDX URL List Scraper
Pricing
from $8.25 / 1,000 items
Wayback Machine CDX URL List Scraper
Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.
Pricing
from $8.25 / 1,000 items
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share

🕰️ Wayback Machine CDX Scraper
🚀 Export every archived URL the Internet Archive holds for any domain or URL prefix. Filter by date range, status, MIME, and uniqueness. No API key, no registration.
🕒 Last updated: 2026-05-01 · 📊 10 fields per record · 🕰️ archives back to 1996 · 🌐 billions of snapshots · 🔓 free public CDX index
The Wayback Machine CDX Scraper queries the public Internet Archive CDX index for a domain or URL prefix and returns every snapshot the Wayback Machine has on file. Each record includes the URL key, raw timestamp, ISO timestamp, original URL, MIME type, HTTP status, content digest, byte length, and a direct snapshot link you can open in any browser.
The Wayback Machine has been running since 1996 and now holds more than 800 billion web pages. It is the canonical historical record of the public web, used by lawyers for evidence, by SEO teams for content recovery, and by journalists for accountability work. This Actor handles CDX query syntax, pagination, and filters server-side so you skip writing the parser yourself.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| SEO teams, web archivists, OSINT researchers, journalists, security analysts, legal teams | Lost-content recovery, redirect audits, brand history, competitor evolution, link reclamation, evidence collection |
📋 What the Wayback Machine CDX Scraper does
Five filtering workflows in a single run:
- 🌐 Full domain export. Submit a domain or URL prefix and pull every snapshot the archive holds.
- 📐 Match-type control.
exactfor one URL,prefixfor a path tree,hostfor one hostname,domainfor the host plus subdomains. - 📅 Date range.
fromandtotimestamps in YYYYMMDD format restrict to a specific window. - 🌐 MIME and status filter. Restrict to
text/htmlor200-only snapshots when auditing a redirect map. - 🔁 Unique URLs.
uniqueOnlycollapses by URL key so you get one row per distinct URL instead of one per capture.
Each row reports the CDX URL key, original URL, raw timestamp, ISO timestamp, MIME type, HTTP status, content digest, byte length, and a direct snapshot link in web.archive.org/web/{ts}/{url} form.
💡 Why it matters: the CDX index is the cheapest historical web record available. When a competitor pivots, when a regulator demands evidence of a marketing claim, or when an SEO team needs to recover a deleted blog, the Wayback Machine is usually the only public source. Building your own pipeline against the CDX endpoint means handling pagination tokens and timestamp formats; this Actor handles all of that.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Snapshots to return. Free plan caps at 10, paid plan at 1,000,000. |
urlOrDomain | string | "apify.com" | Domain or URL prefix to look up. |
matchType | string | "domain" | exact, prefix, host, or domain. |
fromDate | string | empty | Earliest timestamp. Examples: 2020, 202001, 20200115. |
toDate | string | empty | Latest timestamp. |
statusCode | string | empty | HTTP status filter, e.g. 200. |
mimeType | string | empty | MIME type filter, e.g. text/html. |
collapse | string | empty | CDX collapse field, e.g. urlkey. |
uniqueOnly | boolean | false | Shortcut for collapse=urlkey. |
Example: every HTML snapshot of apify.com homepage.
{"maxItems": 100,"urlOrDomain": "apify.com","matchType": "exact","mimeType": "text/html","statusCode": "200"}
Example: every unique URL ever captured under a competitor blog.
{"maxItems": 1000,"urlOrDomain": "example.com/blog","matchType": "prefix","uniqueOnly": true,"fromDate": "2020"}
⚠️ Good to Know: very broad queries on busy domains can return millions of rows. Always set
maxItemsand ideally a date window. The CDX endpoint accepts multi-million-row responses but they take minutes to download.
📊 Output
Each snapshot record contains 10 fields. Download as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🔑 urlkey | string | "com,apify)/" |
⏱️ timestamp | string | "20070531101538" |
🔗 original | string | "http://www.apify.com:80/" |
📄 mimetype | string | null | "text/html" |
✅ statusCode | integer | null | 200 |
🔐 digest | string | null | "EE6FCHP3MKBC3EV5D5Q4WQJNZNVUTNU6" |
📦 length | integer | null | 1013 |
🌐 snapshotUrl | string | "https://web.archive.org/web/20070531101538/..." |
📅 timestampIso | ISO 8601 | null | "2007-05-31T10:15:38.000Z" |
🕒 scrapedAt | ISO 8601 | "2026-05-01T00:47:14.231Z" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🆓 | Free public source. Reads the Internet Archive CDX endpoint, no auth needed. |
| 🕰️ | Decades of history. Archive starts 1996, with continuous coverage of major sites. |
| 📐 | Match-type control. Exact URL, prefix tree, host, or full domain in a single input. |
| 📅 | Flexible date windows. Year, month, day precision via fromDate and toDate. |
| 🔁 | Unique-URL collapse. One row per URL key when you only need a content map. |
| 🌐 | Direct snapshot links. Each row carries a ready-to-open Wayback URL. |
| 🛡️ | Pagination handled. CDX returns paged responses; the Actor walks them all. |
📊 The Internet Archive reports more than 800 billion web pages indexed across the Wayback Machine.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| Manual CDX queries | Free | Full | Live | Manual | Engineer hours |
| Paid web archive APIs | $$$ subscription | Partial | Daily | Built-in | Account setup |
| Static archive dumps | Free | Snapshot only | Stale | None | Self-host parser |
| ⭐ Wayback Machine CDX Scraper (this Actor) | Pay-per-event | Full | Live | Match type, dates, status, MIME | None |
Same CDX endpoint the Internet Archive itself exposes, wrapped in a clean filter UI.
🚀 How to use
- 🆓 Create a free Apify account. Sign up here and get $5 in free credit.
- 🔍 Open the Actor. Search for "Wayback Machine CDX" in the Apify Store.
- ⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
- ▶️ Click Start. A 100-snapshot run typically completes in 10 to 40 seconds.
- 📥 Download. Export as CSV, Excel, JSON, or XML.
⏱️ Total time from sign-up to first dataset: under five minutes.
💼 Business use cases
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🔌 Automating Wayback Machine CDX Scraper
Run this Actor on a schedule, from your codebase, or inside another tool:
- Node.js SDK: see Apify JavaScript client for programmatic runs.
- Python SDK: see Apify Python client for the same flow in Python.
- HTTP API: see Apify API docs for raw REST integration.
Schedule daily, weekly, or monthly runs from the Apify Console. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.
❓ Frequently Asked Questions
🔌 Integrate with any app
- Make - drop run results into 1,800+ apps.
- Zapier - trigger automations off completed runs.
- Slack - post run summaries to a channel.
- Google Sheets - sync each run into a spreadsheet.
- Webhooks - notify your own services on run finish.
- Airbyte - load runs into Snowflake, BigQuery, or Postgres.
🔗 Recommended Actors
- 🌐 Common Crawl Index Scraper - second-largest public web archive, complementary coverage.
- 🅱️ Bing Search Scraper - find current rank for the URLs you recover.
- 🦆 DuckDuckGo Search Scraper - alternative SERP signal alongside Wayback history.
- 📚 Wikipedia Pageviews Scraper - cross-reference brand mentions over time.
- 🐙 GitHub Trending Repos Scraper - track adjacent developer-attention shifts.
💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.
🆘 Need Help? Open our contact form and we'll route the question to the right person.
Internet Archive and Wayback Machine are trademarks of Internet Archive, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Internet Archive. It uses only the public CDX index endpoint and respects all published rate limits.