Wayback Machine CDX URL List Scraper
Pricing
from $8.25 / 1,000 items
Wayback Machine CDX URL List Scraper
Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.
Pricing
from $8.25 / 1,000 items
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
6
Total users
2
Monthly active users
8 days ago
Last modified
Categories
Share

🕰️ Wayback Machine CDX Scraper
🚀 Export every archived URL the Internet Archive holds for any domain or URL prefix. Filter by date range, status, MIME, and uniqueness. No API key, no registration.
🕒 Last updated: 2026-05-01 · 📊 10 fields per record · 🕰️ archives back to 1996 · 🌐 billions of snapshots · 🔓 free public CDX index
The Wayback Machine CDX Scraper queries the public Internet Archive CDX index for a domain or URL prefix and returns every snapshot the Wayback Machine has on file. Each record includes the URL key, raw timestamp, ISO timestamp, original URL, MIME type, HTTP status, content digest, byte length, and a direct snapshot link you can open in any browser.
The Wayback Machine has been running since 1996 and now holds more than 800 billion web pages. It is the canonical historical record of the public web, used by lawyers for evidence, by SEO teams for content recovery, and by journalists for accountability work. This Actor handles CDX query syntax, pagination, and filters server-side so you skip writing the parser yourself.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| SEO teams, web archivists, OSINT researchers, journalists, security analysts, legal teams | Lost-content recovery, redirect audits, brand history, competitor evolution, link reclamation, evidence collection |
📋 What the Wayback Machine CDX Scraper does
Five filtering workflows in a single run:
- 🌐 Full domain export. Submit a domain or URL prefix and pull every snapshot the archive holds.
- 📐 Match-type control.
exactfor one URL,prefixfor a path tree,hostfor one hostname,domainfor the host plus subdomains. - 📅 Date range.
fromandtotimestamps in YYYYMMDD format restrict to a specific window. - 🌐 MIME and status filter. Restrict to
text/htmlor200-only snapshots when auditing a redirect map. - 🔁 Unique URLs.
uniqueOnlycollapses by URL key so you get one row per distinct URL instead of one per capture.
Each row reports the CDX URL key, original URL, raw timestamp, ISO timestamp, MIME type, HTTP status, content digest, byte length, and a direct snapshot link in web.archive.org/web/{ts}/{url} form.
💡 Why it matters: the CDX index is the cheapest historical web record available. When a competitor pivots, when a regulator demands evidence of a marketing claim, or when an SEO team needs to recover a deleted blog, the Wayback Machine is usually the only public source. Building your own pipeline against the CDX endpoint means handling pagination tokens and timestamp formats; this Actor handles all of that.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Snapshots to return. Free plan caps at 10, paid plan at 1,000,000. |
urlOrDomain | string | "apify.com" | Domain or URL prefix to look up. |
matchType | string | "domain" | exact, prefix, host, or domain. |
fromDate | string | empty | Earliest timestamp. Examples: 2020, 202001, 20200115. |
toDate | string | empty | Latest timestamp. |
statusCode | string | empty | HTTP status filter, e.g. 200. |
mimeType | string | empty | MIME type filter, e.g. text/html. |
collapse | string | empty | CDX collapse field, e.g. urlkey. |
uniqueOnly | boolean | false | Shortcut for collapse=urlkey. |
Example: every HTML snapshot of apify.com homepage.
{"maxItems": 100,"urlOrDomain": "apify.com","matchType": "exact","mimeType": "text/html","statusCode": "200"}
Example: every unique URL ever captured under a competitor blog.
{"maxItems": 1000,"urlOrDomain": "example.com/blog","matchType": "prefix","uniqueOnly": true,"fromDate": "2020"}
⚠️ Good to Know: very broad queries on busy domains can return millions of rows. Always set
maxItemsand ideally a date window. The CDX endpoint accepts multi-million-row responses but they take minutes to download.
📊 Output
Each snapshot record contains 10 fields. Download as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🔑 urlkey | string | "com,apify)/" |
⏱️ timestamp | string | "20070531101538" |
🔗 original | string | "http://www.apify.com:80/" |
📄 mimetype | string | null | "text/html" |
✅ statusCode | integer | null | 200 |
🔐 digest | string | null | "EE6FCHP3MKBC3EV5D5Q4WQJNZNVUTNU6" |
📦 length | integer | null | 1013 |
🌐 snapshotUrl | string | "https://web.archive.org/web/20070531101538/..." |
📅 timestampIso | ISO 8601 | null | "2007-05-31T10:15:38.000Z" |
🕒 scrapedAt | ISO 8601 | "2026-05-01T00:47:14.231Z" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🆓 | Free public source. Reads the Internet Archive CDX endpoint, no auth needed. |
| 🕰️ | Decades of history. Archive starts 1996, with continuous coverage of major sites. |
| 📐 | Match-type control. Exact URL, prefix tree, host, or full domain in a single input. |
| 📅 | Flexible date windows. Year, month, day precision via fromDate and toDate. |
| 🔁 | Unique-URL collapse. One row per URL key when you only need a content map. |
| 🌐 | Direct snapshot links. Each row carries a ready-to-open Wayback URL. |
| 🛡️ | Pagination handled. CDX returns paged responses; the Actor walks them all. |
📊 The Internet Archive reports more than 800 billion web pages indexed across the Wayback Machine.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| Manual CDX queries | Free | Full | Live | Manual | Engineer hours |
| Paid web archive APIs | $$$ subscription | Partial | Daily | Built-in | Account setup |
| Static archive dumps | Free | Snapshot only | Stale | None | Self-host parser |
| ⭐ Wayback Machine CDX Scraper (this Actor) | Pay-per-event | Full | Live | Match type, dates, status, MIME | None |
Same CDX endpoint the Internet Archive itself exposes, wrapped in a clean filter UI.
🚀 How to use
- 🆓 Create a free Apify account. Sign up here and get $5 in free credit.
- 🔍 Open the Actor. Search for "Wayback Machine CDX" in the Apify Store.
- ⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
- ▶️ Click Start. A 100-snapshot run typically completes in 10 to 40 seconds.
- 📥 Download. Export as CSV, Excel, JSON, or XML.
⏱️ Total time from sign-up to first dataset: under five minutes.
💼 Business use cases
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🔌 Automating Wayback Machine CDX Scraper
Run this Actor on a schedule, from your codebase, or inside another tool:
- Node.js SDK: see Apify JavaScript client for programmatic runs.
- Python SDK: see Apify Python client for the same flow in Python.
- HTTP API: see Apify API docs for raw REST integration.
Schedule daily, weekly, or monthly runs from the Apify Console. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
📅 How far back does the data go?
The Internet Archive started crawling in 1996. Many large sites have continuous coverage from then on; smaller sites were added later as the archive expanded.
🔍 What is the difference between match types?
exact matches one URL only. prefix matches a URL plus everything beneath it. host matches one hostname. domain matches the host plus all subdomains.
🔁 What does uniqueOnly do?
It collapses results by URL key, so you get one row per distinct URL instead of one row per capture. Useful when building a content map of a domain.
🌐 Can I get the actual page content?
Each row includes a snapshotUrl you can open in a browser to view the captured page. Pulling the rendered HTML in bulk is a separate workflow.
📦 How many snapshots can I pull at once?
Free plan caps at 10. Paid plans go up to 1,000,000. Very broad queries on busy sites can return millions of rows; always set a date window for those.
🔠 How do I format the URL?
Pass the bare domain (apify.com) or a URL prefix (example.com/blog). The CDX endpoint normalizes and splits the URL into a key automatically.
📅 What date format does the API expect?
YYYYMMDD or any prefix of it. 2020 matches everything in 2020. 202001 matches January 2020. 20200115 matches a specific day.
💼 Can I use this for commercial work?
Yes. The CDX index is part of the Internet Archive's public API surface. Always cite the Internet Archive when republishing snapshots.
💳 Do I need a paid Apify plan?
The free plan returns up to 10 snapshots per run. Paid plans return up to 1,000,000.
⚠️ What if a run returns no rows?
Most often the URL has not been crawled or is filtered out by status/MIME. Try with no status filter and a wider date window. Open a contact form and include the run URL if the issue persists.
🔁 How fresh is the data?
Live. Each run hits the Internet Archive CDX endpoint at run time.
⚖️ Is this legal?
Yes. The Internet Archive publishes the CDX index for exactly this kind of programmatic access. The Actor respects the published rate limits.
🔌 Integrate with any app
- Make - drop run results into 1,800+ apps.
- Zapier - trigger automations off completed runs.
- Slack - post run summaries to a channel.
- Google Sheets - sync each run into a spreadsheet.
- Webhooks - notify your own services on run finish.
- Airbyte - load runs into Snowflake, BigQuery, or Postgres.
🔗 Recommended Actors
- 🌐 Common Crawl Index Scraper - second-largest public web archive, complementary coverage.
- 🅱️ Bing Search Scraper - find current rank for the URLs you recover.
- 🦆 DuckDuckGo Search Scraper - alternative SERP signal alongside Wayback history.
- 📚 Wikipedia Pageviews Scraper - cross-reference brand mentions over time.
- 🐙 GitHub Trending Repos Scraper - track adjacent developer-attention shifts.
💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.
🆘 Need Help? Open our contact form and we'll route the question to the right person.
Internet Archive and Wayback Machine are trademarks of Internet Archive, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Internet Archive. It uses only the public CDX index endpoint and respects all published rate limits.