Pricing

from $8.25 / 1,000 items

Wayback Machine CDX URL List Scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

Pricing

from $8.25 / 1,000 items

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

8 days ago

Last modified

🕰️ Wayback Machine CDX Scraper

🚀 Export every archived URL the Internet Archive holds for any domain or URL prefix. Filter by date range, status, MIME, and uniqueness. No API key, no registration.

🕒 Last updated: 2026-05-01 · 📊 10 fields per record · 🕰️ archives back to 1996 · 🌐 billions of snapshots · 🔓 free public CDX index

The Wayback Machine CDX Scraper queries the public Internet Archive CDX index for a domain or URL prefix and returns every snapshot the Wayback Machine has on file. Each record includes the URL key, raw timestamp, ISO timestamp, original URL, MIME type, HTTP status, content digest, byte length, and a direct snapshot link you can open in any browser.

The Wayback Machine has been running since 1996 and now holds more than 800 billion web pages. It is the canonical historical record of the public web, used by lawyers for evidence, by SEO teams for content recovery, and by journalists for accountability work. This Actor handles CDX query syntax, pagination, and filters server-side so you skip writing the parser yourself.

🎯 Target Audience	💡 Primary Use Cases
SEO teams, web archivists, OSINT researchers, journalists, security analysts, legal teams	Lost-content recovery, redirect audits, brand history, competitor evolution, link reclamation, evidence collection

📋 What the Wayback Machine CDX Scraper does

Five filtering workflows in a single run:

🌐 Full domain export. Submit a domain or URL prefix and pull every snapshot the archive holds.
📐 Match-type control. exact for one URL, prefix for a path tree, host for one hostname, domain for the host plus subdomains.
📅 Date range. from and to timestamps in YYYYMMDD format restrict to a specific window.
🌐 MIME and status filter. Restrict to text/html or 200-only snapshots when auditing a redirect map.
🔁 Unique URLs. uniqueOnly collapses by URL key so you get one row per distinct URL instead of one per capture.

Each row reports the CDX URL key, original URL, raw timestamp, ISO timestamp, MIME type, HTTP status, content digest, byte length, and a direct snapshot link in web.archive.org/web/{ts}/{url} form.

💡 Why it matters: the CDX index is the cheapest historical web record available. When a competitor pivots, when a regulator demands evidence of a marketing claim, or when an SEO team needs to recover a deleted blog, the Wayback Machine is usually the only public source. Building your own pipeline against the CDX endpoint means handling pagination tokens and timestamp formats; this Actor handles all of that.

🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.

⚙️ Input

Input	Type	Default	Behavior
`maxItems`	integer	`10`	Snapshots to return. Free plan caps at 10, paid plan at 1,000,000.
`urlOrDomain`	string	`"apify.com"`	Domain or URL prefix to look up.
`matchType`	string	`"domain"`	`exact`, `prefix`, `host`, or `domain`.
`fromDate`	string	empty	Earliest timestamp. Examples: `2020`, `202001`, `20200115`.
`toDate`	string	empty	Latest timestamp.
`statusCode`	string	empty	HTTP status filter, e.g. `200`.
`mimeType`	string	empty	MIME type filter, e.g. `text/html`.
`collapse`	string	empty	CDX collapse field, e.g. `urlkey`.
`uniqueOnly`	boolean	`false`	Shortcut for `collapse=urlkey`.

Example: every HTML snapshot of apify.com homepage.

{
    "maxItems": 100,
    "urlOrDomain": "apify.com",
    "matchType": "exact",
    "mimeType": "text/html",
    "statusCode": "200"
}

Example: every unique URL ever captured under a competitor blog.

{
    "maxItems": 1000,
    "urlOrDomain": "example.com/blog",
    "matchType": "prefix",
    "uniqueOnly": true,
    "fromDate": "2020"
}

⚠️ Good to Know: very broad queries on busy domains can return millions of rows. Always set maxItems and ideally a date window. The CDX endpoint accepts multi-million-row responses but they take minutes to download.

📊 Output

Each snapshot record contains 10 fields. Download as CSV, Excel, JSON, or XML.

🧾 Schema

Field	Type	Example
🔑 `urlkey`	string	`"com,apify)/"`
⏱️ `timestamp`	string	`"20070531101538"`
🔗 `original`	string	`"http://www.apify.com:80/"`
📄 `mimetype`	string \| null	`"text/html"`
✅ `statusCode`	integer \| null	`200`
🔐 `digest`	string \| null	`"EE6FCHP3MKBC3EV5D5Q4WQJNZNVUTNU6"`
📦 `length`	integer \| null	`1013`
🌐 `snapshotUrl`	string	`"https://web.archive.org/web/20070531101538/..."`
📅 `timestampIso`	ISO 8601 \| null	`"2007-05-31T10:15:38.000Z"`
🕒 `scrapedAt`	ISO 8601	`"2026-05-01T00:47:14.231Z"`

📦 Sample records

✨ Why choose this Actor

	Capability
🆓	Free public source. Reads the Internet Archive CDX endpoint, no auth needed.
🕰️	Decades of history. Archive starts 1996, with continuous coverage of major sites.
📐	Match-type control. Exact URL, prefix tree, host, or full domain in a single input.
📅	Flexible date windows. Year, month, day precision via `fromDate` and `toDate`.
🔁	Unique-URL collapse. One row per URL key when you only need a content map.
🌐	Direct snapshot links. Each row carries a ready-to-open Wayback URL.
🛡️	Pagination handled. CDX returns paged responses; the Actor walks them all.

📊 The Internet Archive reports more than 800 billion web pages indexed across the Wayback Machine.

📈 How it compares to alternatives

Approach	Cost	Coverage	Refresh	Filters	Setup
Manual CDX queries	Free	Full	Live	Manual	Engineer hours
Paid web archive APIs	$$$ subscription	Partial	Daily	Built-in	Account setup
Static archive dumps	Free	Snapshot only	Stale	None	Self-host parser
⭐ Wayback Machine CDX Scraper (this Actor)	Pay-per-event	Full	Live	Match type, dates, status, MIME	None

Same CDX endpoint the Internet Archive itself exposes, wrapped in a clean filter UI.

🚀 How to use

🆓 Create a free Apify account. Sign up here and get $5 in free credit.
🔍 Open the Actor. Search for "Wayback Machine CDX" in the Apify Store.
⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
▶️ Click Start. A 100-snapshot run typically completes in 10 to 40 seconds.
📥 Download. Export as CSV, Excel, JSON, or XML.

⏱️ Total time from sign-up to first dataset: under five minutes.

💼 Business use cases

📈 SEO & content recovery

Recover deleted blog posts and product pages
Audit historical redirect chains for migration QA
Reclaim broken backlinks pointing to dead URLs
Pull old metadata for content rebuild projects

🛡️ Brand & competitive

Trace how a competitor's positioning evolved
Document past marketing claims for legal review
Detect domain ownership changes via WHOIS plus archive
Monitor design and copy iterations across years

⚖️ Legal & compliance

Collect evidence-grade snapshots of past pages
Preserve disputed content before it disappears
Track regulatory disclosure timelines on public sites
Verify warranty or pricing terms at a specific date

📰 Journalism & OSINT

Investigate deleted statements from public figures
Pull historical versions of government pages
Track edits to disputed Wikipedia-adjacent sources
Cite stable timestamped URLs in reporting

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

Empirical datasets for papers, thesis work, and coursework
Longitudinal studies tracking changes across snapshots
Reproducible research with cited, versioned data pulls
Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

Side projects, portfolio demos, and indie app launches
Data visualizations, dashboards, and infographics
Content research for bloggers, YouTubers, and podcasters
Hobbyist collections and personal trackers

🤝 Non-profit and civic

Transparency reporting and accountability projects
Advocacy campaigns backed by public-interest data
Community-run databases for local issues
Investigative journalism on public records

🧪 Experimentation

Prototype AI and machine-learning pipelines with real data
Validate product-market hypotheses before engineering spend
Train small domain-specific models on niche corpora
Test dashboard concepts with live input

🔌 Automating Wayback Machine CDX Scraper

Run this Actor on a schedule, from your codebase, or inside another tool:

Node.js SDK: see Apify JavaScript client for programmatic runs.
Python SDK: see Apify Python client for the same flow in Python.
HTTP API: see Apify API docs for raw REST integration.

Schedule daily, weekly, or monthly runs from the Apify Console. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

📅 How far back does the data go?

The Internet Archive started crawling in 1996. Many large sites have continuous coverage from then on; smaller sites were added later as the archive expanded.

🔍 What is the difference between match types?

exact matches one URL only. prefix matches a URL plus everything beneath it. host matches one hostname. domain matches the host plus all subdomains.

🔁 What does uniqueOnly do?

It collapses results by URL key, so you get one row per distinct URL instead of one row per capture. Useful when building a content map of a domain.

🌐 Can I get the actual page content?

Each row includes a snapshotUrl you can open in a browser to view the captured page. Pulling the rendered HTML in bulk is a separate workflow.

📦 How many snapshots can I pull at once?

Free plan caps at 10. Paid plans go up to 1,000,000. Very broad queries on busy sites can return millions of rows; always set a date window for those.

🔠 How do I format the URL?

Pass the bare domain (apify.com) or a URL prefix (example.com/blog). The CDX endpoint normalizes and splits the URL into a key automatically.

📅 What date format does the API expect?

YYYYMMDD or any prefix of it. 2020 matches everything in 2020. 202001 matches January 2020. 20200115 matches a specific day.

💼 Can I use this for commercial work?

Yes. The CDX index is part of the Internet Archive's public API surface. Always cite the Internet Archive when republishing snapshots.

💳 Do I need a paid Apify plan?

The free plan returns up to 10 snapshots per run. Paid plans return up to 1,000,000.

⚠️ What if a run returns no rows?

Most often the URL has not been crawled or is filtered out by status/MIME. Try with no status filter and a wider date window. Open a contact form and include the run URL if the issue persists.

🔁 How fresh is the data?

Live. Each run hits the Internet Archive CDX endpoint at run time.

⚖️ Is this legal?

Yes. The Internet Archive publishes the CDX index for exactly this kind of programmatic access. The Actor respects the published rate limits.

🔌 Integrate with any app

Make - drop run results into 1,800+ apps.
Zapier - trigger automations off completed runs.
Slack - post run summaries to a channel.
Google Sheets - sync each run into a spreadsheet.
Webhooks - notify your own services on run finish.
Airbyte - load runs into Snowflake, BigQuery, or Postgres.

🔗 Recommended Actors

🌐 Common Crawl Index Scraper - second-largest public web archive, complementary coverage.
🅱️ Bing Search Scraper - find current rank for the URLs you recover.
🦆 DuckDuckGo Search Scraper - alternative SERP signal alongside Wayback history.
📚 Wikipedia Pageviews Scraper - cross-reference brand mentions over time.
🐙 GitHub Trending Repos Scraper - track adjacent developer-attention shifts.

💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.

🆘 Need Help? Open our contact form and we'll route the question to the right person.

Internet Archive and Wayback Machine are trademarks of Internet Archive, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Internet Archive. It uses only the public CDX index endpoint and respects all published rate limits.

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Glass Ventures

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

Stas Persiianenko

Common Crawl URL Index Lookup Scraper

parseforge/common-crawl-index-scraper

Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.

ParseForge

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Crawler Bros

5.0

Wayback Machine Snapshots Scraper — Internet Archive History

seemuapps/wayback-machine-snapshots-scraper

List every Internet Archive snapshot of a URL, page, or whole domain. Timestamp, snapshot URL, status code, mime type, content length. No login.

Andrew

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

Wayback Machine Checker

automation-lab/wayback-machine-checker

This actor checks if URLs are archived in the Internet Archive Wayback Machine. It retrieves snapshot counts, oldest and newest archive dates, and direct links to archived versions. Uses both the Availability API and CDX API for comprehensive results.

Stas Persiianenko

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

knotless_cadence/wayback-machine-scraper

Wayback Machine snapshots CSV/JSON — per snapshot: timestamp, status, MIME, size, archive URL — date-filterable. CDX API, no key. 21+ runs. For competitor history-tracking + SEO recovery + brand archaeology. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

Internet Archive & Wayback Machine Scraper

cloud9_ai/internet-archive-scraper

Search Internet Archive and check Wayback Machine snapshots. Access 800B+ archived pages, books, movies, audio. Search items, get metadata, or check URL archive history. No API key needed. For SEO, OSINT, legal, and research.

cloud9

Internet Archive Search — Wayback Machine Advanced Query Tool

maged120/archive-org-advanced-search

Search the Internet Archive (archive.org) with full advanced filter support — date range, media type, language, subject, and more. Returns metadata from archived web pages, books, audio, and video.

Maged

Wayback Machine CDX URL List Scraper

🕰️ Wayback Machine CDX Scraper

📋 What the Wayback Machine CDX Scraper does

🎬 Full Demo

⚙️ Input

📊 Output

🧾 Schema

📦 Sample records

✨ Why choose this Actor

📈 How it compares to alternatives

🚀 How to use

💼 Business use cases

📈 SEO & content recovery

🛡️ Brand & competitive

⚖️ Legal & compliance

📰 Journalism & OSINT

🌟 Beyond business use cases

🎓 Research and academia

🎨 Personal and creative

🤝 Non-profit and civic

🧪 Experimentation

🔌 Automating Wayback Machine CDX Scraper

🤖 Ask an AI assistant about this scraper

❓ Frequently Asked Questions

📅 How far back does the data go?

🔍 What is the difference between match types?

🔁 What does uniqueOnly do?

🌐 Can I get the actual page content?

📦 How many snapshots can I pull at once?

🔠 How do I format the URL?

📅 What date format does the API expect?

💼 Can I use this for commercial work?

💳 Do I need a paid Apify plan?

⚠️ What if a run returns no rows?

🔁 How fresh is the data?

⚖️ Is this legal?

🔌 Integrate with any app

🔗 Recommended Actors

You might also like

Wayback Machine Scraper

Wayback Machine CDX Bulk Extractor

Common Crawl URL Index Lookup Scraper

Wayback Machine Search

Wayback Machine Snapshots Scraper — Internet Archive History

Wayback Machine Scraper - Track Website Changes Over Time

Wayback Machine Checker

Wayback Snapshots — CSV, Date-Filter, Bulk JSON

Internet Archive & Wayback Machine Scraper

Internet Archive Search — Wayback Machine Advanced Query Tool