Wayback Machine CDX URL List Scraper avatar

Wayback Machine CDX URL List Scraper

Pricing

from $8.25 / 1,000 items

Go to Apify Store
Wayback Machine CDX URL List Scraper

Wayback Machine CDX URL List Scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

Pricing

from $8.25 / 1,000 items

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

ParseForge Banner

🕰️ Wayback Machine CDX Scraper

🚀 Export every archived URL the Internet Archive holds for any domain or URL prefix. Filter by date range, status, MIME, and uniqueness. No API key, no registration.

🕒 Last updated: 2026-05-01 · 📊 10 fields per record · 🕰️ archives back to 1996 · 🌐 billions of snapshots · 🔓 free public CDX index

The Wayback Machine CDX Scraper queries the public Internet Archive CDX index for a domain or URL prefix and returns every snapshot the Wayback Machine has on file. Each record includes the URL key, raw timestamp, ISO timestamp, original URL, MIME type, HTTP status, content digest, byte length, and a direct snapshot link you can open in any browser.

The Wayback Machine has been running since 1996 and now holds more than 800 billion web pages. It is the canonical historical record of the public web, used by lawyers for evidence, by SEO teams for content recovery, and by journalists for accountability work. This Actor handles CDX query syntax, pagination, and filters server-side so you skip writing the parser yourself.

🎯 Target Audience💡 Primary Use Cases
SEO teams, web archivists, OSINT researchers, journalists, security analysts, legal teamsLost-content recovery, redirect audits, brand history, competitor evolution, link reclamation, evidence collection

📋 What the Wayback Machine CDX Scraper does

Five filtering workflows in a single run:

  • 🌐 Full domain export. Submit a domain or URL prefix and pull every snapshot the archive holds.
  • 📐 Match-type control. exact for one URL, prefix for a path tree, host for one hostname, domain for the host plus subdomains.
  • 📅 Date range. from and to timestamps in YYYYMMDD format restrict to a specific window.
  • 🌐 MIME and status filter. Restrict to text/html or 200-only snapshots when auditing a redirect map.
  • 🔁 Unique URLs. uniqueOnly collapses by URL key so you get one row per distinct URL instead of one per capture.

Each row reports the CDX URL key, original URL, raw timestamp, ISO timestamp, MIME type, HTTP status, content digest, byte length, and a direct snapshot link in web.archive.org/web/{ts}/{url} form.

💡 Why it matters: the CDX index is the cheapest historical web record available. When a competitor pivots, when a regulator demands evidence of a marketing claim, or when an SEO team needs to recover a deleted blog, the Wayback Machine is usually the only public source. Building your own pipeline against the CDX endpoint means handling pagination tokens and timestamp formats; this Actor handles all of that.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Snapshots to return. Free plan caps at 10, paid plan at 1,000,000.
urlOrDomainstring"apify.com"Domain or URL prefix to look up.
matchTypestring"domain"exact, prefix, host, or domain.
fromDatestringemptyEarliest timestamp. Examples: 2020, 202001, 20200115.
toDatestringemptyLatest timestamp.
statusCodestringemptyHTTP status filter, e.g. 200.
mimeTypestringemptyMIME type filter, e.g. text/html.
collapsestringemptyCDX collapse field, e.g. urlkey.
uniqueOnlybooleanfalseShortcut for collapse=urlkey.

Example: every HTML snapshot of apify.com homepage.

{
"maxItems": 100,
"urlOrDomain": "apify.com",
"matchType": "exact",
"mimeType": "text/html",
"statusCode": "200"
}

Example: every unique URL ever captured under a competitor blog.

{
"maxItems": 1000,
"urlOrDomain": "example.com/blog",
"matchType": "prefix",
"uniqueOnly": true,
"fromDate": "2020"
}

⚠️ Good to Know: very broad queries on busy domains can return millions of rows. Always set maxItems and ideally a date window. The CDX endpoint accepts multi-million-row responses but they take minutes to download.


📊 Output

Each snapshot record contains 10 fields. Download as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🔑 urlkeystring"com,apify)/"
⏱️ timestampstring"20070531101538"
🔗 originalstring"http://www.apify.com:80/"
📄 mimetypestring | null"text/html"
statusCodeinteger | null200
🔐 digeststring | null"EE6FCHP3MKBC3EV5D5Q4WQJNZNVUTNU6"
📦 lengthinteger | null1013
🌐 snapshotUrlstring"https://web.archive.org/web/20070531101538/..."
📅 timestampIsoISO 8601 | null"2007-05-31T10:15:38.000Z"
🕒 scrapedAtISO 8601"2026-05-01T00:47:14.231Z"

📦 Sample records


✨ Why choose this Actor

Capability
🆓Free public source. Reads the Internet Archive CDX endpoint, no auth needed.
🕰️Decades of history. Archive starts 1996, with continuous coverage of major sites.
📐Match-type control. Exact URL, prefix tree, host, or full domain in a single input.
📅Flexible date windows. Year, month, day precision via fromDate and toDate.
🔁Unique-URL collapse. One row per URL key when you only need a content map.
🌐Direct snapshot links. Each row carries a ready-to-open Wayback URL.
🛡️Pagination handled. CDX returns paged responses; the Actor walks them all.

📊 The Internet Archive reports more than 800 billion web pages indexed across the Wayback Machine.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
Manual CDX queriesFreeFullLiveManualEngineer hours
Paid web archive APIs$$$ subscriptionPartialDailyBuilt-inAccount setup
Static archive dumpsFreeSnapshot onlyStaleNoneSelf-host parser
⭐ Wayback Machine CDX Scraper (this Actor)Pay-per-eventFullLiveMatch type, dates, status, MIMENone

Same CDX endpoint the Internet Archive itself exposes, wrapped in a clean filter UI.


🚀 How to use

  1. 🆓 Create a free Apify account. Sign up here and get $5 in free credit.
  2. 🔍 Open the Actor. Search for "Wayback Machine CDX" in the Apify Store.
  3. ⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
  4. ▶️ Click Start. A 100-snapshot run typically completes in 10 to 40 seconds.
  5. 📥 Download. Export as CSV, Excel, JSON, or XML.

⏱️ Total time from sign-up to first dataset: under five minutes.


💼 Business use cases

📈 SEO & content recovery

  • Recover deleted blog posts and product pages
  • Audit historical redirect chains for migration QA
  • Reclaim broken backlinks pointing to dead URLs
  • Pull old metadata for content rebuild projects

🛡️ Brand & competitive

  • Trace how a competitor's positioning evolved
  • Document past marketing claims for legal review
  • Detect domain ownership changes via WHOIS plus archive
  • Monitor design and copy iterations across years
  • Collect evidence-grade snapshots of past pages
  • Preserve disputed content before it disappears
  • Track regulatory disclosure timelines on public sites
  • Verify warranty or pricing terms at a specific date

📰 Journalism & OSINT

  • Investigate deleted statements from public figures
  • Pull historical versions of government pages
  • Track edits to disputed Wikipedia-adjacent sources
  • Cite stable timestamped URLs in reporting

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🤝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

🧪 Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

🔌 Automating Wayback Machine CDX Scraper

Run this Actor on a schedule, from your codebase, or inside another tool:

Schedule daily, weekly, or monthly runs from the Apify Console. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.


❓ Frequently Asked Questions


🔌 Integrate with any app

  • Make - drop run results into 1,800+ apps.
  • Zapier - trigger automations off completed runs.
  • Slack - post run summaries to a channel.
  • Google Sheets - sync each run into a spreadsheet.
  • Webhooks - notify your own services on run finish.
  • Airbyte - load runs into Snowflake, BigQuery, or Postgres.

💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.


🆘 Need Help? Open our contact form and we'll route the question to the right person.


Internet Archive and Wayback Machine are trademarks of Internet Archive, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Internet Archive. It uses only the public CDX index endpoint and respects all published rate limits.