Wayback Machine Search
Pricing
from $1.00 / 1,000 snapshot fetcheds
Wayback Machine Search
Search Internet Archive Wayback Machine for historical web snapshots. Find cached pages, recover deleted content, track website changes over time. Filter by date range, HTTP status, MIME type. Collapse duplicates. Fetch archived page content. No API key needed.
Pricing
from $1.00 / 1,000 snapshot fetcheds
Rating
0.0
(0)
Developer

ryan clinton
Actor stats
1
Bookmarked
12
Total users
4
Monthly active users
15 hours ago
Last modified
Categories
Share
Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run. Optionally fetch the full text content of archived pages with built-in polite rate limiting. This actor queries the Internet Archive CDX API, requires no API key, and outputs structured JSON data ready for compliance audits, competitive research, SEO analysis, digital forensics, and content recovery.
Why use Wayback Machine Search?
The Internet Archive has been capturing web pages since 1996 and holds hundreds of billions of snapshots. Manually browsing the Wayback Machine is tedious and impractical at scale. This actor gives you programmatic, structured access to the CDX index so you can:
- Query at scale -- retrieve thousands of snapshot records in a single run instead of clicking through the Wayback Machine interface one page at a time.
- Filter precisely -- narrow results by date range, HTTP status code, MIME type, and URL matching strategy to get exactly what you need.
- Deduplicate intelligently -- collapse results by content digest or time interval to remove redundant snapshots and focus on meaningful changes.
- Fetch archived content -- optionally pull the actual text of archived pages for content analysis, all with polite rate limiting built in.
- Integrate anywhere -- consume clean JSON output via the Apify API, webhooks, or direct integrations with Google Sheets, Slack, Zapier, and more.
Key features
- Four match types -- search by exact URL, URL prefix, host, or entire domain including all subdomains.
- Date range filtering -- restrict results to a specific time window using YYYYMMDD or YYYY format.
- Status code filtering -- retrieve only successful pages (200), redirects (301/302), or any other HTTP status.
- MIME type filtering -- focus on HTML pages, images, PDFs, or any other content type.
- Smart deduplication -- collapse by content digest (unique content only) or by timestamp granularity (monthly, daily, or hourly).
- Content extraction -- fetch and strip HTML from archived pages, returning clean text up to 50,000 characters per snapshot.
- Polite crawling -- built-in 500ms delay between content fetches with a custom User-Agent to respect the Internet Archive's servers.
- Batch processing -- results are pushed to the dataset in 1,000-item chunks for memory efficiency.
- No API key required -- the Internet Archive CDX API is completely free and open.
- ISO 8601 timestamps -- raw Wayback timestamps (YYYYMMDDHHMMSS) are automatically converted to standard ISO 8601 format.
- Direct archive URLs -- every result includes a clickable Wayback Machine link to view the archived page in your browser.
How to use Wayback Machine Search
From the Apify Console:
- Navigate to Wayback Machine Search on the Apify Store.
- Click Try for free to open the actor in your Apify Console.
- Enter the URL or domain you want to search in the URL field.
- Choose a Match Type -- use "Exact URL" for a single page, "URL Prefix" for a path and its children, "Same Host" for an entire hostname, or "All Subdomains" for a domain and all its subdomains.
- Optionally set date ranges, status code filters, MIME type filters, and deduplication preferences.
- Set your Max Results (default 500, up to 10,000).
- Click Start and wait for the run to finish.
- View, download, or export your results from the Dataset tab in JSON, CSV, or Excel format.
Via the Apify API:
You can start the actor programmatically using the Apify API with Python, JavaScript, cURL, or any HTTP client. See the API & Integration section below for ready-to-use code examples.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | "apify.com" | URL or domain to search for snapshots |
matchType | string | No | "exact" | URL matching strategy: exact, prefix, host, or domain |
dateFrom | string | No | -- | Start date filter in YYYYMMDD or YYYY format |
dateTo | string | No | -- | End date filter in YYYYMMDD or YYYY format |
statusFilter | string | No | -- | HTTP status code filter (e.g., "200") |
mimeFilter | string | No | -- | MIME type filter (e.g., "text/html") |
collapseBy | string | No | -- | Deduplication: digest, timestamp:6, timestamp:8, or timestamp:10 |
maxResults | integer | No | 500 | Maximum number of snapshots to return (1--10,000) |
includeContent | boolean | No | false | Fetch and include archived page text content |
maxContentFetch | integer | No | 10 | Maximum pages to fetch content for (1--100) |
Example JSON input:
{"url": "apify.com","matchType": "domain","dateFrom": "2020","dateTo": "2024","statusFilter": "200","mimeFilter": "text/html","collapseBy": "timestamp:8","maxResults": 1000,"includeContent": false}
Tips:
- Use
matchType: "domain"to search across all subdomains (e.g.,blog.example.com,docs.example.com). - Set
collapseBy: "digest"to see only snapshots where the content actually changed -- this filters out identical captures. - Use
collapseBy: "timestamp:6"to get one snapshot per month, which is useful for tracking gradual changes over long time periods. - Filter by
statusFilter: "200"to exclude error pages and redirects from results. - Enable
includeContentwith a lowmaxContentFetchvalue first to test before scaling up, since content fetching is significantly slower. - You can use partial dates --
"2023"is equivalent to"20230101"for the start date.
Output
Each run produces a dataset of snapshot records in JSON format. Below is an example of a single output item.
Example output:
{"originalUrl": "https://apify.com/","timestamp": "20231015143022","archiveDate": "2023-10-15T14:30:22Z","archiveUrl": "https://web.archive.org/web/20231015143022/https://apify.com/","mimeType": "text/html","statusCode": "200","contentDigest": "QXHG7V5BDNP3WKZLIOEM6RVATS2YUHJ4","contentLength": 48523,"content": null}
When includeContent is enabled, the content field contains the extracted plain text of the archived page (up to 50,000 characters) with HTML tags, scripts, and styles stripped.
Output fields:
| Field | Type | Description |
|---|---|---|
originalUrl | string | The original URL that was archived |
timestamp | string | Raw Wayback timestamp in YYYYMMDDHHMMSS format |
archiveDate | string | ISO 8601 formatted date (e.g., 2023-10-15T14:30:22Z) |
archiveUrl | string | Direct link to view the snapshot on the Wayback Machine |
mimeType | string | Content MIME type at the time of archiving (e.g., text/html) |
statusCode | string | HTTP status code recorded at archive time (e.g., 200, 301) |
contentDigest | string | Unique content hash -- identical digests mean identical page content |
contentLength | number or null | Size of the archived content in bytes, or null if unavailable |
content | string or null | Extracted plain text of the page, or null if content fetching is disabled |
Use cases
- Website change tracking -- monitor how a competitor's pricing page, product descriptions, or marketing copy has evolved over months or years.
- Compliance and legal evidence -- retrieve timestamped proof of what content appeared on a website at a specific date for legal proceedings or regulatory audits.
- SEO historical analysis -- analyze how a site's title tags, meta descriptions, and content structure have changed and correlate with search ranking shifts.
- Brand monitoring -- verify historical claims made on a company's website, track rebranding efforts, or document terms of service changes over time.
- Domain research -- investigate the history of a domain before purchasing it to check what content it previously hosted.
- Academic research -- study the evolution of web content, language, design trends, or information availability for digital humanities and media studies.
- Digital forensics -- recover deleted or modified web content for investigations, journalism, or fact-checking.
- Content recovery -- retrieve lost blog posts, documentation, or product pages from websites that have gone offline or restructured their URLs.
- Competitive intelligence -- track how competitors have changed their feature pages, pricing tiers, or messaging strategy over time.
- Link rot detection -- identify archived versions of pages that are no longer available at their original URLs.
- Security analysis -- investigate historical versions of domains to detect defacements, phishing page deployments, or unauthorized content changes.
API & Integration
You can run this actor programmatically using the Apify API. Use actor ID rT8Qt6fe3ygVyVMdb or the full slug ryanclinton/wayback-machine-search.
Python:
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run_input = {"url": "example.com","matchType": "domain","dateFrom": "2020","dateTo": "2024","statusFilter": "200","collapseBy": "timestamp:8","maxResults": 1000,}run = client.actor("rT8Qt6fe3ygVyVMdb").call(run_input=run_input)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{item['archiveDate']} -- {item['originalUrl']}")print(f" Archive: {item['archiveUrl']}")
JavaScript:
import { ApifyClient } from "apify-client";const client = new ApifyClient({ token: "YOUR_API_TOKEN" });const input = {url: "example.com",matchType: "domain",dateFrom: "2020",dateTo: "2024",statusFilter: "200",collapseBy: "timestamp:8",maxResults: 1000,};const run = await client.actor("rT8Qt6fe3ygVyVMdb").call(input);const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((item) => {console.log(`${item.archiveDate} -- ${item.originalUrl}`);console.log(` Archive: ${item.archiveUrl}`);});
cURL:
curl -X POST "https://api.apify.com/v2/acts/rT8Qt6fe3ygVyVMdb/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"url": "example.com","matchType": "domain","dateFrom": "2020","dateTo": "2024","statusFilter": "200","collapseBy": "timestamp:8","maxResults": 1000}'
Integrations:
This actor works with all standard Apify platform integrations, including:
- Webhooks -- trigger external services when a run completes.
- Google Sheets -- export snapshot data directly to a spreadsheet for collaborative analysis.
- Slack -- receive notifications with run summaries and result counts.
- Zapier / Make (Integromat) -- connect to thousands of apps and build multi-step automation workflows.
- GitHub / GitLab -- trigger runs from CI/CD pipelines for automated archival monitoring.
- Amazon S3 / Google Cloud Storage -- export datasets to cloud storage for long-term retention.
- Python and Node.js SDKs -- use the official Apify Python SDK or Apify JavaScript SDK to integrate directly into your applications.
How it works
The actor queries the Internet Archive's publicly available CDX (Capture/Digital Index) API, which indexes every snapshot stored in the Wayback Machine. The CDX API returns raw index data -- timestamps, URLs, status codes, content hashes, and sizes -- without requiring you to load full archived pages.
- Input validation -- the actor validates the URL and constructs a CDX API query with the specified match type, date range, filters, and collapse parameters.
- CDX API request -- a single HTTP request is sent to
web.archive.org/cdx/search/cdxwith JSON output format. The API returns all matching snapshot index records. - Timestamp parsing -- raw Wayback timestamps in YYYYMMDDHHMMSS format are converted to ISO 8601 dates, and direct archive URLs are constructed for each snapshot.
- Content fetching (optional) -- if enabled, the actor sequentially fetches each archived page, strips HTML tags, scripts, and styles, and extracts plain text limited to 50,000 characters per page. A 500ms delay is enforced between requests to respect the Archive's servers.
- Batch push -- results are pushed to the Apify dataset in chunks of 1,000 items for memory efficiency.
+------------+ +----------------+ +-------------------+| Input URL | --> | CDX API Query | --> | Timestamp Parsing |+------------+ +----------------+ +-------------------+|v+------------+ +----------------+ +-------------------+| Dataset | <-- | Batch Push | <-- | Content Fetch |+------------+ +----------------+ | (optional) |+-------------------+
Performance & cost
This actor uses the Internet Archive CDX API, which is completely free and requires no API key. The only cost is Apify platform compute usage.
| Scenario | Estimated results | Content fetch | Run time | Apify platform cost |
|---|---|---|---|---|
| Single URL, metadata only | 100--500 | Off | 5--15 seconds | ~$0.001 |
| Domain search, metadata only | 1,000--5,000 | Off | 10--30 seconds | ~$0.005 |
| Single URL + content extraction | 100 snapshots, 10 fetched | On | 30--60 seconds | ~$0.005 |
| Large domain search | 10,000 | Off | 30--90 seconds | ~$0.01 |
| Large domain + content extraction | 10,000 snapshots, 100 fetched | On | 2--5 minutes | ~$0.02 |
Performance depends primarily on the Internet Archive CDX API response time and, when content fetching is enabled, the 500ms delay between fetches. Metadata-only runs complete quickly since they require only a single API call. The Apify Free plan gives you $5 of platform credits each month, which is enough for thousands of Wayback Machine searches.
Limitations
- Maximum 10,000 results per run -- the CDX API and actor enforce a 10,000 snapshot limit. For domains with millions of snapshots, use date ranges and filters to split results across multiple runs.
- Content fetching is slow -- each page is fetched individually with a mandatory 500ms delay between requests. Fetching 100 pages adds approximately 50 seconds to the run time.
- Content is plain text only -- HTML tags, scripts, and styles are stripped during extraction. The actor does not preserve formatting, images, or interactive elements.
- 50,000 character content limit -- extracted text is truncated at 50,000 characters per page to prevent excessively large datasets.
- CDX API availability -- the Internet Archive's servers can experience downtime or rate limiting during periods of heavy traffic. Runs may fail or return partial results during outages.
- No JavaScript rendering -- archived content is fetched as raw HTML. Pages that rely heavily on client-side JavaScript rendering may yield incomplete text extraction.
- Historical coverage gaps -- not every page change is captured. The Wayback Machine crawls on its own schedule, so gaps between snapshots may exist, especially for smaller or newer websites.
- Redirect snapshots -- some results may correspond to redirects (301/302) rather than final page content. Use
statusFilter: "200"to filter these out.
Responsible use
This actor accesses the Internet Archive's public CDX API, which is a free community resource. To ensure sustainable access for everyone:
- Respect rate limits -- the actor includes a built-in 500ms delay between content fetches. Do not modify or bypass this delay.
- Use filters -- apply date ranges, status codes, MIME types, and collapse strategies to minimize the volume of data requested from the Archive's servers.
- Avoid excessive runs -- schedule runs at reasonable intervals rather than querying the same URLs repeatedly in rapid succession.
- Respect copyright -- the Wayback Machine provides access to historical web content for reference and research purposes. Archived content remains subject to its original copyright. Do not use extracted content in ways that violate intellectual property rights.
- Credit the source -- when using archived data in publications, reports, or applications, credit the Internet Archive and the Wayback Machine as the data source.
FAQ
Does this actor require an API key? No. The Internet Archive CDX API is free and open to the public. No registration or API key is needed.
What is the difference between the match types?
- Exact -- matches only the specific URL you provide (e.g.,
example.com/about). - Prefix -- matches the URL and anything that starts with it (e.g.,
example.com/blogalso matchesexample.com/blog/post-1,example.com/blog/post-2). - Host -- matches all pages on the same host (e.g.,
example.commatchesexample.com/about,example.com/contact). - Domain -- matches all subdomains too (e.g.,
example.comalso matchesblog.example.com,docs.example.com).
What does "collapse by digest" mean? Every time the Wayback Machine captures a page, it computes a content hash (digest). Collapsing by digest removes duplicate snapshots where the page content did not change between captures, leaving only unique versions.
What does collapsing by timestamp do?
timestamp:6-- keeps one snapshot per month (YYYYMM granularity).timestamp:8-- keeps one snapshot per day (YYYYMMDD granularity).timestamp:10-- keeps one snapshot per hour (YYYYMMDDHH granularity).
How far back does the data go? The Wayback Machine has been archiving the web since 1996. However, coverage varies significantly by site. Popular websites may have daily snapshots, while smaller sites may have only a handful of captures across their entire history.
Why are some content fields null?
The content field is null by default unless you enable the includeContent option. Even with content fetching enabled, only the first N snapshots have content fetched (controlled by maxContentFetch). Content may also be null if the archived page could not be retrieved from the Wayback Machine servers.
Can I search for non-HTML content like PDFs or images?
Yes. Use the mimeFilter parameter to target specific content types. Set it to application/pdf for PDFs, image/jpeg for JPEG images, image/png for PNGs, or any other valid MIME type.
Can I use this actor on a schedule? Yes. You can set up a recurring schedule on the Apify platform to run this actor daily, weekly, or at any custom interval. Combine it with the Website Change Monitor actor for both historical and ongoing website tracking.
What if the CDX API returns an error?
The Internet Archive's servers occasionally experience high load or temporary outages. If a run fails, wait a few minutes and try again. Reducing maxResults or adding more specific filters can also help reduce server load.
Is there a limit on how many snapshots I can retrieve? The actor supports up to 10,000 snapshots per run, which is the practical limit for the CDX API. For URLs with more than 10,000 snapshots, use date range filtering to split your search across multiple runs.
What happens if a URL has no archived snapshots? The actor will return an empty dataset with zero results. Not all websites or pages have been crawled by the Internet Archive -- smaller, newer, or robots.txt-blocked sites may have limited or no coverage.
Related actors
| Actor | Description |
|---|---|
| Internet Archive Search | Search the Internet Archive's general collections -- books, audio, video, and software -- beyond just web snapshots. |
| Website Change Monitor | Monitor live websites for content changes in real time with configurable check intervals and diff detection. |
| WHOIS Domain Lookup | Look up domain registration details including registrar, creation date, expiration, and nameservers. |
| Website Content to Markdown | Convert any live web page into clean Markdown format for documentation, analysis, or content migration. |
| SSL Certificate Search | Search Certificate Transparency logs to discover SSL certificates issued for a domain and its subdomains. |
| DNS Record Lookup | Query DNS records (A, AAAA, MX, TXT, NS, CNAME, SOA) for any domain to investigate infrastructure and hosting. |
