Pricing

from $1.00 / 1,000 results

Try for free

Go to Apify Store

Wayback Machine Search

Try for free

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(22)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

What this actor does

The Internet Archive's Wayback Machine has captured trillions of web pages since 1996. This actor uses the Wayback Machine's free, public search API to enumerate every archived capture of a given URL or domain and return it as structured data — something the Wayback Machine's own browser UI can't easily do at scale.

You give the actor a URL and a match mode (exact URL, path prefix, full host, or entire domain) and it returns a row per archived snapshot with the capture timestamp, playback URL, HTTP status, MIME type, and content fingerprint. You can filter by date range, by HTTP status (e.g. find every archived 404 on a site), or by MIME type (e.g. enumerate every archived PDF). And because many pages are captured hundreds of times with identical content, the actor can deduplicate by content fingerprint so you get one row per actual change.

For historical-content research, the actor can optionally download each snapshot and extract the readable text of the archived page, giving you the raw material for diffs, longitudinal SEO studies, or training datasets.

Key features

Four URL match modes: exact, prefix, host, domain
Date-range filtering (YYYY, YYYYMM, or YYYYMMDD precision)
HTTP-status filtering — find every archived 200, 301, 404, etc.
MIME-type filtering — restrict to text/html, application/pdf, images, and so on
Content deduplication by fingerprint, or time-bucketing (one snapshot per month / day / hour)
Optional archived-text retrieval — pull the readable text of selected snapshots
No API key, no login, no proxy — the Wayback Machine's public search API is free and open
Zero-null output — empty fields are omitted from each record

Input

Field	Type	Default	Description
`url`	string	—	Required. The URL or domain to look up (e.g. `apify.com`, `https://nytimes.com/2024/01/15/story.html`).
`matchType`	enum	`exact`	How to match the URL: `exact` (this URL only), `prefix` (URL and anything beneath it), `host` (same hostname only), `domain` (host plus all subdomains).
`dateFrom`	string	—	Earliest snapshot date, inclusive. Accepts `YYYY`, `YYYYMM`, or `YYYYMMDD`.
`dateTo`	string	—	Latest snapshot date, inclusive. Same formats.
`statusFilter`	string	—	Keep only snapshots with this HTTP status (e.g. `200`, `404`).
`mimeFilter`	string	—	Keep only snapshots with this MIME type (e.g. `text/html`, `application/pdf`).
`collapseBy`	enum	`digest`	`digest` (drop adjacent duplicates by content fingerprint), `monthly`, `daily`, `hourly`, or `none`.
`maxResults`	integer	`500`	Maximum snapshot records to return (1-10000).
`includeContent`	boolean	`false`	If `true`, download each snapshot and extract its readable text.
`maxContentFetch`	integer	`10`	When `includeContent` is on, cap the number of snapshots to download (0-500).

Example input — every HTML page ever archived under a domain, deduplicated

{
  "url": "example.com",
  "matchType": "domain",
  "mimeFilter": "text/html",
  "statusFilter": "200",
  "collapseBy": "digest",
  "maxResults": 2000
}

Example input — track a single page through 2024 with monthly granularity

{
  "url": "https://news.ycombinator.com/",
  "matchType": "exact",
  "dateFrom": "2024",
  "dateTo": "2024",
  "collapseBy": "monthly"
}

Example input — archive the readable text of a page across changes

{
  "url": "apify.com",
  "matchType": "exact",
  "includeContent": true,
  "maxContentFetch": 20,
  "collapseBy": "digest"
}

Output

One record per snapshot:

{
  "originalUrl": "https://apify.com/",
  "timestamp": "20240115123045",
  "archiveDate": "2024-01-15T12:30:45+00:00",
  "archiveUrl": "https://web.archive.org/web/20240115123045/https://apify.com/",
  "mimeType": "text/html",
  "statusCode": 200,
  "contentDigest": "SHA1:ABC123...",
  "contentLength": 45678,
  "content": "Apify is the platform where developers build, deploy, and publish...",
  "scrapedAt": "2026-04-24T12:00:00+00:00"
}

Field descriptions

originalUrl — the URL that was archived
timestamp — the Wayback Machine's raw capture timestamp (YYYYMMDDHHMMSS)
archiveDate — ISO-8601 UTC rendering of timestamp for convenience
archiveUrl — direct playback URL in the Wayback Machine viewer
mimeType — MIME type reported when the snapshot was captured
statusCode — HTTP status code at capture time
contentDigest — content fingerprint (used for deduplication)
contentLength — response body size in bytes
content — extracted readable page text (only when includeContent=true and fetch succeeded; capped at 500 KB)
scrapedAt — ISO timestamp of this run

If no snapshots match, a single diagnostic record is emitted with type: "wayback_search_error" and a reason of no_snapshots, cdx_fetch_failed, or invalid_input.

Use cases

Content audit / SEO — recover every historical version of a site's pages to diff copy, titles, or schema changes over time
Broken-link recovery — enumerate every archived 404 on your domain so you can redirect or restore the missing URLs
Competitive intelligence — see how a competitor's landing page, pricing, or product catalogue has evolved month by month
Journalism and due diligence — reconstruct a web page, press release, or statement as it existed on a specific date
Training-data curation — pull the text of older captures of reference sites for ML datasets

FAQ

Do I need a Wayback Machine account or API key? No. The Wayback Machine's search API is a free public endpoint. The actor uses no credentials.

Does it use proxies? No. The search API works fine from Apify's datacenter IPs.

What's the difference between matchType=host and matchType=domain? host matches one exact hostname — www.example.com won't return blog.example.com. domain matches that host and all of its subdomains (blog.example.com, api.example.com, and so on).

Why is collapseBy=digest the default? Most pages are captured many times with identical content. Collapsing by fingerprint drops those duplicates so you get one record per actual change rather than hundreds of identical captures.

When should I use the time-bucket collapses (monthly / daily / hourly)? Use these when you want an evenly-spaced sample over time — one snapshot per month is great for a long-term trend, one per day for detailed change tracking.

Why does my record sometimes lack a content field? Either includeContent was off, maxContentFetch was exhausted, or the archived page replayed with a non-200 status. The actor never emits an empty content field.

What are the rate limits? The Wayback Machine's search API doesn't publish a hard limit. The actor retries transient 429 / 5xx responses and paces archived-page downloads with a small polite delay.

Can I search by keyword or full-text? No. The Wayback Machine's public search API works by URL, not full-text. Use the includeContent option to retrieve page text and then filter downstream.

Known limitations

URL-based search only. You can't query by keyword — only by URL, host, or domain.
Archive coverage is not complete. Not every page on the internet is captured, and capture frequency varies wildly by site popularity.
maxResults is capped at 10,000. For very large domains, narrow the scope with a date range or use more specific match types.
Archived text extraction captures readable text only — not images, interactive widgets, or JavaScript-rendered content that wasn't present in the stored HTML.
Content is capped at 500 KB per archived page to keep dataset rows manageable; anything longer is truncated.
Playback occasionally redirects. When the Wayback Machine redirects a playback URL to a different snapshot, the statusCode field reflects the original capture status, not the redirect chain.

Funda.nl Real Estate

crawlerbros/funda-scraper

Scrape Dutch real-estate listings from Funda.nl with price, living area, plot size, rooms, bedrooms, energy label, address, city, neighborhood, images, and agent. HTTP-only, no login required.

Crawler Bros

5.0

Google Maps Leads Scraper

crawlerbros/google-maps-leads

Scrape business leads from Google Maps. Search by query and extract business name, category, address, phone, website, rating, review count, place ID, and coordinates. Optionally enrich with emails, phone numbers, and social links crawled from each business's website.

Crawler Bros

5.0

Seek.com.au Jobs Scraper

crawlerbros/seek-jobs-scraper

Scrape Seek.com.au job listings with titles, companies, locations, salaries, work arrangements, and full descriptions. HTTP-only, no login required.

Crawler Bros

5.0

Reddit Scraper Pro

crawlerbros/reddit-scraper-pro

Scrape Reddit subreddit posts with advanced filters (keywordFilter, minScore, maxAgeDays, excludeStickied, excludeNsfw, authorBlocklist, domainAllowlist/Blocklist). Adds is_video / awards / gilded / upvote_ratio / media_metadata to every post. Browser-based, no login.

Crawler Bros

5.0

Better Business Bureau Scraper

crawlerbros/bbb-scraper

Scrape Better Business Bureau (BBB.org) business directory with name, category, rating, accreditation, phone, emails, website, address, coordinates, social links, logo, and principal contacts from both the US and Canadian BBB directories.

Crawler Bros

5.0

Tumblr Posts Scraper

crawlerbros/tumblr-scraper

Scrape public Tumblr posts from any blog URL, search query, or tag. Returns blog name, post URL, date, tags, content text, note count, like/reblog counts, and media URLs. HTTP-only, no cookies, no login, no API key.

Crawler Bros

5.0

NHS Jobs Scraper

crawlerbros/nhs-jobs-scraper

Extract UK NHS job vacancies from jobs.nhs.uk including title, employer, salary, band, pay scheme, location, contract type, closing date, full description, and more.

Crawler Bros

5.0

Ultimate Website Screenshot / PDF / Video

crawlerbros/ultimate-screenshot

Capture JPEG/PNG screenshots, PDFs, GIFs and MP4s of any website. 12+ device presets, full-page, custom viewport, scroll-to-bottom, cookie injection, proxy support.

Crawler Bros

5.0

Google Keywords Suggest Scraper Pro

crawlerbros/google-keywords-suggest-scraper-pro

Scrape Google Suggest autocomplete keywords for any seed term. Pro modes: exact, all (related), questions (who/what/why), prepositions (with/for/vs), comparisons, alphabet expansion (a-z). Multi-keyword + multi-language + 200+ Google country codes.

Crawler Bros

5.0

Y Combinator Companies Scraper

crawlerbros/y-combinator-scraper

Scrape the full Y Combinator company directory with company profiles, founders, open jobs, batch, industry, status, and social links. HTTP-only, no login required.

Crawler Bros

4.3