Wayback HTML Page History Extractor

Created by

Stas Persiianenko

Actor

Wayback Machine CDX Bulk Extractor

Extract archived HTML pages for a URL prefix from the Wayback Machine with HTTP 200 filters, timestamps, digests, and replay links.

Wayback Machine CDX Bulk Extractorautomation-lab/wayback-machine-cdx-extractor

Original URL

Timestamp

Status Code

MIME Type

+4 fields

Input

URL or domain(required):https://example.com/blog/

Match type:prefix

Max snapshots:5000

From date (YYYYMMDD):20180101

To date (YYYYMMDD):20251231

Filter by status codes:200

Exclude status codes

Filter by MIME types:text/html

Page size:10000

Collapse duplicates:digest

Include Wayback Machine URL:true

Output fields

Original URL

Timestamp

Status Code

MIME Type

Content Digest

Size (bytes)

URL Key

Wayback URL

How it works

Sign up on Apify01

Create your Apify account to access the Wayback Machine CDX Bulk Extractor.

Start the run02

The Actor will start running based on the input automatically.

Receive the output03

Monitor the progress in real-time. You will be notified as soon as your dataset is complete and ready for review.

Integrate into your workflow04

The final output is delivered in JSON, CSV, or Excel format, ready to be plugged into your workflow.

Integrate Actor directly into your workflow

Choose from one of 100+ integration options we provide or integrate via API

Webhook

n8n

Make

Zapier

Airbyte

Keboola

IFTTT

Hubspot

GDrive

Gmail

Apify MCP

GitHub

Slack

LangChain

LlamaIndex

Flowise

Pinecone

OpenAI

Mastra

Clay