Wayback Machine Archive Scraper
Pricing
from $1.00 / 1,000 dataset items
Wayback Machine Archive Scraper
Fetch historical snapshots of any webpage from the Internet Archive. Perfect for digital forensics and tracking deleted content.
Pricing
from $1.00 / 1,000 dataset items
Rating
0.0
(0)
Developer

Andok
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
16 hours ago
Last modified
Categories
Share
Archive.org Wayback Machine Scraper
Retrieve historical snapshots of websites from the Internet Archive's Wayback Machine. Perfect for historical research, content analysis, and tracking website changes over time.
Features
✅ Historical Snapshots - Get timestamped snapshots of any URL from the Internet Archive
✅ Flexible Filtering - Filter by date ranges, HTTP status codes, and content types
✅ Batch Processing - Process multiple URLs in a single run
✅ Content Retrieval - Optionally download the actual archived content
✅ Smart Deduplication - Collapse similar snapshots to reduce noise
✅ Cost Control - Built-in charge limits to control usage costs
Use Cases
- Website Evolution Analysis - Track how websites changed over time
- Content Research - Access historical versions of articles, pages, and documents
- Competitive Analysis - Study competitor website changes and strategies
- Digital Archaeology - Recover lost or deleted web content
- SEO Research - Analyze historical SEO strategies and content changes
- Legal & Compliance - Document website states for legal purposes
Input Configuration
Required Parameters
- URLs - Array of website URLs to scrape from the Wayback Machine
Optional Parameters
- From Date - Start date filter (YYYYMMDD format, e.g., "20200101")
- To Date - End date filter (YYYYMMDD format, e.g., "20240101")
- Limit - Maximum snapshots per URL (default: 100)
- Status Filter - HTTP status code filter (default: 200)
- Collapse - Snapshot grouping (default: "timestamp:8" for daily)
- Fetch Content - Download archived content (default: false)
Example Input
{"urls": ["https://example.com","https://another-site.com"],"from": "20200101","to": "20240101","limit": 50,"statusFilter": 200,"collapse": "timestamp:8","fetchContent": false}
Output Format
For each URL processed, the actor returns:
{"url": "https://example.com","totalSnapshots": 245,"firstSnapshot": "20100315141205","lastSnapshot": "20240115082344","snapshots": [{"timestamp": "20100315141205","archiveUrl": "https://web.archive.org/web/20100315141205/https://example.com","statusCode": 200,"mimeType": "text/html","length": 15234,"content": "<!DOCTYPE html>..." // Only if fetchContent is true}]}
Technical Details
This actor uses the official Internet Archive CDX API:
- API Endpoint:
http://web.archive.org/cdx/search/cdx - No Authentication Required - Free public API
- Rate Limiting - Respects Archive.org's usage guidelines
- Data Format - Processes CDX JSON responses efficiently
Snapshot Data Fields
- timestamp - When the snapshot was taken (YYYYMMDDhhmmss)
- archiveUrl - Direct link to the archived version
- statusCode - HTTP response code (200, 404, etc.)
- mimeType - Content type (text/html, application/pdf, etc.)
- length - Size of archived content in bytes
Performance & Costs
- Free API Access - Uses Archive.org's free CDX API
- Efficient Processing - Processes URLs in sequence to respect rate limits
- Memory Optimized - Streams large datasets without memory issues
- Charge Controls - Built-in limits prevent unexpected costs
Common Patterns
Daily Website Monitoring
{"urls": ["https://competitor.com"],"collapse": "timestamp:8","limit": 365}
Content Recovery
{"urls": ["https://deleted-site.com"],"fetchContent": true,"statusFilter": 200}
Historical Analysis
{"urls": ["https://company.com"],"from": "20100101","to": "20241231","collapse": "timestamp:10"}
Getting Started
- Add URLs - List the websites you want to retrieve from the Wayback Machine
- Set Filters - Configure date ranges and content filters as needed
- Run Actor - Process and download your historical snapshot data
- Analyze Results - Use the timestamped snapshots for your research
Perfect for researchers, analysts, SEO professionals, and anyone needing access to historical web content.
Built with the Apify platform for reliable, scalable web data extraction.