Pricing

$25.00/month + usage

Website Extractor

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. For Research purposes

Pricing

$25.00/month + usage

Rating

0.0

(0)

Developer

mikolabs

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Scrap Any Website with Source Code

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code.

Features

✅ Complete Website Downloads - Downloads entire websites with all assets and source code
✅ ZIP Archive Output - Automatically creates compressed ZIP files with full source code
✅ Configurable Depth - Control how deep to follow links (1-10 levels)
✅ Rate Limiting - Respect servers with configurable download rates
✅ Domain Filtering - Stay on same domain or follow external links
✅ Content Selection - Choose to download images, videos, or just HTML/CSS/JS
✅ Robots.txt Support - Optionally respect website's robots.txt
✅ Progress Tracking - Real-time logging of scraping progress
✅ Statistics - File counts, sizes, and compression ratios

Input Configuration

Required

Website URL - The URL to scrape (must include http:// or https://)

Optional

Parameter	Type	Default	Description
`depth`	Integer	2	How many links deep to follow (1-10)
`stayOnDomain`	Boolean	true	Only download from the same domain
`externalDepth`	Integer	0	How deep to follow external links
`connections`	Integer	4	Number of simultaneous downloads
`maxRate`	Integer	0	Max download rate in KB/s (0 = unlimited)
`maxSize`	Integer	0	Max total size in MB (0 = unlimited)
`maxTime`	Integer	0	Max scraping time in seconds (0 = unlimited)
`retries`	Integer	2	Number of retry attempts on error
`timeout`	Integer	30	Connection timeout in seconds
`getImages`	Boolean	true	Download image files
`getVideos`	Boolean	true	Download video files
`followRobots`	Boolean	true	Respect robots.txt
`outputName`	String	null	Custom output name (auto-generated if empty)
`cleanup`	Boolean	true	Remove source files after creating ZIP

Output

The Actor provides two types of output:

1. Dataset

Statistics and metadata for each scrape:

{
  "url": "https://example.com",
  "outputName": "example.com_20241205_130000",
  "zipFile": "example.com_20241205_130000.zip",
  "fileCount": 156,
  "totalSize": 5242880,
  "zipSize": 2621440,
  "compressionRatio": 50.0,
  "timestamp": "2024-12-05T13:00:00.000Z",
  "config": { ... },
  "status": "success"
}

2. Key-Value Store

The complete website as a ZIP archive. Access it via:

Apify Console: Storage → Key-Value Store → [filename].zip
API: https://api.apify.com/v2/key-value-stores/{storeId}/keys/{filename}.zip

Usage Examples

Example 1: Basic Website Backup

{
  "url": "https://example.com",
  "depth": 2,
  "stayOnDomain": true
}

Downloads the website up to 2 levels deep, staying on the same domain.

Example 2: Deep Archive with External Links

{
  "url": "https://example.com",
  "depth": 5,
  "externalDepth": 1,
  "stayOnDomain": false
}

Downloads 5 levels deep and follows external links 1 level.

Example 3: Fast Scrape (HTML/CSS/JS Only)

{
  "url": "https://example.com",
  "depth": 3,
  "getImages": false,
  "getVideos": false,
  "connections": 8
}

Fast scraping without images or videos, using 8 parallel connections.

Example 4: Rate-Limited Polite Scrape

{
  "url": "https://example.com",
  "depth": 2,
  "maxRate": 500,
  "connections": 2,
  "followRobots": true
}

Polite scraping with rate limiting and respecting robots.txt.

Example 5: Time-Limited Scrape

{
  "url": "https://example.com",
  "depth": 10,
  "maxTime": 300,
  "maxSize": 100
}

Stops after 5 minutes or 100 MB, whichever comes first.

How It Works

Input Validation - Validates the URL and configuration
HTTrack Execution - Runs HTTrack with configured parameters to download website source code
Progress Monitoring - Logs progress in real-time
Pre-ZIP Cleanup - Removes HTTrack cache files and index files before archiving
ZIP Creation - Creates a compressed archive of all website files and source code
Storage - Saves ZIP to Key-Value Store and stats to Dataset
Post-ZIP Cleanup - Optionally removes temporary files after ZIP creation

Technical Details

Based On

HTTrack 3.49+ - Industry-standard website copier
Python 3.11 - Modern async Python runtime
Apify SDK 2.7+ - For Actor integration and storage

Limitations

Some JavaScript-heavy SPAs may not download completely
Websites with aggressive bot protection may block scraping
Dynamic content loaded after page load may be missed
Maximum recommended depth is 5-6 for most websites

Performance

Small websites (< 100 pages): 1-5 minutes
Medium websites (100-1000 pages): 5-30 minutes
Large websites (1000+ pages): 30+ minutes

Performance depends on:

Website size and structure
Number of connections
Network speed
Rate limiting settings

Legal and Ethical Considerations

⚠️ Important: Always ensure you have permission to scrape websites.

✅ Respect robots.txt files (enabled by default)
✅ Don't overload servers (use rate limiting)
✅ Check website Terms of Service
✅ Don't scrape copyrighted content without permission
✅ Use reasonable connection limits (2-8)

Troubleshooting

Scraping Takes Too Long

Reduce depth to 1 or 2
Disable getVideos and getImages
Increase connections (but be respectful)
Set maxTime or maxSize limits

ZIP File Too Large

Reduce depth
Disable getVideos
Set maxSize limit
Use maxTime to stop early

Website Blocks Scraping

Enable followRobots
Reduce connections to 2-4
Add rate limiting with maxRate
Increase timeout if connections are slow

Missing Content

Increase depth
Enable externalDepth if content is on other domains
Check if website uses heavy JavaScript (may not work)
Enable getImages and getVideos if needed

Development

Local Testing

# Install dependencies
pip install -r requirements.txt

# Run locally
apify run

Building

# Build Docker image
docker build -t httrack-scraper .

# Run container
docker run httrack-scraper

Support

For issues or questions:

Check Actor logs for detailed error messages
Review HTTrack documentation: https://www.httrack.com/
Contact Apify support through the platform

License

This Actor uses HTTrack, which is licensed under GPL v3.

Version History

1.0 - Initial release with full HTTrack integration, source code download, and ZIP archive output

Extract Any Website with Source Code

mikolabs/extract-any-website-with-source-code

mikolabs

Scrap Any Website with Source Code

mikolabs/web-extractor

mikolabs

Full Site Downloader | $4.99/Site | 1-Time Crawl | All Assets

hailey_apify/Full-Website-Downloader

Full-Website-Downloader - Automatically crawls entire websites including HTML and all static assets (CSS, JS, images, etc.), preserves complete structure and exports as ZIP package. Supports depth control and same-domain resource filtering.

Hailey

Zip Code API

vivid_astronaut/zip-code

Fabio Suizu

Website Scraper

snipercoder/website-scraper

Scrape websites effortlessly

Sniper Coder

162

Website Image Downloader Pro

powerful_bachelor/website-image-downloader-pro

📸 Website Image Downloader Pro: Extract and download images from any URL! 🚀 Features include image URL extraction, SVG to PNG conversion, downloading, and zipping images. Perfect for market research, AI training, and creating visual archives. 🌐✨ Try it now on Apify! 💾

Powerful Bachelor

453

2.5

Zip Extractor

ukonhattu/zip-extractor

Extracts files from ZIP archives. Input can be a URL or uploaded ZIP. Extracts contents and saves each file as a record in the Apify Key-Value Store, with sanitized filenames as keys. Ideal for automating data retrieval from compressed sources.

Daniel

Zillow ZIP Code Search Scraper

maxcopell/zillow-zip-search

Scraper to find all Zillow real estate properties for sale, for rent or recently sold from given ZIP code locations.

Max

2.4K

3.1

Zip Download Extraction Scraper

fresh_cliff/zip-download-extraction-scraper

Download and extract zip files automatically. Extract archives, process documents, analyze logs, backup files. Batch extract text, JSON, CSV content. Real-time data extraction API.

Brennan Crawford

Email Extractor

gordian/email-extractor

Find and extract email addresses from any website in seconds. This actor will crawl entire websites and return all emails after validation. Easy to use and extremely fast.

Gordian

183

5.0

Website Extractor

Scrap Any Website with Source Code

Features

Input Configuration

Required

Optional

Output

1. Dataset

2. Key-Value Store

Usage Examples

Example 1: Basic Website Backup

Example 2: Deep Archive with External Links

Example 3: Fast Scrape (HTML/CSS/JS Only)

Example 4: Rate-Limited Polite Scrape

Example 5: Time-Limited Scrape

How It Works

Technical Details

Based On

Limitations

Performance

Legal and Ethical Considerations

Troubleshooting

Scraping Takes Too Long

ZIP File Too Large

Website Blocks Scraping

Missing Content

Development

Local Testing

Building

Support

License

Version History

You might also like

Extract Any Website with Source Code

Scrap Any Website with Source Code

Full Site Downloader | $4.99/Site | 1-Time Crawl | All Assets

Zip Code API

Website Scraper

Website Image Downloader Pro

Zip Extractor

Zillow ZIP Code Search Scraper

Zip Download Extraction Scraper

Email Extractor