Scrap Any Website with Source Code
Pricing
from $2,500.00 / 1,000 results
Scrap Any Website with Source Code
Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. For Research purposes
Pricing
from $2,500.00 / 1,000 results
Rating
0.0
(0)
Developer

mikolabs
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
11 days ago
Last modified
Categories
Share
Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code.
Features
✅ Complete Website Downloads - Downloads entire websites with all assets and source code
✅ ZIP Archive Output - Automatically creates compressed ZIP files with full source code
✅ Configurable Depth - Control how deep to follow links (1-10 levels)
✅ Rate Limiting - Respect servers with configurable download rates
✅ Domain Filtering - Stay on same domain or follow external links
✅ Content Selection - Choose to download images, videos, or just HTML/CSS/JS
✅ Robots.txt Support - Optionally respect website's robots.txt
✅ Progress Tracking - Real-time logging of scraping progress
✅ Statistics - File counts, sizes, and compression ratios
Input Configuration
Required
- Website URL - The URL to scrape (must include
http://orhttps://)
Optional
| Parameter | Type | Default | Description |
|---|---|---|---|
depth | Integer | 2 | How many links deep to follow (1-10) |
stayOnDomain | Boolean | true | Only download from the same domain |
externalDepth | Integer | 0 | How deep to follow external links |
connections | Integer | 4 | Number of simultaneous downloads |
maxRate | Integer | 0 | Max download rate in KB/s (0 = unlimited) |
maxSize | Integer | 0 | Max total size in MB (0 = unlimited) |
maxTime | Integer | 0 | Max scraping time in seconds (0 = unlimited) |
retries | Integer | 2 | Number of retry attempts on error |
timeout | Integer | 30 | Connection timeout in seconds |
getImages | Boolean | true | Download image files |
getVideos | Boolean | true | Download video files |
followRobots | Boolean | true | Respect robots.txt |
outputName | String | null | Custom output name (auto-generated if empty) |
cleanup | Boolean | true | Remove source files after creating ZIP |
Output
The Actor provides two types of output:
1. Dataset
Statistics and metadata for each scrape:
{"url": "https://example.com","outputName": "example.com_20241205_130000","zipFile": "example.com_20241205_130000.zip","fileCount": 156,"totalSize": 5242880,"zipSize": 2621440,"compressionRatio": 50.0,"timestamp": "2024-12-05T13:00:00.000Z","config": { ... },"status": "success"}
2. Key-Value Store
The complete website as a ZIP archive. Access it via:
- Apify Console: Storage → Key-Value Store → [filename].zip
- API:
https://api.apify.com/v2/key-value-stores/{storeId}/keys/{filename}.zip
Usage Examples
Example 1: Basic Website Backup
{"url": "https://example.com","depth": 2,"stayOnDomain": true}
Downloads the website up to 2 levels deep, staying on the same domain.
Example 2: Deep Archive with External Links
{"url": "https://example.com","depth": 5,"externalDepth": 1,"stayOnDomain": false}
Downloads 5 levels deep and follows external links 1 level.
Example 3: Fast Scrape (HTML/CSS/JS Only)
{"url": "https://example.com","depth": 3,"getImages": false,"getVideos": false,"connections": 8}
Fast scraping without images or videos, using 8 parallel connections.
Example 4: Rate-Limited Polite Scrape
{"url": "https://example.com","depth": 2,"maxRate": 500,"connections": 2,"followRobots": true}
Polite scraping with rate limiting and respecting robots.txt.
Example 5: Time-Limited Scrape
{"url": "https://example.com","depth": 10,"maxTime": 300,"maxSize": 100}
Stops after 5 minutes or 100 MB, whichever comes first.
How It Works
- Input Validation - Validates the URL and configuration
- HTTrack Execution - Runs HTTrack with configured parameters to download website source code
- Progress Monitoring - Logs progress in real-time
- Pre-ZIP Cleanup - Removes HTTrack cache files and index files before archiving
- ZIP Creation - Creates a compressed archive of all website files and source code
- Storage - Saves ZIP to Key-Value Store and stats to Dataset
- Post-ZIP Cleanup - Optionally removes temporary files after ZIP creation
Technical Details
Based On
- HTTrack 3.49+ - Industry-standard website copier
- Python 3.11 - Modern async Python runtime
- Apify SDK 2.7+ - For Actor integration and storage
Limitations
- Some JavaScript-heavy SPAs may not download completely
- Websites with aggressive bot protection may block scraping
- Dynamic content loaded after page load may be missed
- Maximum recommended depth is 5-6 for most websites
Performance
- Small websites (< 100 pages): 1-5 minutes
- Medium websites (100-1000 pages): 5-30 minutes
- Large websites (1000+ pages): 30+ minutes
Performance depends on:
- Website size and structure
- Number of connections
- Network speed
- Rate limiting settings
Legal and Ethical Considerations
⚠️ Important: Always ensure you have permission to scrape websites.
- ✅ Respect
robots.txtfiles (enabled by default) - ✅ Don't overload servers (use rate limiting)
- ✅ Check website Terms of Service
- ✅ Don't scrape copyrighted content without permission
- ✅ Use reasonable connection limits (2-8)
Troubleshooting
Scraping Takes Too Long
- Reduce
depthto 1 or 2 - Disable
getVideosandgetImages - Increase
connections(but be respectful) - Set
maxTimeormaxSizelimits
ZIP File Too Large
- Reduce
depth - Disable
getVideos - Set
maxSizelimit - Use
maxTimeto stop early
Website Blocks Scraping
- Enable
followRobots - Reduce
connectionsto 2-4 - Add rate limiting with
maxRate - Increase
timeoutif connections are slow
Missing Content
- Increase
depth - Enable
externalDepthif content is on other domains - Check if website uses heavy JavaScript (may not work)
- Enable
getImagesandgetVideosif needed
Development
Local Testing
# Install dependenciespip install -r requirements.txt# Run locallyapify run
Building
# Build Docker imagedocker build -t httrack-scraper .# Run containerdocker run httrack-scraper
Support
For issues or questions:
- Check Actor logs for detailed error messages
- Review HTTrack documentation: https://www.httrack.com/
- Contact Apify support through the platform
License
This Actor uses HTTrack, which is licensed under GPL v3.
Version History
- 1.0 - Initial release with full HTTrack integration, source code download, and ZIP archive output