Zip Download Extraction Scraper avatar
Zip Download Extraction Scraper

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Zip Download Extraction Scraper

Zip Download Extraction Scraper

Download ZIP files from URLs and automatically extract their contents with advanced features like retry logic, password protection, duplicate handling, and real-time progress tracking.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(1)

Developer

anuj upadhyay

anuj upadhyay

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

3 days ago

Last modified

Share

πŸ“¦ ZIP Download & Extraction Actor

The most efficient, reliable, and developer-friendly ZIP extraction solution on Apify


🌟 Why Choose This Actor?

The ZIP Download & Extraction Actor delivers enterprise-grade performance with these advanced capabilities:

Performance & Reliability: Built optimized for high-throughput processing with intelligent retry logic and exponential backoff handling.

Cost-Effective: Pay-per-use model with no monthly rentals. Scale up or down based on your actual needs without being locked into expensive subscriptions.

Lightning-Fast Extraction: Download and extract ZIP files from any URL with blazing-fast performance. Process gigabytes of archived data in seconds, not minutes, with intelligent chunking and optimization.

Precision Targeting & Advanced Filtering: Extract only what you need with file type filtering, password protection support, and smart duplicate handling. Get precisely the files you need, when you need them.

Rich, Structured Metadata: Extract complete file manifests including paths, sizes, types, and timestamps. Our advanced parsing ensures you get clean, structured data ready for immediate use.

Enterprise-Grade Configuration & Flexibility: Built for developers and businesses who demand reliability. Highly configurable with intuitive controls, comprehensive error handling, and robust logging. Focus on your business logic while we handle the complexity of ZIP extraction.

No Hidden Costs: We do not charge monthly rentals. Our actor operates on a transparent pay-per-use model based on compute units consumed.


πŸš€ Features

Core Capabilities

  • Universal ZIP Support: Download from any accessible URL with automatic retry logic
  • Password Protection: Full support for encrypted archives (PKWARE, AES-128/192/256)
  • Batch Processing: Process multiple ZIP files in a single run
  • Smart Pagination: Automatic handling of large archives with progress tracking
  • File Type Filtering: Extract only specific file extensions to save time and storage

Data Quality

  • Clean Output: Structured JSON metadata for production-ready integration
  • Integrity Validation: CRC checksum verification and ZIP structure validation
  • Path Security: Built-in path traversal protection against malicious archives
  • Detailed Manifests: Complete file metadata including size, type, and timestamps
  • Summary Reports: Comprehensive processing statistics and error tracking

πŸ“– Usage Examples

Basic Extraction Example

Extract a single ZIP file with default settings.

{
"urls": [
{"url": "https://example.com/archive.zip"}
]
}

Advanced Extraction Example

Batch process multiple password-protected ZIPs with file filtering and custom timeout.

{
"urls": [
{"url": "https://example.com/backup1.zip"},
{"url": "https://example.com/backup2.zip"},
{"url": "https://example.com/backup3.zip"}
],
"password": "MySecurePassword",
"file_type_filter": "pdf,docx,xlsx",
"handle_duplicates": "rename",
"timeout": 600,
"keep_zip": false
}

Memory-Only Processing Example

Extract metadata without storing files (perfect for inventory/audit).

{
"urls": [
{"url": "https://example.com/large-archive.zip"}
],
"extract_to_memory": true,
"file_type_filter": "csv,json"
}

πŸ” Input Configuration

Input Parameters

ParameterTypeRequiredDefaultDescription
urlsArrayβœ…Sample ZIPArray of URL objects: [{"url": "https://..."}]
extract_to_memoryBoolean❌falseDelete files after processing (metadata only mode)
keep_zipBoolean❌falseRetain downloaded ZIP file after extraction
passwordString❌nullPassword for encrypted archives
handle_duplicatesString❌"rename"Strategy: rename | skip | overwrite
timeoutNumber❌300Download timeout in seconds
file_type_filterString❌""Comma-separated extensions (e.g., "pdf,jpg,png")

Duplicate Handling Strategies

StrategyBehaviorExample
renameAppends numeric suffixfile.txt β†’ file_1.txt, file_2.txt
skipKeeps first occurrenceFirst file.txt kept, others ignored
overwriteKeeps last occurrenceLatest file.txt overwrites previous

πŸ“Š Output Format

Individual ZIP Result Structure

{
"success": true,
"url": "https://example.com/archive.zip",
"filename": "archive.zip",
"files_extracted": 42,
"bytes_downloaded": 5242880,
"processing_time_seconds": 12.34,
"extracted_files": [
{
"path": "subfolder/document.pdf",
"size": 1048576,
"type": ".pdf",
"modified": "2024-12-20T15:30:00"
}
],
"skipped_files": 0,
"corrupted_files": 0,
"timestamp": "2024-12-20T15:35:00"
}

Summary Report Structure

{
"report_type": "summary",
"total_urls_processed": 5,
"successful_extractions": 4,
"failed_extractions": 1,
"total_bytes_downloaded": 52428800,
"total_files_extracted": 215,
"total_skipped_files": 3,
"total_corrupted_files": 0,
"total_errors": 1,
"processing_duration_seconds": 67.89,
"timestamp": "2024-12-27T10:30:00Z"
}

Error Output Structure

{
"success": false,
"url": "https://example.com/broken.zip",
"error": "Invalid or corrupted zip file",
"error_type": "BadZipFile",
"timestamp": "2024-12-27T10:30:00Z"
}

🎯 Use Cases

πŸ“Š Data Extraction & ETL Pipelines

Automate extraction of data files (CSV, JSON, Parquet) from scheduled backups, API downloads, and database exports. Benefits: Eliminate manual steps, schedule automated ingestion, integrate seamlessly with data workflows.

πŸ’Ύ Backup & Disaster Recovery

Process bulk backup archives for cloud restoration, integrity validation, and selective file recovery. Benefits: Rapid disaster recovery, automated testing, selective restoration.

🌐 Web Scraping Workflows

Extract downloadable content from websites including bulk documents, archives, and datasets. Benefits: Automate repetitive downloads, process batches overnight, chain with other Apify actors.

πŸ“„ Document Processing

Extract and process documents at scale: PDF archives, image collections, invoices, and legal documents. Benefits: Batch processing, automated categorization, OCR/AI integration ready.

πŸ—„οΈ Database Dumps

Handle large database export archives: SQL dumps, MongoDB exports, PostgreSQL/MySQL backups. Benefits: Automated decompression, GB+ file support, enterprise reliability.

🎬 Media Asset Management

Process bulk media archives: photo galleries, video collections, audio libraries, design packages. Benefits: Fast large-file extraction, type filtering, organized structure.


πŸ”§ Advanced Features

πŸ”„ Automatic Retry with Exponential Backoff

Failed downloads automatically retry with increasing wait times (2s β†’ 4s β†’ 8s β†’ 16s). Maximum 3 attempts before marking as failed. Handles temporary network issues gracefully.

βœ… ZIP Integrity Validation

Multi-stage verification: pre-extraction structure checks, CRC checksum validation, size verification, and early corruption detection.

πŸ›‘οΈ Path Traversal Protection

Security features prevent malicious ZIPs from writing outside extraction directory. Blocks ../ patterns, absolute paths, and sanitizes all filenames.

πŸ” Password-Protected Archives

Supports Traditional PKWARE, AES-128, AES-192, and AES-256 encryption. Simply provide the password in input configuration.

πŸ“Š Real-Time Progress Tracking

Monitor extraction with detailed breakdowns: download progress with speed metrics, extraction progress every 10%, file counts, and error notifications.

πŸ—‚οΈ File Type Filtering

Extract only specific file types to save time and storage. Supports any file extension. Benefits: Faster processing, reduced storage, focused extraction.


πŸ› οΈ Troubleshooting

❌ "HTTP 404: URL not found"

Problem: ZIP file URL returns 404 error
Solution: Verify URL accessibility, test in browser, check authentication requirements, ensure direct ZIP link (not download page)

⏱️ "Timeout downloading ZIP"

Problem: Download exceeded timeout limit
Solution: Increase timeout parameter (default: 300s). For files >500MB, try timeout: 900 or higher. Check network stability.

πŸ” "Bad password for encrypted file"

Problem: Incorrect password provided
Solution: Verify case-sensitive password, check for spaces, ensure supported encryption (PKWARE/AES), test locally first.

πŸ—œοΈ "Invalid or corrupted ZIP file"

Problem: ZIP file damaged or invalid
Solution: Download locally and verify, check with unzip -t, ensure complete download, verify actual ZIP content (not HTML error).

πŸ“ "Extraction size exceeds limit"

Problem: Extracted content exceeds size limits
Solution: Use file_type_filter for selective extraction, enable extract_to_memory for metadata-only, split large archives, contact support for limits.

πŸ”„ "Multiple extraction failures"

Problem: Several ZIPs failing extraction
Solution: Check Actor logs for errors, verify URL accessibility, test single known-good ZIP first, ensure storage space, check network connectivity.


πŸ’¬ Support & Community

πŸ› Report Bugs

Found a bug? Create an issue

Please include: ZIP URL, complete error message, expected vs actual behavior, file size/count, input configuration JSON

πŸ’‘ Request Features

Have an idea? Submit a feature request

Popular requests: 7z/rar/tar.gz support, cloud storage integration, content scanning, regex filtering, enhanced analytics

πŸ“š Documentation & Resources

πŸ‘₯ Community Support


Performance Benchmarks

Download: Network limited (10-100 Mbps), 8KB chunked, automatic retry, parallel processing
Extraction: ~50-200 files/second, SSD-backed, optimized for bulk operations
Resources: ~256MB base + file sizes, temporary storage, low-moderate CPU, bandwidth dependent
Scalability: Process 1000+ files/run, handle multi-GB archives, concurrent processing, automatic management

Optimization Tips

  1. Use file type filtering to extract only needed files
  2. Enable extract_to_memory for metadata-only mode
  3. Batch multiple URLs in single run for efficiency
  4. Increase timeout for large files to avoid re-runs
  5. Use appropriate duplicate strategy to minimize processing

License

This Actor is licensed under the LICENSE - free for commercial and personal use.

Privacy & Security

Data Handling: No ZIP content logged/stored, isolated temporary processing, extracted files deleted after run, only metadata saved to datasets.

Security: Path traversal protection, ZIP bomb protection, CRC validation, isolated execution, no external transmission.

Platform: SOC 2 Type II, GDPR compliant, ISO 27001 certified, regular security audits.

Read Apify Security Documentation

You are responsible for content you download/extract. Respect intellectual property rights, ensure access permissions, comply with source website terms, use only for lawful purposes. Tool provided "as-is" without warranty.

Responsible Use

βœ… Do: Extract permitted files, legitimate automation, respect rate limits, comply with laws
❌ Don't: Download copyrighted content without permission, bypass access controls, overload servers, malicious use


🀝 Contributing

We welcome contributions! Report bugs, suggest features, submit code improvements.

Quick Start:

  1. Fork repository: git clone https://github.com/anuj123upadhyay/zip-extractor-actor.git
  2. Create branch: git checkout -b feature/your-feature-name
  3. Make changes (follow code style, add tests, update docs)
  4. Test locally: apify run
  5. Submit pull request with clear description

Other ways to help: Fix documentation typos, add usage examples, translate content, create tutorials, star repository, share with others.


πŸ“ Changelog

v1.0 (Current - December 2024)

πŸŽ‰ Initial Release

  • ZIP download and extraction functionality
  • Password-protected archive support
  • Multiple URL batch processing
  • Smart duplicate file handling
  • File type filtering
  • Real-time progress tracking
  • Comprehensive error handling
  • Detailed output manifests

πŸ› Bug Fixes: URL extraction from requestListSources, empty input handling, version format compatibility, improved retry logic

πŸ”’ Security: Path traversal protection, ZIP integrity validation, CRC verification, enhanced error reporting

⚑ Performance: Optimized Docker image (Python 3.11), memory-efficient downloads, faster extraction, reduced footprint

Upcoming Features (Roadmap)

  • Support for 7z, tar.gz, rar formats
  • Direct cloud storage integration (S3, GCS, Azure)
  • Content validation and virus scanning
  • Advanced regex filtering
  • Enhanced analytics and reporting
  • Incremental extraction support

πŸŽ‰ Ready to Get Started?

Check out the Github repo for quickstart examples and advanced configurations.


Made with ❀️ for the Apify community

Transform your ZIP extraction workflows with the most reliable and efficient solution on the market. Last Updated: 2024.12.27

πŸš€ Run on Apify Β· ⭐ Star on GitHub Β· πŸ’¬ Get Support