HTML/Website Media Scraper
Pricing
$8.00/month + usage
HTML/Website Media Scraper
The Advanced HTML/Website Media Scraper is a comprehensive media extraction tool that supports images, videos, audio, documents, archives, e-books, fonts, apps, and contact information from websites. Features advanced filtering, proxy support, and detailed analytics.
Pricing
$8.00/month + usage
Rating
4.8
(3)
Developer

$crypt
Actor stats
9
Bookmarked
247
Total users
8
Monthly active users
4 days ago
Last modified
Categories
Share

HTML/Website Media Scraper v2.0
Overview
The Advanced Website Media Scraping Tool is a comprehensive utility designed to extract various media files and information from multiple websites. This enhanced version supports a wide range of media types including images, videos, audio, documents, archives, e-books, fonts, applications, and contact information. It provides advanced filtering options, detailed analytics, and flexible output formats.
Supported Media Types
Support for 90+ file formats across 8 media categories.
Video Files
- mp4, webm, mkv, mov, avi, flv, 3gp, ogv, m4v, wmv, mpg, mpeg, f4v
Audio Files
- mp3, wav, ogg, aac, flac, m4a, wma, opus, aiff, au, ra
Image Files
- jpg, jpeg, png, gif, webp, svg, bmp, apng, heic, heif, tiff, ico, avif
Document Files
- pdf, doc, docx, ppt, pptx, xls, xlsx, csv, odt, ods, odp, rtf, md, txt, json, xml
Archive Files
- zip, rar, tar, gz, 7z, bz2, xz, lz, lzma, cab, deb, rpm
E-book Files
- epub, mobi, azw3, fb2, lit, pdb, prc, azw, kf8
Font Files
- ttf, otf, woff, woff2, eot, pfb, pfm, afm
Application Files
- apk, xapk, ipa, exe, msi, dmg, pkg, deb, rpm, snap, appimage
Contact Information (Extra)
- Email addresses, phone numbers, social media profiles (Twitter, Facebook, Instagram, LinkedIn)
Pricing Model
This actor uses a pay-per-event pricing model where you only pay for media items successfully extracted and processed.
Pricing: $0.001 per media item (images, videos, documents, etc.)
What counts as a billable event:
- Each image, video, audio file, document, archive, e-book, font, app, or contact successfully extracted
- SVG elements converted to data URLs
- Canvas elements with extracted metadata
- Duplicate items are removed before billing (you don't pay for duplicates)
- Failed extractions or inaccessible media don't count
Billing Examples:
- Extract 50 images from a website = $0.05
- Process 200 media items across multiple sites = $0.20
- Large batch job with 1,000+ media items = $1.00+
Transparent Billing:
- Real-time billing statistics in logs
- Detailed billing summary at job completion
- No hidden fees or minimum charges
- Pay only for successful extractions
Key Features
- Comprehensive Media Detection: Automatically identifies and extracts all supported media types
- Advanced Filtering: Filter by file size, type, dimensions, and custom criteria
- Contact Extraction: Automatically finds email addresses, phone numbers, and social media profiles
- Background Image Detection: Extracts images from CSS background-image properties
- Lazy Loading Support: Detects images with data-src and data-lazy-src attributes
- Performance Analytics: Detailed statistics and performance monitoring
- Error Handling: Robust error handling with retry logic and blocked URL detection
- Flexible Output: Configurable output with summary statistics and metadata
- Proxy Support: Built-in proxy support for bot-blocking websites
- Rate Limiting: Respectful crawling with configurable delays
- Batch Processing: Efficient processing of large URL lists with progress tracking and resumption
- Duplicate Detection: Advanced algorithms to identify and remove duplicate media
- Media Validation: Health checks and accessibility validation for media files
- Custom Selectors: User-defined CSS selectors for specialized media extraction
- SVG Conversion: Convert SVG elements to data URLs with security sanitization (currently creates data URLs only; full rasterization planned)
- Canvas Detection: Identify and extract canvas element metadata (conversion creates placeholders; full rasterization planned)
Input Configuration
The actor accepts a comprehensive configuration object with the following sections:
Basic Settings
{"startUrls": [{ "url": "https://example.com" }],"maxRequestsPerCrawl": 100,"proxyConfiguration": {"useApifyProxy": true}}
Media Type Selection
{"mediaTypes": ["images","videos","audios","documents","archives","ebooks","fonts","apps","contacts"]}
Image Options
{"imageOptions": {"includeBackgroundImages": true,"minImageSize": 50,"includeDataUrls": false}}
File Filtering
{"fileFilters": {"maxFileSize": 100,"allowedExtensions": [],"blockedExtensions": []}}
Contact Extraction
{"contactExtraction": {"extractContacts": true,"includeEmails": true,"includePhones": true,"includeSocialMedia": true}}
Advanced Crawling Options
{"crawlingOptions": {"respectRobotsTxt": true,"userAgent": "","maxRetries": 3}}
Batch Processing (New!)
{"batchProcessing": {"enableBatchProcessing": true,"batchSize": 10,"concurrency": 3,"delayBetweenBatches": 1000,"maxRetries": 3,"failureThreshold": 0.5,"enableProgressTracking": true,"resumeFromLastBatch": true}}
URL List Management
{"urlListManagement": {"enableDeduplication": true,"enableValidation": true,"maxUrlsPerBatch": 1000,"blockedDomains": ["spam.com"],"allowedDomains": ["trusted.com"],"urlPatterns": {"includePatterns": [".*\\.jpg$", ".*\\.png$"],"excludePatterns": [".*admin.*"]}}}
Output Format
The actor provides structured output with detailed information for each media type:
{"URL": "https://example.com/page","domain": "example.com","timestamp": "2024-03-07T10:30:00.000Z","total_media": 25,"images": [{"id": "abc123","url": "https://example.com/page","src": "https://example.com/image.jpg","alt": "Description","type": "image","others": {"width": "800","height": "600","fileExtension": "jpg","elementTag": "img"}}],"documents": [{"id": "def456","url": "https://example.com/page","src": "https://example.com/document.pdf","alt": "Important Document","type": "document","others": {"fileExtension": "pdf"}}],"contacts": [{"id": "ghi789","url": "https://example.com/page","type": "email","data": "contact@example.com","alt": "Email address"}],"summary": {"imageCount": 15,"videoCount": 3,"audioCount": 2,"documentCount": 4,"contactCount": 1,"totalSize": 5242880,"averageFileSize": 209715,"mostCommonType": "image"}}
Batch Processing for Large URL Lists
The actor automatically switches to batch processing mode when you provide more URLs than the configured batch size. This provides several benefits:
Automatic URL Processing:
- Validates and normalizes all URLs
- Removes duplicates automatically
- Filters by domain and pattern rules
- Prioritizes URLs for optimal processing order
Intelligent Batching:
- Processes URLs in configurable batch sizes
- Controls concurrency to avoid overwhelming servers
- Implements delays between batches for respectful crawling
- Automatic retry logic with exponential backoff
Progress Tracking & Resumption:
- Real-time progress monitoring with ETA calculations
- Automatic progress saving for long-running jobs
- Resume from last completed batch if interrupted
- Failure threshold monitoring to stop on excessive errors
Performance Optimization:
- Memory-efficient processing of large URL lists
- Configurable concurrency limits
- Batch size optimization based on URL count
- Performance metrics and timing analysis
Example Batch Configuration:
{"startUrls": [{ "url": "https://site1.com" },{ "url": "https://site2.com" }// ... 1000+ URLs],"batchProcessing": {"enableBatchProcessing": true,"batchSize": 20,"concurrency": 5,"delayBetweenBatches": 2000,"failureThreshold": 0.3}}
Analytics & Monitoring
The actor provides comprehensive analytics stored in the key-value store:
- Performance Statistics: Request counts, success rates, memory usage
- Error Logs: Detailed error tracking with timestamps and stack traces
- Blocked URLs: List of URLs that returned access denied or bot detection
- Configuration: Runtime settings and applied filters
- Batch Processing Stats: Progress tracking, failure rates, processing times
- URL Processing Stats: Validation results, duplicate removal, domain distribution
Use Cases
- Aggregate media from multiple sources for content creation
- Research competitor visual strategies and branding
- Collect images for machine learning training datasets
Digital Asset Management
- Monitor brand asset usage across platforms
- Identify unauthorized use of copyrighted materials
- Track visual trends and user-generated content
Business Intelligence
- Analyze competitor product catalogs and pricing
- Monitor industry visual trends and marketing strategies
- Extract contact information for lead generation
Academic & Research
- Collect visual data for academic research projects
- Analyze media usage patterns across different domains
- Study digital communication and visual culture
E-commerce & Marketing
- Monitor competitor product images and descriptions
- Analyze visual merchandising strategies
- Extract product information and specifications
Best Practices
- Respectful Crawling: Use appropriate delays and respect robots.txt
- Proxy Usage: Enable proxies for sites with bot detection
- Filter Configuration: Set appropriate file size limits to avoid large downloads
- Media Type Selection: Only extract the media types you need to improve performance
- Error Monitoring: Check the key-value store for blocked URLs and errors
Migration Notes
Removed Features
- requestDelay: This property has been deprecated in newer versions of Crawlee. Rate limiting is now handled automatically by the crawler's internal mechanisms and session management.
Alternative Rate Limiting
If you need to control request rates, consider:
- Using proxy rotation to distribute requests
- Implementing custom delays in request handlers
- Configuring AutoscaledPool settings for concurrency control
For customization requests, additional features, or technical support, please contact us at hlymrk8@gmail.com. We respond to all inquiries within one business day.
- Report Bug Or Issue: Report On Apify