HTML/Website Media Scraper avatar

HTML/Website Media Scraper

Pricing

$8.00/month + usage

Go to Apify Store
HTML/Website Media Scraper

HTML/Website Media Scraper

The Advanced HTML/Website Media Scraper is a comprehensive media extraction tool that supports images, videos, audio, documents, archives, e-books, fonts, apps, and contact information from websites. Features advanced filtering, proxy support, and detailed analytics.

Pricing

$8.00/month + usage

Rating

4.8

(3)

Developer

$crypt

$crypt

Maintained by Community

Actor stats

9

Bookmarked

247

Total users

8

Monthly active users

4 days ago

Last modified

Share

Logo Image

HTML/Website Media Scraper v2.0

Overview

The Advanced Website Media Scraping Tool is a comprehensive utility designed to extract various media files and information from multiple websites. This enhanced version supports a wide range of media types including images, videos, audio, documents, archives, e-books, fonts, applications, and contact information. It provides advanced filtering options, detailed analytics, and flexible output formats.

Supported Media Types

Support for 90+ file formats across 8 media categories.

Video Files

  • mp4, webm, mkv, mov, avi, flv, 3gp, ogv, m4v, wmv, mpg, mpeg, f4v

Audio Files

  • mp3, wav, ogg, aac, flac, m4a, wma, opus, aiff, au, ra

Image Files

  • jpg, jpeg, png, gif, webp, svg, bmp, apng, heic, heif, tiff, ico, avif

Document Files

  • pdf, doc, docx, ppt, pptx, xls, xlsx, csv, odt, ods, odp, rtf, md, txt, json, xml

Archive Files

  • zip, rar, tar, gz, 7z, bz2, xz, lz, lzma, cab, deb, rpm

E-book Files

  • epub, mobi, azw3, fb2, lit, pdb, prc, azw, kf8

Font Files

  • ttf, otf, woff, woff2, eot, pfb, pfm, afm

Application Files

  • apk, xapk, ipa, exe, msi, dmg, pkg, deb, rpm, snap, appimage

Contact Information (Extra)

  • Email addresses, phone numbers, social media profiles (Twitter, Facebook, Instagram, LinkedIn)

Pricing Model

This actor uses a pay-per-event pricing model where you only pay for media items successfully extracted and processed.

Pricing: $0.001 per media item (images, videos, documents, etc.)

What counts as a billable event:

  • Each image, video, audio file, document, archive, e-book, font, app, or contact successfully extracted
  • SVG elements converted to data URLs
  • Canvas elements with extracted metadata
  • Duplicate items are removed before billing (you don't pay for duplicates)
  • Failed extractions or inaccessible media don't count

Billing Examples:

  • Extract 50 images from a website = $0.05
  • Process 200 media items across multiple sites = $0.20
  • Large batch job with 1,000+ media items = $1.00+

Transparent Billing:

  • Real-time billing statistics in logs
  • Detailed billing summary at job completion
  • No hidden fees or minimum charges
  • Pay only for successful extractions

Key Features

  • Comprehensive Media Detection: Automatically identifies and extracts all supported media types
  • Advanced Filtering: Filter by file size, type, dimensions, and custom criteria
  • Contact Extraction: Automatically finds email addresses, phone numbers, and social media profiles
  • Background Image Detection: Extracts images from CSS background-image properties
  • Lazy Loading Support: Detects images with data-src and data-lazy-src attributes
  • Performance Analytics: Detailed statistics and performance monitoring
  • Error Handling: Robust error handling with retry logic and blocked URL detection
  • Flexible Output: Configurable output with summary statistics and metadata
  • Proxy Support: Built-in proxy support for bot-blocking websites
  • Rate Limiting: Respectful crawling with configurable delays
  • Batch Processing: Efficient processing of large URL lists with progress tracking and resumption
  • Duplicate Detection: Advanced algorithms to identify and remove duplicate media
  • Media Validation: Health checks and accessibility validation for media files
  • Custom Selectors: User-defined CSS selectors for specialized media extraction
  • SVG Conversion: Convert SVG elements to data URLs with security sanitization (currently creates data URLs only; full rasterization planned)
  • Canvas Detection: Identify and extract canvas element metadata (conversion creates placeholders; full rasterization planned)

Input Configuration

The actor accepts a comprehensive configuration object with the following sections:

Basic Settings

{
"startUrls": [{ "url": "https://example.com" }],
"maxRequestsPerCrawl": 100,
"proxyConfiguration": {
"useApifyProxy": true
}
}

Media Type Selection

{
"mediaTypes": [
"images",
"videos",
"audios",
"documents",
"archives",
"ebooks",
"fonts",
"apps",
"contacts"
]
}

Image Options

{
"imageOptions": {
"includeBackgroundImages": true,
"minImageSize": 50,
"includeDataUrls": false
}
}

File Filtering

{
"fileFilters": {
"maxFileSize": 100,
"allowedExtensions": [],
"blockedExtensions": []
}
}

Contact Extraction

{
"contactExtraction": {
"extractContacts": true,
"includeEmails": true,
"includePhones": true,
"includeSocialMedia": true
}
}

Advanced Crawling Options

{
"crawlingOptions": {
"respectRobotsTxt": true,
"userAgent": "",
"maxRetries": 3
}
}

Batch Processing (New!)

{
"batchProcessing": {
"enableBatchProcessing": true,
"batchSize": 10,
"concurrency": 3,
"delayBetweenBatches": 1000,
"maxRetries": 3,
"failureThreshold": 0.5,
"enableProgressTracking": true,
"resumeFromLastBatch": true
}
}

URL List Management

{
"urlListManagement": {
"enableDeduplication": true,
"enableValidation": true,
"maxUrlsPerBatch": 1000,
"blockedDomains": ["spam.com"],
"allowedDomains": ["trusted.com"],
"urlPatterns": {
"includePatterns": [".*\\.jpg$", ".*\\.png$"],
"excludePatterns": [".*admin.*"]
}
}
}

Output Format

The actor provides structured output with detailed information for each media type:

{
"URL": "https://example.com/page",
"domain": "example.com",
"timestamp": "2024-03-07T10:30:00.000Z",
"total_media": 25,
"images": [
{
"id": "abc123",
"url": "https://example.com/page",
"src": "https://example.com/image.jpg",
"alt": "Description",
"type": "image",
"others": {
"width": "800",
"height": "600",
"fileExtension": "jpg",
"elementTag": "img"
}
}
],
"documents": [
{
"id": "def456",
"url": "https://example.com/page",
"src": "https://example.com/document.pdf",
"alt": "Important Document",
"type": "document",
"others": {
"fileExtension": "pdf"
}
}
],
"contacts": [
{
"id": "ghi789",
"url": "https://example.com/page",
"type": "email",
"data": "contact@example.com",
"alt": "Email address"
}
],
"summary": {
"imageCount": 15,
"videoCount": 3,
"audioCount": 2,
"documentCount": 4,
"contactCount": 1,
"totalSize": 5242880,
"averageFileSize": 209715,
"mostCommonType": "image"
}
}

Batch Processing for Large URL Lists

The actor automatically switches to batch processing mode when you provide more URLs than the configured batch size. This provides several benefits:

Automatic URL Processing:

  • Validates and normalizes all URLs
  • Removes duplicates automatically
  • Filters by domain and pattern rules
  • Prioritizes URLs for optimal processing order

Intelligent Batching:

  • Processes URLs in configurable batch sizes
  • Controls concurrency to avoid overwhelming servers
  • Implements delays between batches for respectful crawling
  • Automatic retry logic with exponential backoff

Progress Tracking & Resumption:

  • Real-time progress monitoring with ETA calculations
  • Automatic progress saving for long-running jobs
  • Resume from last completed batch if interrupted
  • Failure threshold monitoring to stop on excessive errors

Performance Optimization:

  • Memory-efficient processing of large URL lists
  • Configurable concurrency limits
  • Batch size optimization based on URL count
  • Performance metrics and timing analysis

Example Batch Configuration:

{
"startUrls": [
{ "url": "https://site1.com" },
{ "url": "https://site2.com" }
// ... 1000+ URLs
],
"batchProcessing": {
"enableBatchProcessing": true,
"batchSize": 20,
"concurrency": 5,
"delayBetweenBatches": 2000,
"failureThreshold": 0.3
}
}

Analytics & Monitoring

The actor provides comprehensive analytics stored in the key-value store:

  • Performance Statistics: Request counts, success rates, memory usage
  • Error Logs: Detailed error tracking with timestamps and stack traces
  • Blocked URLs: List of URLs that returned access denied or bot detection
  • Configuration: Runtime settings and applied filters
  • Batch Processing Stats: Progress tracking, failure rates, processing times
  • URL Processing Stats: Validation results, duplicate removal, domain distribution

Use Cases

  • Aggregate media from multiple sources for content creation
  • Research competitor visual strategies and branding
  • Collect images for machine learning training datasets

Digital Asset Management

  • Monitor brand asset usage across platforms
  • Identify unauthorized use of copyrighted materials
  • Track visual trends and user-generated content

Business Intelligence

  • Analyze competitor product catalogs and pricing
  • Monitor industry visual trends and marketing strategies
  • Extract contact information for lead generation

Academic & Research

  • Collect visual data for academic research projects
  • Analyze media usage patterns across different domains
  • Study digital communication and visual culture

E-commerce & Marketing

  • Monitor competitor product images and descriptions
  • Analyze visual merchandising strategies
  • Extract product information and specifications

Best Practices

  1. Respectful Crawling: Use appropriate delays and respect robots.txt
  2. Proxy Usage: Enable proxies for sites with bot detection
  3. Filter Configuration: Set appropriate file size limits to avoid large downloads
  4. Media Type Selection: Only extract the media types you need to improve performance
  5. Error Monitoring: Check the key-value store for blocked URLs and errors

Migration Notes

Removed Features

  • requestDelay: This property has been deprecated in newer versions of Crawlee. Rate limiting is now handled automatically by the crawler's internal mechanisms and session management.

Alternative Rate Limiting

If you need to control request rates, consider:

  • Using proxy rotation to distribute requests
  • Implementing custom delays in request handlers
  • Configuring AutoscaledPool settings for concurrency control

For customization requests, additional features, or technical support, please contact us at hlymrk8@gmail.com. We respond to all inquiries within one business day.