Pricing

from $1.00 / 1,000 media extractions

Try for free

Go to Apify Store

Website Media Extractor & Scraper

Try for free

The Advanced Website Media Extractor & Scraper is a comprehensive media extraction tool that supports images, videos, audio, documents, archives, e-books, fonts, apps, and contact information from websites. Features advanced filtering, proxy support, and detailed analytics.

Pricing

from $1.00 / 1,000 media extractions

Rating

4.9

(4)

Developer

$crypt

Actor stats

Bookmarked

275

Total users

Monthly active users

2 months ago

Last modified

Website Media Extractor & Scraper

Overview

The Advanced Website Media Scraping Tool is a comprehensive utility designed to extract various media files and information from multiple websites. This enhanced version supports a wide range of media types including images, videos, audio, documents, archives, e-books, fonts, applications, and contact information. It provides advanced filtering options, detailed analytics, and flexible output formats.

Supported Media Types

Support for 90+ file formats across 8 media categories.

Video Files

mp4, webm, mkv, mov, avi, flv, 3gp, ogv, m4v, wmv, mpg, mpeg, f4v

Audio Files

mp3, wav, ogg, aac, flac, m4a, wma, opus, aiff, au, ra

Image Files

jpg, jpeg, png, gif, webp, svg, bmp, apng, heic, heif, tiff, ico, avif

Document Files

pdf, doc, docx, ppt, pptx, xls, xlsx, csv, odt, ods, odp, rtf, md, txt, json, xml

Archive Files

zip, rar, tar, gz, 7z, bz2, xz, lz, lzma, cab, deb, rpm

E-book Files

epub, mobi, azw3, fb2, lit, pdb, prc, azw, kf8

Font Files

ttf, otf, woff, woff2, eot, pfb, pfm, afm

Application Files

apk, xapk, ipa, exe, msi, dmg, pkg, deb, rpm, snap, appimage

Contact Information (Extra)

Email addresses, phone numbers, social media profiles (Twitter, Facebook, Instagram, LinkedIn)

Pricing Model

This actor uses a pay-per-event pricing model where you only pay for media items successfully extracted and processed.

Pricing: $0.001 per media item (images, videos, documents, etc.)

What counts as a billable event:

Each image, video, audio file, document, archive, e-book, font, app, or contact successfully extracted
SVG elements converted to data URLs
Canvas elements with extracted metadata
Duplicate items are removed before billing (you don't pay for duplicates)
Failed extractions or inaccessible media don't count

Billing Examples:

Extract 50 images from a website = $0.05
Process 200 media items across multiple sites = $0.20
Large batch job with 1,000+ media items = $1.00+

Transparent Billing:

Real-time billing statistics in logs
Detailed billing summary at job completion
No hidden fees or minimum charges
Pay only for successful extractions

Key Features

Comprehensive Media Detection: Automatically identifies and extracts all supported media types
Advanced Filtering: Filter by file size, type, dimensions, and custom criteria
Contact Extraction: Automatically finds email addresses, phone numbers, and social media profiles
Background Image Detection: Extracts images from CSS background-image properties
Lazy Loading Support: Detects images with data-src and data-lazy-src attributes
Performance Analytics: Detailed statistics and performance monitoring
Error Handling: Robust error handling with retry logic and blocked URL detection
Flexible Output: Configurable output with summary statistics and metadata
Proxy Support: Built-in proxy support for bot-blocking websites
Rate Limiting: Respectful crawling with configurable delays
Batch Processing: Efficient processing of large URL lists with progress tracking and resumption
Duplicate Detection: Advanced algorithms to identify and remove duplicate media
Media Validation: Health checks and accessibility validation for media files
Custom Selectors: User-defined CSS selectors for specialized media extraction
SVG Conversion: Convert SVG elements to data URLs with security sanitization (currently creates data URLs only; full rasterization planned)
Canvas Detection: Identify and extract canvas element metadata (conversion creates placeholders; full rasterization planned)

Input Configuration

The actor accepts a comprehensive configuration object with the following sections:

Basic Settings

{
    "startUrls": [{ "url": "https://example.com" }],
    "maxRequestsPerCrawl": 100,
    "proxyConfiguration": {
        "useApifyProxy": true
    }
}

Media Type Selection

{
    "mediaTypes": [
        "images",
        "videos",
        "audios",
        "documents",
        "archives",
        "ebooks",
        "fonts",
        "apps",
        "contacts"
    ]
}

Image Options

{
    "imageOptions": {
        "includeBackgroundImages": true,
        "minImageSize": 50,
        "includeDataUrls": false
    }
}

File Filtering

{
    "fileFilters": {
        "maxFileSize": 100,
        "allowedExtensions": [],
        "blockedExtensions": []
    }
}

Contact Extraction

{
    "contactExtraction": {
        "extractContacts": true,
        "includeEmails": true,
        "includePhones": true,
        "includeSocialMedia": true
    }
}

Advanced Crawling Options

{
    "crawlingOptions": {
        "respectRobotsTxt": true,
        "userAgent": "",
        "maxRetries": 3
    }
}

Batch Processing (New!)

{
    "batchProcessing": {
        "enableBatchProcessing": true,
        "batchSize": 10,
        "concurrency": 3,
        "delayBetweenBatches": 1000,
        "maxRetries": 3,
        "failureThreshold": 0.5,
        "enableProgressTracking": true,
        "resumeFromLastBatch": true
    }
}

URL List Management

{
    "urlListManagement": {
        "enableDeduplication": true,
        "enableValidation": true,
        "maxUrlsPerBatch": 1000,
        "blockedDomains": ["spam.com"],
        "allowedDomains": ["trusted.com"],
        "urlPatterns": {
            "includePatterns": [".*\\.jpg$", ".*\\.png$"],
            "excludePatterns": [".*admin.*"]
        }
    }
}

Output Format

The actor provides structured output with detailed information for each media type:

{
    "URL": "https://example.com/page",
     "title": "Example Title",
    "domain": "example.com",
    "timestamp": "2024-03-07T10:30:00.000Z",
    "total_media": 25,
    "images": [
        {
            "id": "abc123",
            "url": "https://example.com/page",
            "src": "https://example.com/image.jpg",
            "alt": "Description",
            "type": "image",
            "others": {
                "width": "800",
                "height": "600",
                "fileExtension": "jpg",
                "elementTag": "img"
            }
        }
    ],
    "documents": [
        {
            "id": "def456",
            "url": "https://example.com/page",
            "src": "https://example.com/document.pdf",
            "alt": "Important Document",
            "type": "document",
            "others": {
                "fileExtension": "pdf"
            }
        }
    ],
    "contacts": [
        {
            "id": "ghi789",
            "url": "https://example.com/page",
            "type": "email",
            "data": "contact@example.com",
            "alt": "Email address"
        }
    ],
    "summary": {
        "imageCount": 15,
        "videoCount": 3,
        "audioCount": 2,
        "documentCount": 4,
        "contactCount": 1,
        "totalSize": 5242880,
        "averageFileSize": 209715,
        "mostCommonType": "image"
    }
}

Batch Processing for Large URL Lists

The actor automatically switches to batch processing mode when you provide more URLs than the configured batch size. This provides several benefits:

Automatic URL Processing:

Validates and normalizes all URLs
Removes duplicates automatically
Filters by domain and pattern rules
Prioritizes URLs for optimal processing order

Intelligent Batching:

Processes URLs in configurable batch sizes
Controls concurrency to avoid overwhelming servers
Implements delays between batches for respectful crawling
Automatic retry logic with exponential backoff

Progress Tracking & Resumption:

Real-time progress monitoring with ETA calculations
Automatic progress saving for long-running jobs
Resume from last completed batch if interrupted
Failure threshold monitoring to stop on excessive errors

Performance Optimization:

Memory-efficient processing of large URL lists
Configurable concurrency limits
Batch size optimization based on URL count
Performance metrics and timing analysis

Example Batch Configuration:

{
    "startUrls": [
        { "url": "https://site1.com" },
        { "url": "https://site2.com" }
        // ... 1000+ URLs
    ],
    "batchProcessing": {
        "enableBatchProcessing": true,
        "batchSize": 20,
        "concurrency": 5,
        "delayBetweenBatches": 2000,
        "failureThreshold": 0.3
    }
}

Analytics & Monitoring

The actor provides comprehensive analytics stored in the key-value store:

Performance Statistics: Request counts, success rates, memory usage
Error Logs: Detailed error tracking with timestamps and stack traces
Blocked URLs: List of URLs that returned access denied or bot detection
Configuration: Runtime settings and applied filters
Batch Processing Stats: Progress tracking, failure rates, processing times
URL Processing Stats: Validation results, duplicate removal, domain distribution

Use Cases

Aggregate media from multiple sources for content creation
Research competitor visual strategies and branding
Collect images for machine learning training datasets

Digital Asset Management

Monitor brand asset usage across platforms
Identify unauthorized use of copyrighted materials
Track visual trends and user-generated content

Business Intelligence

Analyze competitor product catalogs and pricing
Monitor industry visual trends and marketing strategies
Extract contact information for lead generation

Academic & Research

Collect visual data for academic research projects
Analyze media usage patterns across different domains
Study digital communication and visual culture

E-commerce & Marketing

Monitor competitor product images and descriptions
Analyze visual merchandising strategies
Extract product information and specifications

Best Practices

Respectful Crawling: Use appropriate delays and respect robots.txt
Proxy Usage: Enable proxies for sites with bot detection
Filter Configuration: Set appropriate file size limits to avoid large downloads
Media Type Selection: Only extract the media types you need to improve performance
Error Monitoring: Check the key-value store for blocked URLs and errors

Migration Notes

Removed Features

requestDelay: This property has been deprecated in newer versions of Crawlee. Rate limiting is now handled automatically by the crawler's internal mechanisms and session management.

Alternative Rate Limiting

If you need to control request rates, consider:

Using proxy rotation to distribute requests
Implementing custom delays in request handlers
Configuring AutoscaledPool settings for concurrency control

For customization requests, additional features, or technical support, please contact us at hlymrk8@gmail.com. We respond to all inquiries within one business day.

Report Bug Or Issue: Report On Apify

Website Media Link Scraper

thenetaji/website-media-link-scraper

Quickly find video, audio, docs, pdf, image and more links from websites using this fast and lightweight web crawler. No browser needed—just clean and efficient media extraction.

The Netaji

276

4.3

TeraBox Video/File Downloader 🎥

easyapi/terabox-video-file-downloader

Extract direct download links from TeraBox video sharing URLs. Get fast download links, file details, and metadata for TeraBox shared videos automatically.

EasyApi

1.0

ThisVid Video Downloader | How to Download ThisVid Videos

serpxxx/thisvid-downloader

ThisVid Downloader detects supported ThisVid video streams, shows the available quality options, and exports the selected video as MP4 for offline playback directly in your browser.

SERP XXX

672

Terabox Video Downloader

scraper-mind/terabox-downloader

Download Terabox videos effortlessly with our fast and reliable Terabox Video Downloader API. Enjoy hassle-free video downloads for just $10/month using our powerful downloader API.

Scraper Mind

537

5.0

Thisvid Video Downloader | How to Download Thisvid Videos

how-to-download-videos/how-to-download-thisvid-videos

Download ThisVid videos for offline use. Fast, private ThisVid downloader.

how-to-download-videos

138

🎥 TeraBox Video Downloader & Audio (MP3) Extractor 🛠️

scrapearchitect/terabox-video-downloader-audio-mp3-extractor

🚀 Instantly download TeraBox videos 🎥 & MP3s 🎵! Secure Apify storage links 🔐📦, bulk URLs 📋, auto-retry 🔄. For devs 🛠️, archivists 📚, music collectors 🎧. Fast ⚡, no watermarks 🚫💧, encrypted downloads! 🎥 TeraBox Video Downloader & Audio (MP3) Extractor 🛠️

Scrape Architect

125

TeraBox VideoDownload Link Scraper

hello.datawizards/TeraBox-VideoDownload-Link-Scraper

TeraBox-VideoDownload-Link-Scraper is an Apify Actor that extracts direct video download links from TeraBox URLs. Get structured JSON with original URLs, download links, filenames, and sizes. Ideal for content aggregation and media analysis. Built by DataWizards with residential proxy support.

datawizards

5.0

TeraBox High-Speed Direct Video/File (Bulk)

mikolabs/terabox-high-speed-direct-video-file-bulk

A fast, asynchronous scraper to extract file information and direct CDN download links from TeraBox collections. Supports bulk input, cookie-based authentication, and provides structured JSON/CSV output.

mikolabs

112

5.0

ThisVid Bulk Video Downloader

serpxxx/thisvid-bulk-video-downloader

Extract direct video URLs, metadata, thumbnails, and format details from ThisVid links.

SERP XXX

TeraBox Video Player and Downloader API

express_kingfisher/terabox-video-player-and-downloader-api

Download TeraBox videos and audio in high definition quality for free. Our advanced TeraBox downloader supports HD and high-quality video downloads with lightning-fast speed. Play or download TeraBox videos directly in your browser: no software installation required. Perfect for offline viewing.

Prince Raj

156

5.0

Website Media Extractor & Scraper

Website Media Extractor & Scraper

Overview

Supported Media Types

Pricing Model

Key Features

Input Configuration

Basic Settings

Media Type Selection

Image Options

File Filtering

Contact Extraction

Advanced Crawling Options

Batch Processing (New!)

URL List Management

Output Format

Batch Processing for Large URL Lists

Analytics & Monitoring

Use Cases

Best Practices

Migration Notes

Removed Features

Alternative Rate Limiting

You might also like

Website Media Link Scraper

TeraBox Video/File Downloader 🎥

ThisVid Video Downloader | How to Download ThisVid Videos

Terabox Video Downloader

Thisvid Video Downloader | How to Download Thisvid Videos

🎥 TeraBox Video Downloader & Audio (MP3) Extractor 🛠️

TeraBox VideoDownload Link Scraper

TeraBox High-Speed Direct Video/File (Bulk)

ThisVid Bulk Video Downloader

TeraBox Video Player and Downloader API