Bulk Image Downloader: 22-Field Metadata, SHA-256 & ZIP
Pricing
from $2.00 / 1,000 images
Bulk Image Downloader: 22-Field Metadata, SHA-256 & ZIP
Download every image from any webpage or direct image URL. Smart srcset picks the highest-resolution variant. 22 metadata fields per image: width, height, format, SHA-256, dedup flag, EXIF, provenance. ZIP and S3 outputs, webhooks, MCP-ready. $2.00 per 1k.
Pricing
from $2.00 / 1,000 images
Rating
0.0
(0)
Developer
GetAScraper
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
22 metadata fields per image, SHA-256 content hash, optional EXIF strip and WebP-to-PNG, ZIP and S3 outputs. $2.00 per 1,000 results. 70% cheaper than the top Store alternative. Download every image from any webpage or direct image URL in one call. 50 images per run are free.
This Actor is a generic image downloader. It works on any public URL. Pass it a list of webpages and it discovers every image via HTML <img>, <picture>, srcset, og:image, and twitter:image. Pass it a list of direct image URLs and it downloads them straight. Picks the highest-resolution variant from any srcset automatically. Hashes every image body with SHA-256 for dedup. Strips EXIF or converts WebP to PNG on demand. Exports as a structured dataset, ZIP archive, or S3 upload. Processes 10,000 URLs per run at up to 10 concurrent downloads.
What can you do with it?
- You are building an AI training dataset. Pull thousands of product photos, real estate shots, or stock images for CLIP, DINOv2, or SigLIP. Auto-hash for dedup means you never train on the same image twice.
- You are a scraper developer. Hand the Actor a list of image URLs returned by your catalog scraper (REI, IndiaMART, eBay, Poshmark) and get back a ZIP of the binaries plus a clean metadata dataset. One Actor replaces three.
- You are an e-commerce operator. Mirror product image catalogs. Detect when a competitor swaps an image. Track pricing-page visual changes over time.
- You are an archivist or newsroom tool. Grab every image from a story page in one call. Use the per-URL ZIP mode to keep sources separated.
- You are a research analyst. Pull the full visual corpus of any public site for content analysis, brand tracking, or visual trend reports.
- You are a builder integrating via webhook. The Actor POSTs a JSON summary on completion. Pipe the dataset URL into your BigQuery, Sheets, or n8n pipeline.
How to use it
- Open the Actor in the Apify Store and click "Try for free".
- Paste your URLs. Mix webpages (the Actor parses the HTML) and direct image links (it downloads straight) in a single list.
- Pick your options. Turn on SHA-256 dedup, EXIF strip, format conversion, or ZIP output as needed.
- Click Start. The Actor fetches each URL, discovers or downloads the images, and pushes metadata to the dataset and binaries to the key-value store.
- Download your results. Pull the dataset as JSON, CSV, or Excel. Grab the image binaries from the key-value store (links in the dataset's
kv_urlcolumn). Or use the single-click ZIP download.
Input
| Field | Type | Required | Description |
|---|---|---|---|
urls | array | Yes | List of URLs. Each can be a webpage (HTML is parsed for images) or a direct image link. Mix freely. |
mode | enum | No | auto (recommended, detects by extension), page (force HTML parse), or direct (force image URL). |
includeSrcset | boolean | No | Discover images from srcset, picture>source, and lazy data-src. Default true. |
includeOgTags | boolean | No | Discover Open Graph and Twitter Card images. Default true. |
minWidth | integer | No | Skip images narrower than this. Default 0. |
minHeight | integer | No | Skip images shorter than this. Default 0. |
minSizeBytes | integer | No | Skip images smaller than this. Filters tracking pixels. Default 0. |
maxImagesPerUrl | integer | No | Cap images per source URL. Default 1000. |
maxUrls | integer | No | Cap total URLs processed. Default 10000. |
dedupByHash | boolean | No | Compute SHA-256 of each image body and skip duplicates. Default true. |
stripExif | boolean | No | Re-encode JPEGs without EXIF metadata. Default false. |
convertFormat | enum | No | none, webp-to-png, or png-to-jpg. Default none. |
filenamePattern | string | No | Templated filename using {slug}, {hash}, {ext}, {idx}, {source}. Default {slug}-{hash}.{ext}. |
outputFormat | array | No | dataset (always), kv-store (binaries), zip (single archive), zipPerUrl (one ZIP per source), s3 (upload to bucket), webhook (POST summary on completion). |
s3Bucket | string | No | Required when outputFormat includes s3. Uses standard AWS_* env vars for credentials. |
webhookUrl | string | No | URL to receive a JSON run summary on completion. |
maxConcurrency | integer | No | Max parallel image downloads. Default 10. |
downloadTimeoutMs | integer | No | Per-image fetch timeout. Default 15000. |
imageCheckMaxRetries | integer | No | Retries per failed image. Default 3. |
proxyConfiguration | object | No | Optional proxy. Default off. Use residential if source sites are hotlink-protected. |
failFast | boolean | No | Stop on first error. Default false. |
debugLogging | boolean | No | Verbose per-image tracing. Default false. |
Output
The Actor pushes one row to the dataset per downloaded image. Binaries are written to the default key-value store under IMAGES/{filename}. Use the dataset's kv_url column to download each binary.
{"filename": "picsum-photos-800x600-a1b2c3d4e5f67890.jpg","source_url": "https://example.com/gallery","image_url": "https://picsum.photos/800/600.jpg","kv_store_key": "IMG-picsum-photos-800x600-a1b2c3d4e5f67890.jpg","kv_url": "https://api.apify.com/v2/key-value-stores/abc/records/IMG-picsum-photos-800x600-a1b2c3d4e5f67890.jpg","content_type": "image/jpeg","size_bytes": 54321,"width": 800,"height": 600,"format": "jpeg","sha256": "a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456","is_duplicate": false,"exif_stripped": false,"from_srcset": true,"from_picture_source": false,"from_og_tag": false,"from_twitter_tag": false,"from_data_attr": false,"from_direct_url": false,"downloaded_at": "2026-06-20T12:34:56.000Z","duration_ms": 423,"http_status": 200,"error": null}
You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.
Output data fields
| Field | Description |
|---|---|
filename | Final filename (per filenamePattern). |
source_url | The page URL the image was discovered on (or its direct URL). |
image_url | Final resolved image URL (after srcset expansion, redirects). |
kv_store_key | Key in the run's key-value store (IMG-...). |
kv_url | Signed download URL for the binary (24-hour default). |
content_type | MIME type (e.g. image/jpeg, image/webp). |
size_bytes | Downloaded size. |
width | Image width in pixels (from sharp metadata). |
height | Image height in pixels (from sharp metadata). |
format | Normalized format: jpeg, png, webp, gif, svg, avif, bmp, ico, other. |
sha256 | Content hash (when dedupByHash=true). |
is_duplicate | True if hash matched a previously-seen image in this run. |
exif_stripped | True if JPEG was re-encoded to remove EXIF. |
from_srcset | True if discovered via srcset / picture / data-srcset. |
from_picture_source | True if discovered via <picture><source>. |
from_og_tag | True if discovered via <meta og:image>. |
from_twitter_tag | True if discovered via <meta twitter:image>. |
from_data_attr | True if discovered via lazy data-src / data-srcset. |
from_direct_url | True if the URL was treated as a direct image (mode=direct/auto). |
downloaded_at | ISO timestamp of the download. |
duration_ms | Time to fetch + process. |
http_status | HTTP response code (0 on network error). |
error | Per-image error string (404, timeout, below-min-size-N, etc.) or null. |
Pricing
$2.00 per 1,000 results. The first 50 results of every run are free. There is no monthly fee and no proxy surcharge.
| Volume | What you pay |
|---|---|
| 50 images (free trial) | $0.00 |
| 1,000 images | $2.00 |
| 10,000 images | $20.00 |
| 100,000 images | $200.00 |
For comparison, the next-most-popular bulk image downloader on the Store (onescales/bulk-image-downloader) charges $7.00 per 1,000 URLs and only ships image bytes (no width, no height, no hash, no format). We charge 70% less and ship the richest schema in the field.
For scheduled or standby runs, pricing drops to $1.00 per 1,000 results (50% off). Volume runs of more than 50,000 images are eligible for $1.50 per 1,000.
Tips and advanced options
- Set
includeSrcsetto false if you only want the page's primary images. This skips lazydata-srcand responsive variants, which is faster on heavy pages. - Use
minSizeBytesto filter tracking pixels. A typical tracking pixel is under 1KB. SetminSizeBytes: 2000to skip them. - Use
minWidthandminHeightto focus on useful images. SetminWidth: 400to skip thumbnails and avatars. - Pick the right output mode.
zipfor a single archive,zipPerUrlto keep source pages separated,s3to push directly to your training bucket. - Pair with a catalog scraper. Run one of our catalog scrapers (REI, IndiaMART, eBay) first, then feed the image URLs to this Actor for a complete e-commerce dataset.
- Schedule weekly runs to refresh your image corpus. Most product catalogs update slowly; daily is overkill.
- Use SHA-256 dedup across runs. Hashes are stable, so a daily run that re-discovers the same images will mark them as
is_duplicate: trueand skip the KV write.
FAQ
Is this Actor legal to use? The Actor downloads images that are publicly accessible. You are responsible for ensuring your use case complies with the source site's Terms of Service and applicable copyright laws. Do not use the Actor to bypass access controls, scrape private content, or violate copyright.
Why does it work on any site? The Actor is generic. It fetches the URL you give it, parses the HTML for image tags, and downloads the images it finds. There is no per-site configuration.
Does it execute JavaScript?
No. Single-page apps that render images via React/Vue hydration will return an empty image list. If your target site is a SPA, use a Playwright-based scraper first to get the image URLs, then pass them to this Actor with mode: 'direct'.
Do I need a proxy?
No. Most public sites serve images to any client. Default useApifyProxy: false works perfectly. If your source site is hotlink-protected, set residential proxy as an opt-in via the proxyConfiguration field.
What is the largest image it can handle? Sharp auto-streams, so peak memory is around 5x the size of the largest single image. A 50MB image is fine. A 500MB image may cause memory pressure on smaller container sizes.
Does the EXIF strip work on PNG or WebP? No, EXIF strip is JPEG-only. PNG metadata stripping is a v2 feature.
How does the free trial work? Every new Apify user gets $5 of platform credit. That is enough to run this Actor many times. The first 50 results of every run are free, so you can evaluate the data quality before spending anything.
Can I get a single ZIP of all images?
Yes. Set outputFormat: ['dataset', 'kv-store', 'zip']. The ZIP is written to OUT-images.zip and is also linked in the dataset summary.
Can I push directly to S3?
Yes. Set outputFormat: ['dataset', 's3'], fill in s3Bucket, and set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION as Apify Secrets. Each image uploads to s3://{bucket}/images/{filename}.
Can I get a webhook on completion?
Yes. Set outputFormat: ['dataset', 'webhook'] and fill in webhookUrl. The Actor POSTs a JSON summary with run stats (counts, errors, total size) to the URL when the run finishes.
Disclaimers and support
- Disclaimer: This Actor retrieves publicly accessible images. Make sure your usage complies with the source site's terms of service and applicable copyright laws. The Actor is a generic utility and does not bypass authentication, paywalls, or access controls.
- Support: Open an issue from the Issues tab for bug reports or feature requests. Custom scrapers and integration help are available on request.