No credit card required

Dataset Image Downloader & Uploader

lukaskrivka/images-download-upload

No credit card required

Download image files from image URLs in your datasets and save them to a Zip file, Key-Value store, or directly your AWS S3 bucket.

images-download-upload

Overview
Changelog
Usage
Input
Data and image paths
Input functions
State
Webhooks

Overview

images-download-upload is an Apify actor that can be used to to download and upload any number of image files from any data that include image URLs. It can load from Apify datasets orApify key value stores. It can be run both on Apify platform or locally. It is built with Apify SDK and request npm package.

Changelog

Check CHANGELOG.md for detailed information.

Version-2 - 2019-12-02

Limits

It is better to split downloading of more than 200k images into more runs due to memory constrains.
Keep in mind that if you don't have enough proxies, some websites can block you fairly quickly (even though images aren't usually that protected)

Usage

If you want to run the actor on Apify platform you need to open the actor's page in the library and then click on Try for free which will will create new task in your account or you can directly start it using our API. When using public actors, you don't need to build them since everything is done by the author. You only need to provide an input and then you can run them. But keep in mind that usage is always charged towards the one who runs the actor. You can also let it run in schedules or called by a webhook.

If on the other side you want to run the actor locally, you need to open the actor's github page and clone it to your computer. See Apify CLI how to run it locally.

Input

Most of Apify actors require a JSON input and this one is no exception. The input consists of one object with multiple options. For brevity the options here are split into categories but they all belong into one object.

Main Options:

datasetId: <string> Apify ID of the dataset where the data are located. This or storeInput has to be provided.
pathToImageUrls: <string> Path from the item to the array of image URLs or single image URL string. "" means the data are array of image URLs, detail/images means it will search for images under this nested property. Look at Data and image paths Default: "" (root array).
fileNameFunction: <string> Stringified function. It return how the image file will be named. See Input functions.

Input/Output options:

limit: <number> Max items to load from the dataset. Use with offset to paginate over the data (can reduce memory requirement of large loads).
offset: <number> How many items to skip from the dataset. Use with limit to paginate over the data (can reduce memory requirement of large loads).
outputTo: <string> Useful when you want to transform the data as you download images. Can be one one no-output, key-value-store or dataset. Default: dataset.
storeInput: <string> If you want to input the data from key-value store instead of dataset. Notation: storeId-recordKey, e.g. kWdGzuXuKfYkrntWw-OUTPUT.
outputDatasetId: <string> The id or name of the dataset you want to output the results to. Only relevant if outputTo is set to dataset!, e.g. my-dataset.

Image upload options

uploadTo: <string> Where do you want to upload the image files. Valid options are: key-value-store, s3 or no-upload.
uploadStoreName: <string> Fill this only if uploadTo is key-value-store. Key-value store name where the images will be upload. Empty field means it will be uploaded to the default key-value store.
s3Bucket: <string> Name of the S3 bucket where you want to upload the image files. You need to set uploadTo to s3.
s3AccessKeyId: <string> Your S3 access key id. You need to set uploadTo to s3.
s3SecretAccessKey: <string> Your S3 secret access key. You need to set uploadTo to s3.
s3CheckIfAlreadyThere: <boolean> If set to true it will check your S3 bucket if the image file is already there before uploading. Reading is much cheaper than writing so this is useful to save money if you do a lot of reuploads. Default: false.

Transforming functions

preDownloadFunction: <string> Stringified function. It can help you prepare the data for the image download. For example you can mark some items to not be downloaded. See Input functions.
postDownloadFunction: <string> Stringified function. It can help you process the data after the image download. For example you can remove item where images were not downloaded(failed for any reason). See Input functions.

Image quality check options

imageCheckMaxRetries: <number> Maximum number of retries if the image download fails. Doesn't retry on 404 or too small images.
imageCheckType: <string> You can set a checker of image quality and size. none downloads everything, content-type checks if the file has image-like content-type and image-size: <string> allows you to check if the image is big enough. Default: content-type.
imageCheckMinSize: <number> Minimal size of the image in KBs. Smaller images are not downloaded. Only useful if image-size is set to image-size.
imageCheckMinWidth: <number> Minimal width of the image in pixels. Smaller images are not downloaded. Only useful if image-size is set to image-size.
imageCheckMinHeight: <number> Minimal height of the image in pixels. Smaller images are not downloaded. Only useful if image-size is set to image-size.

Miscellaneous options

proxyConfiguration: <object> Select proxies to be used. Default: { "useApifyProxy": true }.
maxConcurrency: <number> You can limit the number of parallel downloads. Useful when the target website is blocking. Default: It scales to maximum that your memory/CPU can handle. downloadTimeout: <number> How long we will max wait to download each image in milliseconds. Default: 7000. batchSize: <number> Number of items loaded from dataset in one batch. Each batch manages its own state. Useful to split for runs with hundreds of thousands images. Default: 10000.
convertWebpToPng: <boolean> If true, It will automatically convert all images in webp format to png. Be careful that settings in to true will significantly increase the size of the image files. Default: false.
stateFields: <array> Array of state fields you want to be present in the state object. Useful if you want cleaner log or less memory usage. noDownloadRun: <boolean> If true, the actor will not download and upload the images. Usefull for checking duplicates or transformations. Default: false.

Data and image paths

The data where the image URLs are located needs to be saved on Apify storage either in key-value store or dataset. If you don't have the data already there, you can simply upload them with a single API call for key-value store or dataset.

Data provided should be an array (which is always the case for datasets) and the images can be located anywhere in the nested objects, it should just be consistent over all items. The pathToImageUrls uses object-path library to locate the images, it can point either to a single image URL or an array of image URLs.

Few examples:

Data can be just a plain array of image URLs. In this case you don't need to fill pathToImageUrls at all.

1[
2    "https://n.nordstrommedia.com/id/cf6c6151-4380-44aa-ad73-85e4b2140383.jpeg",
3    "https://n.nordstrommedia.com/id/6c03833f-c5f1-43d8-9d20-fb29834c7798.jpeg"
4]

If you scrape some e-commerce website, you will usually have items that have the images inside. In this example pathToImageUrls would be images.

1[{
2
3  "title": "wide sleeved blouse",
4  "price": 790,
5  "url": "https://www.farfetch.com/shopping/women/rosetta-getty-wide-sleeved-blouse-item-12997948.aspx",
6  "images": [
7    "https://cdn-images.farfetch-contents.com/12/99/79/48/12997948_13710943_1000.jpg",
8    "https://cdn-images.farfetch-contents.com/12/99/79/48/12997948_13710944_1000.jpg",
9    "https://cdn-images.farfetch-contents.com/12/99/79/48/12997948_13710945_1000.jpg",
10  ]
11},
12{
13  "title": "Nagoya jumpsuit",
14  "price": 996,
15  "url": "https://www.farfetch.com/shopping/women/le-kasha-nagoya-jumpsuit-item-12534697.aspx",
16  "images": [
17    "https://cdn-images.farfetch-contents.com/12/53/46/97/12534697_11885527_1000.jpg",
18    "https://cdn-images.farfetch-contents.com/12/53/46/97/12534697_11885539_1000.jpg",
19    "https://cdn-images.farfetch-contents.com/12/53/46/97/12534697_11885553_1000.jpg",
20  ]
21}
22]

Image URLs can be also deeply nested. In this case it is also just single URL instead of an array. pathToImageUrls will be images.0.src

1[{
2  "retailer": "walmart",
3  "url": "https://www.walmart.com/ip/Pull-On-Treggings/462482210",
4  "title": "Pull-On Treggings",
5  "retailPrice": 40,
6  "images": [
7    {
8      "src": "https://i5.walmartimages.com/asr/88dcf47d-052b-4a06-815f-82c071ca2e50_1.ea5c31e512197ac22b3c7e7c1959aa84.jpeg?odnHeight=450&odnWidth=450&odnBg=FFFFFF"
9    }
10  ]
11}]

Input functions

For more advanced data preparation and post-processing, you can use any of the 3 input functions. Let's look at each of them and their use-cases.

fileNameFunction

fileNameFunction is the only one of the three that is always executed and has it's default form. It basically names each image file no matter where it is stored.

It receives an object as an argument with these properties which you can (but not need to) use. They should cover all use-cases for filename creation:

url: <string> URL of the image.
md5: <function> Simple function that takes a string and produces a hash.
state: <object> Reference to the entire state object.
item: <object> The item object where the image URL is located.
iterationIndex: <number> Index of the current iteration(batch). Look at internals for more info. Starts at 0.
indexInArray:<number> If images were inside of an array, this is an index of the current image in the array.
input: <object> Original input of the actor

By default fileNameFunction simply produces a hash of the image URL:

({ url, md5 }) => md5(url)

So your image file would be named something like 78e731027d8fd50ed642340b7c9a63b3.

Example use-cases: Create folder on S3 and simply add index numbers as filenames

({ url, md5, state }) => `images/${state[url].imageIndex}`

More complicated filename that depends on other atributes of the item

({ item }) => `${item.retailer}_${item.retailerProductId}_${item.color}.jpg`

preDownloadFunction

preDownloadFunction is useful when you need to process the data before downloading them. You can get rid of items that are corrupted or not interesting.

It receives an object as argument with these properties which you can (but not need to) use:

data: <array> Initial data loaded from dataset or key value store you provided.
iterationIndex: <number> Index of the current iteration(batch). Look at internals for more info. Starts at 0.
input: <object> Original input of the actor

skipDownload If you add skipDownload: true property to any item, its images won't be downloaded. The data will stay as they are.

Example use-cases: Do not download images of items that are not new

1({ data }) => data.map((item) => {
2    if (item.status !== 'NEW) {
3        item.skipDownload = true;
4    }
5    return item;
6})

postDownloadFunction

postDownloadFunction allows you to change the data after the downloading process finished. Its main advantage is that you know if the images were properly downloaded.

It receives an object as argument with these properties which you can (but not need to) use:

data <array> The data that you get from your input or passed by preDownloadFunction if you specified it.
state <object> State object that has image URLs of the current batch as keys and their info as values. Look below for more details about state object.
fileNameFunction <function> Filename function that you specified or its default implementation.
md5 <function> Simple function that takes a string and produces a hash.
iterationIndex <number> Index of the current iteration(batch). Look at internals for more info. Starts at 0.
input: <object> Original input of the actor

Example use-cases: Remove all image URLs that were not properly downloaded/uploaded. If the item has no downloaded/uploaded image, remove it completely. The download can be hard blocked by the website (even after multiple retries) but it can also fail the test you can configure, e.g. the image is too small

1({ data, state, fileNameFunction, md5 }) => {
2    // we map over all the items
3    return data.reduce((newData, item) => {
4        // We filter only the downloaded/uploaded
5        const downloadedImages = item.images.filter((imageUrl) => {
6                return state[imageUrl] && state[imageUrl].imageUploaded;
7            });
8
9        // If there are no downloaded image, we remove the item from the data
10        if (downloadedImages.length === 0) {
11            return newData;
12        }
13
14        // At the end we will assign only properly downloaded/uploaded images and pass the item to our processed data.
15        return newData.concat({ ...item, images: downloadedImages });
16    }, []);
17}

State

The actor processes the input data in batches to lower memory needs. The default batch size is 10000 items. Each batch has its own data and state and the data are fully processed before the next batch starts to get processed.

The state is an object which keys are image URLs. It's values depend on if the image URLs was processed or not. Initially the images are loaded just with indexes like this:

1{
2  "https://images-na.ssl-images-amazon.com/images/I/716chGzGflL._UL1500_.jpg": {
3    "itemIndex": 328,
4    "imageIndex": 1982
5  },
6  "https://images-na.ssl-images-amazon.com/images/I/81ySn0IS0zL._UL1500_.jpg": {
7    "itemIndex": 328,
8    "imageIndex": 1983
9  },
10  "https://images-na.ssl-images-amazon.com/images/I/71plznRyJ9L._UL1500_.jpg": {
11    "itemIndex": 328,
12    "imageIndex": 1984
13  }
14}

After download/upload the state has much richer information that you can use in postDownloadFunction to determine what to do next.

1{
2    "https://i.ebayimg.com/images/g/FDgAAOSwJd1b5NKF/s-l1600.jpg": {
3        "imageIndex": 0,
4        "itemIndex": 0,
5        "duplicateIndexes": [
6            43,
7            46,
8            49
9        ],
10        "imageUploaded": true,
11        "errors": [],
12        "retryCount": 0,
13        "contentType": "image/jpeg",
14        "sizes": {
15            "sizeInKB": 346
16        },
17        "time": {
18            "downloading": 1959,
19            "processing": 0,
20            "uploading": 9
21        }
22    },
23  ...
24}

Webhooks

Very often you want to run an image download/upload update after every run of your scraping/automation actor. Webhooks are solution for this. The default datasetId will be passed automatically to the this actor's run so you don't need to set it up in the payload template (internally the actor transforms the resource.defaultDatasetId from the webhook into just datasetId for its own input).

The webhook from your scraping/automation run can either call the Images Downalod & Upload actor directly or as a task. If you call the actor directly, you have to fill up the payload template with appropriate input and add this as a URL: https://api.apify.com/v2/acts/lukaskrivka~images-download-upload/runs?token=<YOUR_API_TOKEN> Be aware that this is dangerous because if you don't specify exact version, yout integration will break after actor's author will update it. Use tasks for webhooks instead!

I strongly recommend to rather create a task with predefined input that will not change in every run - the only changing part is usually datasetId. You will not need to fill up the payload template and your webhook URL will then look like: https://api.apify.com/v2/actor-tasks/<YOUR-TASK-ID>/runs?token=<YOUR_API_TOKEN>

Developer

Lukáš Křivka

Actor metrics

26 monthly users
10 stars
99.5% runs succeeded
8.8 days response time
Created in Nov 2018
Modified about 1 month ago

Categories

Developer tools

Automation

Integrations

All Social Media Video Downloader

wilcode/all-social-media-video-downloader

All Social Media Video Downloader API allows easy extraction of video links from platforms like Facebook, Instagram, Twitter, TikTok, YouTube, etc.

Wilcode

Similarweb Scraper

tri_angle/similarweb-scraper

A simple but powerful scraper for similarweb.com. Retrieve website popularity information and get it in a JSON/XML/CSV/Excel/HTML table format. Get data such as total visits, traffic sources, competitors, top countries, company info, etc..

Tri⟁angle

1.4k

Website Image Downloader Pro (Pay per Result)

powerful_bachelor/website-image-downloader-pro-pay-per-result

📷 Website Image Downloader Pro: Scrape and download images effortlessly from any URL! 🌟 Features include extracting image URLs, converting SVG to PNG, downloading, and zipping images into one file. Ideal for research, AI training, and visual content archiving. 🖼️✨ Start now on Apify! 🚀

Powerful Bachelor

Website Image Downloader Pro

powerful_bachelor/website-image-downloader-pro

📸 Website Image Downloader Pro: Extract and download images from any URL! 🚀 Features include image URL extraction, SVG to PNG conversion, downloading, and zipping images. Perfect for market research, AI training, and creating visual archives. 🌐✨ Try it now on Apify! 💾

Powerful Bachelor

Slack Messages Downloader

zuzka/slack-messages-downloader

Download up to 1,000 Slack messages from a public channel of your choice. Extract message text, image URL, timestamp, reply count, user ID, reply user IDs, and more. Export Slack data in JSON, CSV, and Excel and use it for archives, backups, and automated reports.

Zuzka Pelechová

Tripadvisor Reviews Scraper

maxcopell/tripadvisor-reviews

Get and download reviews for chosen places on Tripadvisor. Extract the review text, URL, rating, date of travel, published date, basic reviewer info, owner's response, helpful votes, images, review language, place details. Download reviews in XML, JSON, CSV.

Maximillian Copelli

1.8k

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

71.5k

283

Free GPTs Scraper

seadapp/free-gpts-scraper

Gets you GPT data from Openai. Download your data as JSON, HTML Table, CSV, Execl, RSS Feed

Seadapp

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

17.6k

317

How to bulk download all images from a URL list (2024 guide)

Top 5 Google Image Search APIs to extract web image data

How to use Google Lens API to extract image data and find matching images

Build new tools

Are you a developer? Build your own Actors and run them on Apify.

Learn more

Get a custom solution

Get a custom web scraping or RPA solution.

Book a demo

Dataset Image Downloader & Uploader

images-download-upload

Overview

Changelog

Version-2 - 2019-12-02

Limits

Usage

Input

Data and image paths

Input functions

fileNameFunction

preDownloadFunction

postDownloadFunction

State

Webhooks

All Social Media Video Downloader

Similarweb Scraper

Website Image Downloader Pro (Pay per Result)

Website Image Downloader Pro

Slack Messages Downloader

Tripadvisor Reviews Scraper

Google Maps Scraper

Free GPTs Scraper

Website Content Crawler

Related articles

Where next?

Build new tools

Get a custom solution

images-download-upload

Overview

Changelog

Version-2 - 2019-12-02

Limits

Usage

Input

Data and image paths

Input functions

fileNameFunction

preDownloadFunction

postDownloadFunction

State

Webhooks

You might also like these Actors

All Social Media Video Downloader

Similarweb Scraper

Website Image Downloader Pro (Pay per Result)

Website Image Downloader Pro

Slack Messages Downloader

Tripadvisor Reviews Scraper

Google Maps Scraper

Free GPTs Scraper

Website Content Crawler

Related articles

Where next?

Build new tools

Get a custom solution