# This is a template for a Dockerfile used to run acts in Actor system.
# The base image name below is set during the act build, based on user settings.
# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
FROM apify/actor-node

# Second, copy just package.json and package-lock.json since it should be
# the only file that affects "npm install" in the next step, to speed up the build
COPY package*.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && (npm list --all || true) \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Copy source code to container
# Do this in the last step, to have fast build if only the source code changed
COPY  . ./

# NOTE: The CMD is already defined by the base image.
# Uncomment this for local node inspector debugging:
# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]

INPUT_SCHEMA.json

{
    "title": "XMLS To Dataset input",
    "description": "Enter the XML URLs you want to be downloaded.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "sources": {
            "title": "URLs of XML files",
            "type": "array",
            "description": "Enter the XML URLs you want to be downloaded.",
            "prefill": [
                {
                    "url": "https://www.w3schools.com/xml/plant_catalog.xml"
                },
                {
                    "url": "https://www.w3schools.com/xml/cd_catalog.xml"
                }
            ],
            "editor": "requestListSources"
        },
        "proxy": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Select proxies to be used to download XML files.",
            "prefill": { "useApifyProxy": true },
            "editor": "proxy"
        }
    },
    "required": ["sources", "proxy"]
}

main.js

1const Apify = require('apify');
2const util = require('util')
3const parseString = require('xml2js').parseString;
4const _ = require('underscore');
5
6const parseStringPromised = util.promisify(parseString);
7
8Apify.main(async () => {
9    const {
10        sources,
11        proxy,
12    } = await Apify.getValue('INPUT');
13        
14    const proxyConfiguration = await Apify.createProxyConfiguration(proxy);
15    const requestList = await Apify.openRequestList('urls', sources);    
16
17    const crawler = new Apify.BasicCrawler({
18        requestList,
19
20        handleRequestFunction: async ({ request }) => {
21            const { body, statusCode } =  await Apify.utils.requestAsBrowser({
22                url: request.url,
23                proxyUrl: proxyConfiguration.newUrl(),
24            });            
25
26            if (statusCode >= 300) throw new Error(`Request failed with statusCode=${statusCode}`);
27
28            await Apify.pushData({
29                data: await parseStringPromised(body),
30                request,
31            });
32        },
33
34        handleFailedRequestFunction: async ({ request }) => {
35            await Apify.pushData({
36                failed: true,
37                request,
38            });
39        },
40    });
41    
42    await crawler.run();
43});

package.json

{
    "name": "apify-project",
    "version": "0.0.1",
    "description": "",
    "author": "It's not you it's me",
    "license": "ISC",
    "dependencies": {
        "apify": "latest",
        "xml2js": "latest",
        "underscore": "latest"
    },
    "scripts": {
        "start": "node main.js"
    }
}

Pay-as-you-go API / JSON scraper

pocesar/pay-as-you-go-api-json-scraper

Scrape as pay-as-you-go any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

Paulo Cesar

Dataset(s) To Schema

zuzka/dataset-to-schema

Takes a Dataset ID(s) and outputs a JSON schema of the contents of the dataset into key value store.

Zuzka Pelechová

Error Messages Deduplication

petrpatek/get-debug-items-from-dataset

Filter items from the dataset with `#debug` fieldName and saves them to the dataset and deduplicates `errorMessages` so you don't have to go through all the errors.

Petr Pátek

API / JSON scraper

pocesar/json-downloader

Scrape any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

Paulo Cesar

540

Dataset Processor in Python

drobnikj/dataset-processor-python

This actor utilizes Python to process the dataset.

Jakub Drobník

Append to dataset

valek.josef/append-to-dataset

Utility actor that allows you to build a single large dataset from individual default datasets of other actor runs.

Josef Válek

Zip Download of KV Store

useful-tools/downloadKvStoreZip

Creates a zip file from all items in the key-value store, zips them, and downloads them in a unique file. On input, you can specify the number of the files or keep it null to download them all. The key-value store must be under your Apify account.