Actor picture

XMLs To Dataset

mtrunkat/xmls-to-dataset

Go to actor anytime you need to download XML files and store them in the dataset.

No credit card required

Author's avatarMarek Trunkát
  • Modified
  • Users31
  • Runs365

Dockerfile

# This is a template for a Dockerfile used to run acts in Actor system.
# The base image name below is set during the act build, based on user settings.
# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
FROM apify/actor-node

# Second, copy just package.json and package-lock.json since it should be
# the only file that affects "npm install" in the next step, to speed up the build
COPY package*.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && (npm list --all || true) \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Copy source code to container
# Do this in the last step, to have fast build if only the source code changed
COPY  . ./

# NOTE: The CMD is already defined by the base image.
# Uncomment this for local node inspector debugging:
# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]

INPUT_SCHEMA.json

{
    "title": "XMLS To Dataset input",
    "description": "Enter the XML URLs you want to be downloaded.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "sources": {
            "title": "URLs of XML files",
            "type": "array",
            "description": "Enter the XML URLs you want to be downloaded.",
            "prefill": [
                {
                    "url": "https://www.w3schools.com/xml/plant_catalog.xml"
                },
                {
                    "url": "https://www.w3schools.com/xml/cd_catalog.xml"
                }
            ],
            "editor": "requestListSources"
        },
        "proxy": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Select proxies to be used to download XML files.",
            "prefill": { "useApifyProxy": true },
            "editor": "proxy"
        }
    },
    "required": ["sources", "proxy"]
}

README.md

This actor simply takes URLs of XML files hosted elsewhere, downloads them, and saves their contents into the dataset.

main.js

const Apify = require('apify');
const util = require('util')
const parseString = require('xml2js').parseString;
const _ = require('underscore');

const parseStringPromised = util.promisify(parseString);

Apify.main(async () => {
    const {
        sources,
        proxy,
    } = await Apify.getValue('INPUT');
        
    const proxyConfiguration = await Apify.createProxyConfiguration(proxy);
    const requestList = await Apify.openRequestList('urls', sources);    

    const crawler = new Apify.BasicCrawler({
        requestList,

        handleRequestFunction: async ({ request }) => {
            const { body, statusCode } =  await Apify.utils.requestAsBrowser({
                url: request.url,
                proxyUrl: proxyConfiguration.newUrl(),
            });            

            if (statusCode >= 300) throw new Error(`Request failed with statusCode=${statusCode}`);

            await Apify.pushData({
                data: await parseStringPromised(body),
                request,
            });
        },

        handleFailedRequestFunction: async ({ request }) => {
            await Apify.pushData({
                failed: true,
                request,
            });
        },
    });
    
    await crawler.run();
});

package.json

{
    "name": "apify-project",
    "version": "0.0.1",
    "description": "",
    "author": "It's not you it's me",
    "license": "ISC",
    "dependencies": {
        "apify": "latest",
        "xml2js": "latest",
        "underscore": "latest"
    },
    "scripts": {
        "start": "node main.js"
    }
}