Actor picture

Algolia Webcrawler

jancurn/algolia-webcrawler

Crawls a website using one or more sitemaps and imports the data to Algolia search index. The text content is identified using simple CSS selectors.

No credit card required

Author's avatarJan Čurn
  • Modified
  • Users57
  • Runs312
Actor picture
Algolia Webcrawler

Dockerfile

FROM apify/actor-node-basic

# First, copy package.json since it affects NPM install
COPY package.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging to much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && npm list \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Lastly, copy remaining files and directories with the source code.
# This way, quick build will not need to reinstall packages on a simple change.
COPY . ./

# Specify how to run the source code
CMD npm start

README.md

# Algolia Webcrawler

Crawls a website using one or more sitemaps and imports the data
to [Algolia](https://www.algolia.com) search index. The text content is identified using
simple CSS selectors.

The actor simply runs the
[algolia-webcrawler](https://www.npmjs.com/package/algolia-webcrawler)
NPM package on the Apify cloud, so that you don't need to deploy it yourself.
You can run it easily using API or scheduler.

On input, the actor accepts a JSON configuration
required by algolia-webcrawler.
For details, see
https://www.npmjs.com/package/algolia-webcrawler#configuration-options

main.js

const fs = require('fs');
const tmp = require('tmp');
const Apify = require('apify');

// Hack to circumvent strange error exit code masking in alogila-crawler
// (see https://github.com/DeuxHuitHuit/algolia-webcrawler/blob/master/app.js#L29)
process.on('exit', (code) => {
    console.log('Exiting the process with code ' + code);
	process.exit(code);
});

(async function () {
    try {
        // Get input of your actor
        const input = await Apify.getValue('INPUT');
        console.log('Input fetched:');
        console.dir(input);
        
        // From algolia-webcrawler docs:
        // "At the bare minimum, you can edit config.json to set a values to the following options:
        //  'app', 'cred', 'indexname' and at least one 'sitemap' object. If you have multiple sitemaps,
        //  please list them all: sub-sitemaps will not be crawled."
        if (!input || !input.app || !input.cred || !input.index || !input.sitemaps) {
            console.error('The input must be a JSON config file with fields as required by algolia-webcrawler package.');
            console.error('For details, see https://www.npmjs.com/package/algolia-webcrawler');
            process.exit(33);
        }
        
        var tmpobj = tmp.fileSync({ prefix: 'aloglia-input-', postfix: '.json' });
        console.log(`Writing input JSON to file ${tmpobj.name}`);
        fs.writeFileSync(tmpobj.name, JSON.stringify(input, null, 2));
        
        console.log(`Emulating command: node algolia-webcrawler --config ${tmpobj.name}`);
        process.argv[2] = '--config';
        process.argv[3] = tmpobj.name;
        const webcrawler = require('algolia-webcrawler');
    } catch (e) {
        console.error(e.stack || e);
        process.exit(34);
    }
})();

package.json

{
    "name": "my-actor",
    "version": "0.0.1",
    "dependencies": {
        "apify": "^0.14.3",
        "tmp": "^0.1.0",
        "algolia-webcrawler": "^3.2.0"
    },
    "scripts": {
        "start": "node main.js"
    },
    "author": "Me!"
}