Algolia Webcrawler avatar
Algolia Webcrawler

Pricing

Pay per usage

Go to Store
Algolia Webcrawler

Algolia Webcrawler

Developed by

Jan Čurn

Jan Čurn

Maintained by Community

Crawls a website using one or more sitemaps and imports the data to Algolia search index. The text content is identified using simple CSS selectors.

0.0 (0)

Pricing

Pay per usage

4

Total users

80

Monthly users

1

Runs succeeded

0%

Last modified

4 years ago

Dockerfile

FROM apify/actor-node-basic
# First, copy package.json since it affects NPM install
COPY package.json ./
# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging to much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
&& npm install --only=prod --no-optional \
&& echo "Installed NPM packages:" \
&& npm list \
&& echo "Node.js version:" \
&& node --version \
&& echo "NPM version:" \
&& npm --version
# Lastly, copy remaining files and directories with the source code.
# This way, quick build will not need to reinstall packages on a simple change.
COPY . ./
# Specify how to run the source code
CMD npm start

main.js

1const fs = require('fs');
2const tmp = require('tmp');
3const Apify = require('apify');
4
5// Hack to circumvent strange error exit code masking in alogila-crawler
6// (see https://github.com/DeuxHuitHuit/algolia-webcrawler/blob/master/app.js#L29)
7process.on('exit', (code) => {
8 console.log('Exiting the process with code ' + code);
9 process.exit(code);
10});
11
12(async function () {
13 try {
14 // Get input of your actor
15 const input = await Apify.getValue('INPUT');
16 console.log('Input fetched:');
17 console.dir(input);
18
19 // From algolia-webcrawler docs:
20 // "At the bare minimum, you can edit config.json to set a values to the following options:
21 // 'app', 'cred', 'indexname' and at least one 'sitemap' object. If you have multiple sitemaps,
22 // please list them all: sub-sitemaps will not be crawled."
23 if (!input || !input.app || !input.cred || !input.index || !input.sitemaps) {
24 console.error('The input must be a JSON config file with fields as required by algolia-webcrawler package.');
25 console.error('For details, see https://www.npmjs.com/package/algolia-webcrawler');
26 process.exit(33);
27 }
28
29 var tmpobj = tmp.fileSync({ prefix: 'aloglia-input-', postfix: '.json' });
30 console.log(`Writing input JSON to file ${tmpobj.name}`);
31 fs.writeFileSync(tmpobj.name, JSON.stringify(input, null, 2));
32
33 console.log(`Emulating command: node algolia-webcrawler --config ${tmpobj.name}`);
34 process.argv[2] = '--config';
35 process.argv[3] = tmpobj.name;
36 const webcrawler = require('algolia-webcrawler');
37 } catch (e) {
38 console.error(e.stack || e);
39 process.exit(34);
40 }
41})();

package.json

{
"name": "my-actor",
"version": "0.0.1",
"dependencies": {
"apify": "^0.14.3",
"tmp": "^0.1.0",
"algolia-webcrawler": "^3.2.0"
},
"scripts": {
"start": "node main.js"
},
"author": "Me!"
}