FROM apify/actor-node-basic

# First, copy package.json since it affects NPM install
COPY package.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging to much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && npm list \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Lastly, copy remaining files and directories with the source code.
# This way, quick build will not need to reinstall packages on a simple change.
COPY . ./

# Specify how to run the source code
CMD npm start

main.js

1const fs = require('fs');
2const tmp = require('tmp');
3const Apify = require('apify');
4
5// Hack to circumvent strange error exit code masking in alogila-crawler
6// (see https://github.com/DeuxHuitHuit/algolia-webcrawler/blob/master/app.js#L29)
7process.on('exit', (code) => {
8    console.log('Exiting the process with code ' + code);
9	process.exit(code);
10});
11
12(async function () {
13    try {
14        // Get input of your actor
15        const input = await Apify.getValue('INPUT');
16        console.log('Input fetched:');
17        console.dir(input);
18        
19        // From algolia-webcrawler docs:
20        // "At the bare minimum, you can edit config.json to set a values to the following options:
21        //  'app', 'cred', 'indexname' and at least one 'sitemap' object. If you have multiple sitemaps,
22        //  please list them all: sub-sitemaps will not be crawled."
23        if (!input || !input.app || !input.cred || !input.index || !input.sitemaps) {
24            console.error('The input must be a JSON config file with fields as required by algolia-webcrawler package.');
25            console.error('For details, see https://www.npmjs.com/package/algolia-webcrawler');
26            process.exit(33);
27        }
28        
29        var tmpobj = tmp.fileSync({ prefix: 'aloglia-input-', postfix: '.json' });
30        console.log(`Writing input JSON to file ${tmpobj.name}`);
31        fs.writeFileSync(tmpobj.name, JSON.stringify(input, null, 2));
32        
33        console.log(`Emulating command: node algolia-webcrawler --config ${tmpobj.name}`);
34        process.argv[2] = '--config';
35        process.argv[3] = tmpobj.name;
36        const webcrawler = require('algolia-webcrawler');
37    } catch (e) {
38        console.error(e.stack || e);
39        process.exit(34);
40    }
41})();

package.json

{
    "name": "my-actor",
    "version": "0.0.1",
    "dependencies": {
        "apify": "^0.14.3",
        "tmp": "^0.1.0",
        "algolia-webcrawler": "^3.2.0"
    },
    "scripts": {
        "start": "node main.js"
    },
    "author": "Me!"
}

Algolia Website Indexer

apify/algolia-website-indexer

The Indexer crawls recursively a website using the Puppeteer browser (headless Chrome) and indexes the selected pages to the Algolia index.

Apify

4.5

Website Content Extractor

fastidious_drawer/website-content-extractor

This extractor lets you extract content from any website with a single or multiple URLs. Use selectors to choose specific sections like the body and exclude elements like headers or navigation. It also extracts images and links, providing data in JSON and DataTable formats for easy processing.

fastidious_drawer

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

471

Sitemap Change Orchestrator

tri_angle/sitemap-change-orchestrator

Monitor website sitemaps for new, updated, or removed URLs. Integration with the Website Content Crawler (WCC) allows feeding only relevant URLs. This ensures your web crawls are efficient, targeted, and resource-optimized, keeping your datasets fresh for any application.

Tri⟁angle

LinkedIn Sales Navigator Search Scraper

noddsolutions/linkedin-sales-navigator-search-scraper

Crawls sales navigator search results and extracts the available data for you to use for enriching your CRM amongst other things. Scrape Company or People data using this one tool.

Nodd Solutions

793

3.7

tsboi index

trim_flag/tsboi-index

Indexing for LLMs. This application crawls specified websites, processes their content into a searchable vector database, and enables users to ask natural language questions about the content.

Ikenna Chidoka

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

870

4.2

Text to Speech

theapicompany/text-to-speech

Transfers your Text input into a MP3 file.This is the Text to Speech API; The Input: { "text": "Your text that will be an audio" } The Output: To get the Output, which is a MP3 Data file, you have to go to Storage, in there you need to click on Key-Value-Storage and Download the file.

Jonah

5.0

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

9.1K

4.7