The Indexer crawls a website using the Puppeteer browser (headless Chrome) and indexes the selected pages to the Algolia index. It was designed to run in an Apify actor.

Usage

You can find instructions on how to run it in the Apify cloud on its Apify Store page. If you want to run it in your environment, you can use the Apify CLI.

Input

The input of the actor is JSON with the following parameters.

Field	Type	Description
algoliaAppId	String	Your Algolia Application ID
algoliaApiKey	String	Your Algolia API key
algoliaIndexName	String	Your Algolia index name
crawlerName	String	Crawler name, it updates/removes/adds pages into the index regarding this name. In this case, you can have more websites in the index.
startUrls	Array	URLs where crawler starts crawling
selectors	Array	Selectors, which text content you want to index. Key is name of the attribute and value is the CSS selector.
waitForElement	String	Selector of an element to wait on each page.
additionalPageAttrs	Object	Additional attributes you want to attach to each record in the index.
skipIndexUpdate	Boolean	Option to switch off updating the Algolia index.

Advanced

There are a few parameters not shown in the UI. These parameters change the behaviour of crawling, and you can set them up using the API or in the local environment.

Field	Type	Description
pageFunction	String	Overrides default pageFunction
pseudoUrls	Array	Overrides default pseudoUrls
clickableElements	String	Overrides default clickableElements
keepUrlFragment	Boolean	Option to switch on enqueueing URL with URL fragments
omitSearchParamsFromUrl	Boolean	Option to switch off enqueueing with search params.

Debug indexed pages

You can find all the pages that will be indexed in the default dataset for a specific actor run.

On this page

Algolia Website Indexer

Share Actor:

Algolia Webcrawler

jancurn/algolia-webcrawler

Crawls a website using one or more sitemaps and imports the data to Algolia search index. The text content is identified using simple CSS selectors.

Jan Čurn

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

5.0

Sunbiz

app/sunbiz

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

app Premiumstaysrentals

Playwright Scraper

apify/playwright-scraper

Apify

2.4K

4.7

bcv-tasa-oficial

grupoaceivzla/bcv-tasa-oficial

Grupo ACEI

Website Checker Runner Puppeteer

lukaskrivka/website-checker-puppeteer

Checks the provided website using Puppeteer. This is a low level runner, most likely you want to use the high level master actor - https://apify.com/lukaskrivka/website-checker

Lukáš Křivka

206

HTML to PDF Converter

jancurn/url-to-pdf

Loads a web page in headless Chrome using Puppeteer and prints it to PDF. The input is a JSON object and output is a PDF file.

Jan Čurn

484

Html Renderer

jakubbalada/html-renderer

Generate image for your HTML using a headless browser

Jakub Balada

Example Puppeteer

apify/example-puppeteer

Example showing how to use headless Chromium with Puppeteer to open a web page, determine its dimensions, save a screenshot, and print the page to PDF. This actor must use images with Puppeteer (Node.js 8 + Puppeteer on Debian).

Apify

412

4.6