Algolia.com Scraper

jancurn/algolia-webcrawler

Crawls a website using one or more sitemaps and imports the data to Algolia search index. The text content is identified using simple CSS selectors. The actor simply runs the algolia-webcrawler NPM package (https://www.npmjs.com/package/algolia-webcrawler) on the Apify cloud, so that you don't need to deploy it yourself. You can run it easily using API or scheduler. On input, the actor accepts a JSON configuration required by algolia-webcrawler. For details, see https://www.npmjs.com/package/algolia-webcrawler#configuration-options

Modified
Last run
Used 11 times

To run the actor, you'll need a free Apify account. Simply open the actor console by clicking the button below, modify the actor input configuration, click Run and get your results.

API

To run the actor from your code, send a HTTP POST request to the following API endpoint:

https://api.apify.com/v2/acts/jancurn~algolia-webcrawler/runs?token=<YOUR_API_TOKEN>

The POST payload including its Content-Type header is passed as INPUT to the actor (usually application/json). The actor is started with the default options; you can override them using various URL query parameters.

Example
curl https://api.apify.com/v2/acts/jancurn~algolia-webcrawler/runs?token=<YOUR_API_TOKEN> \
-d '{
    "app": "My app",
    "cred": {
        "appid": "APP ID",
        "apikey": "API KEY"
    },
    "delayBetweenRequests": 0,
    "oldentries": 0,
    "maxRecordSize": 10000,
    "index": {
        "name": "INDEX NAME",
        "settings": null,
        "attributesToIndex": null,
        "attributesForFaceting": null
    },
    "sitemaps": [
        {
            "url": "http://www.example.com/sitemap.txt",
            "lang": "EN"
        }
    ],
    "http": {
        "auth": null
    },
    "selectors": {
        "title": null,
        "description": null,
        "image": null,
        "text": null,
        "some-key": ".some-key"
    },
    "exclusions": {
        "text": ".some-css-selector",
        "my-key": ".my-key"
    },
    "formatters": {
        "title": "The string to remove from the title of the page. Can also be an array of strings",
        "my-key": ".my-key"
    },
    "types": {
        "integer": null,
        "float": null,
        "json": null
    },
    "defaults": {
        "key1": null,
        "key2": null
    },
    "plugins": [],
    "blacklist": [
        "http://www.example.com/skip-this-url"
    ]
}' \
-H 'Content-Type: application/json' \
-X POST

To use the API, you'll need to replace <YOUR_API_TOKEN> with the API token of your Apify account (view here).

For more information, view the list of actor API endpoints or the full API reference.

Scheduler

Do you need to run the actor periodically? You can easily create a schedule that will run the actor any time you want.