Pricing

Pay per usage

Try for free

Go to Apify Store

Cheerio Scraper

Try for free

Developed by

Apify

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

4.8 (11)

Pricing

Pay per usage

Issues response

12 days

Last modified

3 months ago

Developer tools

Open source

.dockerignore

# configurations
.idea

# crawlee and apify storage folders
apify_storage
crawlee_storage
storage

# installed files
node_modules

CHANGELOG.md

1# Change Log
2
3## 3.0.15 (2024-10-25)
4
5- Updated Crawlee version to v3.11.5 and SDK v3.2.6
6- Updated Node to v22
7
8## 3.0.14 (2024-04-09)
9
10- Updated Crawlee version to v3.8.0.
11- Updated to use new request queue in scraper
12
13## 3.0.11 (2023-08-22)
14
15- Updated Crawlee version to v3.5.2.
16- Updated Node.js version to v18.
17- Added new options:
18    - **Exclude Glob Patterns** (`excludes`): Glob patterns to match links in the page that you want to exclude from being enqueued.
19
20## 3.0 (`version-3`)
21
22- Rewrite from Apify SDK to Crawlee, see the [v3 migration guide](https://sdk.apify.com/docs/upgrading/upgrading-to-v3) for more details.
23- Proxy usage is now required.
24
25## 2.0 (`version-2`)
26
27Main difference between v1 and v2 of the scrapers is the upgrade of SDK to v2, which requires node v15.10+. SDK v2 uses http2 to do the requests with cheerio-scraper, and the http2 support in older node versions were too buggy, so we decided to drop support for those. If you need to run on older node version, use SDK v1.
28
29Please refer to the SDK 1.0 migration guide for more details about functional changes in the SDK. SDK v2 basically only changes the required node version and has no other breaking changes.
30
31- deprecated `useRequestQueue` option has been removed
32    - `RequestQueue` will be always used
33- deprecated `context.html` getter from the `cheerio-scraper` has been removed
34    - use `context.body` instead
35- deprecated prepareRequestFunction input option
36    - use `pre/postNavigationHooks` instead
37- removed `puppeteerPool/autoscaledPool` from the crawlingContext object
38    - `puppeteerPool` was replaced by `browserPool`
39    - `autoscaledPool` and `browserPool` and available on the `crawler` property of `crawlingContext` object
40- custom "Key-value store name" option in Advanced configuration is now fixed, previously the default store was always used

Dockerfile

FROM apify/actor-node:22 AS builder

COPY package*.json ./

RUN npm install --include=dev --audit=false

COPY . ./

RUN npm run build

FROM apify/actor-node:22

COPY --from=builder /usr/src/app/dist ./dist

COPY package*.json ./

RUN rm -rf node_modules \
    && npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && (npm list --omit=dev --all || true) \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version \
    && rm -r ~/.npm

COPY . ./

ENV APIFY_DISABLE_OUTDATED_WARNING=1

CMD npm run start:prod --silent

INPUT_SCHEMA.json

{
    "title": "Cheerio Scraper Input",
    "type": "object",
    "description": "Cheerio Scraper loads <b>Start URLs</b> using raw HTTP requests, parses the HTML using the <a href='https://cheerio.js.org' target='_blank' rel='noopener noreferrer'>Cheerio</a> library and then executes <b>Page function</b> for each page to extract data from it. To follow links and scrape additional pages, set <b>Link selector</b> with <b>Pseudo-URLs</b> and/or <b>Glob patterns</b> to specify which links to follow. Alternatively, you can manually enqueue new links in the <b>Page function</b>. For details, see the Actor's <a href='https://apify.com/apify/cheerio-scraper' target='_blank' rel='noopener'>README</a> or the <a href='https://docs.apify.com/academy/apify-scrapers/cheerio-scraper' target='_blank' rel='noopener'>Web scraping tutorial</a> in the Apify documentation.",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "sectionCaption": "Basic configuration",
            "title": "Start URLs",
            "type": "array",
            "description": "A static list of URLs to scrape. <br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> section in the README.",
            "prefill": [{ "url": "https://crawlee.dev/js" }],
            "editor": "requestListSources"
        },
        "keepUrlFragments": {
            "title": "URL #fragments identify unique pages",
            "type": "boolean",
            "description": "Indicates that URL fragments (e.g. <code>http://example.com<b>#fragment</b></code>) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such cases, this option should be enabled.",
            "default": false,
            "groupCaption": "Options"
        },
        "respectRobotsTxtFile": {
            "title": "Respect the robots.txt file",
            "type": "boolean",
            "description": "If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.",
            "default": false,
            "prefill": true
        },
        "globs": {
            "title": "Glob Patterns",
            "type": "array",
            "description": "Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.",
            "editor": "globs",
            "default": [],
            "prefill": [
                {
                    "glob": "https://crawlee.dev/js/*/*"
                }
            ]
        },
        "pseudoUrls": {
            "title": "Pseudo-URLs",
            "type": "array",
            "description": "Specifies what kind of URLs found by the <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with <b>regular expressions</b> enclosed in <code>[]</code> brackets, e.g. <code>http://www.example.com/[.*]</code>. <br><br>If <b>Pseudo-URLs</b> are omitted, the Actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.",
            "editor": "pseudoUrls",
            "default": [],
            "prefill": []
        },
        "excludes": {
            "title": "Exclude Glob Patterns",
            "type": "array",
            "description": "Glob patterns to match links in the page that you want to exclude from being enqueued.",
            "editor": "globs",
            "default": [],
            "prefill": [
                {
                    "glob": "/**/*.{png,jpg,jpeg,pdf}"
                }
            ]
        },
        "linkSelector": {
            "title": "Link selector",
            "type": "string",
            "description": "A CSS selector stating which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the <b>Pseudo-URLs</b> and/or <b>Glob patterns</b> field.<br><br>If the <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README.",
            "editor": "textfield",
            "prefill": "a[href]"
        },
        "pageFunction": {
            "title": "Page function",
            "type": "string",
            "description": "A JavaScript function that is executed for every page loaded server-side in Node.js 12. Use it to scrape data from the page, perform actions or add new URLs to the request queue.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#page-function' target='_blank' rel='noopener'>Page function</a> in README.",
            "prefill": "async function pageFunction(context) {\n    const { $, request, log } = context;\n\n    // The \"$\" property contains the Cheerio object which is useful\n    // for querying DOM elements and extracting data from them.\n    const pageTitle = $('title').first().text();\n\n    // The \"request\" property contains various information about the web page loaded. \n    const url = request.url;\n    \n    // Use \"log\" object to print information to Actor log.\n    log.info('Page scraped', { url, pageTitle });\n\n    // Return an object with the data extracted from the page.\n    // It will be stored to the resulting dataset.\n    return {\n        url,\n        pageTitle\n    };\n}",
            "editor": "javascript"
        },
        "proxyConfiguration": {
            "sectionCaption": "Proxy and HTTP configuration",
            "title": "Proxy configuration",
            "type": "object",
            "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
            "prefill": { "useApifyProxy": true },
            "default": { "useApifyProxy": true },
            "editor": "proxy"
        },
        "proxyRotation": {
            "title": "Proxy rotation",
            "type": "string",
            "description": "This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.",
            "default": "RECOMMENDED",
            "editor": "select",
            "enum": ["RECOMMENDED", "PER_REQUEST", "UNTIL_FAILURE"],
            "enumTitles": [
                "Use recommended settings",
                "Rotate proxy after each request",
                "Use one proxy until failure"
            ]
        },
        "sessionPoolName": {
            "title": "Session pool name",
            "type": "string",
            "description": "<b>Use only english alphanumeric characters dashes and underscores.</b> A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window.",
            "editor": "textfield",
            "minLength": 3,
            "maxLength": 200,
            "pattern": "[0-9A-z-]"
        },
        "initialCookies": {
            "title": "Initial cookies",
            "type": "array",
            "description": "A JSON array with cookies that will be send with every HTTP request made by the Cheerio Scraper, in the format accepted by the <a href='https://www.npmjs.com/package/tough-cookie' target='_blank' rel='noopener noreferrer'>tough-cookie</a> NPM package. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this <a href='https://help.apify.com/en/articles/1444249-log-in-to-website-by-transferring-cookies-from-web-browser-legacy' target='_blank' rel='noopener'>help article</a>.",
            "default": [],
            "prefill": [],
            "editor": "json"
        },
        "additionalMimeTypes": {
            "title": "Additional MIME types",
            "type": "array",
            "description": "A JSON array specifying additional MIME content types of web pages to support. By default, Cheerio Scraper supports the <code>text/html</code> and <code>application/xhtml+xml</code> content types, and skips all other resources. For details, see <a href='https://apify.com/apify/cheerio-scraper#content-types' target='_blank' rel='noopener'>Content types</a> in README.",
            "editor": "json",
            "default": [],
            "prefill": []
        },
        "suggestResponseEncoding": {
            "title": "Suggest response encoding",
            "type": "string",
            "description": "The scraper automatically determines response encoding from the response headers. If the headers are invalid or information is missing, malformed responses may be produced. Use the Suggest response encoding option to provide a fall-back encoding to the Scraper for cases where it could not be determined.",
            "editor": "textfield"
        },
        "forceResponseEncoding": {
            "title": "Force response encoding",
            "type": "boolean",
            "description": "If enabled, the suggested response encoding will be used even if a valid response encoding is provided by the target website. Use this only when you've inspected the responses thoroughly and are sure that they are the ones doing it wrong.",
            "default": false
        },
        "ignoreSslErrors": {
            "title": "Ignore SSL errors",
            "type": "boolean",
            "description": "If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.",
            "default": false,
            "groupCaption": "Security"
        },
        "preNavigationHooks": {
            "sectionCaption": "Advanced configuration",
            "title": "Pre-navigation hooks",
            "type": "string",
            "description": "Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `requestAsBrowserOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate.",
            "prefill": "// We need to return array of (possibly async) functions here.\n// The functions accept two arguments: the \"crawlingContext\" object\n// and \"requestAsBrowserOptions\" which are passed to the `requestAsBrowser()`\n// function the crawler calls to navigate..\n[\n    async (crawlingContext, requestAsBrowserOptions) => {\n        // ...\n    }\n]",
            "editor": "javascript"
        },
        "postNavigationHooks": {
            "title": "Post-navigation hooks",
            "type": "string",
            "description": "Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter.",
            "prefill": "// We need to return array of (possibly async) functions here.\n// The functions accept a single argument: the \"crawlingContext\" object.\n[\n    async (crawlingContext) => {\n        // ...\n    },\n]",
            "editor": "javascript"
        },
        "maxRequestRetries": {
            "title": "Max request retries",
            "type": "integer",
            "description": "The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by the <b>Page function</b>.<br><br>If set to <code>0</code>, the page will be considered failed right after the first error.",
            "minimum": 0,
            "default": 3,
            "unit": "retries"
        },
        "maxPagesPerCrawl": {
            "title": "Max pages per run",
            "type": "integer",
            "description": "The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It is always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.",
            "minimum": 0,
            "default": 0,
            "unit": "pages"
        },
        "maxResultsPerCrawl": {
            "title": "Max result records",
            "type": "integer",
            "description": "The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached. <br><br>If set to <code>0</code>, there is no limit.",
            "minimum": 0,
            "default": 0,
            "unit": "results"
        },
        "maxCrawlingDepth": {
            "title": "Max crawling depth",
            "type": "integer",
            "description": "Specifies how many links away from the <b>Start URLs</b> the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using <code>context.enqueuePage()</code> in <b>Page function</b> are not subject to the maximum depth constraint. <br><br>If set to <code>0</code>, there is no limit.",
            "minimum": 0,
            "default": 0
        },
        "maxConcurrency": {
            "title": "Max concurrency",
            "type": "integer",
            "description": "Specifies the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.",
            "minimum": 1,
            "default": 50
        },
        "pageLoadTimeoutSecs": {
            "title": "Page load timeout",
            "type": "integer",
            "description": "The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to <b>Max page retries</b>), similarly as with other page load errors.",
            "minimum": 1,
            "default": 60,
            "unit": "seconds"
        },
        "pageFunctionTimeoutSecs": {
            "title": "Page function timeout",
            "type": "integer",
            "description": "The maximum amount of time the scraper will wait for the <b>Page function</b> to execute, in seconds. It is always a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.",
            "minimum": 1,
            "default": 60,
            "unit": "seconds"
        },
        "debugLog": {
            "title": "Enable debug log",
            "type": "boolean",
            "description": "If enabled, the Actor log will include debug messages. Beware that this can be quite verbose. Use <code>context.log.debug('message')</code> to log your own debug messages from the <b>Page function</b>.",
            "default": false,
            "groupCaption": "Logging"
        },
        "customData": {
            "title": "Custom data",
            "type": "object",
            "description": "A custom JSON object that is passed to the <b>Page function</b> as <code>context.customData</code>. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.",
            "default": {},
            "prefill": {},
            "editor": "json"
        },
        "datasetName": {
            "title": "Dataset name",
            "type": "string",
            "description": "Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used.",
            "editor": "textfield"
        },
        "keyValueStoreName": {
            "title": "Key-value store name",
            "type": "string",
            "description": "Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used.",
            "editor": "textfield"
        },
        "requestQueueName": {
            "title": "Request queue name",
            "type": "string",
            "description": "Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used.",
            "editor": "textfield"
        }
    },
    "required": ["startUrls", "pageFunction", "proxyConfiguration"]
}

package.json

{
    "name": "actor-cheerio-scraper",
    "version": "3.1.0",
    "private": true,
    "description": "Crawl web pages using HTTP requests and Cheerio",
    "type": "module",
    "dependencies": {
        "@apify/scraper-tools": "^1.1.4",
        "@crawlee/cheerio": "^3.13.4",
        "apify": "^3.2.6"
    },
    "devDependencies": {
        "@apify/tsconfig": "^0.1.0",
        "@types/node": "^22.7.4",
        "tsx": "^4.19.1",
        "typescript": "~5.8.0"
    },
    "peerDependencies": {
        "cheerio": "^1.0.0-rc.12"
    },
    "scripts": {
        "start": "npm run start:dev",
        "start:prod": "node dist/main.js",
        "start:dev": "tsx src/main.ts",
        "build": "tsc"
    },
    "repository": {
        "type": "git",
        "url": "https://github.com/apify/apify-sdk-js"
    },
    "author": {
        "name": "Apify Technologies",
        "email": "support@apify.com",
        "url": "https://apify.com"
    },
    "contributors": [
        "Marek Trunkat <marek@apify.com>",
        "Ondra Urban <ondra@apify.com>"
    ],
    "license": "Apache-2.0",
    "homepage": "https://github.com/apify/apify-sdk-js"
}

tsconfig.json

{
    "extends": "@apify/tsconfig",
    "compilerOptions": {
        "outDir": "dist",
        "module": "ESNext",
        "allowJs": true,
        "skipLibCheck": true
    },
    "include": ["src"]
}

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "cheerio-scraper",
    "version": "0.1",
    "buildTag": "latest"
}

src/main.ts

1import { runActor } from '@apify/scraper-tools';
2
3import { CrawlerSetup } from './internals/crawler_setup.js';
4
5runActor(CrawlerSetup);

src/internals/consts.ts

1import type {
2    Dictionary,
3    GlobInput,
4    ProxyConfigurationOptions,
5    PseudoUrlInput,
6    RegExpInput,
7    RequestOptions,
8    Session,
9} from '@crawlee/cheerio';
10
11export const enum ProxyRotation {
12    Recommended = 'RECOMMENDED',
13    PerRequest = 'PER_REQUEST',
14    UntilFailure = 'UNTIL_FAILURE',
15}
16
17/**
18 * Replicates the INPUT_SCHEMA with JavaScript types for quick reference
19 * and IDE type check integration.
20 */
21export interface Input {
22    startUrls: RequestOptions[];
23    globs: GlobInput[];
24    regexps: RegExpInput[];
25    excludes: GlobInput[];
26    pseudoUrls: PseudoUrlInput[];
27    keepUrlFragments: boolean;
28    respectRobotsTxtFile: boolean;
29    linkSelector?: string;
30    pageFunction: string;
31    preNavigationHooks?: string;
32    postNavigationHooks?: string;
33    proxyConfiguration: ProxyConfigurationOptions;
34    proxyRotation: ProxyRotation;
35    sessionPoolName?: string;
36    initialCookies: Parameters<Session['setCookies']>[0];
37    additionalMimeTypes: string[];
38    suggestResponseEncoding?: string;
39    forceResponseEncoding: boolean;
40    ignoreSslErrors: boolean;
41    maxRequestRetries: number;
42    maxPagesPerCrawl: number;
43    maxResultsPerCrawl: number;
44    maxCrawlingDepth: number;
45    maxConcurrency: number;
46    pageLoadTimeoutSecs: number;
47    pageFunctionTimeoutSecs: number;
48    debugLog: boolean;
49    customData: Dictionary;
50    datasetName?: string;
51    keyValueStoreName?: string;
52    requestQueueName?: string;
53}

src/internals/crawler_setup.ts

1import { readFile } from 'node:fs/promises';
2import type { IncomingMessage } from 'node:http';
3import { dirname } from 'node:path';
4import { fileURLToPath, URL } from 'node:url';
5
6import type {
7    AutoscaledPool,
8    Awaitable,
9    CheerioCrawlerOptions,
10    CheerioCrawlingContext,
11    Dictionary,
12    ProxyConfiguration,
13    Request,
14} from '@crawlee/cheerio';
15import {
16    CheerioCrawler,
17    Dataset,
18    KeyValueStore,
19    log,
20    RequestList,
21    RequestQueueV2,
22} from '@crawlee/cheerio';
23import type { ApifyEnv } from 'apify';
24import { Actor } from 'apify';
25import { load } from 'cheerio';
26
27import type {
28    CrawlerSetupOptions,
29    RequestMetadata,
30} from '@apify/scraper-tools';
31import {
32    constants as scraperToolsConstants,
33    createContext,
34    tools,
35} from '@apify/scraper-tools';
36
37import type { Input } from './consts.js';
38import { ProxyRotation } from './consts.js';
39
40const { SESSION_MAX_USAGE_COUNTS, META_KEY } = scraperToolsConstants;
41const SCHEMA = JSON.parse(
42    await readFile(new URL('../../INPUT_SCHEMA.json', import.meta.url), 'utf8'),
43);
44
45const MAX_EVENT_LOOP_OVERLOADED_RATIO = 0.9;
46const SESSION_STORE_NAME = 'APIFY-CHEERIO-SCRAPER-SESSION-STORE';
47const REQUEST_QUEUE_INIT_FLAG_KEY = 'REQUEST_QUEUE_INITIALIZED';
48
49/**
50 * Holds all the information necessary for constructing a crawler
51 * instance and creating a context for a pageFunction invocation.
52 */
53export class CrawlerSetup implements CrawlerSetupOptions {
54    name = 'Cheerio Scraper';
55    rawInput: string;
56    env: ApifyEnv;
57    /**
58     * Used to store data that persist navigations
59     */
60    globalStore = new Map();
61    requestQueue: RequestQueueV2;
62    keyValueStore: KeyValueStore;
63    customData: unknown;
64    input: Input;
65    maxSessionUsageCount: number;
66    evaledPageFunction: (...args: unknown[]) => unknown;
67    evaledPreNavigationHooks: ((...args: unknown[]) => Awaitable<void>)[];
68    evaledPostNavigationHooks: ((...args: unknown[]) => Awaitable<void>)[];
69    datasetName?: string;
70    keyValueStoreName?: string;
71    requestQueueName?: string;
72
73    crawler!: CheerioCrawler;
74    dataset!: Dataset;
75    pagesOutputted!: number;
76    proxyConfiguration?: ProxyConfiguration;
77    private initPromise: Promise<void>;
78
79    constructor(input: Input) {
80        // Set log level early to prevent missed messages.
81        if (input.debugLog) log.setLevel(log.LEVELS.DEBUG);
82
83        // Keep this as string to be immutable.
84        this.rawInput = JSON.stringify(input);
85
86        // Attempt to load page function from disk if not present on input.
87        tools.maybeLoadPageFunctionFromDisk(
88            input,
89            dirname(fileURLToPath(import.meta.url)),
90        );
91
92        // Validate INPUT if not running on Apify Cloud Platform.
93        if (!Actor.isAtHome()) tools.checkInputOrThrow(input, SCHEMA);
94
95        this.input = input;
96        this.env = Actor.getEnv();
97
98        // Validations
99        this.input.pseudoUrls.forEach((purl) => {
100            if (!tools.isPlainObject(purl)) {
101                throw new Error(
102                    'The pseudoUrls Array must only contain Objects.',
103                );
104            }
105            if (purl.userData && !tools.isPlainObject(purl.userData)) {
106                throw new Error(
107                    'The userData property of a pseudoUrl must be an Object.',
108                );
109            }
110        });
111
112        this.input.initialCookies.forEach((cookie) => {
113            if (!tools.isPlainObject(cookie)) {
114                throw new Error(
115                    'The initialCookies Array must only contain Objects.',
116                );
117            }
118        });
119
120        // solving proxy rotation settings
121        this.maxSessionUsageCount =
122            SESSION_MAX_USAGE_COUNTS[this.input.proxyRotation];
123
124        // Functions need to be evaluated.
125        this.evaledPageFunction = tools.evalFunctionOrThrow(
126            this.input.pageFunction,
127        );
128
129        if (this.input.preNavigationHooks) {
130            this.evaledPreNavigationHooks = tools.evalFunctionArrayOrThrow(
131                this.input.preNavigationHooks,
132                'preNavigationHooks',
133            );
134        } else {
135            this.evaledPreNavigationHooks = [];
136        }
137
138        if (this.input.postNavigationHooks) {
139            this.evaledPostNavigationHooks = tools.evalFunctionArrayOrThrow(
140                this.input.postNavigationHooks,
141                'postNavigationHooks',
142            );
143        } else {
144            this.evaledPostNavigationHooks = [];
145        }
146
147        // Named storages
148        this.datasetName = this.input.datasetName;
149        this.keyValueStoreName = this.input.keyValueStoreName;
150        this.requestQueueName = this.input.requestQueueName;
151
152        // Initialize async operations.
153        this.crawler = null!;
154        this.requestQueue = null!;
155        this.dataset = null!;
156        this.keyValueStore = null!;
157        this.proxyConfiguration = null!;
158        this.initPromise = this._initializeAsync();
159    }
160
161    private async _initializeAsync() {
162        // RequestList
163        const startUrls = this.input.startUrls.map((req) => {
164            req.useExtendedUniqueKey = true;
165            req.keepUrlFragment = this.input.keepUrlFragments;
166            return req;
167        });
168
169        // KeyValueStore
170        this.keyValueStore = await KeyValueStore.open(this.keyValueStoreName);
171
172        // RequestQueue
173        this.requestQueue = await RequestQueueV2.open(this.requestQueueName);
174
175        if (
176            !(await this.keyValueStore.recordExists(
177                REQUEST_QUEUE_INIT_FLAG_KEY,
178            ))
179        ) {
180            const requests: Request[] = [];
181            for await (const request of await RequestList.open(
182                null,
183                startUrls,
184            )) {
185                if (
186                    this.input.maxResultsPerCrawl > 0 &&
187                    requests.length >= 1.5 * this.input.maxResultsPerCrawl
188                ) {
189                    break;
190                }
191                requests.push(request);
192            }
193
194            const { waitForAllRequestsToBeAdded } =
195                await this.requestQueue.addRequestsBatched(requests);
196
197            void waitForAllRequestsToBeAdded.then(async () => {
198                await this.keyValueStore.setValue(
199                    REQUEST_QUEUE_INIT_FLAG_KEY,
200                    '1',
201                );
202            });
203        }
204
205        // Dataset
206        this.dataset = await Dataset.open(this.datasetName);
207        const info = await this.dataset.getInfo();
208        this.pagesOutputted = info?.itemCount ?? 0;
209
210        // Proxy configuration
211        this.proxyConfiguration = (await Actor.createProxyConfiguration(
212            this.input.proxyConfiguration,
213        )) as any as ProxyConfiguration;
214    }
215
216    /**
217     * Resolves to a `CheerioCrawler` instance.
218     */
219    async createCrawler() {
220        await this.initPromise;
221
222        const options: CheerioCrawlerOptions = {
223            proxyConfiguration: this.proxyConfiguration,
224            requestHandler: this._requestHandler.bind(this),
225            preNavigationHooks: [],
226            postNavigationHooks: [],
227            requestQueue: this.requestQueue,
228            navigationTimeoutSecs: this.input.pageLoadTimeoutSecs,
229            requestHandlerTimeoutSecs: this.input.pageFunctionTimeoutSecs,
230            ignoreSslErrors: this.input.ignoreSslErrors,
231            failedRequestHandler: this._failedRequestHandler.bind(this),
232            respectRobotsTxtFile: this.input.respectRobotsTxtFile,
233            maxRequestRetries: this.input.maxRequestRetries,
234            maxRequestsPerCrawl: this.input.maxPagesPerCrawl,
235            additionalMimeTypes: this.input.additionalMimeTypes,
236            autoscaledPoolOptions: {
237                maxConcurrency: this.input.maxConcurrency,
238                systemStatusOptions: {
239                    // Cheerio does a lot of sync operations, so we need to
240                    // give it some time to do its job.
241                    maxEventLoopOverloadedRatio:
242                        MAX_EVENT_LOOP_OVERLOADED_RATIO,
243                },
244            },
245            useSessionPool: true,
246            persistCookiesPerSession: true,
247            sessionPoolOptions: {
248                persistStateKeyValueStoreId: this.input.sessionPoolName
249                    ? SESSION_STORE_NAME
250                    : undefined,
251                persistStateKey: this.input.sessionPoolName,
252                sessionOptions: {
253                    maxUsageCount: this.maxSessionUsageCount,
254                },
255            },
256            experiments: {
257                requestLocking: true,
258            },
259        };
260
261        this._createNavigationHooks(options);
262
263        if (this.input.proxyRotation === ProxyRotation.UntilFailure) {
264            options.sessionPoolOptions!.maxPoolSize = 1;
265        }
266
267        if (this.input.suggestResponseEncoding) {
268            if (this.input.forceResponseEncoding) {
269                options.forceResponseEncoding =
270                    this.input.suggestResponseEncoding;
271            } else {
272                options.suggestResponseEncoding =
273                    this.input.suggestResponseEncoding;
274            }
275        }
276
277        this.crawler = new CheerioCrawler(options);
278
279        return this.crawler;
280    }
281
282    private _createNavigationHooks(options: CheerioCrawlerOptions) {
283        options.preNavigationHooks!.push(async ({ request, session }) => {
284            // Normalize headers
285            request.headers = Object.entries(request.headers ?? {}).reduce(
286                (newHeaders, [key, value]) => {
287                    newHeaders[key.toLowerCase()] = value;
288                    return newHeaders;
289                },
290                {} as Dictionary<string>,
291            );
292
293            // Add initial cookies, if any.
294            if (this.input.initialCookies && this.input.initialCookies.length) {
295                const cookiesToSet = session
296                    ? tools.getMissingCookiesFromSession(
297                          session,
298                          this.input.initialCookies,
299                          request.url,
300                      )
301                    : this.input.initialCookies;
302                if (cookiesToSet?.length) {
303                    // setting initial cookies that are not already in the session and page
304                    session?.setCookies(cookiesToSet, request.url);
305                }
306            }
307        });
308
309        options.preNavigationHooks!.push(
310            ...this._runHookWithEnhancedContext(this.evaledPreNavigationHooks),
311        );
312        options.postNavigationHooks!.push(
313            ...this._runHookWithEnhancedContext(this.evaledPostNavigationHooks),
314        );
315    }
316
317    private _runHookWithEnhancedContext(
318        hooks: ((...args: unknown[]) => Awaitable<void>)[],
319    ) {
320        return hooks.map((hook) => (ctx: Dictionary, ...args: unknown[]) => {
321            const { customData } = this.input;
322            return hook({ ...ctx, Apify: Actor, Actor, customData }, ...args);
323        });
324    }
325
326    private async _failedRequestHandler({ request }: CheerioCrawlingContext) {
327        const lastError =
328            request.errorMessages[request.errorMessages.length - 1];
329        const errorMessage = lastError ? lastError.split('\n')[0] : 'no error';
330        log.error(
331            `Request ${request.url} failed and will not be retried anymore. Marking as failed.\nLast Error Message: ${errorMessage}`,
332        );
333        return this._handleResult(request, undefined, undefined, true);
334    }
335
336    /**
337     * First of all, it initializes the state that is exposed to the user via
338     * `pageFunction` context.
339     *
340     * Then it invokes the user provided `pageFunction` with the prescribed context
341     * and saves its return value.
342     *
343     * Finally, it makes decisions based on the current state and post-processes
344     * the data returned from the `pageFunction`.
345     */
346    private async _requestHandler(crawlingContext: CheerioCrawlingContext) {
347        const { request, response, $, crawler } = crawlingContext;
348        const pageFunctionArguments: Dictionary = {};
349
350        // We must use properties and descriptors not to trigger getters / setters.
351        const props = Object.getOwnPropertyDescriptors(crawlingContext);
352        ['json', 'body'].forEach((key) => {
353            props[key].configurable = true;
354        });
355        Object.defineProperties(pageFunctionArguments, props);
356
357        pageFunctionArguments.cheerio = load([]);
358        pageFunctionArguments.response = {
359            status: response!.statusCode,
360            headers: response!.headers,
361        };
362
363        Object.defineProperties(
364            this,
365            Object.getOwnPropertyDescriptors(pageFunctionArguments),
366        );
367
368        /**
369         * PRE-PROCESSING
370         */
371        // Make sure that an object containing internal metadata
372        // is present on every request.
373        tools.ensureMetaData(request);
374
375        // Abort the crawler if the maximum number of results was reached.
376        const aborted = await this._handleMaxResultsPerCrawl(
377            crawler.autoscaledPool,
378        );
379        if (aborted) return;
380
381        // Setup and create Context.
382        const contextOptions = {
383            crawlerSetup: {
384                rawInput: this.rawInput,
385                env: this.env,
386                globalStore: this.globalStore,
387                requestQueue: this.requestQueue,
388                keyValueStore: this.keyValueStore,
389                customData: this.input.customData,
390            },
391            pageFunctionArguments,
392        };
393        const { context, state } = createContext(contextOptions);
394
395        /**
396         * USER FUNCTION INVOCATION
397         */
398        const pageFunctionResult = await this.evaledPageFunction(context);
399
400        /**
401         * POST-PROCESSING
402         */
403        // Enqueue more links if Pseudo URLs, a link selector and cheerio instance are available,
404        // unless the user invoked the `skipLinks()` context function
405        // or maxCrawlingDepth would be exceeded.
406        if (!state.skipLinks && !!$) await this._handleLinks(crawlingContext);
407
408        // Save the `pageFunction`s result to the default dataset.
409        await this._handleResult(
410            request,
411            response,
412            pageFunctionResult as Dictionary,
413        );
414    }
415
416    private async _handleMaxResultsPerCrawl(autoscaledPool?: AutoscaledPool) {
417        if (
418            !this.input.maxResultsPerCrawl ||
419            this.pagesOutputted < this.input.maxResultsPerCrawl
420        )
421            return false;
422        if (!autoscaledPool) return false;
423        log.info(
424            `User set limit of ${this.input.maxResultsPerCrawl} results was reached. Finishing the crawl.`,
425        );
426        await autoscaledPool.abort();
427        return true;
428    }
429
430    private async _handleLinks({
431        request,
432        enqueueLinks,
433    }: CheerioCrawlingContext) {
434        if (!(this.input.linkSelector && this.requestQueue)) return;
435        const currentDepth = (request.userData![META_KEY] as RequestMetadata)
436            .depth;
437        const hasReachedMaxDepth =
438            this.input.maxCrawlingDepth &&
439            currentDepth >= this.input.maxCrawlingDepth;
440        if (hasReachedMaxDepth) {
441            log.debug(
442                `Request ${request.url} reached the maximum crawling depth of ${currentDepth}.`,
443            );
444            return;
445        }
446
447        await enqueueLinks({
448            selector: this.input.linkSelector,
449            pseudoUrls: this.input.pseudoUrls,
450            globs: this.input.globs,
451            exclude: this.input.excludes,
452            transformRequestFunction: (requestOptions) => {
453                requestOptions.userData ??= {};
454                requestOptions.userData[META_KEY] = {
455                    parentRequestId: request.id || request.uniqueKey,
456                    depth: currentDepth + 1,
457                };
458
459                requestOptions.useExtendedUniqueKey = true;
460                requestOptions.keepUrlFragment = this.input.keepUrlFragments;
461                return requestOptions;
462            },
463        });
464    }
465
466    private async _handleResult(
467        request: Request,
468        response?: IncomingMessage,
469        pageFunctionResult?: Dictionary,
470        isError?: boolean,
471    ) {
472        const payload = tools.createDatasetPayload(
473            request,
474            response,
475            pageFunctionResult,
476            isError,
477        );
478        await this.dataset.pushData(payload);
479        this.pagesOutputted++;
480    }
481}

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

93K

4.5

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

8.9K

5.0

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

884

4.2

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

474

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

2.3K

4.7

HTML Scraper pro

scrapingxpert/html-scraper-pro

The HTML Scraper Pro is a powerful tool designed to extract the HTML source code and metadata from websites. It uses advanced web scraping techniques to retrieve the full HTML content of web pages,page title and HTTP status code.This tool is ideal for data extraction, website analysis, and archiving

scrapingxpert

118

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

69K

4.4

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.7K

4.7

SuperScraper API

apify/super-scraper-api

Generic REST API for scraping websites: send a URL and get back HTML. This Actor is a drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!

Apify

713

4.1