Pricing

Pay per usage

Try for free

Go to Apify Store

Cheerio Scraper

Try for free

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Pricing

Pay per usage

Rating

4.6

(32)

Developer

Apify

Actor stats

304

Bookmarked

18K

Total users

1.2K

Monthly active users

12 days ago

Last modified

Categories

Developer tools

Open source

You can access the Cheerio Scraper programmatically from your own applications by using the Apify API. You can also choose the language preference from below. To use the Apify API, you’ll need an Apify account and your API token, found in Integrations settings in Apify Console.

Python

JavaScript

CLI

OpenAPI

HTTP

MCP

{
  "openapi": "3.0.1",
  "info": {
    "version": "3.0",
    "x-build-id": "gTZUgxTMzuh0BBr4b"
  },
  "servers": [
    {
      "url": "https://api.apify.com/v2"
    }
  ],
  "paths": {
    "/acts/apify~cheerio-scraper/run-sync-get-dataset-items": {
      "post": {
        "operationId": "run-sync-get-dataset-items-apify-cheerio-scraper",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    },
    "/acts/apify~cheerio-scraper/runs": {
      "post": {
        "operationId": "runs-sync-apify-cheerio-scraper",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor and returns information about the initiated run in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/runsResponseSchema"
                }
              }
            }
          }
        }
      }
    },
    "/acts/apify~cheerio-scraper/run-sync": {
      "post": {
        "operationId": "run-sync-apify-cheerio-scraper",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "inputSchema": {
        "type": "object",
        "required": [
          "startUrls",
          "pageFunction",
          "proxyConfiguration"
        ],
        "properties": {
          "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "A static list of URLs to scrape. <br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> section in the README.",
            "items": {
              "type": "object",
              "required": [
                "url"
              ],
              "properties": {
                "url": {
                  "type": "string",
                  "title": "URL of a web page",
                  "format": "uri"
                }
              }
            }
          },
          "keepUrlFragments": {
            "title": "URL #fragments identify unique pages",
            "type": "boolean",
            "description": "Indicates that URL fragments (e.g. <code>http://example.com<b>#fragment</b></code>) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such cases, this option should be enabled.",
            "default": false
          },
          "respectRobotsTxtFile": {
            "title": "Respect the robots.txt file",
            "type": "boolean",
            "description": "If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.",
            "default": false
          },
          "globs": {
            "title": "Glob Patterns",
            "type": "array",
            "description": "Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.",
            "default": [],
            "items": {
              "type": "object",
              "required": [
                "glob"
              ],
              "properties": {
                "glob": {
                  "type": "string",
                  "title": "Glob of a web page"
                }
              }
            }
          },
          "pseudoUrls": {
            "title": "Pseudo-URLs",
            "type": "array",
            "description": "Specifies what kind of URLs found by the <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with <b>regular expressions</b> enclosed in <code>[]</code> brackets, e.g. <code>http://www.example.com/[.*]</code>. <br><br>If <b>Pseudo-URLs</b> are omitted, the Actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.",
            "default": [],
            "items": {
              "type": "object",
              "required": [
                "purl"
              ],
              "properties": {
                "purl": {
                  "type": "string",
                  "title": "Pseudo-URL of a web page"
                }
              }
            }
          },
          "excludes": {
            "title": "Exclude Glob Patterns",
            "type": "array",
            "description": "Glob patterns to match links in the page that you want to exclude from being enqueued.",
            "default": [],
            "items": {
              "type": "object",
              "required": [
                "glob"
              ],
              "properties": {
                "glob": {
                  "type": "string",
                  "title": "Glob of a web page"
                }
              }
            }
          },
          "linkSelector": {
            "title": "Link selector",
            "type": "string",
            "description": "A CSS selector stating which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the <b>Pseudo-URLs</b> and/or <b>Glob patterns</b> field.<br><br>If the <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README."
          },
          "pageFunction": {
            "title": "Page function",
            "type": "string",
            "description": "A JavaScript function that is executed for every page loaded server-side in Node.js 12. Use it to scrape data from the page, perform actions or add new URLs to the request queue.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#page-function' target='_blank' rel='noopener'>Page function</a> in README."
          },
          "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
            "default": {
              "useApifyProxy": true
            }
          },
          "proxyRotation": {
            "title": "Proxy rotation",
            "enum": [
              "RECOMMENDED",
              "PER_REQUEST",
              "UNTIL_FAILURE"
            ],
            "type": "string",
            "description": "This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.",
            "default": "RECOMMENDED"
          },
          "sessionPoolName": {
            "title": "Session pool name",
            "pattern": "[0-9A-z-]",
            "minLength": 3,
            "maxLength": 200,
            "type": "string",
            "description": "<b>Use only english alphanumeric characters dashes and underscores.</b> A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window."
          },
          "initialCookies": {
            "title": "Initial cookies",
            "type": "array",
            "description": "A JSON array with cookies that will be send with every HTTP request made by the Cheerio Scraper, in the format accepted by the <a href='https://www.npmjs.com/package/tough-cookie' target='_blank' rel='noopener noreferrer'>tough-cookie</a> NPM package. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this <a href='https://help.apify.com/en/articles/1444249-log-in-to-website-by-transferring-cookies-from-web-browser-legacy' target='_blank' rel='noopener'>help article</a>."
          },
          "additionalMimeTypes": {
            "title": "Additional MIME types",
            "type": "array",
            "description": "A JSON array specifying additional MIME content types of web pages to support. By default, Cheerio Scraper supports the <code>text/html</code> and <code>application/xhtml+xml</code> content types, and skips all other resources. For details, see <a href='https://apify.com/apify/cheerio-scraper#content-types' target='_blank' rel='noopener'>Content types</a> in README.",
            "default": []
          },
          "suggestResponseEncoding": {
            "title": "Suggest response encoding",
            "type": "string",
            "description": "The scraper automatically determines response encoding from the response headers. If the headers are invalid or information is missing, malformed responses may be produced. Use the Suggest response encoding option to provide a fall-back encoding to the Scraper for cases where it could not be determined."
          },
          "forceResponseEncoding": {
            "title": "Force response encoding",
            "type": "boolean",
            "description": "If enabled, the suggested response encoding will be used even if a valid response encoding is provided by the target website. Use this only when you've inspected the responses thoroughly and are sure that they are the ones doing it wrong.",
            "default": false
          },
          "ignoreSslErrors": {
            "title": "Ignore SSL errors",
            "type": "boolean",
            "description": "If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.",
            "default": false
          },
          "preNavigationHooks": {
            "title": "Pre-navigation hooks",
            "type": "string",
            "description": "Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `requestAsBrowserOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate."
          },
          "postNavigationHooks": {
            "title": "Post-navigation hooks",
            "type": "string",
            "description": "Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter."
          },
          "maxRequestRetries": {
            "title": "Max request retries",
            "minimum": 0,
            "type": "integer",
            "description": "The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by the <b>Page function</b>.<br><br>If set to <code>0</code>, the page will be considered failed right after the first error.",
            "default": 3
          },
          "maxPagesPerCrawl": {
            "title": "Max pages per run",
            "minimum": 0,
            "type": "integer",
            "description": "The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It is always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.",
            "default": 0
          },
          "maxResultsPerCrawl": {
            "title": "Max result records",
            "minimum": 0,
            "type": "integer",
            "description": "The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached. <br><br>If set to <code>0</code>, there is no limit.",
            "default": 0
          },
          "maxCrawlingDepth": {
            "title": "Max crawling depth",
            "minimum": 0,
            "type": "integer",
            "description": "Specifies how many links away from the <b>Start URLs</b> the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using <code>context.enqueuePage()</code> in <b>Page function</b> are not subject to the maximum depth constraint. <br><br>If set to <code>0</code>, there is no limit.",
            "default": 0
          },
          "maxConcurrency": {
            "title": "Max concurrency",
            "minimum": 1,
            "type": "integer",
            "description": "Specifies the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.",
            "default": 50
          },
          "pageLoadTimeoutSecs": {
            "title": "Page load timeout",
            "minimum": 1,
            "type": "integer",
            "description": "The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to <b>Max page retries</b>), similarly as with other page load errors.",
            "default": 60
          },
          "pageFunctionTimeoutSecs": {
            "title": "Page function timeout",
            "minimum": 1,
            "type": "integer",
            "description": "The maximum amount of time the scraper will wait for the <b>Page function</b> to execute, in seconds. It is always a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.",
            "default": 60
          },
          "debugLog": {
            "title": "Enable debug log",
            "type": "boolean",
            "description": "If enabled, the Actor log will include debug messages. Beware that this can be quite verbose. Use <code>context.log.debug('message')</code> to log your own debug messages from the <b>Page function</b>.",
            "default": false
          },
          "customData": {
            "title": "Custom data",
            "type": "object",
            "description": "A custom JSON object that is passed to the <b>Page function</b> as <code>context.customData</code>. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.",
            "default": {}
          },
          "datasetName": {
            "title": "Dataset name",
            "type": "string",
            "description": "Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used."
          },
          "keyValueStoreName": {
            "title": "Key-value store name",
            "type": "string",
            "description": "Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used."
          },
          "requestQueueName": {
            "title": "Request queue name",
            "type": "string",
            "description": "Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used."
          }
        }
      },
      "runsResponseSchema": {
        "type": "object",
        "properties": {
          "data": {
            "type": "object",
            "properties": {
              "id": {
                "type": "string"
              },
              "actId": {
                "type": "string"
              },
              "userId": {
                "type": "string"
              },
              "startedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "finishedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "status": {
                "type": "string",
                "example": "READY"
              },
              "meta": {
                "type": "object",
                "properties": {
                  "origin": {
                    "type": "string",
                    "example": "API"
                  },
                  "userAgent": {
                    "type": "string"
                  }
                }
              },
              "stats": {
                "type": "object",
                "properties": {
                  "inputBodyLen": {
                    "type": "integer",
                    "example": 2000
                  },
                  "rebootCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "restartCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "resurrectCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "computeUnits": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "options": {
                "type": "object",
                "properties": {
                  "build": {
                    "type": "string",
                    "example": "latest"
                  },
                  "timeoutSecs": {
                    "type": "integer",
                    "example": 300
                  },
                  "memoryMbytes": {
                    "type": "integer",
                    "example": 1024
                  },
                  "diskMbytes": {
                    "type": "integer",
                    "example": 2048
                  }
                }
              },
              "buildId": {
                "type": "string"
              },
              "defaultKeyValueStoreId": {
                "type": "string"
              },
              "defaultDatasetId": {
                "type": "string"
              },
              "defaultRequestQueueId": {
                "type": "string"
              },
              "buildNumber": {
                "type": "string",
                "example": "1.0.0"
              },
              "containerUrl": {
                "type": "string"
              },
              "usage": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "integer",
                    "example": 1
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "usageTotalUsd": {
                "type": "number",
                "example": 0.00005
              },
              "usageUsd": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "number",
                    "example": 0.00005
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Cheerio Scraper - HTML scraping tool OpenAPI definition

OpenAPI is a standard for designing and describing RESTful APIs, allowing developers to define API structure, endpoints, and data formats in a machine-readable way. It simplifies API development, integration, and documentation.

OpenAPI is effective when used with AI agents and GPTs by standardizing how these systems interact with various APIs, for reliable integrations and efficient communication.

By defining machine-readable API specifications, OpenAPI allows AI models like GPTs to understand and use varied data sources, improving accuracy. This accelerates development, reduces errors, and provides context-aware responses, making OpenAPI a core component for AI applications.

You can download the OpenAPI definitions for Cheerio Scraper from the options below:

OpenAPI.json

If you’d like to learn more about how OpenAPI powers GPTs, read our blog post.

You can also check out our other API clients:

Cheerio Scraper API in Python

Cheerio Scraper API in JavaScript

Cheerio Scraper API through CLI

Cheerio Scraper API

Apify Store Scraper

applora/apify-store-scraper

A powerful Apify Actor designed to extract comprehensive data from the Apify Store. This scraper can discover available Actors, collect detailed Actor information, and extract key metadata, making it perfect for market research and competitor analysis.

Applora

5.0

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

5.0

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

525

IndiaMART Product Scraper

jungle_synthesizer/indiamart-scraper

Scrape product listings and supplier data from IndiaMART's B2B export marketplace. Extract product names, prices, MOQ, supplier details, locations, GST/IEC verification status, ratings, and export history. Search by product keyword with automatic pagination.

BowTiedRaccoon

Indiamart Product Scraper

scrapeai/indiamart-product-scraper

This Apify actor retrieves product and supplier data through the IndiaMart search API. Search by query and city to collect structured product information including name, price, company, contact details, and location. Perfect for B2B lead generation, market research, and sourcing products from India.

ScrapeAI

5.0

IndiaMART Scraper

parseforge/indiamart-scraper

Scrape product listings and supplier data from IndiaMART, India's largest B2B marketplace. Get product names, prices, supplier details, locations, ratings, and contact info. Search by product keyword with automatic pagination.

ParseForge

IndiaMART Scraper - Suppliers, Exporters, Prices & GST

haketa/indiamart-scraper

IndiaMART scraper & API: find Indian B2B suppliers and products by keyword and export company, product, price, contact email and phone, city and profile URL. India sourcing, manufacturer and supplier discovery plus B2B lead generation — fast, no login.

Haketa

Python Scraper

sovanza.inc/python-scraper

Python Scraper extracts web page data using Requests and BeautifulSoup. It collects titles, meta tags, headings, links, images, Open Graph data, text snippets, and custom CSS selector fields, with exports to JSON, CSV, Excel, XML, or HTML.

Sovanza

5.0

Shopify Store Leads Scraper

parsebird/shopify-store-leads-scraper

Scrape Shopify store leads by keyword or category. Extract emails, phone numbers, addresses, ratings, social links, and sample products. Filter by location, price, and shipping. Export as JSON, CSV, Excel.