Pricing

Pay per usage

Try for free

Go to Store

BeautifulSoup Scraper

Try for free

Developed by

Apify

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

4.4 (5)

Pricing

Pay per usage

Total users

870

Monthly users

Runs succeeded

99%

Last modified

16 days ago

Developer tools

Open source

Start URLs

startUrlsarrayRequired

A static list of URLs to scrape.

Max crawling depth

maxCrawlingDepthintegerOptional

Specifies how many links away from the Start URLs the scraper will descend. Note that pages added using context.request_queue in Page function are not subject to the maximum depth constraint.

Default value of this property is 1

Request timeout

requestTimeoutintegerOptional

The maximum duration (in seconds) for the request to complete before timing out. The timeout value is passed to the httpx.AsyncClient object.

Default value of this property is 10

Link selector

linkSelectorstringOptional

A CSS selector stating which links on the page (<a> elements with href attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Link patterns field.

If the Link selector is empty, the page links are ignored. Of course, you can work with the page links and the request queue in the Page function as well.

Link patterns

linkPatternsarrayOptional

Link patterns (regular expressions) to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the link patterns will cause the scraper to enqueue all links matched by the Link selector.

Page function

pageFunctionstringRequired

A Python function, that is executed for every page. Use it to scrape data from the page, perform actions or add new URLs to the request queue. The page function has its own naming scope and you can import any installed modules. Typically you would want to obtain the data from the context.soup object and return them. Identifier page_function can't be changed. For more information about the context object you get into the page_function check the github.com/apify/actor-beautifulsoup-scraper#context. Asynchronous functions are supported.

BeautifulSoup features

soupFeaturesstringOptional

The value of BeautifulSoup features argument. From BeautifulSoup docs: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.

Proxy configuration

proxyConfigurationobjectRequired

Specifies proxy servers that will be used by the scraper in order to hide its origin.

Default value of this property is {"useApifyProxy":true}

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

4.7

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

471

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

90K

4.4

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

3.6

HTML Scraper pro

scrapingxpert/html-scraper-pro

The HTML Scraper Pro is a powerful tool designed to extract the HTML source code and metadata from websites. It uses advanced web scraping techniques to retrieve the full HTML content of web pages,page title and HTTP status code.This tool is ideal for data extraction, website analysis, and archiving

scrapingxpert

100

JSDOM Scraper

apify/jsdom-scraper

Parses the HTML using the JSDOM library, providing the same DOM API as browsers do (e.g. `window`). It is able to process client-side JavaScript without using a real browser. Performance-wise, it stands somewhere between the Cheerio Scraper and the browser scrapers.

Apify

4.3

Get Urls Pro

maged120/get-urls-pro

This Apify actor crawls websites, extracts and creates a hierarchy of links, allowing you to visualize the structure of a website. The crawler can be configured to use either standard HTTP requests with BeautifulSoup (fast HTML parsing) or Selenium (for JavaScript-heavy pages)

Maged

5.0

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

8.4K

5.0

Metadata Extractor

jancurn/extract-metadata

A small efficient actor that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.

Jan Čurn

1.3K

HTTP Status Codes and URL Checker

antonio_espresso/website-status-code-crawler

A HTTP Status Codes Crawler is a tool that scans a website and retrieves HTTP status codes for each page. This helps in diagnosing errors and optimizing technical SEO.