No credit card required

BeautifulSoup Scraper

apify/beautifulsoup-scraper

No credit card required

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Do you want to learn more about this Actor?

Get a demo

Beautifulsoup Scraper is a ready-made solution for crawling websites using plain HTTP requests. It provides HTTP responses to your defined function, where you can use Beautifulsoup Python library to extract any data from them. Fast.

Beautifulsoup is a Python library used for parsing HTML and XML documents. It provides an interface for navigating and manipulating the document structure. With powerful search functions, you can search for elements based on tags, attributes, or CSS classes.

Beautifulsoup Scraper is ideal for scraping web pages that do not rely on client-side JavaScript to serve their content. Beautiful Soup itself is a parser library and does not execute JavaScript.

Usage

To get started with Beautifulsoup Scraper, you only need two things. First, tell the scraper which web pages it should load. Second, tell it how to extract data from each page.

The scraper starts by loading the pages specified in the Start URLs field. You can make the scraper follow page links on the fly by setting a Link selector, and Link patterns to tell the scraper which links it should add to the crawling queue. This is useful for the recursive crawling of entire websites, e.g. to find all products in an online store.

To tell the scraper how to extract data from web pages, you need to provide a Page function. This is Python code that is executed for every web page loaded. A Beautifulsoup library is assumed to be used for data extraction.

In summary, Beautifulsoup Scraper works as follows:

Adds each Start URL to the crawling queue.
Fetches the first URL from the queue and constructs a DOM from the fetched HTML string.
Executes the Page function on the loaded page and saves its results.
Optionally, finds all links from the page using the Link selector. If a link matches any of the Link selector and has not yet been visited, add it to the queue.
If there are more items in the queue, repeats step 2, otherwise finish.

Limitations

The Actor does not employ a full-featured web browser such as Chromium or Firefox, so it will not be sufficient for web pages that render their content dynamically using client-side JavaScript. To scrape such sites, you might prefer to use Web Scraper (apify/web-scraper), which loads pages in a full browser and renders dynamic content.

In the Page function you can only use Python modules that are already installed in this Actor. If you require other modules for your scraping, you'll need to develop a completely new Actor or open a new issue or pull request in the github.com/apify/actor-beautifulsoup-scraper.

Input configuration

As input, the Beautifulsoup Scraper Actor accepts a number of configurations. These can be entered either manually in the user interface in Apify Console, or programmatically in a JSON object using the Apify API. For a complete list of input fields and their types, please visit the Input tab.

Page function

The Page function (pageFunction) field contains a Python script with a single function that enables the user to extract data from the web page, access its DOM, add new URLs to the request queue, and otherwise control Beautifulsoup Scraper's operation.

Example:

1from typing import Any
2
3def page_function(context: Context) -> Any:
4    url = context.request["url"]
5    title = context.soup.title.string if context.soup.title else None
6    return {"url": url, "title": title}

Context

The code runs in Python 3.11 and the page_function accepts a single argument context of type Context. It is a dataclass with the following fields:

soup of type BeautifulSoup with the parsed HTTP payload,
request of type dict with the HTTP request data,
request_queue of type apify.storages.RequestQueue (RequestQueue) for the interaction with the HTTP request queue,
response of type httpx.Response with the HTTP response data.

Proxy configuration

The Proxy configuration (proxyConfiguration) option enables you to set proxies that will be used by the scraper in order to prevent its detection by target web pages. You can use both the Apify Proxy and custom HTTP or SOCKS5 proxy servers.

Proxy is required to run the scraper. The following table lists the available options for the proxy configuration setting:

Apify Proxy (automatic) The scraper will load all web pages using the Apify Proxy in automatic mode. In this mode, the proxy uses all proxy groups that are available to the user. For each new web page, it automatically selects the proxy that hasn't been used in the longest time for the specific hostname in order to reduce the chance of detection by the web page. You can view the list of available proxy groups on the Proxy page in Apify Console.

Apify Proxy (selected groups) The scraper will load all web pages using the Apify Proxy with specific groups of target proxy servers.

Apify Proxy (automatic)	The scraper will load all web pages using the Apify Proxy in automatic mode. In this mode, the proxy uses all proxy groups that are available to the user. For each new web page, it automatically selects the proxy that hasn't been used in the longest time for the specific hostname in order to reduce the chance of detection by the web page. You can view the list of available proxy groups on the Proxy page in Apify Console.
Apify Proxy (selected groups)	The scraper will load all web pages using the Apify Proxy with specific groups of target proxy servers.
Custom proxies	The scraper will use a custom list of proxy servers. The proxies must be specified in the `scheme://user:password@host:port` format. Multiple proxies should be separated by a space or new line. The URL scheme can be either `http` or `socks5`. The user and password might be omitted, but the port must always be present. Example: `http://bob:password@proxy1.example.com:8000,[object Object],http://bob:password@proxy2.example.com:8000`

Custom proxies

The scraper will use a custom list of proxy servers. The proxies must be specified in the scheme://user:password@host:port format. Multiple proxies should be separated by a space or new line. The URL scheme can be either http or socks5. The user and password might be omitted, but the port must always be present.

Example:

http://bob:password@proxy1.example.com:8000,[object Object],http://bob:password@proxy2.example.com:8000

The proxy configuration can be set programmatically when calling the Actor using the API by setting the proxyConfiguration field. It accepts a JSON object with the following structure:

1{
2    // Indicates whether to use the Apify Proxy or not.
3    "useApifyProxy": Boolean,
4
5    // Array of Apify Proxy groups, only used if "useApifyProxy" is true.
6    // If missing or null, the Apify Proxy will use automatic mode.
7    "apifyProxyGroups": String[],
8
9    // Array of custom proxy URLs, in "scheme://user:password@host:port" format.
10    // If missing or null, custom proxies are not used.
11    "proxyUrls": String[],
12}

Results

The scraping results returned by Page function are stored in the default dataset associated with the Actor run, from where you can export them to formats such as JSON, XML, CSV, or Excel.

To download the results, call the Get dataset items API endpoint:

https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json

where [DATASET_ID] is the ID of the Actor's run dataset, in which you can find the Run object returned when starting the Actor. Alternatively, you'll find the download links for the results in Apify Console.

To skip the #error and #debug metadata fields from the results and not include empty result records, simply add the clean=true query parameter to the API URL, or select the Clean items option when downloading the dataset in Apify Console.

To get the results in other formats, set the format query parameter to xml, xlsx, csv, html, etc. For more information, see Datasets in documentation or the Get dataset items endpoint in Apify API reference.

Developer

Apify

Actor Metrics

19 monthly users
4 stars
>99% runs succeeded
Created in Jul 2023
Modified 21 days ago

Categories

Developer tools

For creators

Lamudi.ph Real Estate Listings Scraper

pixelperfekt/lamudi-ph-real-estate-listings-scraper

This Python script uses Selenium and BeautifulSoup to scrape real estate listings from Lamudi.ph . It extracts details like property titles, prices, locations, bedrooms, and land size, storing the data in Apify's dataset for easy analysis.

PixelPerfekt

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

27.8k

697

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

5.7k

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

72k

237

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

4.7k

Legacy PhantomJS Crawler

apify/legacy-phantomjs-crawler

Replacement for the legacy Apify Crawler product with a backward-compatible interface. The actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.

Apify

1.6k

Extended GPT Scraper

drobnikj/extended-gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

1.2k

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

873

Monitoring Reporter Dashboard

apify/monitoring-reporter-dashboard

The monitoring reporter dashboard is a part of the Apify Monitoring Suite (apify/monitoring). See its readme for more information and how to use this.

Apify

Website Screenshot Generator

apify/screenshot-url

Create a screenshot of a website based on a specified URL. The screenshot is stored as the output in a key-value store. It can be used to monitor web changes regularly after setting up the scheduler.

Apify

2.8k