BeautifulSoup Scraper
Pricing
Pay per usage
BeautifulSoup Scraper
Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.
Pricing
Pay per usage
Rating
5.0
(6)
Developer
Apify
Maintained by ApifyActor stats
11
Bookmarked
1K
Total users
20
Monthly active users
7 days ago
Last modified
Categories
Share
Beautifulsoup Scraper crawls websites using plain HTTP requests (no browser) and lets you extract data from each page with your own Python code, powered by the BeautifulSoup library. It's the Python alternative to Cheerio Scraper and is ideal for sites that don't rely on client-side JavaScript.
How it works
You give the scraper two things: where to start and how to extract data.
- It adds your Start URLs to the crawling queue.
- It fetches each URL and builds a
BeautifulSoupDOM from the HTML. - It runs your Page function on the page and stores the returned data.
- Optionally, it follows links matching your Link selector / Link patterns and enqueues them for recursive crawling.
Page function
Python code run for every page. It receives a BeautifulSoupCrawlingContext and returns the data to store:
from typing import Anyfrom crawlee.crawlers import BeautifulSoupCrawlingContextdef page_function(context: BeautifulSoupCrawlingContext) -> Any:return {'url': context.request.url,'title': context.soup.title.string if context.soup.title else None,}
The code runs on Python 3.14 and may only import modules already installed in the Actor.
Proxy configuration
A proxy is required. Set proxyConfiguration to use Apify Proxy (automatic or selected groups) or your own custom proxy URLs:
{"useApifyProxy": true, // use Apify Proxy"apifyProxyGroups": [], // optional: specific groups"proxyUrls": [] // or custom "scheme://user:pass@host:port" URLs}
Output
Results returned by your page function land in the run's default dataset. Download them as JSON, CSV, XML, or Excel from Apify Console, or via the API:
https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json&clean=true
Limitations
The Actor uses raw HTTP requests, so it can't render JavaScript. For dynamic sites use Web Scraper instead. To add Python modules not bundled here, open an issue or PR at github.com/apify/actor-beautifulsoup-scraper.