BeautifulSoup Scraper avatar

BeautifulSoup Scraper

Try for free

No credit card required

Go to Store
BeautifulSoup Scraper

BeautifulSoup Scraper

apify/beautifulsoup-scraper
Try for free

No credit card required

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Do you want to learn more about this Actor?

Get a demo

Start URLs

startUrlsarrayRequired

A static list of URLs to scrape.

Max crawling depth

maxCrawlingDepthintegerOptional

Specifies how many links away from the Start URLs the scraper will descend. Note that pages added using context.request_queue in Page function are not subject to the maximum depth constraint.

Default value of this property is 1

Request timeout

requestTimeoutintegerOptional

The maximum duration (in seconds) for the request to complete before timing out. The timeout value is passed to the httpx.AsyncClient object.

Default value of this property is 10

Link selector

linkSelectorstringOptional

A CSS selector stating which links on the page (<a> elements with href attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Link patterns field.

If the Link selector is empty, the page links are ignored. Of course, you can work with the page links and the request queue in the Page function as well.

Link patterns

linkPatternsarrayOptional

Link patterns (regular expressions) to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the link patterns will cause the scraper to enqueue all links matched by the Link selector.

Page function

pageFunctionstringRequired

A Python function, that is executed for every page. Use it to scrape data from the page, perform actions or add new URLs to the request queue. The page function has its own naming scope and you can import any installed modules. Typically you would want to obtain the data from the context.soup object and return them. Identifier page_function can't be changed. For more information about the context object you get into the page_function check the github.com/apify/actor-beautifulsoup-scraper#context. Asynchronous functions are supported.

BeautifulSoup features

soupFeaturesstringOptional

The value of BeautifulSoup features argument. From BeautifulSoup docs: Desirable features of the parser to be used. This may be the name of a specific parser ("lxml", "lxml-xml", "html.parser", or "html5lib") or it may be the type of markup to be used ("html", "html5", "xml"). It's recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.

BeautifulSoup from_encoding

soupFromEncodingstringOptional

The value of BeautifulSoup from_encoding argument. From BeautifulSoup docs: A string indicating the encoding of the document to be parsed. Pass this in if Beautiful Soup is guessing wrongly about the document's encoding.

BeautifulSoup exclude_encodings

soupExcludeEncodingsarrayOptional

The value of BeautifulSoup exclude_encodings argument. From BeautifulSoup docs: A list of strings indicating encodings known to be wrong. Pass this in if you don't know the document's encoding but you know Beautiful Soup's guess is wrong.

Proxy configuration

proxyConfigurationobjectRequired

Specifies proxy servers that will be used by the scraper in order to hide its origin.

Default value of this property is {"useApifyProxy":true}

Developer
Maintained by Apify

Actor Metrics

  • 23 monthly users

  • 4 stars

  • 94% runs succeeded

  • Created in Jul 2023

  • Modified 2 months ago

Categories