Pricing

from $2.50 / 1,000 scraped pages

Try for free

Go to Apify Store

Beautiful Soup Cloud Runner

Try for free

Beautiful Soup Cloud Runner runs Python BS4 scraping tasks on Apify. Use CSS extraction rules or custom scripts to scrape static HTML pages, follow links, use proxies, save CSV exports, trigger webhooks, and export compact datasets.

Pricing

from $2.50 / 1,000 scraped pages

Rating

5.0

(1)

Developer

Sovanza

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Beautiful Soup Cloud Runner – Run Python BS4 Scrapers in Apify Cloud

Host, schedule, and run Python web scraping tasks built with Beautiful Soup in the Apify cloud. Provide URLs and CSS extraction rules for instant scraping, or supply your own Python script with an entry function. This actor is designed for developers and data teams who need a general-purpose runner for HTML parsing workflows, scheduled jobs, and Apify platform integration.

Overview

This Beautiful Soup Cloud Runner executes Python scraping and parsing tasks using HTTP fetching and Beautiful Soup (bs4). It supports two modes:

builtin — declarative CSS extraction rules on startUrls, with optional link crawling via the Apify request queue (maxDepth).
customScript — dynamically load a user Python module or inline source and call an entry function (default run).

Output is compact: empty or missing fields are omitted so each dataset row contains only what was extracted for that page.

Key benefits

Run Beautiful Soup scrapers in the cloud without managing servers
Scrape with CSS extraction rules — no code required for simple tasks
Load custom Python scripts for advanced parsing logic
Follow links up to a configurable crawl depth via Apify request queue
Export clean datasets in JSON, CSV, or Excel via Apify
Integrate with schedules, webhooks, and the Apify API

Core features

Built-in extraction — CSS rules for text, html, and attr fields
Link crawling — optional maxDepth with Apify request queue (same pattern as the official BeautifulSoup template)
Custom scripts — load scriptModule or inline scriptSource; call entryFunction (default run)
Proxy support — Apify proxy via proxyConfiguration
Rate limiting — requestDelaySecs between HTTP requests
Retries — configurable per-URL retry policy (maxRetries, retryDelaySecs)
Dataset output — Actor.push_data() for each scraped page
CSV export — optional OUTPUT.csv in the key-value store
Webhook callback — POST run summary to webhookCallbackUrl when finished

How to Use Beautiful Soup Cloud Runner on Apify

Using the Actor

Open the Actor on the Apify platform and go to the Input tab.
Configure input (see below): set mode to builtin or customScript, add startUrls, define extract rules or a script path, and enable proxy if needed.
Start the run. The Actor fetches pages, parses HTML with Beautiful Soup, and pushes compact items to the default dataset.
Open the Dataset tab to browse, download JSON/CSV/Excel, or pull data via the Apify API.
Schedule or integrate (optional): use schedules, webhooks, Zapier/Make, or your own code against the Apify API.

Input Configuration

Full schema: INPUT_SCHEMA.json. Example:

{
  "mode": "builtin",
  "startUrls": [
    "https://example.com/"
  ],
  "maxDepth": 0,
  "maxRequestsPerCrawl": 100,
  "sameOriginOnly": true,
  "extract": [
    { "name": "title", "selector": "title", "type": "text", "all": false },
    { "name": "h1", "selector": "h1", "type": "text", "all": false },
    { "name": "links", "selector": "a", "type": "attr", "attr": "href", "all": true }
  ],
  "includeLinks": true,
  "includeHtml": false,
  "parser": "lxml",
  "requestDelaySecs": 0,
  "maxRetries": 2,
  "retryDelaySecs": 3,
  "timeoutSecs": 60,
  "saveCsvToKeyValueStore": false,
  "webhookCallbackUrl": "",
  "cookies": "",
  "headers": {},
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"],
    "apifyProxyCountry": "US"
  }
}

mode (optional): builtin (default) for declarative CSS extraction, or customScript to load a user Python script.
startUrls (required in builtin mode): One or more URLs to fetch and parse with Beautiful Soup.
maxDepth (optional): Follow same-origin links via request queue up to this depth (default 0 = start URLs only, max 10). Builtin mode only.
maxRequestsPerCrawl (optional): Safety cap on total pages per run (default 100, max 10000).
sameOriginOnly (optional): When crawling, enqueue only URLs on the same host as the seed URL (default true).
extract (optional): CSS extraction rules — each rule has name, selector, type (text / html / attr), optional attr, and all (default rules extract title, h1, and links).
includeLinks (optional): Include absolute href links array in each dataset item (default false).
includeHtml (optional): Include full page HTML in each item — can be large (default false).
parser (optional): Beautiful Soup parser backend — lxml (default), html.parser, or html5lib.
requestDelaySecs (optional): Minimum delay between HTTP requests for rate limiting (default 0).
maxRetries (optional): Retry count per URL on HTTP failure (default 2, max 10).
retryDelaySecs (optional): Sleep between retries (default 3).
timeoutSecs (optional): Per-request HTTP timeout in seconds (default 60).
saveCsvToKeyValueStore (optional): Write combined CSV to default key-value store as OUTPUT.csv (default false).
webhookCallbackUrl (optional): POST JSON run summary to this URL when the Actor finishes.
scriptModule (optional): Path to Python file inside the Actor (e.g. scripts/example_titles_links.py). Required in customScript mode unless scriptSource is set.
scriptSource (optional): Inline Python source code — overrides scriptModule. Must define the entry function. Secret input.
entryFunction (optional): Function name to call in your script (default run). Signature: run(context) or async run(context).
scriptArgs (optional): Arbitrary JSON passed to your script via context.script_args.
cookies (optional): Raw Cookie header value for authenticated sessions. Secret input — encrypted at rest, not copied to dataset rows.
headers (optional): Extra HTTP headers as JSON object. Secret input.
proxyConfiguration (optional): Apify proxy settings; residential is recommended for blocked sites.

Custom script mode example

Reference the bundled example script:

{
  "mode": "customScript",
  "startUrls": ["https://example.com/"],
  "scriptModule": "scripts/example_titles_links.py",
  "entryFunction": "run",
  "scriptArgs": { "maxPages": 5 }
}

Your entry function receives a context object with fetch(), parse_html(), push_data(), start_urls, and script_args. See scripts/example_titles_links.py.

Authentication & sensitive input

Fields that can hold credentials use Apify secret input (isSecret: true), following the same pattern as the Python Scraper and Instagram Reels Scraper actors:

cookies — raw Cookie header value
headers — custom HTTP headers JSON
scriptSource — inline Python source code

Secret values are encrypted in storage and are not written into dataset rows or logs.

Environment variables (optional)

Variable	Purpose
`APIFY_LOG_LEVEL`	Log verbosity (default `INFO` via `apify.json`).
`APIFY_TOKEN`	Required when using Apify proxy from your local machine.

Run locally

INPUT.json is gitignored. Copy INPUT.example.json to INPUT.json, set APIFY_TOKEN if using Apify proxy from your machine, then:

cd beautiful-soup-cloud-runner
pip install -r requirements.txt
cp INPUT.example.json INPUT.json
python main.py

Results are written to storage/datasets/default/ when running outside the Apify platform.

Run tests:

$python -m unittest discover -s tests -v

Push to Apify:

apify login
apify push

Output

Results are stored in the Actor's default dataset. Each item is a compact JSON object: fields that are empty or unknown are not included.

Typical fields (when data is available):

URLs: inputUrl, finalUrl.
Status: status (OK / ERROR), httpStatus.
Page metadata: pageTitle, depth (crawl depth from seed).
Extracted data: extracted (object with fields from CSS rules or custom script).
Links: links (when includeLinks is enabled).
HTML: html (when includeHtml is enabled).
Meta: timestamp.
Errors: error on failure rows.

Example item (illustrative — real items only include keys that have values):

{
  "inputUrl": "https://example.com/",
  "finalUrl": "https://example.com/",
  "status": "OK",
  "httpStatus": 200,
  "pageTitle": "Example Domain",
  "depth": 0,
  "extracted": {
    "title": "Example Domain",
    "h1": "Example Domain",
    "links": ["https://www.iana.org/domains/example"]
  },
  "links": ["https://www.iana.org/domains/example"],
  "timestamp": "2026-06-05T12:00:00Z"
}

Example error row:

{
  "inputUrl": "https://example.com/bad-page",
  "finalUrl": "https://example.com/bad-page",
  "status": "ERROR",
  "httpStatus": null,
  "pageTitle": null,
  "extracted": {},
  "error": "Client error '404 Not Found' for url 'https://example.com/bad-page'",
  "timestamp": "2026-06-05T12:00:00Z"
}

➡️ Output is structured for pipelines, warehouses, or spreadsheet export via Apify.

Use Cases

General HTML scraping: extract titles, links, and custom fields from static pages with Beautiful Soup.
Scheduled monitoring: run scrapers on a cron via Apify Schedules and track changes over time.
Custom parser hosting: deploy reusable Python BS4 modules without building a new Actor from scratch.
Workflow integration: chain with other Actors using webhooks, API calls, or Zapier/Make.
Multi-page crawling: follow same-origin links up to maxDepth for sitemap-style extraction.

Integrations & API

Run and fetch results through the Apify API
Use Python, Node.js, or HTTP clients against run and dataset endpoints
Connect Zapier, Make, Google Sheets, and other Apify integrations
Webhooks and schedules for recurring runs
Optional webhookCallbackUrl input for a custom POST at end of run

Why Choose This Actor?

General-purpose cloud runner for any Beautiful Soup scraping task on Apify
Zero-code mode with declarative CSS extraction rules
Custom script mode for advanced parsing without a separate Actor build
Compact JSON — no noise from empty fields
Built for Apify datasets, request queues, proxies, exports, and API access
Same operational patterns as Selenium Cloud Runner and Instagram Reels Scraper in this repo

FAQ

How does Beautiful Soup Cloud Runner work?

It reads input JSON, fetches pages over HTTP (with optional Apify proxy), parses HTML with Beautiful Soup, applies CSS extraction rules or runs your custom Python entry function, and pushes structured rows to the default dataset via Actor.push_data().

When should I use builtin vs customScript?

Use builtin for quick CSS-based scraping without writing code. Use customScript when you need custom parsing logic, multi-step flows, or reusable Python modules.

Can I scrape multiple pages in one run?

Yes. Add multiple URLs to startUrls, or set maxDepth > 0 to follow links via the Apify request queue (builtin mode, same-origin by default).

Does this support JavaScript-rendered pages?

No. Beautiful Soup parses static HTML from HTTP responses. For JavaScript-heavy sites, use Selenium Cloud Runner in this repo.

Can I run untrusted scripts?

Only run scripts you trust. customScript mode executes arbitrary Python code in the Actor environment.

What is the request queue used for?

When maxDepth > 0, the Actor enqueues discovered links in the Apify request queue and processes them one by one — the same pattern as the official Apify BeautifulSoup template.

Why am I getting empty extracted fields?

Possible reasons: incorrect CSS selector, page returned an error/challenge page, or content is loaded dynamically via JavaScript (use Selenium instead).

What formats can I download?

JSON, CSV, and Excel from the Apify dataset UI, plus full access via the Apify API. Enable saveCsvToKeyValueStore for a combined OUTPUT.csv in the key-value store.

Can I integrate this into automation workflows?

Yes. Use Apify schedules, platform webhooks, webhookCallbackUrl, or the Apify API to trigger runs and consume dataset output.

Is web scraping legal?

Only you can ensure compliance. Use public data responsibly, respect each site's Terms of Service, robots guidance, and local law.

SEO Keywords

beautiful soup cloud runner
beautiful soup scraper apify
python web scraper apify
bs4 scraper
html parser scraper
apify beautiful soup
python scraping actor
web scraping cloud runner
css extraction scraper
apify request queue scraper

Actor permissions

This Actor is intended to work with limited permissions: it reads your input and writes to its default dataset (and uses Apify proxy/KV as configured). It does not require broad access to unrelated account data.

To set limited permissions in Apify Console:

Open your Actor on the Apify platform.
Go to Source or Settings.
Open Review permissions / Permissions.
Choose Limited permissions and save.

Limitations

Beautiful Soup only parses static HTML — no JavaScript rendering.
Site HTML structure changes may break CSS selectors; update rules or scripts as needed.
Heavy crawling may require higher Apify memory, proxy budgets, and respectful requestDelaySecs.
customScript mode executes user code — only run trusted scripts.
Some targets block datacenter IPs — use residential proxy when needed.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Get Started

Add your start URLs, define extraction rules or a custom script, enable proxy if needed, and start your first run on Apify. 🚀

Scrapy Cloud Runner

sovanza.inc/scrapy-cloud-runner

Scrapy Cloud Runner runs Scrapy spiders on Apify with runtime arguments, custom settings, schedules, webhooks, and automatic dataset export. It supports custom spiders, compact JSON output, and JSON/CSV/Excel dataset downloads.

Sovanza

5.0

Selenium Cloud Runner

sovanza.inc/selenium-cloud-runner

Selenium Cloud Runner scrapes JavaScript-heavy websites using Selenium and headless Chrome. It extracts data with CSS or XPath rules, supports scrolling, popup handling, screenshots, proxies, retries, and structured dataset exports.

Sovanza

5.0

Monitoring Runner

apify/monitoring-runner

The monitoring runner is a part of the Apify Monitoring Suite (apify/monitoring). See its readme for more information and how to use this.

Apify

142

4.5

Scrapy Cloud Runner

solutionssmart/scrapy-cloud-runner

Run Scrapy spiders on Apify with request queue, dataset export, proxy rotation, scheduling, and cloud-ready deployment.

Solutions Smart

Example Code Runner (Python)

apify/example-code-runner-python

Python Actor to run code examples from the documentation via "Run on Apify" links.

Apify

1.4K

4.5

AI Web Task Runner

solutionssmart/ai-web-task-runner

Run natural-language browser tasks with Playwright. Extract structured data, follow task-relevant links, capture screenshots, generate reports, and export reusable scripts.

Solutions Smart

Actor Test Runner — Validate Inputs, Outputs & Error Handling

ryanclinton/actor-test-runner

Actor Test Runner. Available on the Apify Store with pay-per-event pricing.

Ryan Clinton

Cloud Details Spider

getdataforme/cloud-details-spider

Cloud Details Spider extracts comprehensive details from cloud service pages, including titles, features, pricing, and links, delivering structured JSON for easy analysis. Supports batch URL processing, reliable parsing, and seamless Apify integration, ideal for research, monitoring, and automation.

GetDataForMe

Python Scraper

sovanza.inc/python-scraper

Python Scraper extracts web page data using Requests and BeautifulSoup. It collects titles, meta tags, headings, links, images, Open Graph data, text snippets, and custom CSS selector fields, with exports to JSON, CSV, Excel, XML, or HTML.

Sovanza

5.0

Elite Web Scraper Lite

thepattyroller/elite-web-scraper-lite

Lightning-fast web scraper for static websites. Extract titles, headings, links, and content from any webpage using Cheerio. Perfect for simple scraping tasks without the overhead of browser automation. Supports custom CSS selectors and link extraction.