Beautiful Soup Cloud Runner avatar

Beautiful Soup Cloud Runner

Pricing

from $2.50 / 1,000 scraped pages

Go to Apify Store
Beautiful Soup Cloud Runner

Beautiful Soup Cloud Runner

Beautiful Soup Cloud Runner runs Python BS4 scraping tasks on Apify. Use CSS extraction rules or custom scripts to scrape static HTML pages, follow links, use proxies, save CSV exports, trigger webhooks, and export compact datasets.

Pricing

from $2.50 / 1,000 scraped pages

Rating

5.0

(1)

Developer

Sovanza

Sovanza

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

11 days ago

Last modified

Share

Beautiful Soup Cloud Runner – Run Python BS4 Scrapers in Apify Cloud

Host, schedule, and run Python web scraping tasks built with Beautiful Soup in the Apify cloud. Provide URLs and CSS extraction rules for instant scraping, or supply your own Python script with an entry function. This actor is designed for developers and data teams who need a general-purpose runner for HTML parsing workflows, scheduled jobs, and Apify platform integration.

Overview

This Beautiful Soup Cloud Runner executes Python scraping and parsing tasks using HTTP fetching and Beautiful Soup (bs4). It supports two modes:

  • builtin — declarative CSS extraction rules on startUrls, with optional link crawling via the Apify request queue (maxDepth).
  • customScript — dynamically load a user Python module or inline source and call an entry function (default run).

Output is compact: empty or missing fields are omitted so each dataset row contains only what was extracted for that page.

Key benefits

  • Run Beautiful Soup scrapers in the cloud without managing servers
  • Scrape with CSS extraction rules — no code required for simple tasks
  • Load custom Python scripts for advanced parsing logic
  • Follow links up to a configurable crawl depth via Apify request queue
  • Export clean datasets in JSON, CSV, or Excel via Apify
  • Integrate with schedules, webhooks, and the Apify API

Core features

  • Built-in extraction — CSS rules for text, html, and attr fields
  • Link crawling — optional maxDepth with Apify request queue (same pattern as the official BeautifulSoup template)
  • Custom scripts — load scriptModule or inline scriptSource; call entryFunction (default run)
  • Proxy support — Apify proxy via proxyConfiguration
  • Rate limitingrequestDelaySecs between HTTP requests
  • Retries — configurable per-URL retry policy (maxRetries, retryDelaySecs)
  • Dataset outputActor.push_data() for each scraped page
  • CSV export — optional OUTPUT.csv in the key-value store
  • Webhook callback — POST run summary to webhookCallbackUrl when finished

How to Use Beautiful Soup Cloud Runner on Apify

Using the Actor

  1. Open the Actor on the Apify platform and go to the Input tab.
  2. Configure input (see below): set mode to builtin or customScript, add startUrls, define extract rules or a script path, and enable proxy if needed.
  3. Start the run. The Actor fetches pages, parses HTML with Beautiful Soup, and pushes compact items to the default dataset.
  4. Open the Dataset tab to browse, download JSON/CSV/Excel, or pull data via the Apify API.
  5. Schedule or integrate (optional): use schedules, webhooks, Zapier/Make, or your own code against the Apify API.

Input Configuration

Full schema: INPUT_SCHEMA.json. Example:

{
"mode": "builtin",
"startUrls": [
"https://example.com/"
],
"maxDepth": 0,
"maxRequestsPerCrawl": 100,
"sameOriginOnly": true,
"extract": [
{ "name": "title", "selector": "title", "type": "text", "all": false },
{ "name": "h1", "selector": "h1", "type": "text", "all": false },
{ "name": "links", "selector": "a", "type": "attr", "attr": "href", "all": true }
],
"includeLinks": true,
"includeHtml": false,
"parser": "lxml",
"requestDelaySecs": 0,
"maxRetries": 2,
"retryDelaySecs": 3,
"timeoutSecs": 60,
"saveCsvToKeyValueStore": false,
"webhookCallbackUrl": "",
"cookies": "",
"headers": {},
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"],
"apifyProxyCountry": "US"
}
}
  • mode (optional): builtin (default) for declarative CSS extraction, or customScript to load a user Python script.
  • startUrls (required in builtin mode): One or more URLs to fetch and parse with Beautiful Soup.
  • maxDepth (optional): Follow same-origin links via request queue up to this depth (default 0 = start URLs only, max 10). Builtin mode only.
  • maxRequestsPerCrawl (optional): Safety cap on total pages per run (default 100, max 10000).
  • sameOriginOnly (optional): When crawling, enqueue only URLs on the same host as the seed URL (default true).
  • extract (optional): CSS extraction rules — each rule has name, selector, type (text / html / attr), optional attr, and all (default rules extract title, h1, and links).
  • includeLinks (optional): Include absolute href links array in each dataset item (default false).
  • includeHtml (optional): Include full page HTML in each item — can be large (default false).
  • parser (optional): Beautiful Soup parser backend — lxml (default), html.parser, or html5lib.
  • requestDelaySecs (optional): Minimum delay between HTTP requests for rate limiting (default 0).
  • maxRetries (optional): Retry count per URL on HTTP failure (default 2, max 10).
  • retryDelaySecs (optional): Sleep between retries (default 3).
  • timeoutSecs (optional): Per-request HTTP timeout in seconds (default 60).
  • saveCsvToKeyValueStore (optional): Write combined CSV to default key-value store as OUTPUT.csv (default false).
  • webhookCallbackUrl (optional): POST JSON run summary to this URL when the Actor finishes.
  • scriptModule (optional): Path to Python file inside the Actor (e.g. scripts/example_titles_links.py). Required in customScript mode unless scriptSource is set.
  • scriptSource (optional): Inline Python source code — overrides scriptModule. Must define the entry function. Secret input.
  • entryFunction (optional): Function name to call in your script (default run). Signature: run(context) or async run(context).
  • scriptArgs (optional): Arbitrary JSON passed to your script via context.script_args.
  • cookies (optional): Raw Cookie header value for authenticated sessions. Secret input — encrypted at rest, not copied to dataset rows.
  • headers (optional): Extra HTTP headers as JSON object. Secret input.
  • proxyConfiguration (optional): Apify proxy settings; residential is recommended for blocked sites.

Custom script mode example

Reference the bundled example script:

{
"mode": "customScript",
"startUrls": ["https://example.com/"],
"scriptModule": "scripts/example_titles_links.py",
"entryFunction": "run",
"scriptArgs": { "maxPages": 5 }
}

Your entry function receives a context object with fetch(), parse_html(), push_data(), start_urls, and script_args. See scripts/example_titles_links.py.

Authentication & sensitive input

Fields that can hold credentials use Apify secret input (isSecret: true), following the same pattern as the Python Scraper and Instagram Reels Scraper actors:

  • cookies — raw Cookie header value
  • headers — custom HTTP headers JSON
  • scriptSource — inline Python source code

Secret values are encrypted in storage and are not written into dataset rows or logs.

Environment variables (optional)

VariablePurpose
APIFY_LOG_LEVELLog verbosity (default INFO via apify.json).
APIFY_TOKENRequired when using Apify proxy from your local machine.

Run locally

INPUT.json is gitignored. Copy INPUT.example.json to INPUT.json, set APIFY_TOKEN if using Apify proxy from your machine, then:

cd beautiful-soup-cloud-runner
pip install -r requirements.txt
cp INPUT.example.json INPUT.json
python main.py

Results are written to storage/datasets/default/ when running outside the Apify platform.

Run tests:

$python -m unittest discover -s tests -v

Push to Apify:

apify login
apify push

Output

Results are stored in the Actor's default dataset. Each item is a compact JSON object: fields that are empty or unknown are not included.

Typical fields (when data is available):

  • URLs: inputUrl, finalUrl.
  • Status: status (OK / ERROR), httpStatus.
  • Page metadata: pageTitle, depth (crawl depth from seed).
  • Extracted data: extracted (object with fields from CSS rules or custom script).
  • Links: links (when includeLinks is enabled).
  • HTML: html (when includeHtml is enabled).
  • Meta: timestamp.
  • Errors: error on failure rows.

Example item (illustrative — real items only include keys that have values):

{
"inputUrl": "https://example.com/",
"finalUrl": "https://example.com/",
"status": "OK",
"httpStatus": 200,
"pageTitle": "Example Domain",
"depth": 0,
"extracted": {
"title": "Example Domain",
"h1": "Example Domain",
"links": ["https://www.iana.org/domains/example"]
},
"links": ["https://www.iana.org/domains/example"],
"timestamp": "2026-06-05T12:00:00Z"
}

Example error row:

{
"inputUrl": "https://example.com/bad-page",
"finalUrl": "https://example.com/bad-page",
"status": "ERROR",
"httpStatus": null,
"pageTitle": null,
"extracted": {},
"error": "Client error '404 Not Found' for url 'https://example.com/bad-page'",
"timestamp": "2026-06-05T12:00:00Z"
}

➡️ Output is structured for pipelines, warehouses, or spreadsheet export via Apify.

Use Cases

  • General HTML scraping: extract titles, links, and custom fields from static pages with Beautiful Soup.
  • Scheduled monitoring: run scrapers on a cron via Apify Schedules and track changes over time.
  • Custom parser hosting: deploy reusable Python BS4 modules without building a new Actor from scratch.
  • Workflow integration: chain with other Actors using webhooks, API calls, or Zapier/Make.
  • Multi-page crawling: follow same-origin links up to maxDepth for sitemap-style extraction.

Integrations & API

  • Run and fetch results through the Apify API
  • Use Python, Node.js, or HTTP clients against run and dataset endpoints
  • Connect Zapier, Make, Google Sheets, and other Apify integrations
  • Webhooks and schedules for recurring runs
  • Optional webhookCallbackUrl input for a custom POST at end of run

Why Choose This Actor?

  • General-purpose cloud runner for any Beautiful Soup scraping task on Apify
  • Zero-code mode with declarative CSS extraction rules
  • Custom script mode for advanced parsing without a separate Actor build
  • Compact JSON — no noise from empty fields
  • Built for Apify datasets, request queues, proxies, exports, and API access
  • Same operational patterns as Selenium Cloud Runner and Instagram Reels Scraper in this repo

FAQ

How does Beautiful Soup Cloud Runner work?

It reads input JSON, fetches pages over HTTP (with optional Apify proxy), parses HTML with Beautiful Soup, applies CSS extraction rules or runs your custom Python entry function, and pushes structured rows to the default dataset via Actor.push_data().

When should I use builtin vs customScript?

Use builtin for quick CSS-based scraping without writing code. Use customScript when you need custom parsing logic, multi-step flows, or reusable Python modules.

Can I scrape multiple pages in one run?

Yes. Add multiple URLs to startUrls, or set maxDepth > 0 to follow links via the Apify request queue (builtin mode, same-origin by default).

Does this support JavaScript-rendered pages?

No. Beautiful Soup parses static HTML from HTTP responses. For JavaScript-heavy sites, use Selenium Cloud Runner in this repo.

Can I run untrusted scripts?

Only run scripts you trust. customScript mode executes arbitrary Python code in the Actor environment.

What is the request queue used for?

When maxDepth > 0, the Actor enqueues discovered links in the Apify request queue and processes them one by one — the same pattern as the official Apify BeautifulSoup template.

Why am I getting empty extracted fields?

Possible reasons: incorrect CSS selector, page returned an error/challenge page, or content is loaded dynamically via JavaScript (use Selenium instead).

What formats can I download?

JSON, CSV, and Excel from the Apify dataset UI, plus full access via the Apify API. Enable saveCsvToKeyValueStore for a combined OUTPUT.csv in the key-value store.

Can I integrate this into automation workflows?

Yes. Use Apify schedules, platform webhooks, webhookCallbackUrl, or the Apify API to trigger runs and consume dataset output.

Only you can ensure compliance. Use public data responsibly, respect each site's Terms of Service, robots guidance, and local law.

SEO Keywords

beautiful soup cloud runner
beautiful soup scraper apify
python web scraper apify
bs4 scraper
html parser scraper
apify beautiful soup
python scraping actor
web scraping cloud runner
css extraction scraper
apify request queue scraper

Actor permissions

This Actor is intended to work with limited permissions: it reads your input and writes to its default dataset (and uses Apify proxy/KV as configured). It does not require broad access to unrelated account data.

To set limited permissions in Apify Console:

  1. Open your Actor on the Apify platform.
  2. Go to Source or Settings.
  3. Open Review permissions / Permissions.
  4. Choose Limited permissions and save.

Limitations

  • Beautiful Soup only parses static HTML — no JavaScript rendering.
  • Site HTML structure changes may break CSS selectors; update rules or scripts as needed.
  • Heavy crawling may require higher Apify memory, proxy budgets, and respectful requestDelaySecs.
  • customScript mode executes user code — only run trusted scripts.
  • Some targets block datacenter IPs — use residential proxy when needed.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Get Started

Add your start URLs, define extraction rules or a custom script, enable proxy if needed, and start your first run on Apify. 🚀