Beautiful Soup Cloud Runner
Pricing
from $2.50 / 1,000 scraped pages
Beautiful Soup Cloud Runner
Beautiful Soup Cloud Runner runs Python BS4 scraping tasks on Apify. Use CSS extraction rules or custom scripts to scrape static HTML pages, follow links, use proxies, save CSV exports, trigger webhooks, and export compact datasets.
Pricing
from $2.50 / 1,000 scraped pages
Rating
5.0
(1)
Developer
Sovanza
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
11 days ago
Last modified
Categories
Share
Beautiful Soup Cloud Runner – Run Python BS4 Scrapers in Apify Cloud
Host, schedule, and run Python web scraping tasks built with Beautiful Soup in the Apify cloud. Provide URLs and CSS extraction rules for instant scraping, or supply your own Python script with an entry function. This actor is designed for developers and data teams who need a general-purpose runner for HTML parsing workflows, scheduled jobs, and Apify platform integration.
Overview
This Beautiful Soup Cloud Runner executes Python scraping and parsing tasks using HTTP fetching and Beautiful Soup (bs4). It supports two modes:
- builtin — declarative CSS extraction rules on
startUrls, with optional link crawling via the Apify request queue (maxDepth). - customScript — dynamically load a user Python module or inline source and call an entry function (default
run).
Output is compact: empty or missing fields are omitted so each dataset row contains only what was extracted for that page.
Key benefits
- Run Beautiful Soup scrapers in the cloud without managing servers
- Scrape with CSS extraction rules — no code required for simple tasks
- Load custom Python scripts for advanced parsing logic
- Follow links up to a configurable crawl depth via Apify request queue
- Export clean datasets in JSON, CSV, or Excel via Apify
- Integrate with schedules, webhooks, and the Apify API
Core features
- Built-in extraction — CSS rules for
text,html, andattrfields - Link crawling — optional
maxDepthwith Apify request queue (same pattern as the official BeautifulSoup template) - Custom scripts — load
scriptModuleor inlinescriptSource; callentryFunction(defaultrun) - Proxy support — Apify proxy via
proxyConfiguration - Rate limiting —
requestDelaySecsbetween HTTP requests - Retries — configurable per-URL retry policy (
maxRetries,retryDelaySecs) - Dataset output —
Actor.push_data()for each scraped page - CSV export — optional
OUTPUT.csvin the key-value store - Webhook callback — POST run summary to
webhookCallbackUrlwhen finished
How to Use Beautiful Soup Cloud Runner on Apify
Using the Actor
- Open the Actor on the Apify platform and go to the Input tab.
- Configure input (see below): set
modetobuiltinorcustomScript, addstartUrls, defineextractrules or a script path, and enable proxy if needed. - Start the run. The Actor fetches pages, parses HTML with Beautiful Soup, and pushes compact items to the default dataset.
- Open the Dataset tab to browse, download JSON/CSV/Excel, or pull data via the Apify API.
- Schedule or integrate (optional): use schedules, webhooks, Zapier/Make, or your own code against the Apify API.
Input Configuration
Full schema: INPUT_SCHEMA.json. Example:
{"mode": "builtin","startUrls": ["https://example.com/"],"maxDepth": 0,"maxRequestsPerCrawl": 100,"sameOriginOnly": true,"extract": [{ "name": "title", "selector": "title", "type": "text", "all": false },{ "name": "h1", "selector": "h1", "type": "text", "all": false },{ "name": "links", "selector": "a", "type": "attr", "attr": "href", "all": true }],"includeLinks": true,"includeHtml": false,"parser": "lxml","requestDelaySecs": 0,"maxRetries": 2,"retryDelaySecs": 3,"timeoutSecs": 60,"saveCsvToKeyValueStore": false,"webhookCallbackUrl": "","cookies": "","headers": {},"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"],"apifyProxyCountry": "US"}}
mode(optional):builtin(default) for declarative CSS extraction, orcustomScriptto load a user Python script.startUrls(required in builtin mode): One or more URLs to fetch and parse with Beautiful Soup.maxDepth(optional): Follow same-origin links via request queue up to this depth (default0= start URLs only, max10). Builtin mode only.maxRequestsPerCrawl(optional): Safety cap on total pages per run (default100, max10000).sameOriginOnly(optional): When crawling, enqueue only URLs on the same host as the seed URL (defaulttrue).extract(optional): CSS extraction rules — each rule hasname,selector,type(text/html/attr), optionalattr, andall(default rules extract title, h1, and links).includeLinks(optional): Include absolute href links array in each dataset item (defaultfalse).includeHtml(optional): Include full page HTML in each item — can be large (defaultfalse).parser(optional): Beautiful Soup parser backend —lxml(default),html.parser, orhtml5lib.requestDelaySecs(optional): Minimum delay between HTTP requests for rate limiting (default0).maxRetries(optional): Retry count per URL on HTTP failure (default2, max10).retryDelaySecs(optional): Sleep between retries (default3).timeoutSecs(optional): Per-request HTTP timeout in seconds (default60).saveCsvToKeyValueStore(optional): Write combined CSV to default key-value store asOUTPUT.csv(defaultfalse).webhookCallbackUrl(optional): POST JSON run summary to this URL when the Actor finishes.scriptModule(optional): Path to Python file inside the Actor (e.g.scripts/example_titles_links.py). Required in customScript mode unlessscriptSourceis set.scriptSource(optional): Inline Python source code — overridesscriptModule. Must define the entry function. Secret input.entryFunction(optional): Function name to call in your script (defaultrun). Signature:run(context)orasync run(context).scriptArgs(optional): Arbitrary JSON passed to your script viacontext.script_args.cookies(optional): Raw Cookie header value for authenticated sessions. Secret input — encrypted at rest, not copied to dataset rows.headers(optional): Extra HTTP headers as JSON object. Secret input.proxyConfiguration(optional): Apify proxy settings; residential is recommended for blocked sites.
Custom script mode example
Reference the bundled example script:
{"mode": "customScript","startUrls": ["https://example.com/"],"scriptModule": "scripts/example_titles_links.py","entryFunction": "run","scriptArgs": { "maxPages": 5 }}
Your entry function receives a context object with fetch(), parse_html(), push_data(), start_urls, and script_args. See scripts/example_titles_links.py.
Authentication & sensitive input
Fields that can hold credentials use Apify secret input (isSecret: true), following the same pattern as the Python Scraper and Instagram Reels Scraper actors:
cookies— raw Cookie header valueheaders— custom HTTP headers JSONscriptSource— inline Python source code
Secret values are encrypted in storage and are not written into dataset rows or logs.
Environment variables (optional)
| Variable | Purpose |
|---|---|
APIFY_LOG_LEVEL | Log verbosity (default INFO via apify.json). |
APIFY_TOKEN | Required when using Apify proxy from your local machine. |
Run locally
INPUT.json is gitignored. Copy INPUT.example.json to INPUT.json, set APIFY_TOKEN if using Apify proxy from your machine, then:
cd beautiful-soup-cloud-runnerpip install -r requirements.txtcp INPUT.example.json INPUT.jsonpython main.py
Results are written to storage/datasets/default/ when running outside the Apify platform.
Run tests:
$python -m unittest discover -s tests -v
Push to Apify:
apify loginapify push
Output
Results are stored in the Actor's default dataset. Each item is a compact JSON object: fields that are empty or unknown are not included.
Typical fields (when data is available):
- URLs:
inputUrl,finalUrl. - Status:
status(OK/ERROR),httpStatus. - Page metadata:
pageTitle,depth(crawl depth from seed). - Extracted data:
extracted(object with fields from CSS rules or custom script). - Links:
links(whenincludeLinksis enabled). - HTML:
html(whenincludeHtmlis enabled). - Meta:
timestamp. - Errors:
erroron failure rows.
Example item (illustrative — real items only include keys that have values):
{"inputUrl": "https://example.com/","finalUrl": "https://example.com/","status": "OK","httpStatus": 200,"pageTitle": "Example Domain","depth": 0,"extracted": {"title": "Example Domain","h1": "Example Domain","links": ["https://www.iana.org/domains/example"]},"links": ["https://www.iana.org/domains/example"],"timestamp": "2026-06-05T12:00:00Z"}
Example error row:
{"inputUrl": "https://example.com/bad-page","finalUrl": "https://example.com/bad-page","status": "ERROR","httpStatus": null,"pageTitle": null,"extracted": {},"error": "Client error '404 Not Found' for url 'https://example.com/bad-page'","timestamp": "2026-06-05T12:00:00Z"}
➡️ Output is structured for pipelines, warehouses, or spreadsheet export via Apify.
Use Cases
- General HTML scraping: extract titles, links, and custom fields from static pages with Beautiful Soup.
- Scheduled monitoring: run scrapers on a cron via Apify Schedules and track changes over time.
- Custom parser hosting: deploy reusable Python BS4 modules without building a new Actor from scratch.
- Workflow integration: chain with other Actors using webhooks, API calls, or Zapier/Make.
- Multi-page crawling: follow same-origin links up to
maxDepthfor sitemap-style extraction.
Integrations & API
- Run and fetch results through the Apify API
- Use Python, Node.js, or HTTP clients against run and dataset endpoints
- Connect Zapier, Make, Google Sheets, and other Apify integrations
- Webhooks and schedules for recurring runs
- Optional
webhookCallbackUrlinput for a custom POST at end of run
Why Choose This Actor?
- General-purpose cloud runner for any Beautiful Soup scraping task on Apify
- Zero-code mode with declarative CSS extraction rules
- Custom script mode for advanced parsing without a separate Actor build
- Compact JSON — no noise from empty fields
- Built for Apify datasets, request queues, proxies, exports, and API access
- Same operational patterns as Selenium Cloud Runner and Instagram Reels Scraper in this repo
FAQ
How does Beautiful Soup Cloud Runner work?
It reads input JSON, fetches pages over HTTP (with optional Apify proxy), parses HTML with Beautiful Soup, applies CSS extraction rules or runs your custom Python entry function, and pushes structured rows to the default dataset via Actor.push_data().
When should I use builtin vs customScript?
Use builtin for quick CSS-based scraping without writing code. Use customScript when you need custom parsing logic, multi-step flows, or reusable Python modules.
Can I scrape multiple pages in one run?
Yes. Add multiple URLs to startUrls, or set maxDepth > 0 to follow links via the Apify request queue (builtin mode, same-origin by default).
Does this support JavaScript-rendered pages?
No. Beautiful Soup parses static HTML from HTTP responses. For JavaScript-heavy sites, use Selenium Cloud Runner in this repo.
Can I run untrusted scripts?
Only run scripts you trust. customScript mode executes arbitrary Python code in the Actor environment.
What is the request queue used for?
When maxDepth > 0, the Actor enqueues discovered links in the Apify request queue and processes them one by one — the same pattern as the official Apify BeautifulSoup template.
Why am I getting empty extracted fields?
Possible reasons: incorrect CSS selector, page returned an error/challenge page, or content is loaded dynamically via JavaScript (use Selenium instead).
What formats can I download?
JSON, CSV, and Excel from the Apify dataset UI, plus full access via the Apify API. Enable saveCsvToKeyValueStore for a combined OUTPUT.csv in the key-value store.
Can I integrate this into automation workflows?
Yes. Use Apify schedules, platform webhooks, webhookCallbackUrl, or the Apify API to trigger runs and consume dataset output.
Is web scraping legal?
Only you can ensure compliance. Use public data responsibly, respect each site's Terms of Service, robots guidance, and local law.
SEO Keywords
beautiful soup cloud runner
beautiful soup scraper apify
python web scraper apify
bs4 scraper
html parser scraper
apify beautiful soup
python scraping actor
web scraping cloud runner
css extraction scraper
apify request queue scraper
Actor permissions
This Actor is intended to work with limited permissions: it reads your input and writes to its default dataset (and uses Apify proxy/KV as configured). It does not require broad access to unrelated account data.
To set limited permissions in Apify Console:
- Open your Actor on the Apify platform.
- Go to Source or Settings.
- Open Review permissions / Permissions.
- Choose Limited permissions and save.
Limitations
- Beautiful Soup only parses static HTML — no JavaScript rendering.
- Site HTML structure changes may break CSS selectors; update rules or scripts as needed.
- Heavy crawling may require higher Apify memory, proxy budgets, and respectful
requestDelaySecs. customScriptmode executes user code — only run trusted scripts.- Some targets block datacenter IPs — use residential proxy when needed.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Get Started
Add your start URLs, define extraction rules or a custom script, enable proxy if needed, and start your first run on Apify. 🚀


