# Beautiful Soup Cloud Runner (`sovanza.inc/beautiful-soup-cloud-runner`) Actor

Beautiful Soup Cloud Runner runs Python BS4 scraping tasks on Apify. Use CSS extraction rules or custom scripts to scrape static HTML pages, follow links, use proxies, save CSV exports, trigger webhooks, and export compact datasets.

- **URL**: https://apify.com/sovanza.inc/beautiful-soup-cloud-runner.md
- **Developed by:** [Sovanza](https://apify.com/sovanza.inc) (community)
- **Categories:** Developer tools, Other, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $2.50 / 1,000 scraped pages

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Beautiful Soup Cloud Runner – Run Python BS4 Scrapers in Apify Cloud

Host, schedule, and run **Python web scraping tasks built with Beautiful Soup** in the Apify cloud. Provide URLs and CSS extraction rules for instant scraping, or supply your own Python script with an entry function. This actor is designed for developers and data teams who need a **general-purpose runner** for HTML parsing workflows, **scheduled jobs**, and **Apify platform integration**.

### Overview

This Beautiful Soup Cloud Runner executes Python scraping and parsing tasks using **HTTP fetching** and **Beautiful Soup** (`bs4`). It supports two modes:

- **builtin** — declarative CSS extraction rules on `startUrls`, with optional link crawling via the Apify request queue (`maxDepth`).
- **customScript** — dynamically load a user Python module or inline source and call an entry function (default `run`).

**Output is compact:** empty or missing fields are omitted so each dataset row contains only what was extracted for that page.

### Key benefits

- Run **Beautiful Soup scrapers in the cloud** without managing servers  
- Scrape with **CSS extraction rules** — no code required for simple tasks  
- Load **custom Python scripts** for advanced parsing logic  
- Follow links up to a configurable **crawl depth** via Apify request queue  
- Export clean datasets in **JSON, CSV, or Excel** via Apify  
- Integrate with **schedules, webhooks, and the Apify API**  

### Core features

- **Built-in extraction** — CSS rules for `text`, `html`, and `attr` fields  
- **Link crawling** — optional `maxDepth` with Apify **request queue** (same pattern as the official BeautifulSoup template)  
- **Custom scripts** — load `scriptModule` or inline `scriptSource`; call `entryFunction` (default `run`)  
- **Proxy support** — Apify proxy via `proxyConfiguration`  
- **Rate limiting** — `requestDelaySecs` between HTTP requests  
- **Retries** — configurable per-URL retry policy (`maxRetries`, `retryDelaySecs`)  
- **Dataset output** — `Actor.push_data()` for each scraped page  
- **CSV export** — optional `OUTPUT.csv` in the key-value store  
- **Webhook callback** — POST run summary to `webhookCallbackUrl` when finished  

### How to Use Beautiful Soup Cloud Runner on Apify

#### Using the Actor

1. **Open the Actor** on the Apify platform and go to the **Input** tab.  
2. **Configure input** (see below): set `mode` to `builtin` or `customScript`, add `startUrls`, define `extract` rules or a script path, and enable proxy if needed.  
3. **Start** the run. The Actor fetches pages, parses HTML with Beautiful Soup, and **pushes compact items** to the default dataset.  
4. **Open the Dataset** tab to browse, download JSON/CSV/Excel, or pull data via the Apify API.  
5. **Schedule or integrate** (optional): use schedules, webhooks, Zapier/Make, or your own code against the Apify API.  

#### Input Configuration

Full schema: `INPUT_SCHEMA.json`. Example:

```json
{
  "mode": "builtin",
  "startUrls": [
    "https://example.com/"
  ],
  "maxDepth": 0,
  "maxRequestsPerCrawl": 100,
  "sameOriginOnly": true,
  "extract": [
    { "name": "title", "selector": "title", "type": "text", "all": false },
    { "name": "h1", "selector": "h1", "type": "text", "all": false },
    { "name": "links", "selector": "a", "type": "attr", "attr": "href", "all": true }
  ],
  "includeLinks": true,
  "includeHtml": false,
  "parser": "lxml",
  "requestDelaySecs": 0,
  "maxRetries": 2,
  "retryDelaySecs": 3,
  "timeoutSecs": 60,
  "saveCsvToKeyValueStore": false,
  "webhookCallbackUrl": "",
  "cookies": "",
  "headers": {},
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"],
    "apifyProxyCountry": "US"
  }
}
````

- **`mode`** (optional): `builtin` (default) for declarative CSS extraction, or `customScript` to load a user Python script.
- **`startUrls`** (required in builtin mode): One or more URLs to fetch and parse with Beautiful Soup.
- **`maxDepth`** (optional): Follow same-origin links via request queue up to this depth (default `0` = start URLs only, max `10`). Builtin mode only.
- **`maxRequestsPerCrawl`** (optional): Safety cap on total pages per run (default `100`, max `10000`).
- **`sameOriginOnly`** (optional): When crawling, enqueue only URLs on the same host as the seed URL (default `true`).
- **`extract`** (optional): CSS extraction rules — each rule has `name`, `selector`, `type` (`text` / `html` / `attr`), optional `attr`, and `all` (default rules extract title, h1, and links).
- **`includeLinks`** (optional): Include absolute href links array in each dataset item (default `false`).
- **`includeHtml`** (optional): Include full page HTML in each item — can be large (default `false`).
- **`parser`** (optional): Beautiful Soup parser backend — `lxml` (default), `html.parser`, or `html5lib`.
- **`requestDelaySecs`** (optional): Minimum delay between HTTP requests for rate limiting (default `0`).
- **`maxRetries`** (optional): Retry count per URL on HTTP failure (default `2`, max `10`).
- **`retryDelaySecs`** (optional): Sleep between retries (default `3`).
- **`timeoutSecs`** (optional): Per-request HTTP timeout in seconds (default `60`).
- **`saveCsvToKeyValueStore`** (optional): Write combined CSV to default key-value store as `OUTPUT.csv` (default `false`).
- **`webhookCallbackUrl`** (optional): POST JSON run summary to this URL when the Actor finishes.
- **`scriptModule`** (optional): Path to Python file inside the Actor (e.g. `scripts/example_titles_links.py`). Required in customScript mode unless `scriptSource` is set.
- **`scriptSource`** (optional): Inline Python source code — overrides `scriptModule`. Must define the entry function. **Secret input**.
- **`entryFunction`** (optional): Function name to call in your script (default `run`). Signature: `run(context)` or `async run(context)`.
- **`scriptArgs`** (optional): Arbitrary JSON passed to your script via `context.script_args`.
- **`cookies`** (optional): Raw Cookie header value for authenticated sessions. **Secret input** — encrypted at rest, not copied to dataset rows.
- **`headers`** (optional): Extra HTTP headers as JSON object. **Secret input**.
- **`proxyConfiguration`** (optional): Apify proxy settings; **residential** is recommended for blocked sites.

##### Custom script mode example

Reference the bundled example script:

```json
{
  "mode": "customScript",
  "startUrls": ["https://example.com/"],
  "scriptModule": "scripts/example_titles_links.py",
  "entryFunction": "run",
  "scriptArgs": { "maxPages": 5 }
}
```

Your entry function receives a `context` object with `fetch()`, `parse_html()`, `push_data()`, `start_urls`, and `script_args`. See `scripts/example_titles_links.py`.

##### Authentication & sensitive input

Fields that can hold credentials use Apify **secret input** (`isSecret: true`), following the same pattern as the Python Scraper and Instagram Reels Scraper actors:

- **`cookies`** — raw Cookie header value
- **`headers`** — custom HTTP headers JSON
- **`scriptSource`** — inline Python source code

Secret values are encrypted in storage and are **not** written into dataset rows or logs.

##### Environment variables (optional)

| Variable | Purpose |
|----------|---------|
| `APIFY_LOG_LEVEL` | Log verbosity (default `INFO` via `apify.json`). |
| `APIFY_TOKEN` | Required when using Apify proxy from your local machine. |

##### Run locally

`INPUT.json` is gitignored. Copy `INPUT.example.json` to `INPUT.json`, set `APIFY_TOKEN` if using Apify proxy from your machine, then:

```bash
cd beautiful-soup-cloud-runner
pip install -r requirements.txt
cp INPUT.example.json INPUT.json
python main.py
```

Results are written to `storage/datasets/default/` when running outside the Apify platform.

Run tests:

```bash
python -m unittest discover -s tests -v
```

Push to Apify:

```bash
apify login
apify push
```

### Output

Results are stored in the Actor's **default dataset**. Each item is a **compact** JSON object: fields that are empty or unknown are **not** included.

Typical fields (when data is available):

- **URLs:** `inputUrl`, `finalUrl`.
- **Status:** `status` (`OK` / `ERROR`), `httpStatus`.
- **Page metadata:** `pageTitle`, `depth` (crawl depth from seed).
- **Extracted data:** `extracted` (object with fields from CSS rules or custom script).
- **Links:** `links` (when `includeLinks` is enabled).
- **HTML:** `html` (when `includeHtml` is enabled).
- **Meta:** `timestamp`.
- **Errors:** `error` on failure rows.

Example item (illustrative — real items only include keys that have values):

```json
{
  "inputUrl": "https://example.com/",
  "finalUrl": "https://example.com/",
  "status": "OK",
  "httpStatus": 200,
  "pageTitle": "Example Domain",
  "depth": 0,
  "extracted": {
    "title": "Example Domain",
    "h1": "Example Domain",
    "links": ["https://www.iana.org/domains/example"]
  },
  "links": ["https://www.iana.org/domains/example"],
  "timestamp": "2026-06-05T12:00:00Z"
}
```

Example error row:

```json
{
  "inputUrl": "https://example.com/bad-page",
  "finalUrl": "https://example.com/bad-page",
  "status": "ERROR",
  "httpStatus": null,
  "pageTitle": null,
  "extracted": {},
  "error": "Client error '404 Not Found' for url 'https://example.com/bad-page'",
  "timestamp": "2026-06-05T12:00:00Z"
}
```

➡️ Output is structured for pipelines, warehouses, or spreadsheet export via Apify.

### Use Cases

- **General HTML scraping:** extract titles, links, and custom fields from static pages with Beautiful Soup.
- **Scheduled monitoring:** run scrapers on a cron via Apify Schedules and track changes over time.
- **Custom parser hosting:** deploy reusable Python BS4 modules without building a new Actor from scratch.
- **Workflow integration:** chain with other Actors using webhooks, API calls, or Zapier/Make.
- **Multi-page crawling:** follow same-origin links up to `maxDepth` for sitemap-style extraction.

### Integrations & API

- Run and fetch results through the **Apify API**
- Use **Python**, **Node.js**, or HTTP clients against run and dataset endpoints
- Connect **Zapier**, **Make**, **Google Sheets**, and other Apify integrations
- **Webhooks** and **schedules** for recurring runs
- Optional **`webhookCallbackUrl`** input for a custom POST at end of run

### Why Choose This Actor?

- **General-purpose cloud runner** for any Beautiful Soup scraping task on Apify
- **Zero-code mode** with declarative CSS extraction rules
- **Custom script mode** for advanced parsing without a separate Actor build
- **Compact JSON** — no noise from empty fields
- Built for **Apify** datasets, request queues, proxies, exports, and API access
- Same operational patterns as **Selenium Cloud Runner** and **Instagram Reels Scraper** in this repo

### FAQ

#### How does Beautiful Soup Cloud Runner work?

It reads input JSON, fetches pages over HTTP (with optional Apify proxy), parses HTML with Beautiful Soup, applies CSS extraction rules or runs your custom Python entry function, and pushes structured rows to the default dataset via `Actor.push_data()`.

#### When should I use builtin vs customScript?

Use **builtin** for quick CSS-based scraping without writing code. Use **customScript** when you need custom parsing logic, multi-step flows, or reusable Python modules.

#### Can I scrape multiple pages in one run?

Yes. Add multiple URLs to `startUrls`, or set `maxDepth` > 0 to follow links via the Apify request queue (builtin mode, same-origin by default).

#### Does this support JavaScript-rendered pages?

No. Beautiful Soup parses static HTML from HTTP responses. For JavaScript-heavy sites, use **Selenium Cloud Runner** in this repo.

#### Can I run untrusted scripts?

Only run scripts you trust. `customScript` mode executes arbitrary Python code in the Actor environment.

#### What is the request queue used for?

When `maxDepth` > 0, the Actor enqueues discovered links in the Apify request queue and processes them one by one — the same pattern as the [official Apify BeautifulSoup template](https://apify.com/templates/python-beautifulsoup).

#### Why am I getting empty extracted fields?

Possible reasons: incorrect CSS selector, page returned an error/challenge page, or content is loaded dynamically via JavaScript (use Selenium instead).

#### What formats can I download?

**JSON**, **CSV**, and **Excel** from the Apify dataset UI, plus full access via the **Apify API**. Enable `saveCsvToKeyValueStore` for a combined `OUTPUT.csv` in the key-value store.

#### Can I integrate this into automation workflows?

Yes. Use Apify schedules, platform webhooks, `webhookCallbackUrl`, or the Apify API to trigger runs and consume dataset output.

#### Is web scraping legal?

Only you can ensure compliance. Use public data responsibly, respect each site's Terms of Service, robots guidance, and local law.

### SEO Keywords

beautiful soup cloud runner\
beautiful soup scraper apify\
python web scraper apify\
bs4 scraper\
html parser scraper\
apify beautiful soup\
python scraping actor\
web scraping cloud runner\
css extraction scraper\
apify request queue scraper

### Actor permissions

This Actor is intended to work with **limited permissions**: it reads your input and writes to its **default dataset** (and uses Apify proxy/KV as configured). It does not require broad access to unrelated account data.

**To set limited permissions in Apify Console:**

1. Open your Actor on the Apify platform.
2. Go to **Source** or **Settings**.
3. Open **Review permissions** / **Permissions**.
4. Choose **Limited permissions** and save.

### Limitations

- Beautiful Soup only parses **static HTML** — no JavaScript rendering.
- Site HTML structure changes may break CSS selectors; update rules or scripts as needed.
- Heavy crawling may require higher Apify memory, proxy budgets, and respectful `requestDelaySecs`.
- `customScript` mode executes user code — only run trusted scripts.
- Some targets block datacenter IPs — use **residential proxy** when needed.

### License

This project is licensed under the MIT License - see the LICENSE file for details.

### Get Started

Add your start URLs, define extraction rules or a custom script, enable proxy if needed, and start your first run on Apify. 🚀

# Actor input Schema

## `mode` (type: `string`):

builtin: scrape URLs with declarative CSS extraction rules (no custom code). customScript: load and execute a user Python script with Beautiful Soup.

## `startUrls` (type: `array`):

One or more URLs to fetch and parse with Beautiful Soup. Required for builtin mode; optional for customScript (passed to your script via context.start\_urls).

## `maxDepth` (type: `integer`):

When greater than 0, follow same-origin links via the Apify request queue up to this depth (builtin mode only). 0 = scrape start URLs only.

## `maxRequestsPerCrawl` (type: `integer`):

Safety cap on total pages processed in one run (builtin mode with link following).

## `sameOriginOnly` (type: `boolean`):

When following links, enqueue only URLs on the same host as the seed URL.

## `extract` (type: `array`):

CSS-based extraction rules applied in builtin mode. Each rule extracts text, HTML, or attributes from matched elements.

## `includeLinks` (type: `boolean`):

Include absolute href links found on each page in the dataset item.

## `includeHtml` (type: `boolean`):

If enabled, store full page HTML in each dataset item (can be large).

## `parser` (type: `string`):

Parser backend passed to BeautifulSoup(html, parser). lxml is fastest; html.parser needs no extra deps.

## `requestDelaySecs` (type: `number`):

Minimum delay between HTTP requests (rate limiting).

## `maxRetries` (type: `integer`):

How many times to retry a URL if the HTTP request fails.

## `retryDelaySecs` (type: `number`):

Sleep between retries.

## `timeoutSecs` (type: `integer`):

Per-request HTTP timeout.

## `saveCsvToKeyValueStore` (type: `boolean`):

If enabled, writes a CSV summary of dataset items to the default key-value store as OUTPUT.csv.

## `webhookCallbackUrl` (type: `string`):

Optional URL to POST a JSON run summary when the Actor finishes (success or partial failure). Useful for workflow integration.

## `scriptModule` (type: `string`):

Path to a Python module inside the Actor (e.g. scripts/example\_titles\_links.py). Required in customScript mode unless scriptSource is provided.

## `scriptSource` (type: `string`):

Optional inline Python source code. When set, overrides scriptModule. Must define the entry function. Stored as a secret input and never written to the dataset.

## `entryFunction` (type: `string`):

Name of the function to call in your script (default: run). Signature: run(context) or async run(context).

## `scriptArgs` (type: `object`):

Arbitrary JSON passed to your script via context.script\_args.

## `cookies` (type: `string`):

Optional raw Cookie header value for authenticated sessions. Stored as a secret input and never written to the dataset.

## `headers` (type: `object`):

Optional extra HTTP headers (JSON object). Stored as a secret input.

## `proxyConfiguration` (type: `object`):

Apify proxy settings. Residential proxy is recommended for blocked sites.

## Actor input object example

```json
{
  "mode": "builtin",
  "startUrls": [
    "https://example.com/"
  ],
  "maxDepth": 0,
  "maxRequestsPerCrawl": 100,
  "sameOriginOnly": true,
  "extract": [
    {
      "name": "title",
      "selector": "title",
      "type": "text",
      "all": false
    },
    {
      "name": "h1",
      "selector": "h1",
      "type": "text",
      "all": false
    },
    {
      "name": "links",
      "selector": "a",
      "type": "attr",
      "attr": "href",
      "all": true
    }
  ],
  "includeLinks": true,
  "includeHtml": false,
  "parser": "lxml",
  "requestDelaySecs": 0,
  "maxRetries": 2,
  "retryDelaySecs": 3,
  "timeoutSecs": 60,
  "saveCsvToKeyValueStore": false,
  "entryFunction": "run",
  "scriptArgs": {},
  "headers": {},
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ],
    "apifyProxyCountry": "US"
  }
}
```

# Actor output Schema

## `pages` (type: `string`):

Structured Beautiful Soup extraction results per URL in the default dataset.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "mode": "builtin",
    "startUrls": [
        "https://example.com/"
    ],
    "maxDepth": 0,
    "maxRequestsPerCrawl": 100,
    "sameOriginOnly": true,
    "includeLinks": true,
    "includeHtml": false,
    "parser": "lxml",
    "requestDelaySecs": 0,
    "maxRetries": 2,
    "retryDelaySecs": 3,
    "timeoutSecs": 60,
    "saveCsvToKeyValueStore": false,
    "webhookCallbackUrl": "",
    "scriptModule": "",
    "scriptSource": "",
    "entryFunction": "run",
    "scriptArgs": {},
    "cookies": "",
    "headers": {}
};

// Run the Actor and wait for it to finish
const run = await client.actor("sovanza.inc/beautiful-soup-cloud-runner").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "mode": "builtin",
    "startUrls": ["https://example.com/"],
    "maxDepth": 0,
    "maxRequestsPerCrawl": 100,
    "sameOriginOnly": True,
    "includeLinks": True,
    "includeHtml": False,
    "parser": "lxml",
    "requestDelaySecs": 0,
    "maxRetries": 2,
    "retryDelaySecs": 3,
    "timeoutSecs": 60,
    "saveCsvToKeyValueStore": False,
    "webhookCallbackUrl": "",
    "scriptModule": "",
    "scriptSource": "",
    "entryFunction": "run",
    "scriptArgs": {},
    "cookies": "",
    "headers": {},
}

# Run the Actor and wait for it to finish
run = client.actor("sovanza.inc/beautiful-soup-cloud-runner").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "mode": "builtin",
  "startUrls": [
    "https://example.com/"
  ],
  "maxDepth": 0,
  "maxRequestsPerCrawl": 100,
  "sameOriginOnly": true,
  "includeLinks": true,
  "includeHtml": false,
  "parser": "lxml",
  "requestDelaySecs": 0,
  "maxRetries": 2,
  "retryDelaySecs": 3,
  "timeoutSecs": 60,
  "saveCsvToKeyValueStore": false,
  "webhookCallbackUrl": "",
  "scriptModule": "",
  "scriptSource": "",
  "entryFunction": "run",
  "scriptArgs": {},
  "cookies": "",
  "headers": {}
}' |
apify call sovanza.inc/beautiful-soup-cloud-runner --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=sovanza.inc/beautiful-soup-cloud-runner",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Beautiful Soup Cloud Runner",
        "description": "Beautiful Soup Cloud Runner runs Python BS4 scraping tasks on Apify. Use CSS extraction rules or custom scripts to scrape static HTML pages, follow links, use proxies, save CSV exports, trigger webhooks, and export compact datasets.",
        "version": "0.0",
        "x-build-id": "DpkGKa4OeLZ4i0c9q"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/sovanza.inc~beautiful-soup-cloud-runner/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-sovanza.inc-beautiful-soup-cloud-runner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/sovanza.inc~beautiful-soup-cloud-runner/runs": {
            "post": {
                "operationId": "runs-sync-sovanza.inc-beautiful-soup-cloud-runner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/sovanza.inc~beautiful-soup-cloud-runner/run-sync": {
            "post": {
                "operationId": "run-sync-sovanza.inc-beautiful-soup-cloud-runner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "mode": {
                        "title": "Run mode",
                        "enum": [
                            "builtin",
                            "customScript"
                        ],
                        "type": "string",
                        "description": "builtin: scrape URLs with declarative CSS extraction rules (no custom code). customScript: load and execute a user Python script with Beautiful Soup.",
                        "default": "builtin"
                    },
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "One or more URLs to fetch and parse with Beautiful Soup. Required for builtin mode; optional for customScript (passed to your script via context.start_urls).",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxDepth": {
                        "title": "Max crawl depth",
                        "minimum": 0,
                        "maximum": 10,
                        "type": "integer",
                        "description": "When greater than 0, follow same-origin links via the Apify request queue up to this depth (builtin mode only). 0 = scrape start URLs only.",
                        "default": 0
                    },
                    "maxRequestsPerCrawl": {
                        "title": "Max requests per crawl",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Safety cap on total pages processed in one run (builtin mode with link following).",
                        "default": 100
                    },
                    "sameOriginOnly": {
                        "title": "Same-origin links only",
                        "type": "boolean",
                        "description": "When following links, enqueue only URLs on the same host as the seed URL.",
                        "default": true
                    },
                    "extract": {
                        "title": "Extraction rules",
                        "type": "array",
                        "description": "CSS-based extraction rules applied in builtin mode. Each rule extracts text, HTML, or attributes from matched elements.",
                        "default": [
                            {
                                "name": "title",
                                "selector": "title",
                                "type": "text",
                                "all": false
                            },
                            {
                                "name": "h1",
                                "selector": "h1",
                                "type": "text",
                                "all": false
                            },
                            {
                                "name": "links",
                                "selector": "a",
                                "type": "attr",
                                "attr": "href",
                                "all": true
                            }
                        ]
                    },
                    "includeLinks": {
                        "title": "Include page links",
                        "type": "boolean",
                        "description": "Include absolute href links found on each page in the dataset item.",
                        "default": false
                    },
                    "includeHtml": {
                        "title": "Include full HTML",
                        "type": "boolean",
                        "description": "If enabled, store full page HTML in each dataset item (can be large).",
                        "default": false
                    },
                    "parser": {
                        "title": "Beautiful Soup parser",
                        "enum": [
                            "lxml",
                            "html.parser",
                            "html5lib"
                        ],
                        "type": "string",
                        "description": "Parser backend passed to BeautifulSoup(html, parser). lxml is fastest; html.parser needs no extra deps.",
                        "default": "lxml"
                    },
                    "requestDelaySecs": {
                        "title": "Request delay (seconds)",
                        "minimum": 0,
                        "maximum": 60,
                        "type": "number",
                        "description": "Minimum delay between HTTP requests (rate limiting).",
                        "default": 0
                    },
                    "maxRetries": {
                        "title": "Max retries per URL",
                        "minimum": 0,
                        "maximum": 10,
                        "type": "integer",
                        "description": "How many times to retry a URL if the HTTP request fails.",
                        "default": 2
                    },
                    "retryDelaySecs": {
                        "title": "Retry delay (seconds)",
                        "minimum": 0,
                        "maximum": 60,
                        "type": "number",
                        "description": "Sleep between retries.",
                        "default": 3
                    },
                    "timeoutSecs": {
                        "title": "Request timeout (seconds)",
                        "minimum": 5,
                        "maximum": 300,
                        "type": "integer",
                        "description": "Per-request HTTP timeout.",
                        "default": 60
                    },
                    "saveCsvToKeyValueStore": {
                        "title": "Save CSV (key-value store)",
                        "type": "boolean",
                        "description": "If enabled, writes a CSV summary of dataset items to the default key-value store as OUTPUT.csv.",
                        "default": false
                    },
                    "webhookCallbackUrl": {
                        "title": "Webhook callback URL",
                        "type": "string",
                        "description": "Optional URL to POST a JSON run summary when the Actor finishes (success or partial failure). Useful for workflow integration."
                    },
                    "scriptModule": {
                        "title": "Script module path",
                        "type": "string",
                        "description": "Path to a Python module inside the Actor (e.g. scripts/example_titles_links.py). Required in customScript mode unless scriptSource is provided."
                    },
                    "scriptSource": {
                        "title": "Inline script source (optional)",
                        "type": "string",
                        "description": "Optional inline Python source code. When set, overrides scriptModule. Must define the entry function. Stored as a secret input and never written to the dataset."
                    },
                    "entryFunction": {
                        "title": "Entry function name",
                        "type": "string",
                        "description": "Name of the function to call in your script (default: run). Signature: run(context) or async run(context).",
                        "default": "run"
                    },
                    "scriptArgs": {
                        "title": "Script arguments",
                        "type": "object",
                        "description": "Arbitrary JSON passed to your script via context.script_args.",
                        "default": {}
                    },
                    "cookies": {
                        "title": "Cookie header (optional)",
                        "type": "string",
                        "description": "Optional raw Cookie header value for authenticated sessions. Stored as a secret input and never written to the dataset."
                    },
                    "headers": {
                        "title": "Request headers (optional)",
                        "type": "object",
                        "description": "Optional extra HTTP headers (JSON object). Stored as a secret input."
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Apify proxy settings. Residential proxy is recommended for blocked sites.",
                        "default": {
                            "useApifyProxy": true,
                            "apifyProxyGroups": [
                                "RESIDENTIAL"
                            ],
                            "apifyProxyCountry": "US"
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
