# AI-Ready Web Content Crawler (LLM/RAG Optimized) (`brilliant_gum/web-content-crawler`) Actor

Deep-crawl websites and extract LLM-ready Markdown with OG tags, JSON-LD, author, dates, token estimates, native RAG chunking, language filtering, content-hash dedup, and per-page error reporting. Enforced timeouts. Zero silent failures.

- **URL**: https://apify.com/brilliant\_gum/web-content-crawler.md
- **Developed by:** [Yuliia Kulakova](https://apify.com/brilliant_gum) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $20.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI-Ready Web Content Crawler

[![AI-Ready Web Content Crawler](https://i.imgur.com/SOsH8uT.png)](https://apify.com/brilliant_gum/web-content-crawler)

Crawl any website and get clean, structured Markdown ready for your AI pipeline. Built for developers building RAG applications, fine-tuning datasets, and AI-powered content workflows.

---

### What you get

Every page you crawl is returned as a clean, structured record with:

- **Clean Markdown** — nav, ads, footers, cookie banners automatically removed
- **Plain text** — stripped version for embeddings and search indexes
- **Rich metadata** — title, author, publish date, Open Graph, Twitter Card, JSON-LD structured data, language, canonical URL, hreflang
- **Token estimate** — per-page token count so you know your LLM costs upfront
- **Content type** — automatically classified as article, documentation, product, or landing page
- **RAG-ready chunks** — split at semantic boundaries (headings, paragraphs) with configurable overlap
- **Link graph** — internal links, external links, and PDF links per page
- **Crawl analytics** — word counts, token totals, language distribution, depth distribution

---

### Quick start

Just paste a URL and click Run. That's it.

```json
{
  "startUrls": [{ "url": "https://docs.example.com" }]
}
````

The crawler will crawl up to 100 pages at depth 5, extract clean Markdown with full metadata, and return everything as structured JSON.

***

### Use cases

#### Build a RAG knowledge base

Crawl your documentation site and get chunks ready to embed — no post-processing needed.

```json
{
  "startUrls": [{ "url": "https://docs.yoursite.com" }],
  "maxCrawlPages": 500,
  "languageFilter": ["en"],
  "chunkContent": true,
  "chunkSize": 1500,
  "chunkOverlap": 150,
  "deduplicateByContent": true
}
```

Each page comes with a `chunks` array. Each chunk includes text, position, and token estimate. Feed directly to OpenAI, Pinecone, Weaviate, or any vector database.

#### Monitor competitor content

Track what your competitors publish, when they update it, and how they structure it.

```json
{
  "startUrls": [{ "url": "https://blog.competitor.com" }],
  "globs": ["https://blog.competitor.com/posts/**"],
  "excludeGlobs": ["**/tag/**", "**/author/**"],
  "extractMetadata": true,
  "extractLinks": true,
  "maxCrawlPages": 200
}
```

Get author names, publish dates, content types, and full link graphs for every article.

#### Scrape a static site fast

Don't need JavaScript rendering? Switch to Cheerio mode for 3-5x faster crawling at lower cost.

```json
{
  "startUrls": [{ "url": "https://static-site.com" }],
  "crawlerType": "cheerio",
  "maxConcurrency": 10,
  "maxCrawlPages": 1000
}
```

#### Crawl behind authentication

Pass session cookies and crawl pages that require login.

```json
{
  "startUrls": [{ "url": "https://app.example.com/dashboard" }],
  "initialCookies": [
    { "name": "session", "value": "abc123", "domain": "app.example.com", "path": "/" }
  ],
  "maxCrawlDepth": 3
}
```

***

### Why this crawler?

#### Built-in proxy with automatic fallback

Every request goes through a residential proxy. If it gets blocked, the crawler automatically switches to a backup proxy and retries. You don't configure anything — it just works.

#### Filtered pages don't burn your budget

Language filter, content length filter, and deduplication all run *before* counting against your page limit. If you set `maxCrawlPages: 100` and 30 pages get filtered, you still get 100 real pages.

#### No silent failures

Other crawlers show "SUCCEEDED" with an empty dataset. This crawler tracks every failed URL with a reason (CAPTCHA, 403, timeout, proxy error) and stores them in the key-value store. You always know what happened.

#### Graceful timeout handling

Apify hard-kills actors after 1 hour. This crawler monitors the remaining time and stops gracefully 90 seconds before the limit — no partial records, no data loss.

#### Smart content extraction

Uses Mozilla Readability (the same engine behind Firefox Reader View) to extract article content. Automatically removes navigation, ads, sidebars, cookie banners, and other noise. Falls back to raw HTML extraction when Readability can't parse the page.

***

### Output example

```json
{
  "url": "https://example.com/blog/ai-trends",
  "metadata": {
    "title": "Top AI Trends for 2025",
    "author": "Jane Doe",
    "publishDate": "2025-01-15T10:00:00.000Z",
    "languageCode": "en",
    "contentType": "article",
    "wordCount": 1842,
    "tokenEstimate": 2456,
    "ogImage": "https://example.com/img/ai-trends.jpg",
    "jsonLd": [{ "@type": "Article", "..." : "..." }]
  },
  "markdown": "## Top AI Trends for 2025\n\nClean article content...",
  "text": "Top AI Trends for 2025. Clean article content...",
  "chunks": [
    {
      "chunkIndex": 0,
      "text": "## Top AI Trends...",
      "tokenEstimate": 461
    }
  ],
  "depth": 1,
  "httpStatusCode": 200
}
```

#### Free analytics with every run

The last record in your dataset is a crawl summary — total words, tokens, pages by language, pages by content type, pages by depth. Use it to estimate LLM costs or monitor content changes over time.

***

### Crawler engines

| Engine | Best for | Speed |
|---|---|---|
| **Playwright Chrome** (default) | JavaScript-heavy sites, SPAs, bot-protected pages | Standard |
| **Playwright Firefox** | Sites that block Chrome specifically | Standard |
| **Cheerio** | Static HTML sites, blogs, documentation | 3-5x faster |

***

### Key features at a glance

| Feature | Details |
|---|---|
| **Output format** | Markdown + plain text + metadata JSON |
| **RAG chunking** | Semantic splits with configurable size and overlap |
| **Metadata** | OG tags, JSON-LD, author, dates, Twitter Card, hreflang |
| **Token estimate** | Per page and total across the crawl |
| **Content type** | Auto-classified: article, documentation, product, landing |
| **Language filter** | Filter by ISO 639-1 codes without wasting page budget |
| **Deduplication** | URL + canonical + optional content-hash (MD5) |
| **Link extraction** | Internal, external, and PDF links per page |
| **Error tracking** | Every failed URL logged with reason in KV store |
| **Proxy** | Built-in residential with automatic fallback |
| **Timeout safety** | Graceful stop 90s before Apify hard-kill |
| **Cookie banners** | Auto-dismissed before extraction |
| **Authentication** | Cookie injection for logged-in crawling |

***

### Pricing

Pay per page crawled. No monthly fees. No hidden costs.

| What you pay for | Price |
|---|---|
| Page crawled | **$0.02** per page |
| Apify platform usage | [Standard compute costs](https://apify.com/pricing) |

Crawl 100 pages = $2. Crawl 1,000 pages = $20.

***

### Input reference

| Field | Type | Default | Description |
|---|---|---|---|
| `startUrls` | array | *required* | One or more seed URLs |
| `maxCrawlDepth` | integer | `5` | Max link depth from seed (0 = seed only) |
| `maxCrawlPages` | integer | `100` | Max pages saved (filtered pages don't count) |
| `crawlerType` | select | `playwright:chrome` | Rendering engine |
| `globs` | string\[] | — | Only crawl matching URL patterns |
| `excludeGlobs` | string\[] | — | Skip matching URL patterns |
| `useSitemaps` | boolean | `false` | Auto-discover URLs from sitemap.xml |
| `htmlTransformer` | select | `readability` | Content extraction method |
| `languageFilter` | string\[] | — | Only save pages in these languages |
| `contentMinLength` | integer | `100` | Skip pages with fewer characters |
| `deduplicateByContent` | boolean | `false` | Skip duplicate content (MD5 hash) |
| `chunkContent` | boolean | `false` | Enable RAG chunking |
| `chunkSize` | integer | `2000` | Target chunk size in characters |
| `chunkOverlap` | integer | `200` | Overlap between chunks |
| `extractMetadata` | boolean | `true` | Extract rich metadata |
| `extractLinks` | boolean | `false` | Extract page links |
| `saveMarkdown` | boolean | `true` | Include Markdown in output |
| `saveText` | boolean | `true` | Include plain text in output |
| `saveHtml` | boolean | `false` | Save cleaned HTML to KV store |
| `aggressivePrune` | boolean | `false` | Remove sidebars, comments, widgets |
| `dismissCookieBanners` | boolean | `true` | Auto-click cookie consent dialogs |
| `maxConcurrency` | integer | `3` | Parallel requests |
| `requestTimeoutSecs` | integer | `60` | Hard timeout per page |

***

### FAQ

**Is this compatible with apify/website-content-crawler?**
Yes. Same output format (`url`, `crawl`, `metadata`, `markdown`, `text`). You can switch without changing your pipeline.

**Can I crawl JavaScript-rendered pages?**
Yes. The default Playwright Chrome engine renders JavaScript, handles SPAs, and bypasses basic bot protection.

**How do I crawl only specific sections of a site?**
Use `globs` to include patterns (e.g. `https://example.com/blog/**`) and `excludeGlobs` to exclude patterns (e.g. `**/tag/**`).

**What happens if a page is blocked?**
The crawler detects CAPTCHA and bot-wall pages, retries with a fresh session, and logs the failure. Blocked pages don't count against your page limit.

**Can I use this for multiple languages?**
Yes. Set `languageFilter` to `["en", "de", "fr"]` to keep only those languages. Pages in other languages are skipped but don't waste your budget.

**How does chunking work?**
Content is split at semantic boundaries (headings, paragraph breaks, code blocks). Each chunk includes position data and a token estimate. Configure `chunkSize` and `chunkOverlap` to match your embedding model's context window.

# Actor input Schema

## `startUrls` (type: `array`):

One or more seed URLs. The crawler stays within each URL's domain by default. Supports plain strings or {url, userData} objects.

## `crawlerType` (type: `string`):

Rendering engine. playwright:chrome (default) handles JS and bot protection. cheerio is 3-5x faster and cheaper for static HTML sites.

## `ignoreSslErrors` (type: `boolean`):

Ignore SSL/TLS certificate errors (e.g. self-signed certificates).

## `maxCrawlDepth` (type: `integer`):

Maximum link depth from the seed URL. Depth 0 = only the seed URL. Depth 1 = seed + directly linked pages.

## `maxCrawlPages` (type: `integer`):

Maximum number of pages to save to the dataset. Filtered pages (language, content length, duplicates) do NOT count.

## `maxConcurrency` (type: `integer`):

Maximum parallel browser/HTTP requests. Lower = less chance of rate-limiting.

## `requestTimeoutSecs` (type: `integer`):

Hard timeout per page. The crawl is aborted if a page takes longer. This is strictly enforced (unlike some other crawlers).

## `globs` (type: `array`):

Only crawl URLs matching these glob patterns (e.g. 'https://example.com/blog/\*\*'). Leave empty to crawl all pages within the seed domain.

## `excludeGlobs` (type: `array`):

Skip URLs matching these glob patterns (e.g. '**/\*.pdf', '**/tag/\*\*').

## `useSitemaps` (type: `boolean`):

Auto-fetch {domain}/sitemap.xml and enqueue all discovered URLs. Faster for large sites.

## `htmlTransformer` (type: `string`):

How to extract page content. 'readability' uses Mozilla Readability (Firefox Reader View) for clean article extraction. 'raw' uses the full body HTML.

## `removeElementsCssSelector` (type: `string`):

Remove matching elements before extraction. Default: nav, footer, header, aside, cookie banners, ads. Override with your own selectors.

## `keepElementsCssSelector` (type: `string`):

When set, extract ONLY content inside matching elements (removes everything else first).

## `includeImages` (type: `boolean`):

Keep image Markdown (![alt](url)) in output. Disabled by default to reduce noise for LLM use cases.

## `aggressivePrune` (type: `boolean`):

Remove additional noisy elements: sidebars, social buttons, share widgets, comment sections, related-article widgets.

## `waitForSelector` (type: `string`):

Wait until this element appears before extracting content. Useful for heavily dynamic pages.

## `dismissCookieBanners` (type: `boolean`):

Automatically click Accept on common cookie consent dialogs before extracting content.

## `expandClickableElements` (type: `boolean`):

Auto-click 'Read more', 'Show more' buttons and expand collapsed sections before extracting content.

## `languageFilter` (type: `array`):

Only save pages in these languages (ISO 639-1 codes, e.g. \['en', 'de']). Pages in other languages are skipped but do NOT count against maxCrawlPages. Leave empty for all languages.

## `contentMinLength` (type: `integer`):

Skip pages with fewer than this many characters of extracted text. Prevents saving empty pages, 403 error pages, redirects, etc.

## `deduplicateByContent` (type: `boolean`):

Skip pages with identical content (MD5 of text). Goes beyond URL/canonical deduplication — catches pages with different URLs but the same body.

## `saveMarkdown` (type: `boolean`):

Include the 'markdown' field in each output record.

## `saveText` (type: `boolean`):

Include the 'text' (plain text) field in each output record.

## `saveHtml` (type: `boolean`):

Save the cleaned page HTML to the key-value store and include the URL in 'htmlUrl'.

## `saveScreenshots` (type: `boolean`):

Capture a screenshot of each page and save to the key-value store. Adds 'screenshotUrl' to each record.

## `extractMetadata` (type: `boolean`):

Extract OG tags, JSON-LD structured data, author, publish date, modified date, language, hreflang, Twitter Card, word count, reading time, token estimate, and content type classification.

## `extractLinks` (type: `boolean`):

Include arrays of internal links, external links, and PDF links found on each page.

## `chunkContent` (type: `boolean`):

Split Markdown into semantic chunks and add a 'chunks' array to each record. Each chunk includes text, position, and token estimate. Eliminates need for LangChain/LlamaIndex post-processing.

## `chunkSize` (type: `integer`):

Target maximum characters per chunk. Chunks respect paragraph boundaries.

## `chunkOverlap` (type: `integer`):

Characters of overlap between consecutive chunks for context continuity.

## `initialCookies` (type: `array`):

Cookies to inject before crawling (e.g. for authenticated sessions). Each entry: {name, value, domain, path}.

## `customHeaders` (type: `object`):

Additional headers to send with every request (e.g. {"Authorization": "Bearer token"}).

## `userAgent` (type: `string`):

Override the browser's User-Agent string.

## `requestsPerMinute` (type: `integer`):

Maximum requests per minute to be polite to the target server. 0 = no limit.

## `proxyConfiguration` (type: `object`):

Proxy settings. Built-in residential proxy is used by default. Override here only if you need a specific proxy group (e.g. RESIDENTIAL for heavily protected sites).

## `debugMode` (type: `boolean`):

Add 'extractedBy' field to each record showing whether Readability or raw extraction was used.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://example.com"
    }
  ],
  "crawlerType": "playwright:chrome",
  "ignoreSslErrors": false,
  "maxCrawlDepth": 1,
  "maxCrawlPages": 3,
  "maxConcurrency": 3,
  "requestTimeoutSecs": 60,
  "globs": [
    "https://example.com/blog/**",
    "https://example.com/docs/**"
  ],
  "excludeGlobs": [
    "**/*.pdf",
    "**/tag/**",
    "**/author/**"
  ],
  "useSitemaps": false,
  "htmlTransformer": "readability",
  "removeElementsCssSelector": ".sidebar, .comments, #footer, .cookie-banner",
  "keepElementsCssSelector": "article, .post-content, main",
  "includeImages": false,
  "aggressivePrune": false,
  "waitForSelector": ".content-loaded, #main-article",
  "dismissCookieBanners": true,
  "expandClickableElements": false,
  "languageFilter": [
    "en",
    "de"
  ],
  "contentMinLength": 100,
  "deduplicateByContent": false,
  "saveMarkdown": true,
  "saveText": true,
  "saveHtml": false,
  "saveScreenshots": false,
  "extractMetadata": true,
  "extractLinks": false,
  "chunkContent": false,
  "chunkSize": 2000,
  "chunkOverlap": 200,
  "requestsPerMinute": 0,
  "debugMode": false
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://docs.apify.com/academy/web-scraping-for-beginners"
        }
    ],
    "maxCrawlDepth": 1,
    "maxCrawlPages": 3
};

// Run the Actor and wait for it to finish
const run = await client.actor("brilliant_gum/web-content-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners" }],
    "maxCrawlDepth": 1,
    "maxCrawlPages": 3,
}

# Run the Actor and wait for it to finish
run = client.actor("brilliant_gum/web-content-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://docs.apify.com/academy/web-scraping-for-beginners"
    }
  ],
  "maxCrawlDepth": 1,
  "maxCrawlPages": 3
}' |
apify call brilliant_gum/web-content-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=brilliant_gum/web-content-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI-Ready Web Content Crawler (LLM/RAG Optimized)",
        "description": "Deep-crawl websites and extract LLM-ready Markdown with OG tags, JSON-LD, author, dates, token estimates, native RAG chunking, language filtering, content-hash dedup, and per-page error reporting. Enforced timeouts. Zero silent failures.",
        "version": "1.0",
        "x-build-id": "Z27Y7rakQsHoV7goa"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/brilliant_gum~web-content-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-brilliant_gum-web-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/brilliant_gum~web-content-crawler/runs": {
            "post": {
                "operationId": "runs-sync-brilliant_gum-web-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/brilliant_gum~web-content-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-brilliant_gum-web-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "One or more seed URLs. The crawler stays within each URL's domain by default. Supports plain strings or {url, userData} objects.",
                        "items": {
                            "type": "object"
                        }
                    },
                    "crawlerType": {
                        "title": "Crawler Type",
                        "enum": [
                            "playwright:chrome",
                            "playwright:firefox",
                            "playwright:adaptive",
                            "cheerio",
                            "jsdom"
                        ],
                        "type": "string",
                        "description": "Rendering engine. playwright:chrome (default) handles JS and bot protection. cheerio is 3-5x faster and cheaper for static HTML sites.",
                        "default": "playwright:chrome"
                    },
                    "ignoreSslErrors": {
                        "title": "Ignore SSL Errors",
                        "type": "boolean",
                        "description": "Ignore SSL/TLS certificate errors (e.g. self-signed certificates).",
                        "default": false
                    },
                    "maxCrawlDepth": {
                        "title": "Max Crawl Depth",
                        "minimum": 0,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Maximum link depth from the seed URL. Depth 0 = only the seed URL. Depth 1 = seed + directly linked pages.",
                        "default": 5
                    },
                    "maxCrawlPages": {
                        "title": "Max Saved Pages",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Maximum number of pages to save to the dataset. Filtered pages (language, content length, duplicates) do NOT count.",
                        "default": 100
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum parallel browser/HTTP requests. Lower = less chance of rate-limiting.",
                        "default": 3
                    },
                    "requestTimeoutSecs": {
                        "title": "Request Timeout (seconds)",
                        "minimum": 5,
                        "maximum": 300,
                        "type": "integer",
                        "description": "Hard timeout per page. The crawl is aborted if a page takes longer. This is strictly enforced (unlike some other crawlers).",
                        "default": 60
                    },
                    "globs": {
                        "title": "Include URL Patterns (globs)",
                        "type": "array",
                        "description": "Only crawl URLs matching these glob patterns (e.g. 'https://example.com/blog/**'). Leave empty to crawl all pages within the seed domain.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "excludeGlobs": {
                        "title": "Exclude URL Patterns (globs)",
                        "type": "array",
                        "description": "Skip URLs matching these glob patterns (e.g. '**/*.pdf', '**/tag/**').",
                        "items": {
                            "type": "string"
                        }
                    },
                    "useSitemaps": {
                        "title": "Discover via sitemap.xml",
                        "type": "boolean",
                        "description": "Auto-fetch {domain}/sitemap.xml and enqueue all discovered URLs. Faster for large sites.",
                        "default": false
                    },
                    "htmlTransformer": {
                        "title": "Content Extraction Method",
                        "enum": [
                            "readability",
                            "raw"
                        ],
                        "type": "string",
                        "description": "How to extract page content. 'readability' uses Mozilla Readability (Firefox Reader View) for clean article extraction. 'raw' uses the full body HTML.",
                        "default": "readability"
                    },
                    "removeElementsCssSelector": {
                        "title": "Remove Elements (CSS Selector)",
                        "type": "string",
                        "description": "Remove matching elements before extraction. Default: nav, footer, header, aside, cookie banners, ads. Override with your own selectors."
                    },
                    "keepElementsCssSelector": {
                        "title": "Keep Only Elements (CSS Selector)",
                        "type": "string",
                        "description": "When set, extract ONLY content inside matching elements (removes everything else first)."
                    },
                    "includeImages": {
                        "title": "Include Images in Markdown",
                        "type": "boolean",
                        "description": "Keep image Markdown (![alt](url)) in output. Disabled by default to reduce noise for LLM use cases.",
                        "default": false
                    },
                    "aggressivePrune": {
                        "title": "Aggressive Pruning",
                        "type": "boolean",
                        "description": "Remove additional noisy elements: sidebars, social buttons, share widgets, comment sections, related-article widgets.",
                        "default": false
                    },
                    "waitForSelector": {
                        "title": "Wait for CSS Selector (Playwright only)",
                        "type": "string",
                        "description": "Wait until this element appears before extracting content. Useful for heavily dynamic pages."
                    },
                    "dismissCookieBanners": {
                        "title": "Auto-dismiss Cookie Banners (Playwright only)",
                        "type": "boolean",
                        "description": "Automatically click Accept on common cookie consent dialogs before extracting content.",
                        "default": true
                    },
                    "expandClickableElements": {
                        "title": "Expand Clickable Elements (Playwright only)",
                        "type": "boolean",
                        "description": "Auto-click 'Read more', 'Show more' buttons and expand collapsed sections before extracting content.",
                        "default": false
                    },
                    "languageFilter": {
                        "title": "Language Filter",
                        "type": "array",
                        "description": "Only save pages in these languages (ISO 639-1 codes, e.g. ['en', 'de']). Pages in other languages are skipped but do NOT count against maxCrawlPages. Leave empty for all languages.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "contentMinLength": {
                        "title": "Minimum Content Length (chars)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip pages with fewer than this many characters of extracted text. Prevents saving empty pages, 403 error pages, redirects, etc.",
                        "default": 100
                    },
                    "deduplicateByContent": {
                        "title": "Deduplicate by Content Hash",
                        "type": "boolean",
                        "description": "Skip pages with identical content (MD5 of text). Goes beyond URL/canonical deduplication — catches pages with different URLs but the same body.",
                        "default": false
                    },
                    "saveMarkdown": {
                        "title": "Save Markdown",
                        "type": "boolean",
                        "description": "Include the 'markdown' field in each output record.",
                        "default": true
                    },
                    "saveText": {
                        "title": "Save Plain Text",
                        "type": "boolean",
                        "description": "Include the 'text' (plain text) field in each output record.",
                        "default": true
                    },
                    "saveHtml": {
                        "title": "Save Cleaned HTML",
                        "type": "boolean",
                        "description": "Save the cleaned page HTML to the key-value store and include the URL in 'htmlUrl'.",
                        "default": false
                    },
                    "saveScreenshots": {
                        "title": "Save Screenshots (Playwright only)",
                        "type": "boolean",
                        "description": "Capture a screenshot of each page and save to the key-value store. Adds 'screenshotUrl' to each record.",
                        "default": false
                    },
                    "extractMetadata": {
                        "title": "Extract Rich Metadata",
                        "type": "boolean",
                        "description": "Extract OG tags, JSON-LD structured data, author, publish date, modified date, language, hreflang, Twitter Card, word count, reading time, token estimate, and content type classification.",
                        "default": true
                    },
                    "extractLinks": {
                        "title": "Extract Links",
                        "type": "boolean",
                        "description": "Include arrays of internal links, external links, and PDF links found on each page.",
                        "default": false
                    },
                    "chunkContent": {
                        "title": "Chunk Content for LLM/RAG",
                        "type": "boolean",
                        "description": "Split Markdown into semantic chunks and add a 'chunks' array to each record. Each chunk includes text, position, and token estimate. Eliminates need for LangChain/LlamaIndex post-processing.",
                        "default": false
                    },
                    "chunkSize": {
                        "title": "Chunk Size (characters)",
                        "minimum": 100,
                        "maximum": 50000,
                        "type": "integer",
                        "description": "Target maximum characters per chunk. Chunks respect paragraph boundaries.",
                        "default": 2000
                    },
                    "chunkOverlap": {
                        "title": "Chunk Overlap (characters)",
                        "minimum": 0,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Characters of overlap between consecutive chunks for context continuity.",
                        "default": 200
                    },
                    "initialCookies": {
                        "title": "Initial Cookies",
                        "type": "array",
                        "description": "Cookies to inject before crawling (e.g. for authenticated sessions). Each entry: {name, value, domain, path}.",
                        "items": {
                            "type": "object"
                        }
                    },
                    "customHeaders": {
                        "title": "Custom HTTP Headers",
                        "type": "object",
                        "description": "Additional headers to send with every request (e.g. {\"Authorization\": \"Bearer token\"})."
                    },
                    "userAgent": {
                        "title": "Custom User-Agent",
                        "type": "string",
                        "description": "Override the browser's User-Agent string."
                    },
                    "requestsPerMinute": {
                        "title": "Requests per Minute (rate limit)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum requests per minute to be polite to the target server. 0 = no limit.",
                        "default": 0
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Proxy settings. Built-in residential proxy is used by default. Override here only if you need a specific proxy group (e.g. RESIDENTIAL for heavily protected sites)."
                    },
                    "debugMode": {
                        "title": "Debug Mode",
                        "type": "boolean",
                        "description": "Add 'extractedBy' field to each record showing whether Readability or raw extraction was used.",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
