# Website Content Crawler (`parseforge/website-content-crawler`) Actor

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

- **URL**: https://apify.com/parseforge/website-content-crawler.md
- **Developed by:** [ParseForge](https://apify.com/parseforge) (community)
- **Categories:** AI, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $20.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

![ParseForge Banner](https://github.com/ParseForge/apify-assets/blob/ad35ccc13ddd068b9d6cba33f323962e39aed5b2/banner.jpg?raw=true)

## 🕸️ Website Content Crawler

> 🚀 **Crawl an entire website and export clean Markdown in seconds.** Seed from sitemaps, respect robots.txt, and fall back to a real browser for JavaScript-heavy pages. No API key, no registration, no manual pipeline code.

> 🕒 **Last updated:** 2026-04-21 · **📊 18 fields** per page · **🗺️ Sitemap auto-seed** · **🤖 Robots-aware** · **🌐 HTTP + browser fallback**

The **Website Content Crawler** walks any website from a starting URL, following internal links up to a configurable depth. It parses `sitemap.xml` and `sitemap_index.xml` to discover thousands of URLs instantly, respects robots.txt, and can switch to a headless browser when HTTP-only fetching returns thin content. Every crawled page comes back as clean Markdown plus 17 metadata fields, ready for RAG pipelines, knowledge bases, and content audits.

Built-in include and exclude regex filters let you narrow the crawl to `/docs/`, skip `/auth/`, or ignore query-heavy URLs. Concurrency defaults to 10 parallel fetches, so a 100-page crawl typically finishes in about a minute. The output uses a consistent schema across HTTP and browser modes, so downstream consumers never have to know which fetch strategy was used.

| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| AI app teams, knowledge engineers, SEO specialists, documentation writers, research scientists, content archivists | RAG knowledge bases, docs mirroring, SEO audits, competitor content analysis, research corpus assembly |

---

### 📋 What the Website Content Crawler does

Six crawl workflows in a single run:

- 🗺️ **Sitemap auto-seed.** Parses `sitemap.xml` and index files to discover every public URL in seconds.
- 🤖 **Robots.txt aware.** Respects disallow rules for the `*` and `apify` user-agents.
- 🌐 **Browser fallback.** Uses Playwright when a page returns thin content, handling JavaScript-heavy sites automatically.
- 📝 **Markdown extraction.** Clean headings, paragraphs, lists, blockquotes, and code blocks. Navigation and footers stripped.
- 🔗 **Link analytics.** Counts internal and outbound links per page for site-structure analysis.
- 🚦 **Include/exclude patterns.** Regex filters to control which URLs enter the queue.

Every page ships with title, description, language, author, publishedTime, siteName, og:image, link counts, HTTP status, response time, depth, parent URL, and a timestamp.

> 💡 **Why it matters:** RAG pipelines, SEO audits, and knowledge bases all start with a clean crawl. Doing it yourself means writing link discovery, sitemap parsers, robots.txt logic, and a Markdown cleaner. This Actor ships all of that pre-packaged.

---

### 🎬 Full Demo

_🚧 Coming soon: a 3-minute walkthrough showing sitemap seeding and browser fallback in action._

---

### ⚙️ Input

<table>
<thead>
<tr><th>Input</th><th>Type</th><th>Default</th><th>Behavior</th></tr>
</thead>
<tbody>
<tr><td><code>startUrls</code></td><td>array of URLs</td><td>required</td><td>One or more starting URLs for the crawl.</td></tr>
<tr><td><code>maxDepth</code></td><td>integer</td><td><code>2</code></td><td>Link hops from the start URLs (0 = start URLs only).</td></tr>
<tr><td><code>maxItems</code></td><td>integer</td><td><code>10</code></td><td>Pages returned. Free plan caps at 10, paid plan at 1,000,000.</td></tr>
<tr><td><code>sameDomain</code></td><td>boolean</td><td><code>true</code></td><td>Stay within the starting domain.</td></tr>
<tr><td><code>includeSubdomains</code></td><td>boolean</td><td><code>true</code></td><td>Follow subdomains of the root host.</td></tr>
<tr><td><code>renderingType</code></td><td>string</td><td><code>"http"</code></td><td><code>http</code>, <code>browser</code>, or <code>auto</code> (browser fallback when HTTP content is thin).</td></tr>
<tr><td><code>useSitemap</code></td><td>boolean</td><td><code>true</code></td><td>Seed queue from <code>sitemap.xml</code>.</td></tr>
<tr><td><code>respectRobotsTxt</code></td><td>boolean</td><td><code>true</code></td><td>Skip URLs disallowed by robots.txt.</td></tr>
<tr><td><code>includeUrlPatterns</code></td><td>array of regex</td><td><code>[]</code></td><td>Only URLs matching any pattern are crawled.</td></tr>
<tr><td><code>excludeUrlPatterns</code></td><td>array of regex</td><td><code>[]</code></td><td>URLs matching any pattern are skipped.</td></tr>
</tbody>
</table>

**Example: crawl documentation with sitemap seeding.**

```json
{
    "startUrls": [{ "url": "https://docs.apify.com" }],
    "maxDepth": 3,
    "maxItems": 500,
    "useSitemap": true,
    "respectRobotsTxt": true,
    "renderingType": "auto"
}
````

**Example: blog crawl with URL filters.**

```json
{
    "startUrls": [{ "url": "https://example.com" }],
    "maxDepth": 5,
    "maxItems": 200,
    "includeUrlPatterns": ["/blog/"],
    "excludeUrlPatterns": ["/tag/", "/page/"]
}
```

> ⚠️ **Good to Know:** concurrency is capped at 10 parallel fetches to stay polite. Use browser mode only when HTTP-only returns thin content, because browser rendering is about 3x slower per page.

***

### 📊 Output

Each record contains **18 fields**. Download the dataset as CSV, Excel, JSON, or XML.

#### 🧾 Schema

| Field | Type | Example |
|---|---|---|
| 🔗 `url` | string | `"https://docs.apify.com/platform/actors"` |
| 🪜 `depth` | number | `1` |
| 🏠 `parentUrl` | string | null | `"https://docs.apify.com"` |
| 🏷️ `title` | string | null | `"Actors | Apify Documentation"` |
| 📝 `description` | string | null | `"Learn how Apify Actors package scrapers."` |
| 📃 `markdown` | string | `"## Actors\n\nAn Actor is..."` |
| 💬 `text` | string | `"Actors An Actor is..."` |
| 🔢 `wordCount` | number | `860` |
| 🌍 `language` | string | null | `"en"` |
| 🧑 `author` | string | null | `"Apify"` |
| 📅 `publishedTime` | ISO 8601 | null | `"2024-08-15T00:00:00Z"` |
| 🏢 `siteName` | string | null | `"Apify Documentation"` |
| 🖼️ `imageUrl` | string | null | `"https://.../og.png"` |
| ↗️ `outboundLinks` | number | `14` |
| ↘️ `internalLinks` | number | `42` |
| 🟢 `httpStatus` | number | `200` |
| ⏱️ `responseTimeMs` | number | `210` |
| 🕒 `crawledAt` | ISO 8601 | `"2026-04-21T12:00:00.000Z"` |
| ❗ `error` | string | null | `"Timeout"` on failure |

#### 📦 Sample records

<details>
<summary><strong>📚 Docs page with rich metadata</strong></summary>

```json
{
    "url": "https://docs.apify.com/platform/actors",
    "depth": 1,
    "parentUrl": "https://docs.apify.com",
    "title": "Actors | Apify Documentation",
    "description": "Learn how Apify Actors package scrapers and automation into reusable tools.",
    "markdown": "## Actors\n\nAn **Actor** is a serverless program...",
    "text": "Actors An Actor is a serverless program...",
    "wordCount": 860,
    "language": "en",
    "author": "Apify",
    "publishedTime": "2024-08-15T00:00:00Z",
    "siteName": "Apify Documentation",
    "imageUrl": "https://docs.apify.com/og.png",
    "outboundLinks": 14,
    "internalLinks": 42,
    "httpStatus": 200,
    "responseTimeMs": 210,
    "crawledAt": "2026-04-21T12:00:00.000Z"
}
```

</details>

<details>
<summary><strong>🌱 Root landing page (depth 0)</strong></summary>

```json
{
    "url": "https://docs.apify.com",
    "depth": 0,
    "parentUrl": null,
    "title": "Apify Documentation",
    "description": "Documentation for the Apify platform.",
    "markdown": "## Apify Documentation\n\nWelcome...",
    "text": "Apify Documentation Welcome...",
    "wordCount": 312,
    "language": "en",
    "author": null,
    "publishedTime": null,
    "siteName": "Apify",
    "imageUrl": "https://docs.apify.com/og.png",
    "outboundLinks": 8,
    "internalLinks": 65,
    "httpStatus": 200,
    "responseTimeMs": 195,
    "crawledAt": "2026-04-21T12:00:00.000Z"
}
```

</details>

<details>
<summary><strong>🚧 Page that returned an error</strong></summary>

```json
{
    "url": "https://docs.apify.com/broken-link",
    "depth": 2,
    "parentUrl": "https://docs.apify.com/platform",
    "title": null,
    "description": null,
    "markdown": "",
    "text": "",
    "wordCount": 0,
    "language": null,
    "author": null,
    "publishedTime": null,
    "siteName": null,
    "imageUrl": null,
    "outboundLinks": 0,
    "internalLinks": 0,
    "httpStatus": 404,
    "responseTimeMs": 135,
    "crawledAt": "2026-04-21T12:00:00.000Z",
    "error": "HTTP 404"
}
```

</details>

***

### ✨ Why choose this Actor

| | Capability |
|---|---|
| 🗺️ | **Sitemap auto-seeding.** Discovers thousands of URLs from `sitemap.xml` instantly. |
| 🤖 | **Robots-aware.** Respects disallow rules out of the box. |
| 🌐 | **HTTP plus browser.** Auto fallback to Playwright when JavaScript matters. |
| 📝 | **Clean Markdown.** Strips nav, footer, aside, and scripts. Preserves content structure. |
| 🔗 | **Link graph.** Counts internal and outbound links per page for site analysis. |
| ⚡ | **Fast.** 100 pages in under a minute with HTTP concurrency of 10. |
| 🚫 | **No credentials.** Runs on any publicly accessible site. |

> 📊 Clean crawling is the difference between a RAG pipeline that answers correctly and one that returns garbled navigation text. This Actor does the cleaning for you.

***

### 📈 How it compares to alternatives

| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| **⭐ Website Content Crawler** *(this Actor)* | $5 free credit, then pay-per-use | Any public site | **Live per run** | depth, patterns, sitemap, robots | ⚡ 2 min |
| Generic open-source spiders | Free | Raw HTML | Your schedule | Manual coding | 🐢 Days |
| Cloud crawler platforms | $$$+/month | Full enterprise | Managed | Visual rules | 🕒 Hours |
| DIY Playwright scripts | Free | Your code | Your maintenance | Whatever you build | 🐢 Days |

Pick this Actor when you want a clean, RAG-ready crawl with sitemap discovery and zero infrastructure.

***

### 🚀 How to use

1. 📝 **Sign up.** [Create a free account with $5 credit](https://console.apify.com/sign-up?fpr=vmoqkp) (takes 2 minutes).
2. 🌐 **Open the Actor.** Go to the Website Content Crawler page on the Apify Store.
3. 🎯 **Set input.** Pick one or more start URLs, a depth limit, and `maxItems`.
4. 🚀 **Run it.** Click **Start** and let the Actor walk the site.
5. 📥 **Download.** Grab your results in the **Dataset** tab as CSV, Excel, JSON, or XML.

> ⏱️ Total time from signup to downloaded dataset: **3-5 minutes.** No coding required.

***

### 💼 Business use cases

<table>
<tr>
<td width="50%" valign="top">

#### 🧠 AI Knowledge Bases

- Feed product docs into a vector database
- Sync internal wikis into a RAG index
- Refresh chatbot context on a schedule
- Build training corpora from public sites

</td>
<td width="50%" valign="top">

#### 📈 SEO & Content Audits

- Inventory every public page on a site
- Map internal and outbound link structure
- Detect orphan and 404 pages
- Compare competitor content footprints

</td>
</tr>
<tr>
<td width="50%" valign="top">

#### 📚 Documentation Mirroring

- Archive documentation for offline use
- Snapshot support portals for compliance
- Monitor API reference changes over time
- Build plain-Markdown docs archives

</td>
<td width="50%" valign="top">

#### 🧑‍🔬 Research Corpora

- Extract text datasets from academic sites
- Gather news archives by domain
- Build language modeling corpora
- Snapshot regulatory content for analysis

</td>
</tr>
</table>

***

### 🔌 Automating Website Content Crawler

Control the scraper programmatically for scheduled runs and pipeline integrations:

- 🟢 **Node.js.** Install the `apify-client` NPM package.
- 🐍 **Python.** Use the `apify-client` PyPI package.
- 📚 See the [Apify API documentation](https://docs.apify.com/api/v2) for full details.

The [Apify Schedules feature](https://docs.apify.com/platform/schedules) lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases aligned with the source site.

***

### ❓ Frequently Asked Questions

<details>
<summary><strong>🧩 How does it work?</strong></summary>

Pass a start URL. The Actor parses sitemap.xml (if enabled), walks the site BFS up to your depth limit, fetches each page in parallel, and returns clean Markdown plus metadata. Robots.txt rules are respected by default.

</details>

<details>
<summary><strong>📏 How accurate is the content extraction?</strong></summary>

Excellent for article, main, and role=main containers. Navigation, footer, aside, and script tags are stripped. Single-page apps that render content entirely in JavaScript need browser mode.

</details>

<details>
<summary><strong>🔁 How does the browser fallback decide when to switch?</strong></summary>

In `auto` mode, HTTP fetch runs first. If the response body is under 2 KB or throws an error, the Actor retries the same URL with a headless Chromium browser.

</details>

<details>
<summary><strong>🎯 Can I restrict the crawl to specific URL patterns?</strong></summary>

Yes. `includeUrlPatterns` only allows URLs matching any regex. `excludeUrlPatterns` skips any URL matching any regex.

</details>

<details>
<summary><strong>⏰ Can I schedule regular runs?</strong></summary>

Yes. Use Apify Schedules to crawl the same site on any cron interval and keep downstream content fresh.

</details>

<details>
<summary><strong>⚖️ Is it legal to crawl a site?</strong></summary>

Respecting robots.txt is default behavior. Check each site's terms of service. Some sites explicitly prohibit crawling for commercial use.

</details>

<details>
<summary><strong>💼 Can I use the data commercially?</strong></summary>

Publicly available web content can usually be crawled for research and internal tooling. Commercial redistribution or republishing may require a license from the source.

</details>

<details>
<summary><strong>💳 Do I need a paid Apify plan to use this Actor?</strong></summary>

No. The free plan covers testing (10 pages per run). A paid plan lifts the limit and speeds up concurrency.

</details>

<details>
<summary><strong>🔁 What happens if a run fails?</strong></summary>

Apify retries transient errors automatically. Partial datasets from failed runs are preserved, so you never lose crawled data.

</details>

<details>
<summary><strong>🔍 Does it respect rate limits?</strong></summary>

Concurrency is capped at 10 parallel fetches to stay polite. If a site rate-limits, individual pages will return `error` fields but the run continues.

</details>

<details>
<summary><strong>🌐 Can I crawl multiple domains in one run?</strong></summary>

Yes. Pass multiple start URLs. Set `sameDomain: false` to follow outbound links across domains.

</details>

<details>
<summary><strong>🆘 What if I need help?</strong></summary>

Our team is available through the Apify platform and the Tally form below.

</details>

***

### 🔌 Integrate with any app

Website Content Crawler connects to any cloud service via [Apify integrations](https://apify.com/integrations):

- [**Make**](https://docs.apify.com/platform/integrations/make) - Automate multi-step workflows
- [**Zapier**](https://docs.apify.com/platform/integrations/zapier) - Connect with 5,000+ apps
- [**Slack**](https://docs.apify.com/platform/integrations/slack) - Get run notifications
- [**Airbyte**](https://docs.apify.com/platform/integrations/airbyte) - Pipe content into your warehouse
- [**GitHub**](https://docs.apify.com/platform/integrations/github) - Trigger runs from commits
- [**Google Drive**](https://docs.apify.com/platform/integrations/drive) - Export Markdown to Docs

You can also use webhooks to push freshly crawled content into vector databases and RAG pipelines.

***

### 🔗 Recommended Actors

- [**🤖 RAG Web Browser**](https://apify.com/parseforge/rag-web-browser) - Search or fetch URLs with LLM-ready output
- [**📰 Smart Article Extractor**](https://apify.com/parseforge/article-extractor) - Extract clean article text from news sites
- [**🔍 Google Search Scraper**](https://apify.com/parseforge/google-search-scraper) - SERP results with rank and description
- [**📧 Contact Info Scraper**](https://apify.com/parseforge/contact-info-scraper) - Emails, phones, and socials from URLs
- [**📸 URL Screenshot Tool**](https://apify.com/parseforge/screenshot-url) - Full-page screenshots as PNG, JPEG, or PDF

> 💡 **Pro Tip:** browse the complete [ParseForge collection](https://apify.com/parseforge) for more AI-ready web tools.

***

**🆘 Need Help?** [**Open our contact form**](https://tally.so/r/BzdKgA) to request a new scraper, propose a custom data project, or report an issue.

***

> **⚠️ Disclaimer:** this Actor is an independent tool and is not affiliated with any website or crawler framework. Only publicly accessible pages are crawled. Robots.txt rules are respected by default. Always honor the terms of service of the sites you crawl.

# Actor input Schema

## `startUrls` (type: `array`):

Starting URLs to crawl.

## `maxDepth` (type: `integer`):

How many link hops from start URLs.

## `maxItems` (type: `integer`):

Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000

## `sameDomain` (type: `boolean`):

Restrict crawl to the starting domain

## `includeSubdomains` (type: `boolean`):

Allow subdomains when crawling

## `includeUrlPatterns` (type: `array`):

Regex patterns to include URLs

## `excludeUrlPatterns` (type: `array`):

Regex patterns to exclude URLs

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://apify.com/docs"
    }
  ],
  "maxDepth": 2,
  "maxItems": 10,
  "sameDomain": true,
  "includeSubdomains": true
}
```

# Actor output Schema

## `results` (type: `string`):

Complete dataset

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://apify.com/docs"
        }
    ],
    "maxDepth": 2,
    "maxItems": 10,
    "sameDomain": true,
    "includeSubdomains": true
};

// Run the Actor and wait for it to finish
const run = await client.actor("parseforge/website-content-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://apify.com/docs" }],
    "maxDepth": 2,
    "maxItems": 10,
    "sameDomain": True,
    "includeSubdomains": True,
}

# Run the Actor and wait for it to finish
run = client.actor("parseforge/website-content-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://apify.com/docs"
    }
  ],
  "maxDepth": 2,
  "maxItems": 10,
  "sameDomain": true,
  "includeSubdomains": true
}' |
apify call parseforge/website-content-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=parseforge/website-content-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Content Crawler",
        "description": "Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!",
        "version": "1.0",
        "x-build-id": "ItgHoX53ESkJqaZKe"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/parseforge~website-content-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-parseforge-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/parseforge~website-content-crawler/runs": {
            "post": {
                "operationId": "runs-sync-parseforge-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/parseforge~website-content-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-parseforge-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "Starting URLs to crawl.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxDepth": {
                        "title": "Max Depth",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "How many link hops from start URLs."
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000"
                    },
                    "sameDomain": {
                        "title": "Stay on Same Domain",
                        "type": "boolean",
                        "description": "Restrict crawl to the starting domain"
                    },
                    "includeSubdomains": {
                        "title": "Include Subdomains",
                        "type": "boolean",
                        "description": "Allow subdomains when crawling"
                    },
                    "includeUrlPatterns": {
                        "title": "Include URL Patterns (regex)",
                        "type": "array",
                        "description": "Regex patterns to include URLs",
                        "items": {
                            "type": "string"
                        }
                    },
                    "excludeUrlPatterns": {
                        "title": "Exclude URL Patterns (regex)",
                        "type": "array",
                        "description": "Regex patterns to exclude URLs",
                        "items": {
                            "type": "string"
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
