# Website Content Crawler (`scrapemint/website-content-crawler`) Actor

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

- **URL**: https://apify.com/scrapemint/website-content-crawler.md
- **Developed by:** [Kennedy Mutisya](https://apify.com/scrapemint) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Website Content Crawler — Markdown, Token Counts & RAG Chunks

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries a token estimate, JSON LD metadata, and a link graph. Optional auto chunk splitting drops your data straight into a vector database. Pay per page.

**Built for** AI engineers feeding RAG pipelines, LLM application teams indexing documentation, vector database operators ingesting knowledge bases, and content teams converting websites to clean Markdown for fine tuning.

**Keywords this actor ranks for:** website to markdown, website crawler for LLM, RAG pipeline crawler, scrape website to JSON, website content scraper API, llamaindex web scraper, langchain web crawler, vector database ingestion, AI training data crawler, documentation to markdown, website to RAG chunks, html to markdown converter API, knowledge base crawler.

---

### Why this actor

| Other crawlers | **This actor** |
|---|---|
| Raw HTML or plain text only | Markdown, plain text, AND cleaned HTML in one row |
| One extractor, take it or leave it | Three extractors race, the highest scored wins, the winner is tagged |
| Manual chunking on your side | Auto chunks at paragraph boundaries with token aware overlap |
| No token info | Every row ships an estimated GPT and Claude token count |
| Sitemap configuration required | Auto discovers sitemap.xml, sitemap_index.xml, and robots.txt |
| PII passes through to your index | Optional one click PII redaction (emails, phones, SSN, IBAN) |
| Link graph data missing | Every row carries internal vs external link counts and 25 samples |

---

### How it works

```mermaid
flowchart LR
    A[Start URLs] --> B[Auto sitemap discovery<br/>sitemap.xml + robots.txt]
    A --> C[Adaptive crawler<br/>Playwright or Cheerio]
    B --> C
    C --> D[Strip nav header footer<br/>ads modals cookies]
    D --> E[Race three extractors<br/>Readability vs main vs body]
    E --> F[HTML to Markdown<br/>code blocks tables links]
    F --> G[Token count + chunk split]
    G --> H[(JSON CSV API<br/>vector database)]
````

Three extractors run on every page. Mozilla Readability, a custom main content detector, and a body fallback each return text plus a content score. The highest scoring result wins and the row is tagged with which extractor produced it, so you can audit quality on a per row basis.

***

### What you get per row

```mermaid
flowchart LR
    R[Page row] --> R1[Identity<br/>url loadedUrl title depth]
    R --> R2[Content<br/>markdown text html]
    R --> R3[Tokens<br/>estGpt chars]
    R --> R4[Metadata<br/>author publishedAt JSON LD]
    R --> R5[Link graph<br/>internal external samples]
    R --> R6[Extractor<br/>winner + score]
```

Toggle `chunkOutput` and the same row format is split into RAG ready chunks. Each chunk row has `chunkIndex`, `totalChunks`, the chunk markdown, and a token count, ready to push straight into Pinecone, Qdrant, Weaviate, or a Postgres pgvector table.

***

### Quick start

**Index a documentation site for RAG**

```json
{
  "startUrls": ["https://docs.example.com/"],
  "maxPages": 500,
  "maxDepth": 5,
  "chunkOutput": true,
  "chunkSize": 1000,
  "chunkOverlap": 100
}
```

**Convert a blog to clean Markdown**

```json
{
  "startUrls": ["https://blog.example.com/"],
  "includeUrlPatterns": ["**/posts/**", "**/blog/**"],
  "outputFormats": ["markdown", "text"],
  "maxPages": 200
}
```

**GDPR safe RAG ingestion (PII redacted)**

```json
{
  "startUrls": ["https://support.example.com/"],
  "redactPII": true,
  "chunkOutput": true,
  "removeFluff": true,
  "minContentLength": 200
}
```

**Index a knowledge base with PDF download**

```json
{
  "startUrls": ["https://kb.example.com/"],
  "downloadFiles": true,
  "downloadFileTypes": ["pdf", "docx"],
  "maxPages": 1000
}
```

***

### Sample output

**Page row**

```json
{
  "url": "https://docs.apify.com/academy/scraping-basics-javascript",
  "loadedUrl": "https://docs.apify.com/academy/scraping-basics-javascript",
  "title": "Web scraping basics for JavaScript devs",
  "depth": 0,
  "extractor": "readability",
  "contentScore": 42.8,
  "markdown": "**Learn how to use JavaScript to extract information from websites...**\n\nIn this course we'll use JavaScript to create...",
  "text": "Learn how to use JavaScript to extract information from websites...",
  "tokens": { "estGpt": 1508, "chars": 6030 },
  "metadata": {
    "title": "Web scraping basics for JavaScript devs",
    "description": "Learn how to extract information from websites in this hands on course.",
    "author": null,
    "publishedAt": "2024-09-12T00:00:00.000Z",
    "modifiedAt": "2025-08-04T00:00:00.000Z",
    "language": "en",
    "jsonLdTypes": ["TechArticle"]
  },
  "links": { "outbound": 57, "internal": 43, "external": 14, "crawlable": 25, "samples": ["..."] },
  "crawledAt": "2026-04-25T16:00:00.000Z"
}
```

**Chunk row** (when `chunkOutput` is on)

```json
{
  "url": "https://docs.apify.com/academy/scraping-basics-javascript",
  "title": "Web scraping basics for JavaScript devs",
  "chunkIndex": 0,
  "totalChunks": 4,
  "markdown": "First 1000 token slice of the page...",
  "tokens": { "estGpt": 998, "chars": 3992 },
  "metadata": { "..." }
}
```

**File row** (when `downloadFiles` is on)

```json
{
  "url": "https://docs.example.com/whitepaper.pdf",
  "kind": "file",
  "extension": "pdf",
  "sizeBytes": 482194,
  "keyValueStoreKey": "https___docs_example_com_whitepaper_pdf-1714053000000.pdf"
}
```

***

### Who uses this

| Role | Use case |
|---|---|
| AI engineer | Index docs, knowledge bases, and blogs into a RAG pipeline. Use chunk output to skip a chunking step. |
| LLM app team | Convert customer documentation into Markdown for prompt context or fine tuning datasets. |
| Vector database operator | Pipe each chunk row straight into Pinecone, Qdrant, Weaviate, or pgvector. |
| Content team | Mirror an old website into clean Markdown for migration to a new CMS. |
| Compliance team | Redact PII at ingest time with `redactPII: true`. No post processing on your side. |
| Researcher | Pull every page from a site with metadata, then run analysis on the link graph. |

***

### Input reference

| Field | Type | What it does |
|---|---|---|
| `startUrls` | string\[] | Required. Entry URLs for the crawl. |
| `crawlerType` | enum | adaptive, playwright, or cheerio. |
| `maxPages` | integer | Hard cap across all start URLs. 0 means unlimited. |
| `maxDepth` | integer | Link hops from start URL. 0 means seed only. |
| `useSitemap` | boolean | Auto discover sitemap.xml and robots.txt. |
| `respectRobotsTxt` | boolean | Skip URLs disallowed by robots.txt. |
| `includeUrlPatterns` | string\[] | Glob patterns. Pages must match at least one. |
| `excludeUrlPatterns` | string\[] | Glob patterns. Pages matching any are skipped. |
| `stayOnDomain` | boolean | Stay on the registrable domain of the start URL. |
| `stayOnSubdomain` | boolean | Stricter than stayOnDomain. Same hostname only. |
| `removeFluff` | boolean | Strip nav, footer, ads, and modals before extracting. |
| `extractor` | enum | auto, readability, main, or body. |
| `outputFormats` | string\[] | Any of markdown, text, html. |
| `minContentLength` | integer | Drop pages below this many characters. |
| `chunkOutput` | boolean | Split pages into RAG chunks and push one row per chunk. |
| `chunkSize` | integer | Target tokens per chunk. |
| `chunkOverlap` | integer | Tokens of overlap between consecutive chunks. |
| `redactPII` | boolean | Redact emails, phones, SSN, IBAN before output. |
| `extractMetadata` | boolean | Pull JSON LD, OpenGraph, author, publish dates. |
| `extractLinks` | boolean | Per row link graph counts and 25 samples. |
| `infiniteScroll` | boolean | Stage scroll to render lazy content. Playwright only. |
| `waitForSelector` | string | Wait for a CSS selector before extraction. Playwright only. |
| `cookies` | object\[] | Cookies to set for pages behind a login. |
| `downloadFiles` | boolean | Save linked PDF, DOC, XLS files to the key value store. |
| `concurrency` | integer | Pages processed in parallel. |
| `proxyConfiguration` | object | Apify proxy. Datacenter is fine for most sites. |

***

### API call

```bash
curl -X POST \
  "https://api.apify.com/v2/acts/YOUR_USER~website-content-crawler/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": ["https://docs.example.com/"],
    "maxPages": 500,
    "chunkOutput": true,
    "chunkSize": 1000
  }'
```

***

### Pricing

The first few rows per run are free so you can validate output before paying. After that, one charge per dataset row pushed. Auto chunking, token estimation, link graph, PII redaction, and metadata extraction are all included at no extra cost. File downloads count as one row each.

***

### FAQ

#### Why is this better than the official Website Content Crawler?

This actor races three extractors and tags the winner per row, ships token estimates on every row, auto chunks for RAG with a single toggle, redacts PII at the source, and adds a link graph (internal vs external counts plus samples) without extra config.

#### Will this actor scrape JavaScript heavy sites?

Yes. Set `crawlerType` to `playwright` or leave it on `adaptive`. The browser pool ships fingerprinted Chrome with anti detection patches. Use `infiniteScroll: true` for sites that load content as you scroll, and `waitForSelector` to wait for a specific element before extraction.

#### How accurate is the token count?

Token counts use a 4 chars per token estimate for prose and 3 chars per token for fenced code blocks, calibrated against GPT and Claude tokenizers. Real tokenizer counts will be within 5 to 10 percent on English content. Set `chunkSize` slightly under your model limit to leave headroom.

#### Does the chunk splitter respect paragraph boundaries?

Yes. The splitter walks paragraphs and packs them into chunks until the token budget is reached. Long paragraphs that exceed the chunk size are split at sentence boundaries. Adjacent chunks share `chunkOverlap` tokens for context continuity during retrieval.

#### How does PII redaction work?

Set `redactPII: true` and emails, phone numbers, US Social Security numbers, and IBAN bank account numbers are replaced with `[REDACTED_*]` tokens before output. This applies to both Markdown and plain text fields. Useful for GDPR safe RAG indexing of customer support content.

#### Can I crawl pages behind a login?

Yes. Pass authentication cookies in the `cookies` field. Format is an array of `{name, value, domain}` objects. The crawler sets these on every browser context before navigating.

#### Does it download PDF files for indexing?

Yes. Set `downloadFiles: true` and choose extensions in `downloadFileTypes`. PDFs, DOC, DOCX, XLS, XLSX, and CSV files are saved to the key value store with one dataset row per file pointing at the storage key.

#### Can I run this on a schedule?

Yes. Use the Apify scheduler for hourly, daily, or weekly runs. Combine with a sitemap to capture only new pages, or run a full crawl on a fixed cadence to refresh your vector database.

#### Is the data in the dataset compatible with LangChain or LlamaIndex?

Yes. The Markdown output, page URL, and metadata fields map directly to LangChain Document and LlamaIndex Node schemas. Use the Apify dataset reader from either framework, or pull the dataset via API and feed your own pipeline.

***

### Related actors

- **TripAdvisor Property Rank Tracker** — daily rank, rating, and competitor signals for hotels and restaurants
- **LinkedIn Profile & Company Post Tracker** — public LinkedIn posts without a cookie
- **LinkedIn Hiring Tracker & Salary Intelligence** — parsed salary, tech stack, seniority on every job row
- **Google Maps Scraper** — local business data with reviews
- **Reddit Brand Monitor & Lead Finder** — subreddit mentions and high intent leads

# Actor input Schema

## `startUrls` (type: `array`):

URLs to start the crawl from. Each URL is treated as the entry point for that website.

## `crawlerType` (type: `string`):

Adaptive switches between a real browser and raw HTTP based on whether the page renders content via JavaScript. Use Playwright for heavy JS sites, Cheerio for static pages.

## `maxPages` (type: `integer`):

Hard cap on pages crawled per run across all start URLs. Set to 0 for unlimited.

## `maxDepth` (type: `integer`):

How many link hops away from the start URL the crawler is allowed to follow. 0 means only the start URLs.

## `useSitemap` (type: `boolean`):

Auto discover sitemap.xml, sitemap\_index.xml, and robots.txt sitemap entries to seed the crawl. Recommended for large sites.

## `respectRobotsTxt` (type: `boolean`):

Skip URLs disallowed by the site's robots.txt. Turn off for internal scraping work where you have a contract.

## `includeUrlPatterns` (type: `array`):

Glob patterns. Pages must match at least one of these to be crawled. Empty means everything on the same domain. Examples: '**/docs/**', '\*\*/blog/\*'.

## `excludeUrlPatterns` (type: `array`):

Glob patterns. Pages matching any of these are skipped. Examples: '**/login/**', '**/api/**', '\*\*/\*.zip'.

## `stayOnDomain` (type: `boolean`):

Only follow links on the same registrable domain as the start URL (e.g. apify.com and docs.apify.com both count when on).

## `stayOnSubdomain` (type: `boolean`):

Only follow links on the exact same hostname as the start URL. Stricter than stayOnDomain.

## `removeFluff` (type: `boolean`):

Strip nav, footer, header, aside, ads, cookie banners, and modals before extracting content. Recommended for AI pipelines.

## `extractor` (type: `string`):

Auto picks the best result from Readability, a custom main detector, and a body fallback. Force one if your pipeline needs consistency.

## `outputFormats` (type: `array`):

Each row carries the formats you select. Markdown is the default for AI pipelines. Plain text is best for token tight LLM contexts. HTML is the cleaned post extraction HTML.

## `minContentLength` (type: `integer`):

Drop pages whose extracted content is shorter than this many characters. Useful for filtering out empty templates and 404 pages.

## `chunkOutput` (type: `boolean`):

Push one row per chunk instead of one row per page. Each chunk row carries url, chunkIndex, totalChunks, markdown, tokens, and the page metadata. Built for vector database ingestion.

## `chunkSize` (type: `integer`):

Target token count per chunk when chunkOutput is on. 800 to 1000 works well for most embedding models.

## `chunkOverlap` (type: `integer`):

Tokens of overlap between consecutive chunks. Helps preserve context at chunk boundaries during retrieval.

## `redactPII` (type: `boolean`):

Replace emails, phone numbers, IBANs, and US Social Security numbers with \[REDACTED] tokens before output. Useful for GDPR safe RAG indexing.

## `extractMetadata` (type: `boolean`):

Pull JSON LD article and product schemas, OpenGraph tags, author, publish date, and modified date for every page.

## `extractLinks` (type: `boolean`):

Each row carries outbound link count split by internal vs external, plus a sample of up to 25 link URLs.

## `infiniteScroll` (type: `boolean`):

Scroll the page in stages so lazy loaded content renders before extraction. Playwright crawler only.

## `waitForSelector` (type: `string`):

Optional CSS selector. The crawler waits for this element before extracting content. Playwright crawler only.

## `cookies` (type: `array`):

Cookies to set before crawling. Use this for pages behind a login. Format: array of {name, value, domain}.

## `downloadFiles` (type: `boolean`):

Save linked PDF, DOC, DOCX, XLS, XLSX, and CSV files to the key value store. Useful for indexing knowledge bases and research libraries.

## `downloadFileTypes` (type: `array`):

File extensions the crawler should download when downloadFiles is on.

## `concurrency` (type: `integer`):

Pages processed in parallel. Eight is a safe default. Drop to two or three for sites with strict rate limits.

## `requestTimeoutSecs` (type: `integer`):

Per page timeout. Long pages or slow sites may need 60 to 90 seconds.

## `proxyConfiguration` (type: `object`):

Apify proxy. Datacenter is fine for most documentation sites. Use residential for sites with anti scraping protection.

## Actor input object example

```json
{
  "startUrls": [
    "https://docs.apify.com/academy/scraping-basics-javascript"
  ],
  "crawlerType": "adaptive",
  "maxPages": 25,
  "maxDepth": 3,
  "useSitemap": true,
  "respectRobotsTxt": true,
  "includeUrlPatterns": [],
  "excludeUrlPatterns": [
    "**/login/**",
    "**/signup/**",
    "**/cart/**"
  ],
  "stayOnDomain": true,
  "stayOnSubdomain": false,
  "removeFluff": true,
  "extractor": "auto",
  "outputFormats": [
    "markdown",
    "text"
  ],
  "minContentLength": 100,
  "chunkOutput": false,
  "chunkSize": 1000,
  "chunkOverlap": 100,
  "redactPII": false,
  "extractMetadata": true,
  "extractLinks": true,
  "infiniteScroll": false,
  "waitForSelector": "",
  "cookies": [],
  "downloadFiles": false,
  "downloadFileTypes": [
    "pdf",
    "doc",
    "docx"
  ],
  "concurrency": 8,
  "requestTimeoutSecs": 45,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ]
  }
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://docs.apify.com/academy/scraping-basics-javascript"
    ],
    "proxyConfiguration": {
        "useApifyProxy": true,
        "apifyProxyGroups": [
            "RESIDENTIAL"
        ]
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("scrapemint/website-content-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": ["https://docs.apify.com/academy/scraping-basics-javascript"],
    "proxyConfiguration": {
        "useApifyProxy": True,
        "apifyProxyGroups": ["RESIDENTIAL"],
    },
}

# Run the Actor and wait for it to finish
run = client.actor("scrapemint/website-content-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://docs.apify.com/academy/scraping-basics-javascript"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ]
  }
}' |
apify call scrapemint/website-content-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=scrapemint/website-content-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Content Crawler",
        "description": "Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.",
        "version": "0.1",
        "x-build-id": "g4fnZZKLCNWKg5PeM"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/scrapemint~website-content-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-scrapemint-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/scrapemint~website-content-crawler/runs": {
            "post": {
                "operationId": "runs-sync-scrapemint-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/scrapemint~website-content-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-scrapemint-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "URLs to start the crawl from. Each URL is treated as the entry point for that website.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "crawlerType": {
                        "title": "Crawler type",
                        "enum": [
                            "adaptive",
                            "playwright",
                            "cheerio"
                        ],
                        "type": "string",
                        "description": "Adaptive switches between a real browser and raw HTTP based on whether the page renders content via JavaScript. Use Playwright for heavy JS sites, Cheerio for static pages.",
                        "default": "adaptive"
                    },
                    "maxPages": {
                        "title": "Max pages",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Hard cap on pages crawled per run across all start URLs. Set to 0 for unlimited.",
                        "default": 25
                    },
                    "maxDepth": {
                        "title": "Max link depth",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "How many link hops away from the start URL the crawler is allowed to follow. 0 means only the start URLs.",
                        "default": 3
                    },
                    "useSitemap": {
                        "title": "Use sitemap",
                        "type": "boolean",
                        "description": "Auto discover sitemap.xml, sitemap_index.xml, and robots.txt sitemap entries to seed the crawl. Recommended for large sites.",
                        "default": true
                    },
                    "respectRobotsTxt": {
                        "title": "Respect robots.txt",
                        "type": "boolean",
                        "description": "Skip URLs disallowed by the site's robots.txt. Turn off for internal scraping work where you have a contract.",
                        "default": true
                    },
                    "includeUrlPatterns": {
                        "title": "Include URL patterns",
                        "type": "array",
                        "description": "Glob patterns. Pages must match at least one of these to be crawled. Empty means everything on the same domain. Examples: '**/docs/**', '**/blog/*'.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "excludeUrlPatterns": {
                        "title": "Exclude URL patterns",
                        "type": "array",
                        "description": "Glob patterns. Pages matching any of these are skipped. Examples: '**/login/**', '**/api/**', '**/*.zip'.",
                        "default": [
                            "**/login/**",
                            "**/signup/**",
                            "**/cart/**"
                        ],
                        "items": {
                            "type": "string"
                        }
                    },
                    "stayOnDomain": {
                        "title": "Stay on the same domain",
                        "type": "boolean",
                        "description": "Only follow links on the same registrable domain as the start URL (e.g. apify.com and docs.apify.com both count when on).",
                        "default": true
                    },
                    "stayOnSubdomain": {
                        "title": "Stay on the same subdomain",
                        "type": "boolean",
                        "description": "Only follow links on the exact same hostname as the start URL. Stricter than stayOnDomain.",
                        "default": false
                    },
                    "removeFluff": {
                        "title": "Remove navigation, footers, ads, and modals",
                        "type": "boolean",
                        "description": "Strip nav, footer, header, aside, ads, cookie banners, and modals before extracting content. Recommended for AI pipelines.",
                        "default": true
                    },
                    "extractor": {
                        "title": "Main content extractor",
                        "enum": [
                            "auto",
                            "readability",
                            "main",
                            "body"
                        ],
                        "type": "string",
                        "description": "Auto picks the best result from Readability, a custom main detector, and a body fallback. Force one if your pipeline needs consistency.",
                        "default": "auto"
                    },
                    "outputFormats": {
                        "title": "Output formats per page",
                        "type": "array",
                        "description": "Each row carries the formats you select. Markdown is the default for AI pipelines. Plain text is best for token tight LLM contexts. HTML is the cleaned post extraction HTML.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "markdown",
                                "text",
                                "html"
                            ],
                            "enumTitles": [
                                "Markdown",
                                "Plain text",
                                "Cleaned HTML"
                            ]
                        },
                        "default": [
                            "markdown",
                            "text"
                        ]
                    },
                    "minContentLength": {
                        "title": "Minimum content length",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Drop pages whose extracted content is shorter than this many characters. Useful for filtering out empty templates and 404 pages.",
                        "default": 100
                    },
                    "chunkOutput": {
                        "title": "Auto split into RAG chunks",
                        "type": "boolean",
                        "description": "Push one row per chunk instead of one row per page. Each chunk row carries url, chunkIndex, totalChunks, markdown, tokens, and the page metadata. Built for vector database ingestion.",
                        "default": false
                    },
                    "chunkSize": {
                        "title": "Chunk size in tokens",
                        "minimum": 100,
                        "maximum": 8192,
                        "type": "integer",
                        "description": "Target token count per chunk when chunkOutput is on. 800 to 1000 works well for most embedding models.",
                        "default": 1000
                    },
                    "chunkOverlap": {
                        "title": "Chunk overlap in tokens",
                        "minimum": 0,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Tokens of overlap between consecutive chunks. Helps preserve context at chunk boundaries during retrieval.",
                        "default": 100
                    },
                    "redactPII": {
                        "title": "Redact PII",
                        "type": "boolean",
                        "description": "Replace emails, phone numbers, IBANs, and US Social Security numbers with [REDACTED] tokens before output. Useful for GDPR safe RAG indexing.",
                        "default": false
                    },
                    "extractMetadata": {
                        "title": "Extract metadata",
                        "type": "boolean",
                        "description": "Pull JSON LD article and product schemas, OpenGraph tags, author, publish date, and modified date for every page.",
                        "default": true
                    },
                    "extractLinks": {
                        "title": "Extract link graph",
                        "type": "boolean",
                        "description": "Each row carries outbound link count split by internal vs external, plus a sample of up to 25 link URLs.",
                        "default": true
                    },
                    "infiniteScroll": {
                        "title": "Trigger infinite scroll",
                        "type": "boolean",
                        "description": "Scroll the page in stages so lazy loaded content renders before extraction. Playwright crawler only.",
                        "default": false
                    },
                    "waitForSelector": {
                        "title": "Wait for selector",
                        "type": "string",
                        "description": "Optional CSS selector. The crawler waits for this element before extracting content. Playwright crawler only.",
                        "default": ""
                    },
                    "cookies": {
                        "title": "Cookies",
                        "type": "array",
                        "description": "Cookies to set before crawling. Use this for pages behind a login. Format: array of {name, value, domain}.",
                        "default": []
                    },
                    "downloadFiles": {
                        "title": "Download linked files",
                        "type": "boolean",
                        "description": "Save linked PDF, DOC, DOCX, XLS, XLSX, and CSV files to the key value store. Useful for indexing knowledge bases and research libraries.",
                        "default": false
                    },
                    "downloadFileTypes": {
                        "title": "File extensions to download",
                        "type": "array",
                        "description": "File extensions the crawler should download when downloadFiles is on.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "pdf",
                                "doc",
                                "docx",
                                "xls",
                                "xlsx",
                                "csv",
                                "txt",
                                "json",
                                "xml"
                            ]
                        },
                        "default": [
                            "pdf",
                            "doc",
                            "docx"
                        ]
                    },
                    "concurrency": {
                        "title": "Concurrency",
                        "minimum": 1,
                        "maximum": 64,
                        "type": "integer",
                        "description": "Pages processed in parallel. Eight is a safe default. Drop to two or three for sites with strict rate limits.",
                        "default": 8
                    },
                    "requestTimeoutSecs": {
                        "title": "Request timeout in seconds",
                        "minimum": 5,
                        "maximum": 600,
                        "type": "integer",
                        "description": "Per page timeout. Long pages or slow sites may need 60 to 90 seconds.",
                        "default": 45
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Apify proxy. Datacenter is fine for most documentation sites. Use residential for sites with anti scraping protection.",
                        "default": {
                            "useApifyProxy": true
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
