# Website Content Scraper (`qaseemiqbal/website-content-scraper`) Actor

Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.

- **URL**: https://apify.com/qaseemiqbal/website-content-scraper.md
- **Developed by:** [Muhammad Qaseem Iqbal](https://apify.com/qaseemiqbal) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.10 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Website Content Scraper

Website Content Scraper turns websites into clean, structured content that you can use in AI apps, search tools, knowledge bases, documentation workflows, and data exports.

Give it a website URL and it will crawl the site, remove common clutter, extract useful page text, create clean Markdown, download supported files, and prepare smaller text chunks that are easier for AI tools to search and understand.

It works well for:

- documentation sites
- help centers and knowledge bases
- blogs and article libraries
- product websites
- public portals
- authenticated pages when you provide cookies or headers

### Why use Website Content Scraper?

Most websites are designed for people, not for AI systems or clean data exports. A page can include menus, banners, cookie popups, repeated footers, scripts, and links that are not useful for your final dataset.

This Actor helps by collecting the useful content and organizing it into records you can export as JSON, CSV, Excel, XML, or other Apify dataset formats.

Common use cases include:

- Build a chatbot that answers questions from your website or docs.
- Create a search index for internal or customer-facing support.
- Export documentation pages to Markdown or plain text.
- Feed website content into a vector database or AI workflow.
- Track changed, unchanged, or deleted pages across repeat crawls.
- Download and parse linked documents such as PDFs, spreadsheets, and JSON files.

### Main features

- Crawl one page, one section, or a larger website.
- Extract clean text and Markdown from web pages.
- Create AI-ready chunks, which are smaller pieces of content for search and chatbot systems.
- Download and parse linked files, including PDF, DOCX, XLSX, CSV, TSV, Markdown, JSON, XML, and text files.
- Discover extra URLs from sitemaps and `llms.txt` files.
- Respect `robots.txt` by default.
- Use fast crawling for simple sites and browser crawling for JavaScript-heavy pages.
- Crawl pages behind login when you provide cookies or request headers.
- Save run summaries, skipped URL diagnostics, and sync manifests.
- Support incremental recrawls, so you can skip unchanged content in scheduled runs.

### How it works

Website Content Scraper works in four simple steps:

1. **Find pages**

   The Actor starts from the URLs you provide. It follows links that are in scope, can read sitemaps, and can use `llms.txt` files when available.

2. **Clean the page**

   It removes common noise such as navigation, scripts, repeated layout content, and other page clutter where possible.

3. **Extract content**

   It saves the page as clean text, Markdown, and optionally cleaned HTML. It can also download and parse supported linked files.

4. **Prepare results**

   It writes page records, file records, and AI-ready chunks to the dataset. You can export the data or connect it to another workflow.

### Quick start

For your first run, start small. You can increase the limits after you check the results.

```json
{
  "startUrls": [{ "url": "https://docs.apify.com/" }],
  "crawlerType": "cheerio",
  "crawlScope": "startUrlPath",
  "maxCrawlPages": 25,
  "maxResults": 25,
  "discoverSitemaps": false,
  "discoverLlmsTxt": false,
  "discoverLlmsFullTxt": false,
  "saveMarkdown": true,
  "saveText": false,
  "createChunks": false,
  "saveFiles": false,
  "parseFiles": false,
  "maxFiles": 0,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
````

For many documentation and help sites, `cheerio` is the best first choice because it is fast and cost-efficient. Turn on sitemap discovery, chunks, file parsing, or browser rendering only when the first small run shows that you need them.

For AI search or chatbot workflows, use the `rag` preset or enable `createChunks`. For linked PDFs, spreadsheets, or JSON files, enable `saveFiles`, `parseFiles`, and set `maxFiles` to a small number first.

### Example output

The dataset contains different types of records. The most important field is `recordType`.

#### Page record

A `page` record represents one crawled web page.

```json
{
  "recordType": "page",
  "url": "https://docs.example.com/getting-started",
  "title": "Getting started",
  "markdown": "# Getting started\n\nThis guide explains...",
  "text": "Getting started\n\nThis guide explains...",
  "contentQuality": {
    "confidence": 0.98,
    "wordCount": 1240,
    "isThin": false
  }
}
```

#### Chunk record

A `chunk` record is a smaller piece of a page or file. These records are useful for AI search, chatbots, and retrieval workflows.

```json
{
  "recordType": "chunk",
  "url": "https://docs.example.com/getting-started",
  "title": "Getting started",
  "headingPath": ["Getting started", "Install"],
  "text": "Install the package and configure your project...",
  "tokenEstimate": 420
}
```

#### File record

A `file` record represents a downloaded or parsed file linked from a page.

```json
{
  "recordType": "file",
  "url": "https://docs.example.com/api/openapi.json",
  "title": "JSON",
  "metadata": {
    "contentType": "application/json",
    "byteLength": 968704
  }
}
```

### Understanding the results

Use `recordType` to filter the dataset:

| Record type | What it means | When to use it |
| --- | --- | --- |
| `page` | A full crawled web page | Markdown export, content review, documentation migration |
| `chunk` | A smaller text section | AI search, chatbots, vector databases, RAG workflows |
| `file` | A downloaded or parsed linked file | File archives, API specs, PDFs, spreadsheets |
| `skipped` | A URL skipped by the Actor | Debugging crawl limits or URL scope |
| `tombstone` | A previously seen item that disappeared | Incremental sync and delete handling |

Apify dataset views select useful columns, but they do not filter rows by type. For page-only, chunk-only, or file-only exports, filter by `recordType`.

### Input settings explained

| Setting | Plain-language description |
| --- | --- |
| `startUrls` | The page or website section where the crawl starts. |
| `crawlScope` | Controls which links are allowed. `startUrlPath` is safest for one docs section or blog section. |
| `maxCrawlPages` | Maximum number of page requests the crawler will process. |
| `maxResults` | Maximum number of page records saved to the dataset. |
| `crawlerType` | Choose fast crawling, adaptive crawling, or browser crawling. |
| `maxBrowserFallbacks` | Caps how many pages adaptive mode may retry in a browser. |
| `discoverSitemaps` | Finds more URLs from sitemap files. Leave off for the cheapest first run. |
| `discoverLlmsTxt` | Finds URLs from `llms.txt` files when a site provides them. Leave off unless you need extra discovery. |
| `discoverLlmsFullTxt` | Also reads `llms-full.txt`; keep off unless you want a larger crawl. |
| `saveMarkdown` | Saves page content in Markdown format. |
| `saveText` | Saves page content as plain text. Turn off when Markdown is enough. |
| `createChunks` | Splits content into smaller AI-friendly records. Useful for RAG, but creates more dataset rows. |
| `saveFiles` | Downloads supported linked files. Leave off unless you need file archives. |
| `parseFiles` | Extracts text from supported linked files. Leave off unless you need PDF, spreadsheet, or document text. |
| `maxFiles` | Limits how many linked files are processed. |
| `cookies` | Secret cookie string for logged-in pages. |
| `requestHeaders` | Secret custom headers for authenticated or special requests. |

### Crawler types

| Crawler type | Best for |
| --- | --- |
| `cheerio` | Fast crawling of static pages, docs, blogs, and help centers. |
| `adaptive` | Starts fast and falls back to browser rendering when needed. |
| `playwright-firefox` | Pages that need a real browser, JavaScript, or login flows. |
| `playwright-chromium` | Browser crawling with Chromium. |

Browser crawling is more powerful, but usually slower and more expensive. Start with `cheerio` unless the website content does not appear in the results.

### AI and chatbot use cases

This Actor is especially useful when you want AI to answer questions from website content.

Examples:

- Customer support chatbot trained on a help center.
- Internal assistant that searches company documentation.
- Product copilot that answers questions from API docs.
- Custom GPT knowledge files created from website pages.
- Vector database ingestion for tools such as Pinecone, Qdrant, Weaviate, or similar systems.

If you are not familiar with the term RAG, it simply means giving an AI model relevant information from your own content before it answers a question. The `chunk` records are designed for that kind of workflow.

### Incremental crawling

If you run the Actor on a schedule, you may not want to process the same unchanged content every time.

Use incremental mode to track what changed:

```json
{
  "startUrls": [{ "url": "https://docs.example.com/" }],
  "incrementalMode": "readWriteState",
  "stateKey": "docs-production",
  "skipUnchanged": true,
  "emitDeletedRecords": true
}
```

The Actor stores content hashes in the key-value store. On future runs, it can identify new, changed, unchanged, and deleted content.

### Authenticated websites

For private pages or customer portals, provide cookies or request headers in the input.

These fields are marked as secret inputs:

- `cookies`
- `requestHeaders`

They are not written to dataset records or logs. You can also provide `loginValidationUrl` to check that authentication works before the crawl continues.

### How much does it cost?

The cost depends on:

- how many pages you crawl,
- how many files you download or parse,
- whether you use browser crawling,
- how much data is written to datasets and key-value stores.

Tips to control cost:

- Start with `maxCrawlPages` and `maxResults` set to 25.
- Keep `discoverLlmsFullTxt` off unless you need it.
- Keep `discoverSitemaps` and `discoverLlmsTxt` off for the first test run.
- Use `cheerio` for static sites.
- Use `createChunks` only when you need AI search or chatbot-ready records.
- Keep `saveFiles` and `parseFiles` off unless linked files matter.
- Turn off `saveHtml` and `saveScreenshots` unless you need them.
- Set `maxFiles` to a small number, such as 5 or 10, before processing many files.

### Troubleshooting

#### I only got navigation or very little text

Try `adaptive` or a Playwright crawler. The page may need JavaScript rendering. You can also use `keepElementsCssSelector` to tell the Actor which part of the page to keep.

#### I got too many pages

Use a narrower `startUrl`, keep `crawlScope` set to `startUrlPath`, or add patterns to `excludeUrlGlobs`.

#### I did not get enough pages

Increase `maxCrawlPages`, `maxResults`, and `maxCrawlDepth`. Also keep `discoverSitemaps` enabled.

#### My files are missing

Make sure `saveFiles` and `parseFiles` are enabled, and increase `maxFiles` if the site links to many files.

#### Some pages have low confidence scores

Low scores are common for index pages, category pages, and navigation-heavy pages. For AI workflows, the detailed content pages and `chunk` records are usually more useful.

#### The website blocks the crawler

Try a browser crawler and configure proxies in Apify. Some sites require stronger crawling settings than simple HTTP crawling.

### Limitations

- Legacy `.doc` files can be downloaded but are not text-extracted.
- Very large files may be skipped based on `fileMaxSizeMb`.
- Browser crawling is slower and may cost more than fast HTTP crawling.
- `llms.txt` and `llms-full.txt` are used for discovery, not saved as normal file records.
- Results depend on the structure and accessibility of the target website.

### Best practices

- Test with a small crawl before running a large one.
- Review a few `page` records to confirm the extracted text looks right.
- Use `chunk` records for chatbot and vector database workflows.
- Use `page` records for full Markdown or text exports.
- Use `skipped` records to understand why URLs were not saved.
- Save a tested input as an Apify Task for repeat use.

# Actor input Schema

## `startUrls` (type: `array`):

One or more pages, website sections, or documents to crawl.

## `preset` (type: `string`):

Applies sensible defaults for common workflows. Explicit fields still override the preset.

## `crawlerType` (type: `string`):

Choose whether to use fast HTTP crawling, browser rendering, or adaptive HTTP with browser fallback. Fast HTTP parser is the cheapest option.

## `maxBrowserFallbacks` (type: `integer`):

Maximum number of pages adaptive mode may retry in a browser. Lower values keep runs cheaper.

## `crawlScope` (type: `string`):

Controls which discovered URLs are considered in scope for crawling.

## `includeUrlGlobs` (type: `array`):

Only URLs matching at least one glob are crawled when provided, or when crawl scope is customGlobs.

## `excludeUrlGlobs` (type: `array`):

URLs matching any glob are skipped.

## `maxCrawlPages` (type: `integer`):

Maximum number of page requests the crawler will process before stopping.

## `maxCrawlDepth` (type: `integer`):

Maximum link depth from the start URLs. Use 0 to crawl only the start URLs.

## `maxResults` (type: `integer`):

Maximum number of page records to save to the default dataset.

## `maxConcurrency` (type: `integer`):

Maximum number of requests processed in parallel. Lower concurrency is gentler and can reduce memory usage.

## `requestHandlerTimeoutSecs` (type: `integer`):

Maximum time allowed for each page or file request handler.

## `respectRobotsTxtFile` (type: `boolean`):

Recommended for broad public crawls.

## `discoverSitemaps` (type: `boolean`):

Discover URLs from sitemap.xml files and sitemap declarations in robots.txt. Turn on for broader crawls.

## `discoverLlmsTxt` (type: `boolean`):

Discover URLs from llms.txt and llms-full.txt files when they are available. Turn on when you want extra discovery.

## `discoverLlmsFullTxt` (type: `boolean`):

Also fetch llms-full.txt when discovering LLM-oriented links. This can be large, so it is disabled by default.

## `discoveryOnly` (type: `boolean`):

Emit discovered URL decisions without saving page content.

## `saveMarkdown` (type: `boolean`):

Save cleaned Markdown for each page record.

## `saveText` (type: `boolean`):

Save cleaned plain text for each page record.

## `saveHtml` (type: `boolean`):

Save cleaned HTML for each page record. Useful for debugging and migration workflows.

## `saveScreenshots` (type: `boolean`):

Save screenshots for browser-rendered pages to the default key-value store.

## `createChunks` (type: `boolean`):

Create RAG-ready chunk records with stable IDs, hashes, and token estimates.

## `outputMode` (type: `string`):

Choose whether to emit full page content, chunk records, or both.

## `chunkTargetTokens` (type: `integer`):

Target token count for each generated RAG chunk.

## `chunkOverlapTokens` (type: `integer`):

Approximate token overlap between adjacent chunks when splitting long sections.

## `chunkMaxChars` (type: `integer`):

Hard character limit for a generated chunk.

## `incrementalMode` (type: `string`):

Controls whether previous crawl state is read and updated in the key-value store.

## `stateKey` (type: `string`):

Stable key used to store URL and content hashes between scheduled runs.

## `skipUnchanged` (type: `boolean`):

Skip writing unchanged page and file records when incremental state shows the content did not change.

## `emitUnchangedRecords` (type: `boolean`):

Write unchanged records to the dataset instead of only counting them in the summary.

## `emitDeletedRecords` (type: `boolean`):

Emit tombstone records for documents that existed in prior state but were not seen in this run.

## `saveFiles` (type: `boolean`):

Download supported linked files to the default key-value store. Leave off for the cheapest page-only crawls.

## `parseFiles` (type: `boolean`):

Extract text from supported linked files and emit file/chunk records. Leave off unless you need linked documents.

## `fileMaxSizeMb` (type: `integer`):

Maximum file size that can be downloaded or parsed.

## `maxFiles` (type: `integer`):

Maximum number of linked files to process during the run.

## `removeElementsCssSelector` (type: `string`):

Additional CSS selectors to remove before extraction.

## `keepElementsCssSelector` (type: `string`):

When set, extraction keeps only matching elements.

## `waitForSelector` (type: `string`):

Browser modes wait for this selector before extracting content.

## `dynamicContentWaitSecs` (type: `integer`):

Extra wait time in browser mode before extracting content.

## `scrollToBottom` (type: `boolean`):

Scroll pages in browser mode to load infinite or lazy-loaded content.

## `clickElementsCssSelector` (type: `string`):

CSS selector for expandable elements to click before extraction in browser mode.

## `closeCookieModals` (type: `boolean`):

Attempt to close common cookie consent modals in browser mode.

## `blockMedia` (type: `boolean`):

Block image, font, and media requests in browser mode to reduce cost and speed up crawls.

## `requestHeaders` (type: `object`):

Optional headers for authenticated or signed requests. Values are treated as secrets.

## `cookies` (type: `string`):

Cookie header string for authenticated crawls. Stored as a secret input and never emitted.

## `loginValidationUrl` (type: `string`):

Optional URL to validate authenticated access before the crawl.

## `debugMode` (type: `boolean`):

Enable extra diagnostics such as skipped URL records and debug artifacts.

## `saveSkippedUrls` (type: `boolean`):

Write skipped URL diagnostics to the default dataset.

## `integrationTarget` (type: `string`):

Optional post-run integration handoff target.

## `customWebhookUrl` (type: `string`):

Webhook URL used when the custom webhook integration target is selected.

## `proxyConfiguration` (type: `object`):

Proxy settings used by Crawlee and Apify when fetching pages.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://docs.apify.com/"
    }
  ],
  "preset": "simple",
  "crawlerType": "cheerio",
  "maxBrowserFallbacks": 2,
  "crawlScope": "startUrlPath",
  "includeUrlGlobs": [],
  "excludeUrlGlobs": [],
  "maxCrawlPages": 25,
  "maxCrawlDepth": 3,
  "maxResults": 25,
  "maxConcurrency": 4,
  "requestHandlerTimeoutSecs": 45,
  "respectRobotsTxtFile": true,
  "discoverSitemaps": false,
  "discoverLlmsTxt": false,
  "discoverLlmsFullTxt": false,
  "discoveryOnly": false,
  "saveMarkdown": true,
  "saveText": false,
  "saveHtml": false,
  "saveScreenshots": false,
  "createChunks": false,
  "outputMode": "fullContent",
  "chunkTargetTokens": 700,
  "chunkOverlapTokens": 40,
  "chunkMaxChars": 5000,
  "incrementalMode": "disabled",
  "stateKey": "default",
  "skipUnchanged": true,
  "emitUnchangedRecords": false,
  "emitDeletedRecords": false,
  "saveFiles": false,
  "parseFiles": false,
  "fileMaxSizeMb": 10,
  "maxFiles": 0,
  "removeElementsCssSelector": "",
  "keepElementsCssSelector": "",
  "waitForSelector": "",
  "dynamicContentWaitSecs": 0,
  "scrollToBottom": false,
  "clickElementsCssSelector": "",
  "closeCookieModals": false,
  "blockMedia": true,
  "loginValidationUrl": "",
  "debugMode": false,
  "saveSkippedUrls": false,
  "integrationTarget": "none",
  "customWebhookUrl": "",
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# Actor output Schema

## `results` (type: `string`):

All page, chunk, file, diagnostic, and tombstone records in the default dataset.

## `pages` (type: `string`):

Default dataset with page-oriented columns selected. Filter by recordType=page for page-only exports.

## `chunks` (type: `string`):

Default dataset with chunk-oriented columns selected. Filter by recordType=chunk for chunk-only exports.

## `summary` (type: `string`):

JSON summary stored in the default key-value store.

## `syncManifest` (type: `string`):

Incremental sync manifest stored in the default key-value store.

## `artifacts` (type: `string`):

All key-value store records, including large content, files, and screenshots.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://docs.apify.com/"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("qaseemiqbal/website-content-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [{ "url": "https://docs.apify.com/" }] }

# Run the Actor and wait for it to finish
run = client.actor("qaseemiqbal/website-content-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://docs.apify.com/"
    }
  ]
}' |
apify call qaseemiqbal/website-content-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=qaseemiqbal/website-content-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Content Scraper",
        "description": "Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.",
        "version": "0.1",
        "x-build-id": "EUKMqJlsWpqLZFmG5"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/qaseemiqbal~website-content-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-qaseemiqbal-website-content-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/qaseemiqbal~website-content-scraper/runs": {
            "post": {
                "operationId": "runs-sync-qaseemiqbal-website-content-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/qaseemiqbal~website-content-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-qaseemiqbal-website-content-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "One or more pages, website sections, or documents to crawl.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "preset": {
                        "title": "Workflow preset",
                        "enum": [
                            "simple",
                            "rag",
                            "docsMigration",
                            "debug",
                            "incremental",
                            "authenticatedPortal"
                        ],
                        "type": "string",
                        "description": "Applies sensible defaults for common workflows. Explicit fields still override the preset.",
                        "default": "simple"
                    },
                    "crawlerType": {
                        "title": "Crawler type",
                        "enum": [
                            "adaptive",
                            "cheerio",
                            "playwright-firefox",
                            "playwright-chromium"
                        ],
                        "type": "string",
                        "description": "Choose whether to use fast HTTP crawling, browser rendering, or adaptive HTTP with browser fallback. Fast HTTP parser is the cheapest option.",
                        "default": "cheerio"
                    },
                    "maxBrowserFallbacks": {
                        "title": "Maximum adaptive browser fallbacks",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of pages adaptive mode may retry in a browser. Lower values keep runs cheaper.",
                        "default": 2
                    },
                    "crawlScope": {
                        "title": "Crawl scope",
                        "enum": [
                            "startUrlPath",
                            "sameHostname",
                            "sameDomain",
                            "customGlobs",
                            "exactUrlsOnly"
                        ],
                        "type": "string",
                        "description": "Controls which discovered URLs are considered in scope for crawling.",
                        "default": "startUrlPath"
                    },
                    "includeUrlGlobs": {
                        "title": "Include URL globs",
                        "type": "array",
                        "description": "Only URLs matching at least one glob are crawled when provided, or when crawl scope is customGlobs.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "excludeUrlGlobs": {
                        "title": "Exclude URL globs",
                        "type": "array",
                        "description": "URLs matching any glob are skipped.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxCrawlPages": {
                        "title": "Maximum pages to crawl",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of page requests the crawler will process before stopping.",
                        "default": 25
                    },
                    "maxCrawlDepth": {
                        "title": "Maximum crawl depth",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum link depth from the start URLs. Use 0 to crawl only the start URLs.",
                        "default": 3
                    },
                    "maxResults": {
                        "title": "Maximum saved page records",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of page records to save to the default dataset.",
                        "default": 25
                    },
                    "maxConcurrency": {
                        "title": "Maximum concurrency",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Maximum number of requests processed in parallel. Lower concurrency is gentler and can reduce memory usage.",
                        "default": 4
                    },
                    "requestHandlerTimeoutSecs": {
                        "title": "Request timeout",
                        "minimum": 10,
                        "type": "integer",
                        "description": "Maximum time allowed for each page or file request handler.",
                        "default": 45
                    },
                    "respectRobotsTxtFile": {
                        "title": "Respect robots.txt",
                        "type": "boolean",
                        "description": "Recommended for broad public crawls.",
                        "default": true
                    },
                    "discoverSitemaps": {
                        "title": "Discover sitemaps",
                        "type": "boolean",
                        "description": "Discover URLs from sitemap.xml files and sitemap declarations in robots.txt. Turn on for broader crawls.",
                        "default": false
                    },
                    "discoverLlmsTxt": {
                        "title": "Discover llms.txt",
                        "type": "boolean",
                        "description": "Discover URLs from llms.txt and llms-full.txt files when they are available. Turn on when you want extra discovery.",
                        "default": false
                    },
                    "discoverLlmsFullTxt": {
                        "title": "Discover llms-full.txt",
                        "type": "boolean",
                        "description": "Also fetch llms-full.txt when discovering LLM-oriented links. This can be large, so it is disabled by default.",
                        "default": false
                    },
                    "discoveryOnly": {
                        "title": "Discovery-only dry run",
                        "type": "boolean",
                        "description": "Emit discovered URL decisions without saving page content.",
                        "default": false
                    },
                    "saveMarkdown": {
                        "title": "Save Markdown",
                        "type": "boolean",
                        "description": "Save cleaned Markdown for each page record.",
                        "default": true
                    },
                    "saveText": {
                        "title": "Save plain text",
                        "type": "boolean",
                        "description": "Save cleaned plain text for each page record.",
                        "default": false
                    },
                    "saveHtml": {
                        "title": "Save cleaned HTML",
                        "type": "boolean",
                        "description": "Save cleaned HTML for each page record. Useful for debugging and migration workflows.",
                        "default": false
                    },
                    "saveScreenshots": {
                        "title": "Save screenshots",
                        "type": "boolean",
                        "description": "Save screenshots for browser-rendered pages to the default key-value store.",
                        "default": false
                    },
                    "createChunks": {
                        "title": "Create RAG chunks",
                        "type": "boolean",
                        "description": "Create RAG-ready chunk records with stable IDs, hashes, and token estimates.",
                        "default": false
                    },
                    "outputMode": {
                        "title": "Output mode",
                        "enum": [
                            "fullContentAndChunks",
                            "fullContent",
                            "chunksOnly"
                        ],
                        "type": "string",
                        "description": "Choose whether to emit full page content, chunk records, or both.",
                        "default": "fullContent"
                    },
                    "chunkTargetTokens": {
                        "title": "Chunk target tokens",
                        "minimum": 100,
                        "type": "integer",
                        "description": "Target token count for each generated RAG chunk.",
                        "default": 700
                    },
                    "chunkOverlapTokens": {
                        "title": "Chunk overlap tokens",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Approximate token overlap between adjacent chunks when splitting long sections.",
                        "default": 40
                    },
                    "chunkMaxChars": {
                        "title": "Chunk maximum characters",
                        "minimum": 500,
                        "type": "integer",
                        "description": "Hard character limit for a generated chunk.",
                        "default": 5000
                    },
                    "incrementalMode": {
                        "title": "Incremental mode",
                        "enum": [
                            "disabled",
                            "readState",
                            "readWriteState"
                        ],
                        "type": "string",
                        "description": "Controls whether previous crawl state is read and updated in the key-value store.",
                        "default": "disabled"
                    },
                    "stateKey": {
                        "title": "State key",
                        "type": "string",
                        "description": "Stable key used to store URL and content hashes between scheduled runs.",
                        "default": "default"
                    },
                    "skipUnchanged": {
                        "title": "Skip unchanged content",
                        "type": "boolean",
                        "description": "Skip writing unchanged page and file records when incremental state shows the content did not change.",
                        "default": true
                    },
                    "emitUnchangedRecords": {
                        "title": "Emit unchanged records",
                        "type": "boolean",
                        "description": "Write unchanged records to the dataset instead of only counting them in the summary.",
                        "default": false
                    },
                    "emitDeletedRecords": {
                        "title": "Emit deletion tombstones",
                        "type": "boolean",
                        "description": "Emit tombstone records for documents that existed in prior state but were not seen in this run.",
                        "default": false
                    },
                    "saveFiles": {
                        "title": "Download linked files",
                        "type": "boolean",
                        "description": "Download supported linked files to the default key-value store. Leave off for the cheapest page-only crawls.",
                        "default": false
                    },
                    "parseFiles": {
                        "title": "Parse linked files",
                        "type": "boolean",
                        "description": "Extract text from supported linked files and emit file/chunk records. Leave off unless you need linked documents.",
                        "default": false
                    },
                    "fileMaxSizeMb": {
                        "title": "Maximum file size in MB",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum file size that can be downloaded or parsed.",
                        "default": 10
                    },
                    "maxFiles": {
                        "title": "Maximum files",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of linked files to process during the run.",
                        "default": 0
                    },
                    "removeElementsCssSelector": {
                        "title": "Remove elements CSS selector",
                        "type": "string",
                        "description": "Additional CSS selectors to remove before extraction.",
                        "default": ""
                    },
                    "keepElementsCssSelector": {
                        "title": "Keep elements CSS selector",
                        "type": "string",
                        "description": "When set, extraction keeps only matching elements.",
                        "default": ""
                    },
                    "waitForSelector": {
                        "title": "Wait for selector",
                        "type": "string",
                        "description": "Browser modes wait for this selector before extracting content.",
                        "default": ""
                    },
                    "dynamicContentWaitSecs": {
                        "title": "Dynamic content wait",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Extra wait time in browser mode before extracting content.",
                        "default": 0
                    },
                    "scrollToBottom": {
                        "title": "Scroll to bottom",
                        "type": "boolean",
                        "description": "Scroll pages in browser mode to load infinite or lazy-loaded content.",
                        "default": false
                    },
                    "clickElementsCssSelector": {
                        "title": "Click expandable elements selector",
                        "type": "string",
                        "description": "CSS selector for expandable elements to click before extraction in browser mode.",
                        "default": ""
                    },
                    "closeCookieModals": {
                        "title": "Close cookie modals",
                        "type": "boolean",
                        "description": "Attempt to close common cookie consent modals in browser mode.",
                        "default": false
                    },
                    "blockMedia": {
                        "title": "Block media requests",
                        "type": "boolean",
                        "description": "Block image, font, and media requests in browser mode to reduce cost and speed up crawls.",
                        "default": true
                    },
                    "requestHeaders": {
                        "title": "Request headers",
                        "type": "object",
                        "description": "Optional headers for authenticated or signed requests. Values are treated as secrets."
                    },
                    "cookies": {
                        "title": "Cookies",
                        "type": "string",
                        "description": "Cookie header string for authenticated crawls. Stored as a secret input and never emitted."
                    },
                    "loginValidationUrl": {
                        "title": "Login validation URL",
                        "type": "string",
                        "description": "Optional URL to validate authenticated access before the crawl.",
                        "default": ""
                    },
                    "debugMode": {
                        "title": "Debug mode",
                        "type": "boolean",
                        "description": "Enable extra diagnostics such as skipped URL records and debug artifacts.",
                        "default": false
                    },
                    "saveSkippedUrls": {
                        "title": "Save skipped URL diagnostics",
                        "type": "boolean",
                        "description": "Write skipped URL diagnostics to the default dataset.",
                        "default": false
                    },
                    "integrationTarget": {
                        "title": "Integration handoff",
                        "enum": [
                            "none",
                            "customWebhook"
                        ],
                        "type": "string",
                        "description": "Optional post-run integration handoff target.",
                        "default": "none"
                    },
                    "customWebhookUrl": {
                        "title": "Custom webhook URL",
                        "type": "string",
                        "description": "Webhook URL used when the custom webhook integration target is selected.",
                        "default": ""
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Proxy settings used by Crawlee and Apify when fetching pages.",
                        "default": {
                            "useApifyProxy": false
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
