# Ai Training Data Curator (`omarchydev/ai-training-data-curator`) Actor

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

- **URL**: https://apify.com/omarchydev/ai-training-data-curator.md
- **Developed by:** [Omarchy Dev](https://apify.com/omarchydev) (community)
- **Categories:** AI, Agents, Other
- **Stats:** 7 total users, 0 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.01 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Training Data Curator

Curate high-quality, deduplicated training data for LLM fine-tuning. Extract clean text from any website OR process your own documents with automatic quality scoring, deduplication, and format conversion.

### Features

- **Smart Content Extraction**: Automatically detects and extracts main content, filtering out navigation, ads, and boilerplate
- **Bring Your Own Data (BYOD)**: Process your own text documents without crawling - perfect for existing datasets
- **Quality Scoring**: Scores each document based on vocabulary diversity, sentence structure, and content density
- **Deduplication**: Uses MinHash/Jaccard similarity to remove near-duplicate content
- **Flexible Crawling**: Single page, same domain, same subdomain, or follow all links
- **Document Chunking**: Split long documents into training-ready chunks with configurable overlap
- **Multiple Output Formats**: JSONL (OpenAI compatible), JSON, Parquet, CSV, or HuggingFace Datasets format
- **Language Filtering**: Filter content by language (ISO 639-1 codes)
- **Privacy Features**: Optionally remove emails and URLs from extracted text

### Use Cases

- **LLM Fine-tuning**: Collect domain-specific training data for fine-tuning language models
- **RAG Systems**: Build high-quality document collections for retrieval-augmented generation
- **Knowledge Bases**: Create clean text corpora from documentation sites
- **Research**: Gather datasets from academic or technical resources
- **Data Cleaning**: Clean and deduplicate existing text datasets for ML training

### Input Configuration

#### Mode Selection

The actor supports two modes - provide **either** `start_urls` (for crawling) **or** `documents` (for BYOD):

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `start_urls` | array | - | URLs to start crawling from (Crawl mode) |
| `documents` | array | - | Your own documents to process (BYOD mode) |

#### BYOD (Bring Your Own Data) Settings

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `documents` | array | - | Array of text strings or objects with `text` field |
| `byod_text_field` | string | `text` | Field name containing text in document objects |
| `max_byod_documents` | integer | 500 | Maximum documents to process (hard limit) |

#### Crawl Settings

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `start_urls` | array | - | URLs to start crawling from |
| `crawl_mode` | string | `same_domain` | `single_page`, `same_domain`, `same_subdomain`, or `all_links` |
| `max_pages` | integer | 100 | Maximum pages to crawl |
| `max_depth` | integer | 3 | Maximum link depth from start URLs |

#### Content Extraction

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `content_selectors` | array | `["article", "main", ".content"]` | CSS selectors for main content |
| `exclude_selectors` | array | `["nav", "header", "footer", ".sidebar"]` | CSS selectors to exclude |
| `min_word_count` | integer | 100 | Minimum words per document |
| `max_word_count` | integer | 50000 | Maximum words per document |

#### Quality & Deduplication

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `deduplicate` | boolean | true | Remove duplicate/near-duplicate content |
| `dedup_threshold` | number | 0.85 | Similarity threshold (0.5-1.0) |
| `quality_filter` | boolean | true | Filter low-quality content |
| `min_quality_score` | number | 0.5 | Minimum quality score (0.0-1.0) |
| `language_filter` | array | `["en"]` | Languages to include (ISO codes) |

#### Output Settings

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `output_format` | string | `jsonl` | `jsonl`, `json`, `parquet`, `csv`, or `huggingface` |
| `text_field_name` | string | `text` | Name of the text field in output |
| `include_metadata` | boolean | true | Include URL, title, date metadata |
| `include_raw_html` | boolean | false | Also save original HTML |

#### Chunking

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `chunk_documents` | boolean | false | Split documents into chunks |
| `chunk_size` | integer | 512 | Target chunk size in tokens |
| `chunk_overlap` | integer | 64 | Overlap between chunks |

#### Text Cleaning

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `clean_html` | boolean | true | Remove HTML tags |
| `normalize_whitespace` | boolean | true | Collapse multiple spaces/newlines |
| `remove_urls` | boolean | false | Strip embedded URLs |
| `remove_emails` | boolean | true | Strip email addresses |

#### Performance

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `use_proxies` | boolean | false | Use residential proxies |
| `max_concurrency` | integer | 10 | Parallel requests |
| `request_delay_ms` | integer | 500 | Delay between requests |
| `respect_robots_txt` | boolean | true | Follow robots.txt rules |

### Output Format

Each document in the output contains:

```json
{
  "text": "The cleaned document text content...",
  "doc_id": "abc123def456",
  "source_url": "https://example.com/page",
  "word_count": 1523,
  "quality_score": 0.847,
  "language": "en",
  "title": "Page Title",
  "description": "Meta description",
  "content_type": "documentation",
  "scraped_at": "2024-01-15T10:30:00Z"
}
````

If chunking is enabled, additional fields are included:

```json
{
  "chunk_index": 0,
  "total_chunks": 5,
  "parent_doc_id": "abc123def456"
}
```

### Quality Metrics

The quality scorer evaluates documents based on:

- **Word count**: Penalizes very short documents
- **Sentence length**: Flags very short (fragments) or very long sentences
- **Vocabulary diversity**: Ratio of unique words to total words
- **Boilerplate ratio**: Detection of common web boilerplate patterns
- **Character composition**: Penalizes excessive uppercase, digits, or special characters

Documents with scores below `min_quality_score` are automatically filtered out.

### Example Input

#### Crawl Python Documentation

```json
{
  "start_urls": [
    { "url": "https://docs.python.org/3/tutorial/" }
  ],
  "crawl_mode": "same_subdomain",
  "max_pages": 500,
  "content_selectors": [".document", ".body"],
  "exclude_selectors": [".sphinxsidebar", ".related", "footer"],
  "output_format": "jsonl",
  "chunk_documents": true,
  "chunk_size": 1024
}
```

#### Build Knowledge Base from Blog

```json
{
  "start_urls": [
    { "url": "https://example.com/blog/" }
  ],
  "crawl_mode": "same_domain",
  "max_pages": 100,
  "content_selectors": ["article", ".post-content"],
  "quality_filter": true,
  "min_quality_score": 0.6,
  "deduplicate": true,
  "output_format": "parquet"
}
```

#### BYOD: Process Your Own Documents

```json
{
  "documents": [
    "This is a plain text document that will be processed...",
    {
      "text": "This document has metadata attached to it...",
      "source_id": "doc_001",
      "metadata": {
        "title": "My Document",
        "author": "John Doe",
        "language": "en"
      }
    }
  ],
  "deduplicate": true,
  "quality_filter": true,
  "min_quality_score": 0.5,
  "output_format": "jsonl"
}
```

#### BYOD: Clean Existing Dataset

```json
{
  "documents": [
    {"text": "First document from your dataset..."},
    {"text": "Second document from your dataset..."},
    {"text": "Third document from your dataset..."}
  ],
  "byod_text_field": "text",
  "deduplicate": true,
  "dedup_threshold": 0.85,
  "chunk_documents": true,
  "chunk_size": 512,
  "output_format": "jsonl"
}
```

### Tips for Best Results

1. **Use specific content selectors**: Better extraction with precise CSS selectors for your target site
2. **Set appropriate word counts**: Filter out navigation pages and indexes with `min_word_count`
3. **Enable deduplication**: Prevents training on repetitive content (common on content farms)
4. **Adjust quality threshold**: Lower for technical content, higher for prose
5. **Use chunking for long documents**: Better for training context windows
6. **Start small**: Test with `max_pages: 20` before large crawls

### Pricing

- **$0.01 per document** - charged for each cleaned document (both crawled and BYOD)

Additional costs:

- **Proxy**: ~$0.001-0.005 per request (if enabled)
- **Storage**: ~$0.0001 per document

### Support

- [Apify Documentation](https://docs.apify.com)
- [Report Issues](https://github.com/your-repo/issues)
- [Crawlee Documentation](https://crawlee.dev/python/docs)

# Actor input Schema

## `start_urls` (type: `array`):

URLs to start crawling from. Leave empty if using BYOD mode.

## `documents` (type: `array`):

Your own documents to process. Provide text strings or objects with 'text' field. Use this instead of Start URLs for BYOD mode.

## `byod_text_field` (type: `string`):

Field name containing text in BYOD documents (default: 'text')

## `max_byod_documents` (type: `integer`):

Maximum documents to process in BYOD mode (prevents memory issues)

## `crawl_mode` (type: `string`):

How extensively to crawl: single\_page, same\_domain, same\_subdomain, all\_links

## `max_pages` (type: `integer`):

Maximum number of pages to crawl

## `max_depth` (type: `integer`):

Maximum link depth from start URLs

## `content_selectors` (type: `array`):

CSS selectors for main content elements (e.g., article, main, .content)

## `exclude_selectors` (type: `array`):

CSS selectors for elements to exclude (e.g., nav, footer, .sidebar)

## `min_word_count` (type: `integer`):

Minimum words per document

## `max_word_count` (type: `integer`):

Maximum words per document (longer documents will be truncated)

## `deduplicate` (type: `boolean`):

Remove near-duplicate documents using fuzzy matching

## `dedup_threshold_percent` (type: `integer`):

Similarity threshold for deduplication (50-100%, higher = stricter)

## `quality_filter` (type: `boolean`):

Filter out low-quality documents based on quality score

## `min_quality_score_percent` (type: `integer`):

Minimum quality score to include document (0-100%)

## `language_filter` (type: `array`):

Language codes to include (e.g., en, es, fr). Empty = all languages

## `chunk_documents` (type: `boolean`):

Split documents into smaller chunks for training

## `chunk_size` (type: `integer`):

Target tokens per chunk when chunking is enabled

## `chunk_overlap` (type: `integer`):

Overlap tokens between consecutive chunks

## `output_format` (type: `string`):

Format for output data

## `text_field_name` (type: `string`):

Field name for text content in output

## `include_metadata` (type: `boolean`):

Include document metadata (title, author, URL, etc.)

## `include_raw_html` (type: `boolean`):

Include raw HTML in output (increases data size)

## `remove_urls` (type: `boolean`):

Remove URLs from extracted text

## `remove_emails` (type: `boolean`):

Remove email addresses from extracted text (recommended for privacy)

## `normalize_whitespace` (type: `boolean`):

Normalize whitespace and remove excessive blank lines

## `use_proxies` (type: `boolean`):

Use Apify proxy for requests

## `max_concurrency` (type: `integer`):

Maximum concurrent page requests

## `request_delay_ms` (type: `integer`):

Delay between requests in milliseconds

## Actor input object example

```json
{
  "start_urls": [
    {
      "url": "https://docs.python.org/3/"
    }
  ],
  "documents": [
    {
      "text": "Your document text here...",
      "source_id": "doc_001"
    }
  ],
  "byod_text_field": "text",
  "max_byod_documents": 500,
  "crawl_mode": "same_domain",
  "max_pages": 100,
  "max_depth": 3,
  "content_selectors": [
    "article",
    "main",
    ".content"
  ],
  "exclude_selectors": [
    "nav",
    "footer",
    "aside",
    ".sidebar"
  ],
  "min_word_count": 100,
  "max_word_count": 50000,
  "deduplicate": true,
  "dedup_threshold_percent": 85,
  "quality_filter": true,
  "min_quality_score_percent": 50,
  "language_filter": [
    "en"
  ],
  "chunk_documents": false,
  "chunk_size": 512,
  "chunk_overlap": 64,
  "output_format": "jsonl",
  "text_field_name": "text",
  "include_metadata": true,
  "include_raw_html": false,
  "remove_urls": false,
  "remove_emails": true,
  "normalize_whitespace": true,
  "use_proxies": false,
  "max_concurrency": 10,
  "request_delay_ms": 500
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "start_urls": [
        {
            "url": "https://docs.python.org/3/"
        }
    ],
    "documents": [
        {
            "text": "Your document text here...",
            "source_id": "doc_001"
        }
    ],
    "content_selectors": [
        "article",
        "main",
        ".content"
    ],
    "exclude_selectors": [
        "nav",
        "footer",
        "aside",
        ".sidebar"
    ],
    "language_filter": [
        "en"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("omarchydev/ai-training-data-curator").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "start_urls": [{ "url": "https://docs.python.org/3/" }],
    "documents": [{
            "text": "Your document text here...",
            "source_id": "doc_001",
        }],
    "content_selectors": [
        "article",
        "main",
        ".content",
    ],
    "exclude_selectors": [
        "nav",
        "footer",
        "aside",
        ".sidebar",
    ],
    "language_filter": ["en"],
}

# Run the Actor and wait for it to finish
run = client.actor("omarchydev/ai-training-data-curator").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "start_urls": [
    {
      "url": "https://docs.python.org/3/"
    }
  ],
  "documents": [
    {
      "text": "Your document text here...",
      "source_id": "doc_001"
    }
  ],
  "content_selectors": [
    "article",
    "main",
    ".content"
  ],
  "exclude_selectors": [
    "nav",
    "footer",
    "aside",
    ".sidebar"
  ],
  "language_filter": [
    "en"
  ]
}' |
apify call omarchydev/ai-training-data-curator --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=omarchydev/ai-training-data-curator",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Ai Training Data Curator",
        "description": "Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.",
        "version": "1.1",
        "x-build-id": "5uAj1uayEGy3AmThH"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/omarchydev~ai-training-data-curator/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-omarchydev-ai-training-data-curator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/omarchydev~ai-training-data-curator/runs": {
            "post": {
                "operationId": "runs-sync-omarchydev-ai-training-data-curator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/omarchydev~ai-training-data-curator/run-sync": {
            "post": {
                "operationId": "run-sync-omarchydev-ai-training-data-curator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "start_urls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "URLs to start crawling from. Leave empty if using BYOD mode.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "documents": {
                        "title": "Documents (BYOD)",
                        "type": "array",
                        "description": "Your own documents to process. Provide text strings or objects with 'text' field. Use this instead of Start URLs for BYOD mode."
                    },
                    "byod_text_field": {
                        "title": "Text Field Name (BYOD)",
                        "type": "string",
                        "description": "Field name containing text in BYOD documents (default: 'text')",
                        "default": "text"
                    },
                    "max_byod_documents": {
                        "title": "Max BYOD Documents",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Maximum documents to process in BYOD mode (prevents memory issues)",
                        "default": 500
                    },
                    "crawl_mode": {
                        "title": "Crawl Mode",
                        "enum": [
                            "single_page",
                            "same_domain",
                            "same_subdomain",
                            "all_links"
                        ],
                        "type": "string",
                        "description": "How extensively to crawl: single_page, same_domain, same_subdomain, all_links",
                        "default": "same_domain"
                    },
                    "max_pages": {
                        "title": "Maximum Pages",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl",
                        "default": 100
                    },
                    "max_depth": {
                        "title": "Maximum Depth",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Maximum link depth from start URLs",
                        "default": 3
                    },
                    "content_selectors": {
                        "title": "Content Selectors",
                        "type": "array",
                        "description": "CSS selectors for main content elements (e.g., article, main, .content)",
                        "items": {
                            "type": "string"
                        }
                    },
                    "exclude_selectors": {
                        "title": "Exclude Selectors",
                        "type": "array",
                        "description": "CSS selectors for elements to exclude (e.g., nav, footer, .sidebar)",
                        "items": {
                            "type": "string"
                        }
                    },
                    "min_word_count": {
                        "title": "Minimum Word Count",
                        "minimum": 10,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Minimum words per document",
                        "default": 100
                    },
                    "max_word_count": {
                        "title": "Maximum Word Count",
                        "minimum": 100,
                        "maximum": 500000,
                        "type": "integer",
                        "description": "Maximum words per document (longer documents will be truncated)",
                        "default": 50000
                    },
                    "deduplicate": {
                        "title": "Enable Deduplication",
                        "type": "boolean",
                        "description": "Remove near-duplicate documents using fuzzy matching",
                        "default": true
                    },
                    "dedup_threshold_percent": {
                        "title": "Deduplication Threshold (%)",
                        "minimum": 50,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Similarity threshold for deduplication (50-100%, higher = stricter)",
                        "default": 85
                    },
                    "quality_filter": {
                        "title": "Enable Quality Filter",
                        "type": "boolean",
                        "description": "Filter out low-quality documents based on quality score",
                        "default": true
                    },
                    "min_quality_score_percent": {
                        "title": "Minimum Quality Score (%)",
                        "minimum": 0,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Minimum quality score to include document (0-100%)",
                        "default": 50
                    },
                    "language_filter": {
                        "title": "Language Filter",
                        "type": "array",
                        "description": "Language codes to include (e.g., en, es, fr). Empty = all languages",
                        "items": {
                            "type": "string"
                        }
                    },
                    "chunk_documents": {
                        "title": "Chunk Documents",
                        "type": "boolean",
                        "description": "Split documents into smaller chunks for training",
                        "default": false
                    },
                    "chunk_size": {
                        "title": "Chunk Size",
                        "minimum": 64,
                        "maximum": 8192,
                        "type": "integer",
                        "description": "Target tokens per chunk when chunking is enabled",
                        "default": 512
                    },
                    "chunk_overlap": {
                        "title": "Chunk Overlap",
                        "minimum": 0,
                        "maximum": 512,
                        "type": "integer",
                        "description": "Overlap tokens between consecutive chunks",
                        "default": 64
                    },
                    "output_format": {
                        "title": "Output Format",
                        "enum": [
                            "jsonl",
                            "json",
                            "csv"
                        ],
                        "type": "string",
                        "description": "Format for output data",
                        "default": "jsonl"
                    },
                    "text_field_name": {
                        "title": "Text Field Name",
                        "type": "string",
                        "description": "Field name for text content in output",
                        "default": "text"
                    },
                    "include_metadata": {
                        "title": "Include Metadata",
                        "type": "boolean",
                        "description": "Include document metadata (title, author, URL, etc.)",
                        "default": true
                    },
                    "include_raw_html": {
                        "title": "Include Raw HTML",
                        "type": "boolean",
                        "description": "Include raw HTML in output (increases data size)",
                        "default": false
                    },
                    "remove_urls": {
                        "title": "Remove URLs from Text",
                        "type": "boolean",
                        "description": "Remove URLs from extracted text",
                        "default": false
                    },
                    "remove_emails": {
                        "title": "Remove Emails from Text",
                        "type": "boolean",
                        "description": "Remove email addresses from extracted text (recommended for privacy)",
                        "default": true
                    },
                    "normalize_whitespace": {
                        "title": "Normalize Whitespace",
                        "type": "boolean",
                        "description": "Normalize whitespace and remove excessive blank lines",
                        "default": true
                    },
                    "use_proxies": {
                        "title": "Use Proxies",
                        "type": "boolean",
                        "description": "Use Apify proxy for requests",
                        "default": false
                    },
                    "max_concurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 50,
                        "type": "integer",
                        "description": "Maximum concurrent page requests",
                        "default": 10
                    },
                    "request_delay_ms": {
                        "title": "Request Delay (ms)",
                        "minimum": 0,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Delay between requests in milliseconds",
                        "default": 500
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
