Ai Training Data Curator
Pricing
from $0.01 / 1,000 results
Ai Training Data Curator
Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

Eliud Munyala
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Share
Curate high-quality, deduplicated training data for LLM fine-tuning. Extract clean text from any website OR process your own documents with automatic quality scoring, deduplication, and format conversion.
Features
- Smart Content Extraction: Automatically detects and extracts main content, filtering out navigation, ads, and boilerplate
- Bring Your Own Data (BYOD): Process your own text documents without crawling - perfect for existing datasets
- Quality Scoring: Scores each document based on vocabulary diversity, sentence structure, and content density
- Deduplication: Uses MinHash/Jaccard similarity to remove near-duplicate content
- Flexible Crawling: Single page, same domain, same subdomain, or follow all links
- Document Chunking: Split long documents into training-ready chunks with configurable overlap
- Multiple Output Formats: JSONL (OpenAI compatible), JSON, Parquet, CSV, or HuggingFace Datasets format
- Language Filtering: Filter content by language (ISO 639-1 codes)
- Privacy Features: Optionally remove emails and URLs from extracted text
Use Cases
- LLM Fine-tuning: Collect domain-specific training data for fine-tuning language models
- RAG Systems: Build high-quality document collections for retrieval-augmented generation
- Knowledge Bases: Create clean text corpora from documentation sites
- Research: Gather datasets from academic or technical resources
- Data Cleaning: Clean and deduplicate existing text datasets for ML training
Input Configuration
Mode Selection
The actor supports two modes - provide either start_urls (for crawling) or documents (for BYOD):
| Field | Type | Default | Description |
|---|---|---|---|
start_urls | array | - | URLs to start crawling from (Crawl mode) |
documents | array | - | Your own documents to process (BYOD mode) |
BYOD (Bring Your Own Data) Settings
| Field | Type | Default | Description |
|---|---|---|---|
documents | array | - | Array of text strings or objects with text field |
byod_text_field | string | text | Field name containing text in document objects |
max_byod_documents | integer | 500 | Maximum documents to process (hard limit) |
Crawl Settings
| Field | Type | Default | Description |
|---|---|---|---|
start_urls | array | - | URLs to start crawling from |
crawl_mode | string | same_domain | single_page, same_domain, same_subdomain, or all_links |
max_pages | integer | 100 | Maximum pages to crawl |
max_depth | integer | 3 | Maximum link depth from start URLs |
Content Extraction
| Field | Type | Default | Description |
|---|---|---|---|
content_selectors | array | ["article", "main", ".content"] | CSS selectors for main content |
exclude_selectors | array | ["nav", "header", "footer", ".sidebar"] | CSS selectors to exclude |
min_word_count | integer | 100 | Minimum words per document |
max_word_count | integer | 50000 | Maximum words per document |
Quality & Deduplication
| Field | Type | Default | Description |
|---|---|---|---|
deduplicate | boolean | true | Remove duplicate/near-duplicate content |
dedup_threshold | number | 0.85 | Similarity threshold (0.5-1.0) |
quality_filter | boolean | true | Filter low-quality content |
min_quality_score | number | 0.5 | Minimum quality score (0.0-1.0) |
language_filter | array | ["en"] | Languages to include (ISO codes) |
Output Settings
| Field | Type | Default | Description |
|---|---|---|---|
output_format | string | jsonl | jsonl, json, parquet, csv, or huggingface |
text_field_name | string | text | Name of the text field in output |
include_metadata | boolean | true | Include URL, title, date metadata |
include_raw_html | boolean | false | Also save original HTML |
Chunking
| Field | Type | Default | Description |
|---|---|---|---|
chunk_documents | boolean | false | Split documents into chunks |
chunk_size | integer | 512 | Target chunk size in tokens |
chunk_overlap | integer | 64 | Overlap between chunks |
Text Cleaning
| Field | Type | Default | Description |
|---|---|---|---|
clean_html | boolean | true | Remove HTML tags |
normalize_whitespace | boolean | true | Collapse multiple spaces/newlines |
remove_urls | boolean | false | Strip embedded URLs |
remove_emails | boolean | true | Strip email addresses |
Performance
| Field | Type | Default | Description |
|---|---|---|---|
use_proxies | boolean | false | Use residential proxies |
max_concurrency | integer | 10 | Parallel requests |
request_delay_ms | integer | 500 | Delay between requests |
respect_robots_txt | boolean | true | Follow robots.txt rules |
Output Format
Each document in the output contains:
{"text": "The cleaned document text content...","doc_id": "abc123def456","source_url": "https://example.com/page","word_count": 1523,"quality_score": 0.847,"language": "en","title": "Page Title","description": "Meta description","content_type": "documentation","scraped_at": "2024-01-15T10:30:00Z"}
If chunking is enabled, additional fields are included:
{"chunk_index": 0,"total_chunks": 5,"parent_doc_id": "abc123def456"}
Quality Metrics
The quality scorer evaluates documents based on:
- Word count: Penalizes very short documents
- Sentence length: Flags very short (fragments) or very long sentences
- Vocabulary diversity: Ratio of unique words to total words
- Boilerplate ratio: Detection of common web boilerplate patterns
- Character composition: Penalizes excessive uppercase, digits, or special characters
Documents with scores below min_quality_score are automatically filtered out.
Example Input
Crawl Python Documentation
{"start_urls": [{ "url": "https://docs.python.org/3/tutorial/" }],"crawl_mode": "same_subdomain","max_pages": 500,"content_selectors": [".document", ".body"],"exclude_selectors": [".sphinxsidebar", ".related", "footer"],"output_format": "jsonl","chunk_documents": true,"chunk_size": 1024}
Build Knowledge Base from Blog
{"start_urls": [{ "url": "https://example.com/blog/" }],"crawl_mode": "same_domain","max_pages": 100,"content_selectors": ["article", ".post-content"],"quality_filter": true,"min_quality_score": 0.6,"deduplicate": true,"output_format": "parquet"}
BYOD: Process Your Own Documents
{"documents": ["This is a plain text document that will be processed...",{"text": "This document has metadata attached to it...","source_id": "doc_001","metadata": {"title": "My Document","author": "John Doe","language": "en"}}],"deduplicate": true,"quality_filter": true,"min_quality_score": 0.5,"output_format": "jsonl"}
BYOD: Clean Existing Dataset
{"documents": [{"text": "First document from your dataset..."},{"text": "Second document from your dataset..."},{"text": "Third document from your dataset..."}],"byod_text_field": "text","deduplicate": true,"dedup_threshold": 0.85,"chunk_documents": true,"chunk_size": 512,"output_format": "jsonl"}
Tips for Best Results
- Use specific content selectors: Better extraction with precise CSS selectors for your target site
- Set appropriate word counts: Filter out navigation pages and indexes with
min_word_count - Enable deduplication: Prevents training on repetitive content (common on content farms)
- Adjust quality threshold: Lower for technical content, higher for prose
- Use chunking for long documents: Better for training context windows
- Start small: Test with
max_pages: 20before large crawls
Pricing
- $0.01 per document - charged for each cleaned document (both crawled and BYOD)
Additional costs:
- Proxy: ~$0.001-0.005 per request (if enabled)
- Storage: ~$0.0001 per document