Ai Training Data Curator avatar
Ai Training Data Curator

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Ai Training Data Curator

Ai Training Data Curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Eliud Munyala

Eliud Munyala

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Curate high-quality, deduplicated training data for LLM fine-tuning. Extract clean text from any website OR process your own documents with automatic quality scoring, deduplication, and format conversion.

Features

  • Smart Content Extraction: Automatically detects and extracts main content, filtering out navigation, ads, and boilerplate
  • Bring Your Own Data (BYOD): Process your own text documents without crawling - perfect for existing datasets
  • Quality Scoring: Scores each document based on vocabulary diversity, sentence structure, and content density
  • Deduplication: Uses MinHash/Jaccard similarity to remove near-duplicate content
  • Flexible Crawling: Single page, same domain, same subdomain, or follow all links
  • Document Chunking: Split long documents into training-ready chunks with configurable overlap
  • Multiple Output Formats: JSONL (OpenAI compatible), JSON, Parquet, CSV, or HuggingFace Datasets format
  • Language Filtering: Filter content by language (ISO 639-1 codes)
  • Privacy Features: Optionally remove emails and URLs from extracted text

Use Cases

  • LLM Fine-tuning: Collect domain-specific training data for fine-tuning language models
  • RAG Systems: Build high-quality document collections for retrieval-augmented generation
  • Knowledge Bases: Create clean text corpora from documentation sites
  • Research: Gather datasets from academic or technical resources
  • Data Cleaning: Clean and deduplicate existing text datasets for ML training

Input Configuration

Mode Selection

The actor supports two modes - provide either start_urls (for crawling) or documents (for BYOD):

FieldTypeDefaultDescription
start_urlsarray-URLs to start crawling from (Crawl mode)
documentsarray-Your own documents to process (BYOD mode)

BYOD (Bring Your Own Data) Settings

FieldTypeDefaultDescription
documentsarray-Array of text strings or objects with text field
byod_text_fieldstringtextField name containing text in document objects
max_byod_documentsinteger500Maximum documents to process (hard limit)

Crawl Settings

FieldTypeDefaultDescription
start_urlsarray-URLs to start crawling from
crawl_modestringsame_domainsingle_page, same_domain, same_subdomain, or all_links
max_pagesinteger100Maximum pages to crawl
max_depthinteger3Maximum link depth from start URLs

Content Extraction

FieldTypeDefaultDescription
content_selectorsarray["article", "main", ".content"]CSS selectors for main content
exclude_selectorsarray["nav", "header", "footer", ".sidebar"]CSS selectors to exclude
min_word_countinteger100Minimum words per document
max_word_countinteger50000Maximum words per document

Quality & Deduplication

FieldTypeDefaultDescription
deduplicatebooleantrueRemove duplicate/near-duplicate content
dedup_thresholdnumber0.85Similarity threshold (0.5-1.0)
quality_filterbooleantrueFilter low-quality content
min_quality_scorenumber0.5Minimum quality score (0.0-1.0)
language_filterarray["en"]Languages to include (ISO codes)

Output Settings

FieldTypeDefaultDescription
output_formatstringjsonljsonl, json, parquet, csv, or huggingface
text_field_namestringtextName of the text field in output
include_metadatabooleantrueInclude URL, title, date metadata
include_raw_htmlbooleanfalseAlso save original HTML

Chunking

FieldTypeDefaultDescription
chunk_documentsbooleanfalseSplit documents into chunks
chunk_sizeinteger512Target chunk size in tokens
chunk_overlapinteger64Overlap between chunks

Text Cleaning

FieldTypeDefaultDescription
clean_htmlbooleantrueRemove HTML tags
normalize_whitespacebooleantrueCollapse multiple spaces/newlines
remove_urlsbooleanfalseStrip embedded URLs
remove_emailsbooleantrueStrip email addresses

Performance

FieldTypeDefaultDescription
use_proxiesbooleanfalseUse residential proxies
max_concurrencyinteger10Parallel requests
request_delay_msinteger500Delay between requests
respect_robots_txtbooleantrueFollow robots.txt rules

Output Format

Each document in the output contains:

{
"text": "The cleaned document text content...",
"doc_id": "abc123def456",
"source_url": "https://example.com/page",
"word_count": 1523,
"quality_score": 0.847,
"language": "en",
"title": "Page Title",
"description": "Meta description",
"content_type": "documentation",
"scraped_at": "2024-01-15T10:30:00Z"
}

If chunking is enabled, additional fields are included:

{
"chunk_index": 0,
"total_chunks": 5,
"parent_doc_id": "abc123def456"
}

Quality Metrics

The quality scorer evaluates documents based on:

  • Word count: Penalizes very short documents
  • Sentence length: Flags very short (fragments) or very long sentences
  • Vocabulary diversity: Ratio of unique words to total words
  • Boilerplate ratio: Detection of common web boilerplate patterns
  • Character composition: Penalizes excessive uppercase, digits, or special characters

Documents with scores below min_quality_score are automatically filtered out.

Example Input

Crawl Python Documentation

{
"start_urls": [
{ "url": "https://docs.python.org/3/tutorial/" }
],
"crawl_mode": "same_subdomain",
"max_pages": 500,
"content_selectors": [".document", ".body"],
"exclude_selectors": [".sphinxsidebar", ".related", "footer"],
"output_format": "jsonl",
"chunk_documents": true,
"chunk_size": 1024
}

Build Knowledge Base from Blog

{
"start_urls": [
{ "url": "https://example.com/blog/" }
],
"crawl_mode": "same_domain",
"max_pages": 100,
"content_selectors": ["article", ".post-content"],
"quality_filter": true,
"min_quality_score": 0.6,
"deduplicate": true,
"output_format": "parquet"
}

BYOD: Process Your Own Documents

{
"documents": [
"This is a plain text document that will be processed...",
{
"text": "This document has metadata attached to it...",
"source_id": "doc_001",
"metadata": {
"title": "My Document",
"author": "John Doe",
"language": "en"
}
}
],
"deduplicate": true,
"quality_filter": true,
"min_quality_score": 0.5,
"output_format": "jsonl"
}

BYOD: Clean Existing Dataset

{
"documents": [
{"text": "First document from your dataset..."},
{"text": "Second document from your dataset..."},
{"text": "Third document from your dataset..."}
],
"byod_text_field": "text",
"deduplicate": true,
"dedup_threshold": 0.85,
"chunk_documents": true,
"chunk_size": 512,
"output_format": "jsonl"
}

Tips for Best Results

  1. Use specific content selectors: Better extraction with precise CSS selectors for your target site
  2. Set appropriate word counts: Filter out navigation pages and indexes with min_word_count
  3. Enable deduplication: Prevents training on repetitive content (common on content farms)
  4. Adjust quality threshold: Lower for technical content, higher for prose
  5. Use chunking for long documents: Better for training context windows
  6. Start small: Test with max_pages: 20 before large crawls

Pricing

  • $0.01 per document - charged for each cleaned document (both crawled and BYOD)

Additional costs:

  • Proxy: ~$0.001-0.005 per request (if enabled)
  • Storage: ~$0.0001 per document

Support