Ai Training Data Curator avatar
Ai Training Data Curator

Pricing

Pay per usage

Go to Apify Store
Ai Training Data Curator

Ai Training Data Curator

Curate high-quality training datasets for AI/ML models. Extract, clean & format text data from websites, papers & forums. Perfect for LLM training, RAG systems & research.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Vhub Systems

Vhub Systems

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Scrape and curate high-quality training data for AI/ML models from legal, public domain, and Creative Commons sources.

Why This Actor?

Building AI models requires massive amounts of clean training data. This actor helps you:

  • Legally source training data from permissive sources
  • Clean and normalize text for model training
  • Remove duplicates to improve data quality
  • Export in AI-ready formats (JSONL, CSV)
  • Track provenance with full metadata

Supported Sources

SourceLicenseBest For
WikipediaCC-BY-SA 4.0General knowledge, encyclopedic content
arXivCC-BY / CC0Academic papers, research abstracts
Project GutenbergPublic DomainClassic literature, books
PubMed CentralOpen AccessMedical/scientific papers
CourtListenerPublic DomainLegal documents, court opinions
GovInfo.govPublic DomainUS government documents
Stack OverflowCC-BY-SA 4.0Technical Q&A, programming
Wikimedia CommonsVarious CCImage descriptions, captions
Common CrawlMixedWeb content (verify licenses)

Input Parameters

ParameterTypeDefaultDescription
dataTypestring"articles"Type: articles, academic, legal, technical, conversational
sourcestring"wikipedia"Data source (see table above)
topicstring""Topic filter (e.g., "machine learning")
languagestring"en"Target language (en, de, fr, es, ru, zh, ja, any)
outputFormatstring"jsonl"Output format: jsonl, csv, json
maxItemsinteger1000Maximum items to collect
minWordCountinteger100Minimum words per document
maxWordCountinteger0Maximum words (0 = no limit)
cleanTextbooleantrueClean HTML, normalize whitespace
removeDuplicatesbooleantrueFilter near-duplicate content
includeMetadatabooleantrueInclude source metadata

Output Format

Each item includes:

{
"text": "The cleaned content text...",
"source": "Wikipedia",
"url": "https://en.wikipedia.org/wiki/...",
"topic": "machine learning",
"wordCount": 1523,
"language": "en",
"license": "CC-BY-SA 4.0",
"author": "Various",
"title": "Article Title",
"scrapedAt": "2024-01-15T10:30:00.000Z",
"dataType": "articles"
}

Example Usage

Academic Papers (arXiv)

{
"dataType": "academic",
"source": "arxiv",
"topic": "transformer neural networks",
"maxItems": 5000,
"outputFormat": "jsonl"
}
{
"dataType": "legal",
"source": "courtlistener",
"topic": "intellectual property",
"maxItems": 1000,
"minWordCount": 500
}

Technical Q&A

{
"dataType": "technical",
"source": "stackoverflow",
"topic": "python",
"maxItems": 10000,
"language": "en"
}

Data Quality Features

Text Cleaning

  • Removes HTML tags and formatting
  • Normalizes whitespace
  • Fixes encoding issues
  • Removes control characters

Deduplication

  • Uses content hashing to detect duplicates
  • Catches near-duplicates (same first 1000 chars)
  • Configurable via removeDuplicates parameter

Language Detection

  • Automatic language detection
  • Filters content by target language
  • Supports: en, de, fr, es, ru, zh, ja

Word Count Filtering

  • Set minimum/maximum word counts
  • Filter out too-short snippets
  • Avoid overly long documents

All supported sources are either:

  • Public Domain - No copyright restrictions
  • Creative Commons - Permissive reuse licenses (CC-BY, CC-BY-SA, CC0)
  • Open Access - Explicitly allow reuse

Always verify the specific license for your use case. Some sources (like Common Crawl) contain mixed-license content.

Compliance Tips

  1. Attribution: CC-BY licenses require attribution
  2. Share-Alike: CC-BY-SA requires derivative works use same license
  3. Documentation: Keep provenance metadata for audit trails
  4. Regional Laws: Check local data protection regulations (GDPR, etc.)

Rate Limits

The actor respects rate limits for each source:

  • arXiv: 3-second delays between requests
  • PubMed: 1-second delays
  • Stack Exchange: Built-in throttling
  • Wikipedia: Concurrent request limits

Tips for Best Results

  1. Be Specific: Use topic parameter to focus on relevant content
  2. Set Word Limits: Filter out short/long content with word count params
  3. Use JSONL: Best format for streaming into ML pipelines
  4. Enable Metadata: Useful for filtering and analysis later
  5. Start Small: Test with 100-1000 items before large runs

Integration Examples

Python (Hugging Face Datasets)

from datasets import Dataset
import json
# Load JSONL output
data = []
with open('output.jsonl', 'r') as f:
for line in f:
data.append(json.loads(line))
dataset = Dataset.from_list(data)
dataset.push_to_hub("my-training-data")

OpenAI Fine-tuning

# Convert to OpenAI format
with open('openai_training.jsonl', 'w') as out:
for item in data:
out.write(json.dumps({
"messages": [
{"role": "user", "content": item['title'] or "Summarize:"},
{"role": "assistant", "content": item['text']}
]
}) + '\n')

Changelog

v1.0.0

  • Initial release
  • Support for 10+ data sources
  • Text cleaning and deduplication
  • Multiple output formats
  • Language detection and filtering

Support

For issues or feature requests, please open an issue on the actor's GitHub repository or contact the author.

License

MIT License - Free for commercial and non-commercial use.