Ai Training Data Curator
Pricing
Pay per usage
Ai Training Data Curator
Curate high-quality training datasets for AI/ML models. Extract, clean & format text data from websites, papers & forums. Perfect for LLM training, RAG systems & research.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Vhub Systems
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Scrape and curate high-quality training data for AI/ML models from legal, public domain, and Creative Commons sources.
Why This Actor?
Building AI models requires massive amounts of clean training data. This actor helps you:
- Legally source training data from permissive sources
- Clean and normalize text for model training
- Remove duplicates to improve data quality
- Export in AI-ready formats (JSONL, CSV)
- Track provenance with full metadata
Supported Sources
| Source | License | Best For |
|---|---|---|
| Wikipedia | CC-BY-SA 4.0 | General knowledge, encyclopedic content |
| arXiv | CC-BY / CC0 | Academic papers, research abstracts |
| Project Gutenberg | Public Domain | Classic literature, books |
| PubMed Central | Open Access | Medical/scientific papers |
| CourtListener | Public Domain | Legal documents, court opinions |
| GovInfo.gov | Public Domain | US government documents |
| Stack Overflow | CC-BY-SA 4.0 | Technical Q&A, programming |
| Wikimedia Commons | Various CC | Image descriptions, captions |
| Common Crawl | Mixed | Web content (verify licenses) |
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dataType | string | "articles" | Type: articles, academic, legal, technical, conversational |
source | string | "wikipedia" | Data source (see table above) |
topic | string | "" | Topic filter (e.g., "machine learning") |
language | string | "en" | Target language (en, de, fr, es, ru, zh, ja, any) |
outputFormat | string | "jsonl" | Output format: jsonl, csv, json |
maxItems | integer | 1000 | Maximum items to collect |
minWordCount | integer | 100 | Minimum words per document |
maxWordCount | integer | 0 | Maximum words (0 = no limit) |
cleanText | boolean | true | Clean HTML, normalize whitespace |
removeDuplicates | boolean | true | Filter near-duplicate content |
includeMetadata | boolean | true | Include source metadata |
Output Format
Each item includes:
{"text": "The cleaned content text...","source": "Wikipedia","url": "https://en.wikipedia.org/wiki/...","topic": "machine learning","wordCount": 1523,"language": "en","license": "CC-BY-SA 4.0","author": "Various","title": "Article Title","scrapedAt": "2024-01-15T10:30:00.000Z","dataType": "articles"}
Example Usage
Academic Papers (arXiv)
{"dataType": "academic","source": "arxiv","topic": "transformer neural networks","maxItems": 5000,"outputFormat": "jsonl"}
Legal Documents
{"dataType": "legal","source": "courtlistener","topic": "intellectual property","maxItems": 1000,"minWordCount": 500}
Technical Q&A
{"dataType": "technical","source": "stackoverflow","topic": "python","maxItems": 10000,"language": "en"}
Data Quality Features
Text Cleaning
- Removes HTML tags and formatting
- Normalizes whitespace
- Fixes encoding issues
- Removes control characters
Deduplication
- Uses content hashing to detect duplicates
- Catches near-duplicates (same first 1000 chars)
- Configurable via
removeDuplicatesparameter
Language Detection
- Automatic language detection
- Filters content by target language
- Supports: en, de, fr, es, ru, zh, ja
Word Count Filtering
- Set minimum/maximum word counts
- Filter out too-short snippets
- Avoid overly long documents
Legal Considerations
All supported sources are either:
- Public Domain - No copyright restrictions
- Creative Commons - Permissive reuse licenses (CC-BY, CC-BY-SA, CC0)
- Open Access - Explicitly allow reuse
Always verify the specific license for your use case. Some sources (like Common Crawl) contain mixed-license content.
Compliance Tips
- Attribution: CC-BY licenses require attribution
- Share-Alike: CC-BY-SA requires derivative works use same license
- Documentation: Keep provenance metadata for audit trails
- Regional Laws: Check local data protection regulations (GDPR, etc.)
Rate Limits
The actor respects rate limits for each source:
- arXiv: 3-second delays between requests
- PubMed: 1-second delays
- Stack Exchange: Built-in throttling
- Wikipedia: Concurrent request limits
Tips for Best Results
- Be Specific: Use
topicparameter to focus on relevant content - Set Word Limits: Filter out short/long content with word count params
- Use JSONL: Best format for streaming into ML pipelines
- Enable Metadata: Useful for filtering and analysis later
- Start Small: Test with 100-1000 items before large runs
Integration Examples
Python (Hugging Face Datasets)
from datasets import Datasetimport json# Load JSONL outputdata = []with open('output.jsonl', 'r') as f:for line in f:data.append(json.loads(line))dataset = Dataset.from_list(data)dataset.push_to_hub("my-training-data")
OpenAI Fine-tuning
# Convert to OpenAI formatwith open('openai_training.jsonl', 'w') as out:for item in data:out.write(json.dumps({"messages": [{"role": "user", "content": item['title'] or "Summarize:"},{"role": "assistant", "content": item['text']}]}) + '\n')
Changelog
v1.0.0
- Initial release
- Support for 10+ data sources
- Text cleaning and deduplication
- Multiple output formats
- Language detection and filtering
Support
For issues or feature requests, please open an issue on the actor's GitHub repository or contact the author.
License
MIT License - Free for commercial and non-commercial use.