AI Training Data Curator
Pricing
Pay per usage
AI Training Data Curator
Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.
Pricing
Pay per usage
Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.
Maximum number of pages to crawl.
Maximum link-following depth from start URLs.
Minimum text length (characters) to keep a page.
Minimum quality score (0-1) to include in output.
Detect and flag pages containing personally identifiable information (PII).
Redact detected PII from content (replaces with [EMAIL], [PHONE], etc.).
Output format for extracted content.
Include metadata (URL, title, timestamps) with content.
Skip near-duplicate pages based on content similarity.
URL patterns to exclude (e.g., /login, /cart, /admin).