AI Training Data Curator
Pricing
Pay per usage
AI Training Data Curator
Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.
AI Training Data Curator
Pricing
Pay per usage
Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.
Maximum number of pages to crawl.
Maximum link-following depth from start URLs.
Minimum text length (characters) to keep a page.
Minimum quality score (0-1) to include in output.
Detect and flag pages containing personally identifiable information (PII).
Redact detected PII from content (replaces with [EMAIL], [PHONE], etc.).
Output format for extracted content.
Include metadata (URL, title, timestamps) with content.
Skip near-duplicate pages based on content similarity.
URL patterns to exclude (e.g., /login, /cart, /admin).