AI Training Data Curator
Pricing
from $10.00 / 1,000 page curateds
AI Training Data Curator
Crawl websites to extract quality-scored, deduplicated text for LLM fine-tuning and RAG. Built-in PII detection, content fingerprinting, and JSONL/Markdown/plain output formats.
Pricing
from $10.00 / 1,000 page curateds
Rating
0.0
(0)
Developer

ryan clinton
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training. This actor handles the entire curation pipeline in a single run -- crawling pages, stripping boilerplate HTML, scoring content quality across six weighted factors, deduplicating near-identical pages via trigram fingerprinting, scanning for personally identifiable information, and exporting results in your choice of JSONL, Markdown, or plain text. Whether you are building a domain-specific corpus for GPT fine-tuning or populating a vector database for retrieval-augmented generation, AI Training Data Curator turns raw websites into production-ready training data with no manual cleanup required.
Why use AI Training Data Curator?
Building high-quality training datasets from web content is tedious and error-prone. You have to strip navigation chrome and ad blocks, filter out thin pages with no real substance, detect and handle duplicate content that inflates dataset size without adding value, and scan for PII that could create compliance issues downstream. Most teams cobble this together with fragile Python scripts, custom regex, and manual spot-checks -- a process that breaks every time a site changes its layout and scales poorly beyond a few hundred pages.
AI Training Data Curator solves all of these problems in a single configurable actor. It uses priority-ordered CSS selectors to find the main content area on any page layout, applies a six-factor quality scoring model to filter out low-value pages automatically, runs trigram-based fingerprinting to catch near-duplicate content even when URLs differ, and detects five categories of PII with optional automatic redaction. You get a clean, scored, deduplicated dataset with rich metadata -- ready to feed directly into your fine-tuning job, embedding pipeline, or vector store -- without writing a single line of preprocessing code.
Key features
- Intelligent noise removal -- strips 21 categories of boilerplate HTML including navigation bars, headers, footers, sidebars, ads, cookie banners, modals, comments, and widget containers before extracting content
- Priority-ordered content selection -- tries 12 CSS selectors in priority order (from
main articledown to[role="main"]) to isolate the actual content area, with abodyfallback for unconventional layouts - Six-factor quality scoring -- every page receives a 0-to-1 quality score based on content length, text-to-HTML ratio, paragraph structure, sentence quality, vocabulary diversity, and metadata completeness
- Trigram-based deduplication -- generates content fingerprints from the first 500 characters using sorted trigram hashes, flagging pages with 80%+ similarity as duplicates per domain
- PII detection and redaction -- scans for email addresses, US phone numbers, Social Security Numbers, credit card numbers, and IP addresses with optional automatic redaction to placeholder tokens like
[EMAIL]and[PHONE] - HTML-to-Markdown conversion -- converts headers (h1-h6), code blocks, inline code, bold, italic, links, and lists into clean Markdown formatting while collapsing excessive whitespace
- Three output formats -- export as JSONL with full metadata fields, Markdown with YAML frontmatter, or stripped plain text depending on your downstream pipeline
- Configurable crawl scope -- control maximum pages (up to 10,000), crawl depth (up to 20 levels), minimum content length, quality score threshold, and URL exclusion patterns
- Rich per-page metadata -- each output record includes title, description, author, published date, language, word count, content hash, crawl depth, and scrape timestamp
- Proxy support -- use Apify datacenter or residential proxies, or provide custom proxy configuration for geo-restricted or rate-limited sites
How to use AI Training Data Curator
Using Apify Console
- Navigate to the actor -- go to AI Training Data Curator on Apify and click "Try for free" or "Start".
- Enter your start URLs -- add one or more website URLs in the Start URLs field. The actor follows same-origin internal links automatically, so a single homepage URL often covers an entire site.
- Configure crawl limits and quality thresholds -- set the maximum pages, crawl depth, minimum content length, and minimum quality score. For a typical documentation site, 100-500 pages at depth 3 with a 0.3 quality threshold works well.
- Set PII and output options -- enable PII detection to flag pages containing personal data, optionally enable PII removal to redact with placeholder tokens, and choose your preferred output format (JSONL, Markdown, or plain text).
- Run and export -- click "Start" and wait for the run to complete. Download your curated dataset from the Dataset tab as JSON, CSV, JSONL, XML, or Excel. Feed the results directly into your fine-tuning script, vector database, or data pipeline.
Using the API
You can start the actor programmatically via the Apify API, Python SDK, or JavaScript SDK. See the API & Integration section below for complete code examples in Python, JavaScript, and cURL.
Input parameters
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
startUrls | string[] | Yes | -- | Starting URLs to crawl and extract training data from |
maxPages | integer | No | 100 | Maximum number of pages to crawl (1--10,000) |
maxCrawlDepth | integer | No | 3 | Maximum link-following depth from start URLs (0--20) |
minContentLength | integer | No | 200 | Minimum text length in characters to keep a page |
minQualityScore | number | No | 0.3 | Minimum quality score (0--1) to include in output |
detectPII | boolean | No | true | Detect and flag pages containing personally identifiable information |
removePII | boolean | No | false | Redact detected PII with placeholder tokens like [EMAIL], [PHONE] |
outputFormat | string | No | "jsonl" | Output format: jsonl, markdown, or plain |
includeMetadata | boolean | No | true | Include metadata (URL, title, timestamps) with extracted content |
deduplicateContent | boolean | No | true | Skip near-duplicate pages based on trigram content similarity |
excludePatterns | string[] | No | [] | URL patterns to exclude (e.g., /login, /cart, /admin) |
proxy | object | No | -- | Proxy configuration for crawling |
Example input
{"startUrls": ["https://docs.example.com","https://blog.example.com"],"maxPages": 500,"maxCrawlDepth": 3,"minContentLength": 300,"minQualityScore": 0.5,"detectPII": true,"removePII": true,"outputFormat": "jsonl","includeMetadata": true,"deduplicateContent": true,"excludePatterns": ["/login", "/signup", "/admin", "/tag/", "/page/"]}
Tips for input
- Start small -- run with 20-50 pages first to verify content quality and tune thresholds before launching a full crawl
- Raise quality score for fine-tuning -- set
minQualityScoreto 0.5 or higher when building LLM training corpora to ensure only well-structured, substantive content passes the filter - Use exclude patterns generously -- add paths like
/login,/signup,/cart,/admin,/tag/,/page/to filter out authentication pages, shopping cart pages, and paginated archive listings - Depth 0 for curated lists -- set
maxCrawlDepthto 0 if you provide an explicit list of URLs and do not want the actor to follow any links - Combine PII detection with removal -- enable both
detectPIIandremovePIIfor production datasets to reduce compliance risk while still tracking which PII types were found
Output
Each crawled page that passes the quality and length filters produces one output record. Below is a realistic example of a single output item.
{"url": "https://docs.example.com/guides/getting-started","title": "Getting Started Guide - Example Docs","description": "Learn how to set up and configure Example in under 5 minutes.","author": "Jane Smith","publishedDate": "2024-11-15T10:30:00Z","language": "en","content": "# Getting Started Guide\n\nThis guide walks you through setting up Example from scratch. You will install the CLI, configure your project, and deploy your first application in under five minutes.\n\n## Prerequisites\n\nBefore you begin, make sure you have the following installed:\n\n- Node.js 18 or later\n- npm or yarn package manager\n- A free Example account\n\n## Installation\n\nInstall the Example CLI globally using npm:\n\n```\nnpm install -g @example/cli\n```\n\nVerify the installation by running:\n\n```\nexample --version\n```\n\n## Creating Your First Project\n\nRun the init command to scaffold a new project:\n\n```\nexample init my-project\ncd my-project\n```\n\nThis creates a project directory with the default configuration files and a sample application. Open `example.config.js` to customize your settings.\n\n## Deploying\n\nWhen you are ready, deploy with a single command:\n\n```\nexample deploy\n```\n\nYour application will be live at `https://my-project.example.com` within seconds.","contentLength": 847,"wordCount": 138,"qualityScore": 0.782,"qualityFactors": {"contentLength": 0.15,"textToHtmlRatio": 0.213,"paragraphCount": 0.12,"sentenceQuality": 0.13,"vocabularyDiversity": 0.079,"metadataPresent": 0.1},"piiDetected": false,"piiTypes": [],"isDuplicate": false,"duplicateOf": null,"metadata": {"crawlDepth": 1,"scrapedAt": "2025-01-20T14:32:17.445Z","contentHash": "a3f2c1b8"}}
Output fields
| Field | Type | Description |
|---|---|---|
url | string | The final loaded URL of the crawled page |
title | string | Page title extracted from <title>, Open Graph tags, or first <h1> |
description | string or null | Meta description from <meta name="description"> or Open Graph |
author | string or null | Author from <meta name="author">, [rel="author"], or author CSS class |
publishedDate | string or null | ISO 8601 publish date from article meta tags or <time> elements |
language | string or null | Language code from the <html lang> attribute |
content | string | Cleaned, formatted text content (format depends on outputFormat setting) |
contentLength | integer | Character count of the cleaned content |
wordCount | integer | Word count of the cleaned content |
qualityScore | number | Composite quality score from 0 to 1 |
qualityFactors | object | Breakdown of the six individual quality factor scores |
piiDetected | boolean | Whether any PII patterns were found in the content |
piiTypes | string[] | List of PII types detected (e.g., ["email", "phone"]) |
isDuplicate | boolean | Whether the page was flagged as a near-duplicate (always false in output since duplicates are skipped) |
duplicateOf | string or null | URL of the original page if a duplicate was detected |
metadata | object | Crawl metadata including crawlDepth, scrapedAt timestamp, and contentHash |
Use cases
- LLM fine-tuning datasets -- crawl documentation sites, technical blogs, or niche knowledge bases to build domain-specific corpora for fine-tuning GPT, LLaMA, Mistral, Claude, or other large language models
- RAG pipeline ingestion -- extract and clean website content to populate vector databases like Pinecone, Weaviate, ChromaDB, or Qdrant for retrieval-augmented generation workflows
- Knowledge base construction -- convert sprawling company wikis, help centers, or support documentation into structured, deduplicated text for internal AI assistants
- Academic NLP research -- collect structured text corpora from institutional websites, open-access journals, or government portals for computational linguistics and natural language processing experiments
- Content quality auditing -- use the six-factor quality scoring breakdown to benchmark content depth, vocabulary richness, and structural quality across competitor sites or your own properties
- PII compliance screening -- audit web-scraped datasets for personally identifiable information before using them in AI training, or automatically redact PII during extraction to meet privacy requirements
- Dataset deduplication -- clean up existing web crawl outputs by running them through the trigram fingerprinting pipeline to identify and remove near-duplicate pages that inflate dataset size
- Competitive intelligence corpus -- build structured datasets from competitor documentation, product pages, and blog content for market analysis and strategic planning
- Open-source training data -- crawl publicly available government websites, Wikipedia sections, or Creative Commons content to assemble openly licensed training datasets
API & Integration
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run_input = {"startUrls": ["https://docs.example.com"],"maxPages": 500,"maxCrawlDepth": 3,"minQualityScore": 0.5,"detectPII": True,"removePII": True,"outputFormat": "jsonl","deduplicateContent": True,}run = client.actor("1cYb1W8Ik1Vk4hTcW").call(run_input=run_input)dataset_items = client.dataset(run["defaultDatasetId"]).list_items().itemsfor item in dataset_items:print(f"{item['title']} -- quality: {item['qualityScore']}, words: {item['wordCount']}")
JavaScript
import { ApifyClient } from "apify-client";const client = new ApifyClient({ token: "YOUR_API_TOKEN" });const run = await client.actor("1cYb1W8Ik1Vk4hTcW").call({startUrls: ["https://docs.example.com"],maxPages: 500,maxCrawlDepth: 3,minQualityScore: 0.5,detectPII: true,removePII: true,outputFormat: "jsonl",deduplicateContent: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((item) => {console.log(`${item.title} -- quality: ${item.qualityScore}, words: ${item.wordCount}`);});
cURL
# Start the actor runcurl -X POST "https://api.apify.com/v2/acts/1cYb1W8Ik1Vk4hTcW/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": ["https://docs.example.com"],"maxPages": 500,"maxCrawlDepth": 3,"minQualityScore": 0.5,"detectPII": true,"removePII": true,"outputFormat": "jsonl"}'# Retrieve results (replace DATASET_ID with the actual dataset ID from the run response)curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"
Integrations
- Apify API -- trigger runs and retrieve datasets programmatically via REST endpoints
- Python SDK -- call from training scripts, Jupyter notebooks, or data pipelines using
apify-client - JavaScript SDK -- integrate with Node.js ETL pipelines using
apify-client - Zapier -- trigger crawls from events and route curated data to Google Sheets, Airtable, or Slack
- Make (Integromat) -- build automated workflows piping curated data to downstream systems
- Google Sheets -- export datasets for manual review, labeling, or annotation
- Webhooks -- receive POST notifications at your endpoint when a run completes
How it works
AI Training Data Curator processes web content through a six-stage pipeline.
-
Crawl -- the CheerioCrawler visits each start URL and follows same-origin internal links up to the configured
maxCrawlDepth. It runs with 10 concurrent requests, a 60-second handler timeout, and a 30-second navigation timeout. URLs matchingexcludePatternsare skipped. -
Extract -- for each page, 21 noise selectors remove navigation bars, headers, footers, sidebars, ads, cookie banners, modals, comments, and widgets. The actor then tries 12 content selectors in priority order to isolate the main content area, falling back to
<body>if none match. -
Convert -- the extracted HTML is converted to clean Markdown-formatted text. Headers (h1-h6), code blocks, inline code, bold, italic, links, and lists are preserved as Markdown syntax. Excessive whitespace and blank lines are collapsed.
-
Score -- each page receives a quality score from 0 to 1 based on six weighted factors: content length (0--0.25), text-to-HTML ratio (0--0.25), paragraph count (0--0.15), sentence quality (0--0.15), vocabulary diversity (0--0.10), and metadata completeness (0--0.10). Pages below
minQualityScoreare discarded. -
Deduplicate -- trigram fingerprints are generated from the first 500 characters of each page. The top 20 sorted trigram hashes form each page's fingerprint. Pages with 80%+ fingerprint overlap against already-processed pages from the same domain are flagged as duplicates and skipped.
-
PII scan and output -- if enabled, five regex patterns scan for emails, phone numbers, SSNs, credit card numbers, and IP addresses. Detected PII is either flagged or redacted with placeholder tokens. The final content is formatted according to the chosen output format and pushed to the dataset with full metadata.
AI Training Data Curator Pipeline+----------+ +-----------+ +-----------+| CRAWL | --> | EXTRACT | --> | CONVERT || Start | | Remove 21 | | HTML to || URLs + | | noise | | Markdown || follow | | selectors | | text || links | | + find | | |+----------+ | main | +-----------+| content | |+-----------+ v+-----------++----------+ +-----------+ | SCORE || OUTPUT | <-- | PII | <-- | 6-factor || JSONL / | | SCAN | | quality || Markdown | | Detect or | | 0 to 1 || / Plain | | redact 5 | | filter || | | PII types | +-----------++----------+ +-----------+ ^| | |v | +-----------++-----------+ | | DEDUP || Dataset | <-------+ | Trigram || with full | | finger- || metadata | | printing |+-----------+ +-----------+
Performance & cost
| Scenario | Pages | Estimated time | Estimated cost |
|---|---|---|---|
| Small documentation site | 50 | ~1 minute | Free tier |
| Medium blog or knowledge base | 500 | ~5 minutes | ~$0.05 |
| Large documentation portal | 2,000 | ~15 minutes | ~$0.15 |
| Enterprise multi-site crawl | 10,000 | ~60 minutes | ~$0.75 |
The actor uses 512 MB memory by default. The Apify Free plan includes $5/month of platform credits, which is enough for thousands of pages per month. CheerioCrawler (server-side HTML parsing) is significantly faster and cheaper than browser-based crawling since it does not render JavaScript or load images, stylesheets, or fonts. Actual costs depend on page size, proxy usage, and the number of pages that pass quality filters.
Limitations
- No JavaScript rendering -- the actor uses CheerioCrawler, which parses raw HTML without executing JavaScript. Single-page applications built with React, Angular, Vue, or similar frameworks may yield little or no content. For JS-heavy sites, pre-render the pages with a browser-based scraper first.
- US-format phone detection only -- the phone number PII pattern is tuned for US phone formats (e.g.,
(555) 123-4567,+1-555-123-4567). International phone formats with different digit groupings may not be detected. - English-centric sentence scoring -- the sentence quality factor assumes English-style punctuation (periods, exclamation marks, question marks) for sentence boundary detection. Content in languages with different sentence structures may receive inaccurate sentence quality scores.
- First-500-character fingerprinting -- deduplication fingerprints are generated from only the first 500 characters of content. Pages that share an identical introduction but diverge significantly afterward may be incorrectly flagged as duplicates.
- No image or table extraction -- the actor extracts text content only. Images, charts, diagrams, and complex HTML tables are not included in the output.
- Same-origin link following -- the crawler only follows links within the same origin as each start URL. Cross-domain links are not followed, even if they point to related content.
- 10,000 page maximum -- the
maxPagesparameter caps at 10,000 pages per run. For larger crawls, split across multiple runs with different start URLs.
Responsible use
- Respect robots.txt and terms of service -- always verify that the websites you crawl permit automated access. The actor follows standard HTTP conventions, but compliance with a site's terms of use is your responsibility.
- Avoid overloading target servers -- the actor runs with 10 concurrent requests by default. For small or fragile servers, reduce the crawl scope or add a proxy to distribute load across IP addresses.
- Handle PII responsibly -- if your training dataset may contain personal information, enable both
detectPIIandremovePIIto redact sensitive data before using the dataset in model training. Review flagged PII types and consider manual inspection for high-sensitivity use cases. - Attribute content sources -- the output includes the source URL and metadata for every page. When using extracted content for AI training or publication, respect the original content's copyright and licensing terms.
- Review quality before training -- automated quality scoring filters out low-value pages, but it is not a substitute for human review. Spot-check your curated dataset to verify that content quality, accuracy, and relevance meet your requirements before using it to train models.
FAQ
What types of websites work best? Documentation sites, technical blogs, knowledge bases, news archives, government portals, and content-heavy websites with well-structured HTML produce the best results. The actor excels at sites where content is delivered as server-rendered HTML rather than loaded dynamically via JavaScript.
How does the quality scoring system work? Each page is scored from 0 to 1 based on six weighted factors: content length (0--0.25, full score at 2,000+ characters), text-to-HTML ratio (0--0.25), paragraph count (0--0.15, full score at 10+ substantial paragraphs), sentence quality (0--0.15, ideal range of 10--25 words per sentence), vocabulary diversity (0--0.10, ratio of unique to total words), and metadata completeness (0--0.10, based on presence of title, description, and author).
What PII types are detected?
The actor scans for five categories: email addresses, US-format phone numbers, Social Security Numbers, credit card numbers (16-digit patterns with optional separators), and IPv4 addresses. When removePII is enabled, each match is replaced with a placeholder token such as [EMAIL], [PHONE], [SSN], [CREDIT_CARD], or [IP_ADDRESS].
How does deduplication work? The actor generates a fingerprint for each page by extracting character trigrams from the first 500 characters of cleaned content, hashing each trigram, sorting the hashes, and keeping the top 20 as the fingerprint. When a new page's fingerprint overlaps 80% or more with an existing fingerprint from the same domain, the page is skipped as a near-duplicate.
Can I crawl multiple websites in one run?
Yes. Add multiple URLs to startUrls. The crawler follows internal links within the same origin as each start URL independently, so you can combine a documentation site and a blog in a single run without cross-contamination.
How do I feed the output into a vector database?
Export results as JSONL. Each record's content field contains the cleaned text suitable for embedding, while title, url, qualityScore, and metadata provide context for chunking and retrieval. Load the data into Pinecone, Weaviate, ChromaDB, or Qdrant using LangChain, LlamaIndex, or direct API calls.
Does this handle JavaScript-rendered pages? No. The actor uses CheerioCrawler, which parses the raw HTML response without executing JavaScript. For React SPAs, Next.js apps with client-side rendering, or Angular applications, you would need to pre-render the pages using a browser-based tool first and then pass the resulting URLs to this actor.
What is the difference between JSONL, Markdown, and plain text output formats?
JSONL (the default) preserves Markdown formatting in the content field alongside all metadata fields -- best for structured data pipelines. Markdown output adds YAML frontmatter with the title and URL above the content -- useful for documentation systems. Plain text strips all Markdown formatting (headers, bold, italic, code fences, links) for simple text-only workflows.
How do I increase output quality for fine-tuning?
Set minQualityScore to 0.5 or higher, increase minContentLength to 500 or more, and enable deduplicateContent. This combination filters out thin pages, low-quality content, and duplicates, leaving only substantive, well-structured text suitable for model training.
Can I exclude specific sections of a website?
Yes. Use excludePatterns to skip URLs containing specific path segments. For example, adding /api/, /admin/, /login/, and /tag/ prevents the crawler from wasting requests on API documentation, admin panels, authentication pages, and tag archive pages.
How much does a typical run cost? A 500-page crawl of a documentation site takes roughly 5 minutes and costs approximately $0.05 in Apify platform credits. The Apify Free plan includes $5/month, enough for approximately 50,000 pages per month. Actual costs vary based on page size and proxy usage.
Is the data suitable for commercial model training? The actor extracts and cleans web content, but it does not assess or modify the copyright status of that content. Whether the data is suitable for commercial training depends on the source material's licensing terms. Always verify that you have the right to use the content for your intended purpose.
Related actors
| Actor | Description |
|---|---|
| Website Content to Markdown | Simple website content extraction and Markdown conversion without quality scoring or deduplication |
| Website Contact Scraper | Extract emails, phone numbers, and social media links from websites alongside page content |
| Website Change Monitor | Monitor websites for content changes to keep training datasets up to date as sources evolve |
| Website Tech Stack Detector | Identify which sites use server-rendered HTML (ideal for this actor) versus JavaScript frameworks |
| Semantic Scholar Paper Search | Search academic papers to find URLs for crawling research corpora and scientific training data |
| Wikipedia Article Search | Search and extract Wikipedia articles for general-knowledge training datasets |