AI Training Data Curator avatar

AI Training Data Curator

Pricing

from $3.00 / 1,000 saved training pages

Go to Apify Store
AI Training Data Curator

AI Training Data Curator

Turn any public website into a clean LLM training dataset. Crawl docs, blogs, and help centers, extract readable text, filter by language, remove duplicates, and export JSON, JSONL, or CSV for fine-tuning, RAG, and AI workflows. No coding required.

Pricing

from $3.00 / 1,000 saved training pages

Rating

0.0

(0)

Developer

Vamsi Krishna

Vamsi Krishna

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

AI Training Data Curator — Turn websites into LLM training data

Build clean datasets from any public website for LLM fine-tuning, RAG (retrieval-augmented generation), and AI model training — without writing scrapers or cleaning HTML by hand.

Paste your URLs, run the Actor, and download structured records with page title, body text, language, and word count — ready for JSON, JSONL, or CSV export.


What this Actor does

  1. Crawls pages from your start URLs (and optional sitemap)
  2. Extracts clean article-style text — navigation menus and boilerplate are filtered out
  3. Filters by language and minimum text length
  4. Removes near-duplicate pages automatically
  5. Exports one row per page to your Apify dataset

You get training-ready text, not raw HTML.


Who is it for?

  • AI & ML teams preparing fine-tuning or pre-training corpora
  • RAG builders indexing documentation, blogs, or knowledge bases
  • Researchers collecting web text datasets at scale
  • Product teams turning help centers or marketing sites into searchable AI content

Quick start (3 steps)

  1. Start URLs — Add one or more website URLs to crawl (or paste a sitemap URL).
  2. Max pages — Choose how many pages to collect (default: 10).
  3. Run — Open the Dataset tab when finished and download as JSON, JSONL, CSV, or Excel.

Tip: For a single site section, use Crawl strategy → seeds-only. To follow internal links, use recurse and keep Stay within seed domains enabled.


Ready-made tasks (Apify Console)

Create these public tasks under Actor → Tasks to improve Store discoverability:

TaskInput highlights
Quick demo — single pagemaxPages: 1, seeds-only, https://httpbin.org/html
Documentation crawlrecurse, stayWithinDomain: true, language: en, maxPages: 100
Blog archive from sitemapsitemapUrl + maxPages: 500, deduplicate: true

Store monetization (Apify Console)

Under Actor → Monetization:

  1. Turn off “Pay per event + platform usage” so users see predictable pricing.
  2. Set per-page event price to cover platform cost (current runs are ~$0.0005/page compute + margin).
  3. Enable Store discounts with ~10% step-down per tier (Free → Bronze → Silver → Gold).

What you get (output)

Each saved page becomes one dataset record:

FieldWhat it contains
urlPage address
titlePage title
textClean extracted body text
languageDetected language (e.g. en)
wordCountNumber of words
authorAuthor name, when detected
publishedDatePublish date, when detected

Example record:

{
"url": "https://example.com/blog/getting-started",
"title": "Getting Started with RAG",
"text": "Retrieval-augmented generation combines search with large language models...",
"language": "en",
"wordCount": 842,
"author": "Jane Doe",
"publishedDate": "2025-03-15"
}

Common settings

SettingWhat it does
Start URLsWhere crawling begins
Sitemap URLOptional — load many URLs from sitemap.xml
Max pagesStop after this many pages (1–100,000)
Language filterKeep only pages in one language (e.g. en)
Minimum text lengthSkip very short pages (menus, stubs)
Remove duplicatesDrop near-duplicate content (recommended: on)
Crawl strategyrecurse = follow links; seeds-only = only listed URLs
Stay within seed domainsDo not leave the original website
Export rejected pagesOptional second dataset showing filtered-out URLs
Proxy configurationUse if the site blocks automated access

Example input

{
"startUrls": [{ "url": "https://docs.example.com" }],
"maxPages": 100,
"language": "en",
"minTextLength": 200,
"deduplicate": true,
"crawlStrategy": "seeds-only",
"stayWithinDomain": true
}

Download your dataset

After a run completes:

  1. Go to the run in Apify Console
  2. Open the Dataset tab
  3. Click Export → choose JSON, JSONL, CSV, or Excel

Use the dataset directly in Hugging Face, OpenAI fine-tuning pipelines, vector databases, or your own ML workflow.


FAQ

Can I use this for RAG?
Yes. The output is clean text per URL — ideal for chunking and embedding into Pinecone, Weaviate, Chroma, or similar vector stores.

Does it work on documentation sites and blogs?
Yes. It is designed for article-style pages: docs, blogs, news, help centers, and marketing content.

Does it remove duplicate pages?
Yes, by default. Near-duplicate pages are detected and only one copy is kept.

What formats can I export?
JSON, JSONL, CSV, and Excel from the Apify dataset. JSONL is common for LLM training pipelines.

Do I need to code?
No. Configure inputs in the Apify UI and download results. API access is available if you want to automate runs.

Is crawling always allowed?
No. You must have permission to access and use the content you crawl. See Legal notice below.


You are responsible for complying with each website's terms of service and applicable copyright law. Only crawl sites you are allowed to access, respect robots.txt, and use extracted data in line with applicable regulations.