Pricing

from $0.008 / actor start

AI Dataset Converter - Website to Training Data

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

Pricing

from $0.008 / actor start

Rating

0.0

(0)

Developer

Boztek LTD

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

AI Dataset Converter — Website to AI Training Data

Convert any website into AI-ready datasets for RAG pipelines, LLM fine-tuning, and Q&A training. Token-aware chunking, quality scoring, content deduplication — all without external API calls.

What does AI Dataset Converter do?

AI Dataset Converter crawls websites and transforms their content into structured, token-aware datasets optimized for AI/ML workflows:

RAG Chunks — Embedding-ready JSON with configurable chunk size and overlap
Fine-tuning JSONL — OpenAI-compatible messages[] format
Q&A Pairs — Automatically extracted from FAQ pages and heading structures
Clean Markdown — Boilerplate-free content with full page metadata

Every chunk includes the cl100k_base (GPT-4 compatible) token count, a 0.0–1.0 quality score, source URL, language, and canonical URL — ready to ingest into Pinecone, Qdrant, Weaviate, LangChain, LlamaIndex, or any vector store.

Why AI Dataset Converter?

Feature	Website Content Crawler	AI Dataset Converter
Output	Raw Markdown / text	Structured AI-ready formats
Chunking	Manual	Token-aware, configurable
Token counting	—	cl100k_base (GPT-4)
Q&A extraction	—	5 rule-based strategies
Quality scoring	—	0.0–1.0 per page
Deduplication	URL-based	Content fingerprinting
Fine-tuning format	—	OpenAI JSONL
External LLM cost	None	None

How much does it cost?

AI Dataset Converter uses pay-per-event pricing at approximately $0.002 per output item (chunk, Q&A pair, or page). Platform compute units are included.

Use case	Pages	Output items	Estimated cost
Small docs site	50	~250 chunks	~$0.50
Medium blog	500	~2,500 chunks	~$5.00
Large docs + FAQ	2,000	~12,000 items	~$24.00

Apify's free plan provides $5 of platform credit per month — enough to test on small sites.

Output formats

1. RAG Chunks (`rag-chunks`)

One JSON item per chunk with embedding-ready text plus rich metadata:

{
  "chunk_id": "550e8400-e29b-41d4-a716-446655440000",
  "source_url": "https://docs.example.com/getting-started",
  "canonical_url": "https://docs.example.com/getting-started",
  "text": "Getting started with Example SDK...",
  "markdown": "# Getting Started\n\nWelcome to...",
  "chunk_index": 0,
  "total_chunks": 3,
  "token_count": 487,
  "char_count": 1843,
  "page_title": "Getting Started",
  "page_description": "Quick start guide",
  "page_language": "en",
  "page_author": "Docs Team",
  "page_date": "2026-04-12T00:00:00.000Z",
  "quality_score": 0.85,
  "content_type": "documentation",
  "crawled_at": "2026-05-12T08:30:00.000Z",
  "actor_version": "1.0.0"
}

2. Fine-tuning JSONL (`fine-tuning-jsonl`)

OpenAI-compatible messages[] format. Prompts are synthesized rule-based (no LLM):

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant that provides information about Example Documentation." },
    { "role": "user",   "content": "What is the chunk size?" },
    { "role": "assistant", "content": "The chunk size is the target number of tokens per output chunk..." }
  ],
  "_metadata": {
    "source_url": "https://docs.example.com/chunking",
    "chunk_id": "...",
    "token_count": 412,
    "quality_score": 0.81
  }
}

3. Q&A Pairs (`qa-pairs`)

Extracted from FAQ pages using five rule-based strategies:

{
  "question": "Can I cancel my subscription?",
  "answer": "Yes, you can cancel anytime from the billing settings page in your account.",
  "source_url": "https://example.com/help/faq",
  "extraction_method": "faq_html",
  "confidence": 0.95,
  "token_count": 28,
  "page_title": "FAQ"
}

Extraction strategies (in confidence order):

faq_schema — JSON-LD FAQPage schema (confidence 1.0)
faq_html — <details><summary> elements (0.95)
dt_dd — Definition lists <dl>/<dt>/<dd> (0.90)
accordion — aria-controls / data-toggle patterns (0.85)
heading_paragraph — <h2>/<h3> + following content (0.5–0.9)

4. Clean Markdown (`markdown`)

Full-page Markdown with boilerplate removed and complete metadata.

Input options

Option	Type	Default	Description
`startUrls`	array	required	Initial URLs to crawl
`maxPages`	integer	100	Maximum number of pages (0 = unlimited)
`maxDepth`	integer	5	Link-follow depth from start URLs
`crawlerType`	string	`adaptive`	`adaptive` / `cheerio` / `playwright`
`includeGlobs` / `excludeGlobs`	array	`[]`	URL pattern filters
`outputFormat`	string	`rag-chunks`	`rag-chunks` / `fine-tuning-jsonl` / `qa-pairs` / `markdown` / `all`
`chunkSize`	integer	512	Target tokens per chunk
`chunkOverlap`	integer	50	Token overlap between chunks
`extractQAPairs`	boolean	`true`	Run Q&A extraction strategies
`language`	string	`""`	ISO 639-1 code language filter
`minContentLength`	integer	100	Skip pages shorter than this (chars)
`minQualityScore`	number	0.3	Skip pages below this score (0.0–1.0)
`removeDuplicates`	boolean	`true`	Content-fingerprint deduplication
`removeBoilerplate`	boolean	`true`	Strip nav/footer/cookie banners
`proxyConfiguration`	object	Apify Proxy	Proxy settings
`maxConcurrency`	integer	10	Parallel page processing

Use cases

Build RAG chatbots — Crawl documentation → chunk → embed in Pinecone/Qdrant/Weaviate
Fine-tune LLMs — Convert knowledge bases to OpenAI training format
Create Q&A datasets — Extract FAQ data for customer-support AI
Feed AI agents — Provide structured web knowledge to autonomous agents

Integrations

Output is plain JSON / JSONL and works with LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, MongoDB Atlas, OpenAI fine-tuning, and any tool that accepts JSON.

Quality scoring (heuristic, no LLM)

Each page receives a 0.0–1.0 score computed from:

Content length (25%) — Pages between 500 and 10000 chars score highest
Text density (25%) — Ratio of extracted text to original HTML
Paragraph count (15%) — ≥3 paragraphs preferred
Heading presence (10%) — At least one <h1>–<h6>
Link density (10%) — Low anchor-text ratio preferred
Repetition (15%) — Unique-sentence ratio

Pages scoring below minQualityScore are filtered out before token usage.

Token-aware chunking

Chunks are produced with a recursive splitter that respects natural boundaries:

Split by paragraph (\n\n)
If a paragraph exceeds chunkSize, split by sentence
If a sentence exceeds chunkSize, split by token
Apply chunkOverlap by prepending the last N tokens of the previous chunk

Token counts are computed with js-tiktoken using the cl100k_base encoding — identical to GPT-4 / text-embedding-3-*.

Limitations

No LLM-based extraction (by design — keeps cost predictable)
Q&A extraction works best on structured pages (FAQ, docs with headings)
Login-protected content not supported without cookie injection
JavaScript-heavy SPAs may need crawlerType: "playwright" for full rendering

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

Ai Training Data Curator

omarchydev/ai-training-data-curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Omarchy Dev

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Nguyễn Anh Duy

4.7

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Ken M

AI Training Data Curator

vamsi-krishna/ai-training-data-curator

Turn any public website into a clean LLM training dataset. Crawl docs, blogs, and help centers, extract readable text, filter by language, remove duplicates, and export JSON, JSONL, or CSV for fine-tuning, RAG, and AI workflows. No coding required.

Vamsi Krishna

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.