AI Dataset Converter - Website to Training Data avatar

AI Dataset Converter - Website to Training Data

Pricing

from $0.008 / actor start

Go to Apify Store
AI Dataset Converter - Website to Training Data

AI Dataset Converter - Website to Training Data

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

Pricing

from $0.008 / actor start

Rating

0.0

(0)

Developer

Boztek LTD

Boztek LTD

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

17 hours ago

Last modified

Share

AI Dataset Converter — Website to AI Training Data

Convert any website into AI-ready datasets for RAG pipelines, LLM fine-tuning, and Q&A training. Token-aware chunking, quality scoring, content deduplication — all without external API calls.

What does AI Dataset Converter do?

AI Dataset Converter crawls websites and transforms their content into structured, token-aware datasets optimized for AI/ML workflows:

  • RAG Chunks — Embedding-ready JSON with configurable chunk size and overlap
  • Fine-tuning JSONL — OpenAI-compatible messages[] format
  • Q&A Pairs — Automatically extracted from FAQ pages and heading structures
  • Clean Markdown — Boilerplate-free content with full page metadata

Every chunk includes the cl100k_base (GPT-4 compatible) token count, a 0.0–1.0 quality score, source URL, language, and canonical URL — ready to ingest into Pinecone, Qdrant, Weaviate, LangChain, LlamaIndex, or any vector store.

Why AI Dataset Converter?

FeatureWebsite Content CrawlerAI Dataset Converter
OutputRaw Markdown / textStructured AI-ready formats
ChunkingManualToken-aware, configurable
Token countingcl100k_base (GPT-4)
Q&A extraction5 rule-based strategies
Quality scoring0.0–1.0 per page
DeduplicationURL-basedContent fingerprinting
Fine-tuning formatOpenAI JSONL
External LLM costNoneNone

How much does it cost?

AI Dataset Converter uses pay-per-event pricing at approximately $0.002 per output item (chunk, Q&A pair, or page). Platform compute units are included.

Use casePagesOutput itemsEstimated cost
Small docs site50~250 chunks~$0.50
Medium blog500~2,500 chunks~$5.00
Large docs + FAQ2,000~12,000 items~$24.00

Apify's free plan provides $5 of platform credit per month — enough to test on small sites.

Output formats

1. RAG Chunks (rag-chunks)

One JSON item per chunk with embedding-ready text plus rich metadata:

{
"chunk_id": "550e8400-e29b-41d4-a716-446655440000",
"source_url": "https://docs.example.com/getting-started",
"canonical_url": "https://docs.example.com/getting-started",
"text": "Getting started with Example SDK...",
"markdown": "# Getting Started\n\nWelcome to...",
"chunk_index": 0,
"total_chunks": 3,
"token_count": 487,
"char_count": 1843,
"page_title": "Getting Started",
"page_description": "Quick start guide",
"page_language": "en",
"page_author": "Docs Team",
"page_date": "2026-04-12T00:00:00.000Z",
"quality_score": 0.85,
"content_type": "documentation",
"crawled_at": "2026-05-12T08:30:00.000Z",
"actor_version": "1.0.0"
}

2. Fine-tuning JSONL (fine-tuning-jsonl)

OpenAI-compatible messages[] format. Prompts are synthesized rule-based (no LLM):

{
"messages": [
{ "role": "system", "content": "You are a helpful assistant that provides information about Example Documentation." },
{ "role": "user", "content": "What is the chunk size?" },
{ "role": "assistant", "content": "The chunk size is the target number of tokens per output chunk..." }
],
"_metadata": {
"source_url": "https://docs.example.com/chunking",
"chunk_id": "...",
"token_count": 412,
"quality_score": 0.81
}
}

3. Q&A Pairs (qa-pairs)

Extracted from FAQ pages using five rule-based strategies:

{
"question": "Can I cancel my subscription?",
"answer": "Yes, you can cancel anytime from the billing settings page in your account.",
"source_url": "https://example.com/help/faq",
"extraction_method": "faq_html",
"confidence": 0.95,
"token_count": 28,
"page_title": "FAQ"
}

Extraction strategies (in confidence order):

  1. faq_schema — JSON-LD FAQPage schema (confidence 1.0)
  2. faq_html<details><summary> elements (0.95)
  3. dt_dd — Definition lists <dl>/<dt>/<dd> (0.90)
  4. accordionaria-controls / data-toggle patterns (0.85)
  5. heading_paragraph<h2>/<h3> + following content (0.5–0.9)

4. Clean Markdown (markdown)

Full-page Markdown with boilerplate removed and complete metadata.

Input options

OptionTypeDefaultDescription
startUrlsarrayrequiredInitial URLs to crawl
maxPagesinteger100Maximum number of pages (0 = unlimited)
maxDepthinteger5Link-follow depth from start URLs
crawlerTypestringadaptiveadaptive / cheerio / playwright
includeGlobs / excludeGlobsarray[]URL pattern filters
outputFormatstringrag-chunksrag-chunks / fine-tuning-jsonl / qa-pairs / markdown / all
chunkSizeinteger512Target tokens per chunk
chunkOverlapinteger50Token overlap between chunks
extractQAPairsbooleantrueRun Q&A extraction strategies
languagestring""ISO 639-1 code language filter
minContentLengthinteger100Skip pages shorter than this (chars)
minQualityScorenumber0.3Skip pages below this score (0.0–1.0)
removeDuplicatesbooleantrueContent-fingerprint deduplication
removeBoilerplatebooleantrueStrip nav/footer/cookie banners
proxyConfigurationobjectApify ProxyProxy settings
maxConcurrencyinteger10Parallel page processing

Use cases

  1. Build RAG chatbots — Crawl documentation → chunk → embed in Pinecone/Qdrant/Weaviate
  2. Fine-tune LLMs — Convert knowledge bases to OpenAI training format
  3. Create Q&A datasets — Extract FAQ data for customer-support AI
  4. Feed AI agents — Provide structured web knowledge to autonomous agents

Integrations

Output is plain JSON / JSONL and works with LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, MongoDB Atlas, OpenAI fine-tuning, and any tool that accepts JSON.

Quality scoring (heuristic, no LLM)

Each page receives a 0.0–1.0 score computed from:

  • Content length (25%) — Pages between 500 and 10000 chars score highest
  • Text density (25%) — Ratio of extracted text to original HTML
  • Paragraph count (15%) — ≥3 paragraphs preferred
  • Heading presence (10%) — At least one <h1><h6>
  • Link density (10%) — Low anchor-text ratio preferred
  • Repetition (15%) — Unique-sentence ratio

Pages scoring below minQualityScore are filtered out before token usage.

Token-aware chunking

Chunks are produced with a recursive splitter that respects natural boundaries:

  1. Split by paragraph (\n\n)
  2. If a paragraph exceeds chunkSize, split by sentence
  3. If a sentence exceeds chunkSize, split by token
  4. Apply chunkOverlap by prepending the last N tokens of the previous chunk

Token counts are computed with js-tiktoken using the cl100k_base encoding — identical to GPT-4 / text-embedding-3-*.

Limitations

  • No LLM-based extraction (by design — keeps cost predictable)
  • Q&A extraction works best on structured pages (FAQ, docs with headings)
  • Login-protected content not supported without cookie injection
  • JavaScript-heavy SPAs may need crawlerType: "playwright" for full rendering