AI Dataset Converter - Website to Training Data
Pricing
from $0.008 / actor start
AI Dataset Converter - Website to Training Data
Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.
Pricing
from $0.008 / actor start
Rating
0.0
(0)
Developer
Boztek LTD
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
17 hours ago
Last modified
Categories
Share
AI Dataset Converter — Website to AI Training Data
Convert any website into AI-ready datasets for RAG pipelines, LLM fine-tuning, and Q&A training. Token-aware chunking, quality scoring, content deduplication — all without external API calls.
What does AI Dataset Converter do?
AI Dataset Converter crawls websites and transforms their content into structured, token-aware datasets optimized for AI/ML workflows:
- RAG Chunks — Embedding-ready JSON with configurable chunk size and overlap
- Fine-tuning JSONL — OpenAI-compatible
messages[]format - Q&A Pairs — Automatically extracted from FAQ pages and heading structures
- Clean Markdown — Boilerplate-free content with full page metadata
Every chunk includes the cl100k_base (GPT-4 compatible) token count, a 0.0–1.0 quality score, source URL, language, and canonical URL — ready to ingest into Pinecone, Qdrant, Weaviate, LangChain, LlamaIndex, or any vector store.
Why AI Dataset Converter?
| Feature | Website Content Crawler | AI Dataset Converter |
|---|---|---|
| Output | Raw Markdown / text | Structured AI-ready formats |
| Chunking | Manual | Token-aware, configurable |
| Token counting | — | cl100k_base (GPT-4) |
| Q&A extraction | — | 5 rule-based strategies |
| Quality scoring | — | 0.0–1.0 per page |
| Deduplication | URL-based | Content fingerprinting |
| Fine-tuning format | — | OpenAI JSONL |
| External LLM cost | None | None |
How much does it cost?
AI Dataset Converter uses pay-per-event pricing at approximately $0.002 per output item (chunk, Q&A pair, or page). Platform compute units are included.
| Use case | Pages | Output items | Estimated cost |
|---|---|---|---|
| Small docs site | 50 | ~250 chunks | ~$0.50 |
| Medium blog | 500 | ~2,500 chunks | ~$5.00 |
| Large docs + FAQ | 2,000 | ~12,000 items | ~$24.00 |
Apify's free plan provides $5 of platform credit per month — enough to test on small sites.
Output formats
1. RAG Chunks (rag-chunks)
One JSON item per chunk with embedding-ready text plus rich metadata:
{"chunk_id": "550e8400-e29b-41d4-a716-446655440000","source_url": "https://docs.example.com/getting-started","canonical_url": "https://docs.example.com/getting-started","text": "Getting started with Example SDK...","markdown": "# Getting Started\n\nWelcome to...","chunk_index": 0,"total_chunks": 3,"token_count": 487,"char_count": 1843,"page_title": "Getting Started","page_description": "Quick start guide","page_language": "en","page_author": "Docs Team","page_date": "2026-04-12T00:00:00.000Z","quality_score": 0.85,"content_type": "documentation","crawled_at": "2026-05-12T08:30:00.000Z","actor_version": "1.0.0"}
2. Fine-tuning JSONL (fine-tuning-jsonl)
OpenAI-compatible messages[] format. Prompts are synthesized rule-based (no LLM):
{"messages": [{ "role": "system", "content": "You are a helpful assistant that provides information about Example Documentation." },{ "role": "user", "content": "What is the chunk size?" },{ "role": "assistant", "content": "The chunk size is the target number of tokens per output chunk..." }],"_metadata": {"source_url": "https://docs.example.com/chunking","chunk_id": "...","token_count": 412,"quality_score": 0.81}}
3. Q&A Pairs (qa-pairs)
Extracted from FAQ pages using five rule-based strategies:
{"question": "Can I cancel my subscription?","answer": "Yes, you can cancel anytime from the billing settings page in your account.","source_url": "https://example.com/help/faq","extraction_method": "faq_html","confidence": 0.95,"token_count": 28,"page_title": "FAQ"}
Extraction strategies (in confidence order):
faq_schema— JSON-LDFAQPageschema (confidence 1.0)faq_html—<details><summary>elements (0.95)dt_dd— Definition lists<dl>/<dt>/<dd>(0.90)accordion—aria-controls/data-togglepatterns (0.85)heading_paragraph—<h2>/<h3>+ following content (0.5–0.9)
4. Clean Markdown (markdown)
Full-page Markdown with boilerplate removed and complete metadata.
Input options
| Option | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | Initial URLs to crawl |
maxPages | integer | 100 | Maximum number of pages (0 = unlimited) |
maxDepth | integer | 5 | Link-follow depth from start URLs |
crawlerType | string | adaptive | adaptive / cheerio / playwright |
includeGlobs / excludeGlobs | array | [] | URL pattern filters |
outputFormat | string | rag-chunks | rag-chunks / fine-tuning-jsonl / qa-pairs / markdown / all |
chunkSize | integer | 512 | Target tokens per chunk |
chunkOverlap | integer | 50 | Token overlap between chunks |
extractQAPairs | boolean | true | Run Q&A extraction strategies |
language | string | "" | ISO 639-1 code language filter |
minContentLength | integer | 100 | Skip pages shorter than this (chars) |
minQualityScore | number | 0.3 | Skip pages below this score (0.0–1.0) |
removeDuplicates | boolean | true | Content-fingerprint deduplication |
removeBoilerplate | boolean | true | Strip nav/footer/cookie banners |
proxyConfiguration | object | Apify Proxy | Proxy settings |
maxConcurrency | integer | 10 | Parallel page processing |
Use cases
- Build RAG chatbots — Crawl documentation → chunk → embed in Pinecone/Qdrant/Weaviate
- Fine-tune LLMs — Convert knowledge bases to OpenAI training format
- Create Q&A datasets — Extract FAQ data for customer-support AI
- Feed AI agents — Provide structured web knowledge to autonomous agents
Integrations
Output is plain JSON / JSONL and works with LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, MongoDB Atlas, OpenAI fine-tuning, and any tool that accepts JSON.
Quality scoring (heuristic, no LLM)
Each page receives a 0.0–1.0 score computed from:
- Content length (25%) — Pages between 500 and 10000 chars score highest
- Text density (25%) — Ratio of extracted text to original HTML
- Paragraph count (15%) — ≥3 paragraphs preferred
- Heading presence (10%) — At least one
<h1>–<h6> - Link density (10%) — Low anchor-text ratio preferred
- Repetition (15%) — Unique-sentence ratio
Pages scoring below minQualityScore are filtered out before token usage.
Token-aware chunking
Chunks are produced with a recursive splitter that respects natural boundaries:
- Split by paragraph (
\n\n) - If a paragraph exceeds
chunkSize, split by sentence - If a sentence exceeds
chunkSize, split by token - Apply
chunkOverlapby prepending the last N tokens of the previous chunk
Token counts are computed with js-tiktoken using the cl100k_base encoding — identical to GPT-4 / text-embedding-3-*.
Limitations
- No LLM-based extraction (by design — keeps cost predictable)
- Q&A extraction works best on structured pages (FAQ, docs with headings)
- Login-protected content not supported without cookie injection
- JavaScript-heavy SPAs may need
crawlerType: "playwright"for full rendering