Q&A Dataset Extractor for LLM Fine-Tuning
Under maintenancePricing
from $3.00 / 1,000 q&a pairs
Q&A Dataset Extractor for LLM Fine-Tuning
Under maintenanceCrawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.
Pricing
from $3.00 / 1,000 q&a pairs
Rating
0.0
(0)
Developer
Deniz Schlösser
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share
Turn any website, documentation portal, or FAQ into clean, deduplicated question-answer pairs — ready to fine-tune a model or power a RAG / support chatbot.
Most scrapers stop at raw HTML or Markdown and leave the hard part to you: turning a pile of text into actual training examples. This Actor goes the whole way. It crawls your source, extracts the real content (no nav, ads, or boilerplate), splits it into context-preserving chunks, and uses Claude to generate grounded, self-contained Q&A pairs in the exact JSONL format your training pipeline expects.
You bring a source URL. You get back a fine-tuning-ready dataset.
Why this Actor
- 🎯 Training-ready output, not raw text. Get
{question, answer}pairs in OpenAI, Alpaca, or plain JSONL — drop them straight into a fine-tuning job. - 🧹 Clean extraction. Mozilla Readability strips navigation, sidebars, cookie banners, and ads. The model never sees junk, so your dataset doesn't either.
- 🔒 Grounded answers, no hallucination. Every answer is constrained to the crawled content — the prompt forbids inventing facts.
- ♻️ Automatic deduplication. Near-identical questions across pages are collapsed, so you don't pay to train on the same thing twice.
- 💸 Bring your own Claude key (BYOK). You control model choice and token spend. Default
claude-haiku-4-5keeps costs at roughly $1 per 1,000 pairs in tokens. - 🧪 Free dry run. Preview crawling and chunking with zero LLM cost before you spend anything.
What it does
Start URL(s)│ 1. Crawl → follows same-domain links up to your page limit▼Clean Markdown → Mozilla Readability + Turndown (main content only)│ 2. Chunk → paragraph-aware, with overlap to preserve context▼Content chunks│ 3. Generate → Claude produces up to N grounded Q&A pairs per chunk▼│ 4. Deduplicate → collapses repeated questions▼JSONL dataset → OpenAI / Alpaca / plain, with source_url + source_title
Example
Input:
{"startUrls": [{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners" }],"maxPagesToCrawl": 5,"maxQuestionsPerChunk": 3,"outputFormat": "openai","model": "claude-haiku-4-5","anthropicApiKey": "sk-ant-..."}
Output (one dataset item, openai format):
{"messages": [{"role": "user","content": "What is the main project you'll build in this JavaScript web scraping course?"},{"role": "assistant","content": "In this course, you'll create an application for watching prices. It will be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such a program would be useful for seeing trends in price changes, detecting discounts, and more."}],"source_url": "https://docs.apify.com/academy/web-scraping-for-beginners","source_title": "Web scraping basics for JavaScript devs | Academy | Apify Documentation"}
Every item carries source_url and source_title so you can trace, filter, or cite each example.
Output formats
Pick the shape your training pipeline expects with the outputFormat setting:
| Format | Shape | Use it for |
|---|---|---|
openai | { "messages": [{ "role": "user", ... }, { "role": "assistant", ... }] } | OpenAI fine-tuning, chat-format SFT |
alpaca | { "instruction": "...", "input": "", "output": "..." } | Llama / Mistral / open-model instruction tuning |
plain | { "question": "...", "answer": "..." } | RAG eval sets, custom pipelines, embeddings |
The output format is independent of the model that generates the pairs — Claude produces pairs in whichever training shape you choose.
Use cases
🤖 Build a support chatbot from your docs. Point it at your help center or product docs and generate a Q&A set to fine-tune or seed a RAG index — so the bot answers in your product's own words.
🎓 Fine-tune a domain expert model. Turn a knowledge base, wiki, or set of guides into thousands of instruction examples for a specialized model, without hand-writing them.
📚 Create RAG evaluation sets. Generate grounded question-answer pairs to benchmark retrieval quality — does your RAG system actually find the right answer?
🌍 Localize training data. Use customInstructions (e.g. "Write all questions and answers in German") to produce datasets in any language your source covers.
Input reference
| Field | Description | Default |
|---|---|---|
startUrls | Pages to crawl (follows same-domain links) | — (required) |
maxPagesToCrawl | Hard limit on pages crawled | 10 |
maxQuestionsPerChunk | Q&A pairs generated per content chunk | 3 |
chunkSize / chunkOverlap | Chunking in characters (overlap auto-capped at half) | 4000 / 200 |
outputFormat | openai | alpaca | plain | openai |
model | claude-haiku-4-5 | claude-sonnet-4-6 | claude-opus-4-8 | claude-haiku-4-5 |
anthropicApiKey | Your Anthropic (Claude) API key — sk-ant-... (secret) | — |
customInstructions | Extra guidance, e.g. "answer in German" or "focus on pricing" | — |
dryRun | Skip the LLM and just output chunks — free preview | false |
Choosing a model
| Model | Token cost (your key) | Best for |
|---|---|---|
| Claude Haiku 4.5 (default) | ~$1 / 1,000 pairs | High-volume datasets, the cost-conscious default |
| Claude Sonnet 4.6 | ~$3 / 1,000 pairs | Nuanced answers, technical or complex sources |
| Claude Opus 4.8 | ~$5 / 1,000 pairs | Highest quality where it matters most |
Quick start
- Add one or more Start URLs (e.g. your docs site).
- Paste your Anthropic (Claude) API key (
sk-ant-...). Get one at console.anthropic.com. - (Optional) Set
dryRun: truefirst to preview crawling and chunking for free. - Run, then download the dataset as JSONL and feed it to your fine-tuning job.
Notes & responsible use
- Bring your own key. This Actor calls the Claude API with the key you provide; you are billed by Anthropic for token usage directly.
- Answers are grounded. The prompt forbids inventing facts — answers are constrained to the crawled content.
- Respect each site's rules. Only crawl content you are permitted to use. Honor the target site's Terms of Service and
robotsdirectives, and respect copyright and data-protection law (e.g. GDPR) for any personal data you process.
Questions or a source that doesn't extract cleanly? Open an issue on the Actor page — feedback shapes the roadmap.