Q&A Dataset Extractor for LLM Fine-Tuning avatar

Q&A Dataset Extractor for LLM Fine-Tuning

Under maintenance

Pricing

from $3.00 / 1,000 q&a pairs

Go to Apify Store
Q&A Dataset Extractor for LLM Fine-Tuning

Q&A Dataset Extractor for LLM Fine-Tuning

Under maintenance

Crawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.

Pricing

from $3.00 / 1,000 q&a pairs

Rating

0.0

(0)

Developer

Deniz Schlösser

Deniz Schlösser

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

Turn any website, documentation portal, or FAQ into clean, deduplicated question-answer pairs — ready to fine-tune a model or power a RAG / support chatbot.

Most scrapers stop at raw HTML or Markdown and leave the hard part to you: turning a pile of text into actual training examples. This Actor goes the whole way. It crawls your source, extracts the real content (no nav, ads, or boilerplate), splits it into context-preserving chunks, and uses Claude to generate grounded, self-contained Q&A pairs in the exact JSONL format your training pipeline expects.

You bring a source URL. You get back a fine-tuning-ready dataset.


Why this Actor

  • 🎯 Training-ready output, not raw text. Get {question, answer} pairs in OpenAI, Alpaca, or plain JSONL — drop them straight into a fine-tuning job.
  • 🧹 Clean extraction. Mozilla Readability strips navigation, sidebars, cookie banners, and ads. The model never sees junk, so your dataset doesn't either.
  • 🔒 Grounded answers, no hallucination. Every answer is constrained to the crawled content — the prompt forbids inventing facts.
  • ♻️ Automatic deduplication. Near-identical questions across pages are collapsed, so you don't pay to train on the same thing twice.
  • 💸 Bring your own Claude key (BYOK). You control model choice and token spend. Default claude-haiku-4-5 keeps costs at roughly $1 per 1,000 pairs in tokens.
  • 🧪 Free dry run. Preview crawling and chunking with zero LLM cost before you spend anything.

What it does

Start URL(s)
1. Crawl → follows same-domain links up to your page limit
Clean Markdown → Mozilla Readability + Turndown (main content only)
2. Chunk → paragraph-aware, with overlap to preserve context
Content chunks
3. Generate → Claude produces up to N grounded Q&A pairs per chunk
4. Deduplicate → collapses repeated questions
JSONL dataset → OpenAI / Alpaca / plain, with source_url + source_title

Example

Input:

{
"startUrls": [{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners" }],
"maxPagesToCrawl": 5,
"maxQuestionsPerChunk": 3,
"outputFormat": "openai",
"model": "claude-haiku-4-5",
"anthropicApiKey": "sk-ant-..."
}

Output (one dataset item, openai format):

{
"messages": [
{
"role": "user",
"content": "What is the main project you'll build in this JavaScript web scraping course?"
},
{
"role": "assistant",
"content": "In this course, you'll create an application for watching prices. It will be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such a program would be useful for seeing trends in price changes, detecting discounts, and more."
}
],
"source_url": "https://docs.apify.com/academy/web-scraping-for-beginners",
"source_title": "Web scraping basics for JavaScript devs | Academy | Apify Documentation"
}

Every item carries source_url and source_title so you can trace, filter, or cite each example.


Output formats

Pick the shape your training pipeline expects with the outputFormat setting:

FormatShapeUse it for
openai{ "messages": [{ "role": "user", ... }, { "role": "assistant", ... }] }OpenAI fine-tuning, chat-format SFT
alpaca{ "instruction": "...", "input": "", "output": "..." }Llama / Mistral / open-model instruction tuning
plain{ "question": "...", "answer": "..." }RAG eval sets, custom pipelines, embeddings

The output format is independent of the model that generates the pairs — Claude produces pairs in whichever training shape you choose.


Use cases

🤖 Build a support chatbot from your docs. Point it at your help center or product docs and generate a Q&A set to fine-tune or seed a RAG index — so the bot answers in your product's own words.

🎓 Fine-tune a domain expert model. Turn a knowledge base, wiki, or set of guides into thousands of instruction examples for a specialized model, without hand-writing them.

📚 Create RAG evaluation sets. Generate grounded question-answer pairs to benchmark retrieval quality — does your RAG system actually find the right answer?

🌍 Localize training data. Use customInstructions (e.g. "Write all questions and answers in German") to produce datasets in any language your source covers.


Input reference

FieldDescriptionDefault
startUrlsPages to crawl (follows same-domain links)— (required)
maxPagesToCrawlHard limit on pages crawled10
maxQuestionsPerChunkQ&A pairs generated per content chunk3
chunkSize / chunkOverlapChunking in characters (overlap auto-capped at half)4000 / 200
outputFormatopenai | alpaca | plainopenai
modelclaude-haiku-4-5 | claude-sonnet-4-6 | claude-opus-4-8claude-haiku-4-5
anthropicApiKeyYour Anthropic (Claude) API key — sk-ant-... (secret)
customInstructionsExtra guidance, e.g. "answer in German" or "focus on pricing"
dryRunSkip the LLM and just output chunks — free previewfalse

Choosing a model

ModelToken cost (your key)Best for
Claude Haiku 4.5 (default)~$1 / 1,000 pairsHigh-volume datasets, the cost-conscious default
Claude Sonnet 4.6~$3 / 1,000 pairsNuanced answers, technical or complex sources
Claude Opus 4.8~$5 / 1,000 pairsHighest quality where it matters most

Quick start

  1. Add one or more Start URLs (e.g. your docs site).
  2. Paste your Anthropic (Claude) API key (sk-ant-...). Get one at console.anthropic.com.
  3. (Optional) Set dryRun: true first to preview crawling and chunking for free.
  4. Run, then download the dataset as JSONL and feed it to your fine-tuning job.

Notes & responsible use

  • Bring your own key. This Actor calls the Claude API with the key you provide; you are billed by Anthropic for token usage directly.
  • Answers are grounded. The prompt forbids inventing facts — answers are constrained to the crawled content.
  • Respect each site's rules. Only crawl content you are permitted to use. Honor the target site's Terms of Service and robots directives, and respect copyright and data-protection law (e.g. GDPR) for any personal data you process.

Questions or a source that doesn't extract cleanly? Open an issue on the Actor page — feedback shapes the roadmap.