Deprecated

Pricing

from $3.00 / 1,000 q&a pairs

See alternative Actors

Go to Apify Store

Q&A Dataset Extractor for LLM Fine-Tuning

Deprecated

See alternative Actors

Crawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.

Pricing

from $3.00 / 1,000 q&a pairs

Rating

0.0

(0)

Developer

Deniz Schlösser

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Why this Actor

🎯 Training-ready output, not raw text. Get {question, answer} pairs in OpenAI, Alpaca, or plain JSONL — drop them straight into a fine-tuning job.
🧹 Clean extraction. Mozilla Readability strips navigation, sidebars, cookie banners, and ads. The model never sees junk, so your dataset doesn't either.
🔒 Grounded answers, no hallucination. Every answer is constrained to the crawled content — the prompt forbids inventing facts.
♻️ Automatic deduplication. Near-identical questions across pages are collapsed, so you don't pay to train on the same thing twice.
💸 Bring your own Claude key (BYOK). You control model choice and token spend. Default claude-haiku-4-5 keeps costs at roughly $1 per 1,000 pairs in tokens.
🧪 Free dry run. Preview crawling and chunking with zero LLM cost before you spend anything.

What it does

Start URL(s)
   │  1. Crawl       → follows same-domain links up to your page limit
   ▼
Clean Markdown      → Mozilla Readability + Turndown (main content only)
   │  2. Chunk       → paragraph-aware, with overlap to preserve context
   ▼
Content chunks
   │  3. Generate    → Claude produces up to N grounded Q&A pairs per chunk
   ▼
   │  4. Deduplicate → collapses repeated questions
   ▼
JSONL dataset       → OpenAI / Alpaca / plain, with source_url + source_title

Example

Input:

{
  "startUrls": [{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners" }],
  "maxPagesToCrawl": 5,
  "maxQuestionsPerChunk": 3,
  "outputFormat": "openai",
  "model": "claude-haiku-4-5",
  "anthropicApiKey": "sk-ant-..."
}

Output (one dataset item, openai format):

{
  "messages": [
    {
      "role": "user",
      "content": "What is the main project you'll build in this JavaScript web scraping course?"
    },
    {
      "role": "assistant",
      "content": "In this course, you'll create an application for watching prices. It will be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such a program would be useful for seeing trends in price changes, detecting discounts, and more."
    }
  ],
  "source_url": "https://docs.apify.com/academy/web-scraping-for-beginners",
  "source_title": "Web scraping basics for JavaScript devs | Academy | Apify Documentation"
}

Every item carries source_url and source_title so you can trace, filter, or cite each example.

Output formats

Pick the shape your training pipeline expects with the outputFormat setting:

Format	Shape	Use it for
`openai`	`{ "messages": [{ "role": "user", ... }, { "role": "assistant", ... }] }`	OpenAI fine-tuning, chat-format SFT
`alpaca`	`{ "instruction": "...", "input": "", "output": "..." }`	Llama / Mistral / open-model instruction tuning
`plain`	`{ "question": "...", "answer": "..." }`	RAG eval sets, custom pipelines, embeddings

The output format is independent of the model that generates the pairs — Claude produces pairs in whichever training shape you choose.

Use cases

🤖 Build a support chatbot from your docs. Point it at your help center or product docs and generate a Q&A set to fine-tune or seed a RAG index — so the bot answers in your product's own words.

🎓 Fine-tune a domain expert model. Turn a knowledge base, wiki, or set of guides into thousands of instruction examples for a specialized model, without hand-writing them.

📚 Create RAG evaluation sets. Generate grounded question-answer pairs to benchmark retrieval quality — does your RAG system actually find the right answer?

🌍 Localize training data. Use customInstructions (e.g. "Write all questions and answers in German") to produce datasets in any language your source covers.

Input reference

Field	Description	Default
`startUrls`	Pages to crawl (follows same-domain links)	— (required)
`maxPagesToCrawl`	Hard limit on pages crawled	`10`
`maxQuestionsPerChunk`	Q&A pairs generated per content chunk	`3`
`chunkSize` / `chunkOverlap`	Chunking in characters (overlap auto-capped at half)	`4000` / `200`
`outputFormat`	`openai` \| `alpaca` \| `plain`	`openai`
`model`	`claude-haiku-4-5` \| `claude-sonnet-4-6` \| `claude-opus-4-8`	`claude-haiku-4-5`
`anthropicApiKey`	Your Anthropic (Claude) API key — `sk-ant-...` (secret)	—
`customInstructions`	Extra guidance, e.g. "answer in German" or "focus on pricing"	—
`dryRun`	Skip the LLM and just output chunks — free preview	`false`

Choosing a model

Model	Token cost (your key)	Best for
Claude Haiku 4.5 (default)	~$1 / 1,000 pairs	High-volume datasets, the cost-conscious default
Claude Sonnet 4.6	~$3 / 1,000 pairs	Nuanced answers, technical or complex sources
Claude Opus 4.8	~$5 / 1,000 pairs	Highest quality where it matters most

Quick start

Add one or more Start URLs (e.g. your docs site).
Paste your Anthropic (Claude) API key (sk-ant-...). Get one at console.anthropic.com.
(Optional) Set dryRun: true first to preview crawling and chunking for free.
Run, then download the dataset as JSONL and feed it to your fine-tuning job.

Notes & responsible use

Bring your own key. This Actor calls the Claude API with the key you provide; you are billed by Anthropic for token usage directly.
Answers are grounded. The prompt forbids inventing facts — answers are constrained to the crawled content.
Respect each site's rules. Only crawl content you are permitted to use. Honor the target site's Terms of Service and robots directives, and respect copyright and data-protection law (e.g. GDPR) for any personal data you process.

Questions or a source that doesn't extract cleanly? Open an issue on the Actor page — feedback shapes the roadmap.

Stack Overflow Q&A Scraper

sheshinmcfly/stackoverflow-scraper

Extract quality-scored Q&A from 30 Stack Exchange communities via the official API. Includes qualityScore (0-100), frustrationScore, linked questions, date range filters, and popular tags explorer. Perfect for AI training data, RAG pipelines, and market research.

Sheshinmcfly

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

Boztek LTD

Train Your Local LLM for Business & Finance - DataPro

omissive_aurora/train-your-local-llm-for-business-finance---datapro

Train your local LLM for business and finance with Ultimate DataPro. Scrapes live stock prices, SEC EDGAR filings, options chains, and financial news - then auto-builds Alpaca/ShareGPT fine-tuning datasets. Export as JSONL, CSV, or Parquet. Push to HuggingFace Hub.

d.leigh hunte

Stack Overflow & Stack Exchange Q&A Scraper API

f0rty7even/stackexchange-scraper

Scrape questions and answers from Stack Overflow and any Stack Exchange site via the official API. Filter by tag, keyword, and sort. Clean text output, perfect for LLM/RAG datasets and dev research.

Michael Yousrie

Google News Scraper Comprehensive

scrapeio/google-news-scraper

Enter a keyword and collect up to 2,000 deduplicated Google News articles—headlines, publisher URLs, dates, sources, and snippets—from the public RSS layer. Excel‑ready CSV (UTF‑8 BOM, quoted fields) plus JSON. Perfect for brand monitoring, PR measurement, and news datasets. No Google Cloud API key.

Shop Intel

5.0

(4)

YouTube Transcript Scraper – JSON, SRT, VTT, Plain Text

scraperhive/youtube-transcript-scraper

Extract YouTube video transcripts, subtitles, and captions in multiple formats with precise timestamps. Plain Text · JSON · SRT · WebVTT · 20+ Languages · Batch Processing · Auto + Manual Captions

Mubeen Ali

5.0

(2)

Upwork Talent Search Scraper

powerai/upwork-talent-scraper

Collect rich freelancer profiles from any Upwork talent search link for faster recruiting decisions.

PowerAI

157

1.0

(2)

Reddit Intelligence Scraper

apage/reddit-intelligence-scraper

Scrape Reddit posts, comments, user profiles & subreddit analytics at scale. No API key needed. Supports all sort types (hot, new, top, rising). Built for market research, brand monitoring, sentiment analysis & AI/ML datasets. Outputs clean JSON, CSV & Markdown. Pay only per result scraped.

Andy Page

Stack Exchange Scraper

crawlerbros/stack-exchange-scraper

Scrape questions, answers, users, and tags from Stack Overflow and 170+ Stack Exchange communities. HTTP-only via the public Stack Exchange API. No login, no proxy.

Crawler Bros

YouTube Transcript & Subtitles Scraper - No API Key Required

george.the.developer/youtube-transcript-scraper

Extract YouTube transcripts, subtitles, captions, timestamps, and metadata in bulk for RAG, LLM datasets, content repurposing, and video SEO. No API key needed.