Reddit RAG Dataset — LLM Training Data from Posts & Comments avatar

Reddit RAG Dataset — LLM Training Data from Posts & Comments

Pricing

from $2.00 / 1,000 results

Go to Apify Store
Reddit RAG Dataset — LLM Training Data from Posts & Comments

Reddit RAG Dataset — LLM Training Data from Posts & Comments

Build clean LLM and RAG datasets from Reddit. Export posts with full comment threads as ready-to-chunk text, HTML and Markdown — only text-bearing records with parent/child thread structure. No login or developer token needed.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Black Falcon Data

Black Falcon Data

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 hours ago

Last modified

Share

What does Reddit RAG Dataset do?

Reddit RAG Dataset Builder exports Reddit posts and their full nested comment threads as clean, ready-to-chunk text — for LLM training, RAG knowledge bases, and semantic search corpora. Point it at any subreddit, keyword search, or specific post URL and get only text-bearing records (empty and link-only posts are dropped automatically), each with three body formats — plain text, HTML, and Markdown — plus thread structure (postId, parentId, depth), scores, authors, timestamps, and community. No Reddit account or login required.

New to Apify? Sign up free and use the included $5 monthly platform credit to test this actor.

Key features

  • 🤖 RAG-ready triple-format output — every record is exported as clean text, HTML, and Markdown so you can chunk and embed without any preprocessing step.
  • 🚫 Text-bearing records only, no empty rows — link-only posts and records without body text are dropped automatically, giving you a clean dataset with zero empty rows.
  • 🌳 Full thread structure preserved — comments carry postId, parentId, and depth so you can rebuild the conversation tree for context-aware chunking and hierarchical retrieval.
  • 🔎 Any subreddit, search query, or thread — seed the corpus from a subreddit feed, a keyword search, or specific post URLs and mix inputs in one run.
  • ⚖️ Scale with maxItems and maxComments — cap total records and comments per post independently to control dataset size and cost precisely.
  • 🤝 AI / MCP / automation-friendly — structured JSON output with stable field names integrates directly into LangChain, LlamaIndex, MCP tools, and custom agent pipelines.
  • 🔑 No login or API key required — works on any public subreddit or search term without a Reddit account or developer token.

What data can you extract from reddit.com?

Only records that carry body text are returned — empty posts, link-only posts, and records without a descriptionText are filtered out automatically, so your dataset contains no empty rows.

Every record carries:

  • Body in three formatsdescriptionText (clean plain text), descriptionHtml (raw HTML), and descriptionMarkdown (Markdown) — choose one format or keep all three for multi-pipeline flexibility.
  • Thread structurepostId, parentId, and depth on every comment, so you can reconstruct the full conversation tree for context-aware chunking and retrieval.
  • Standard metadatascore, author, community, createdAt, and canonical url.
  • Item typepost or comment (itemType field), so you can separate top-level posts from replies in your pipeline.

Posts from subreddit feeds carry additional fields: title, upvoteRatio, numComments, awardCount, and postType. Search hits are lighter discovery records (id, url, title, subreddit, nsfw) — their comment threads are still fetched unless you skip them.

Input

Configure the actor through the input schema in Apify Console.

Key parameters:

  • startUrls — Reddit URLs to scrape — subreddits, post pages, user profiles, community pages, or search result pages. Each URL determines what type of content is fetched.
  • searchTerms — Search Reddit for these terms. Each entry becomes an independent search. Search posts are lightweight discovery records (plus their comments) — see Search Type.
  • searchType — Type of results to return when using Search Terms. Post results are lightweight discovery records — id, url, title, subreddit and NSFW flag — plus their comment threads; scrape a result's URL directly for its full post fields (author, body, score, timestamp). (default: "posts")
  • sort — Sort order for posts and search results. (default: "hot")
  • time — Restrict subreddit-feed results to a time window (applies to Top sort on feeds; search is not time-windowed). (default: "all")
  • includeNSFW — Include posts and communities marked as NSFW (18+). (default: false)
  • postDateLimit — Skip posts older than this ISO-8601 date (e.g. "2024-01-01"). Applies to subreddit feeds and post URLs; search results carry no date and are not filtered. Leave blank for no date limit.
  • maxItems — Maximum total records to save across all sources (posts, comments, users, communities). (default: 100)
  • maxComments — Maximum number of comments to collect from each post page. (default: 200)
  • includeCollapsed — Expand and include comments that are initially collapsed (controversial or low-score). Enables deeper thread coverage, up to the comment and depth limits you set. (default: true)
  • commentDepth — Maximum reply nesting depth to collect (1 = top-level only). (default: 10)
  • skipComments — Do not collect comments from post pages — output posts only. (default: false)
  • ...and 4 more parameters

Input examples

RAG dataset from a subreddit — Pull text-bearing posts and comment threads from a subreddit to build a domain-specific RAG corpus.

→ Posts with body text from r/MachineLearning, each followed by its nested comments — ready to chunk and embed.

{
"startUrls": [
{
"url": "https://www.reddit.com/r/MachineLearning/"
}
],
"maxItems": 100,
"maxComments": 200
}

Topic dataset via keyword search — Search Reddit for a specific topic and collect the top posts with their threads.

→ Top posts matching the query, each with comments — suitable for a focused fine-tuning corpus.

{
"searchTerms": [
"retrieval augmented generation"
],
"searchType": "posts",
"sort": "top",
"maxItems": 200
}

Markdown-only export for chunking — Return only the Markdown body format to keep dataset size small when piping straight into a text splitter.

→ Posts and comments from r/LocalLLaMA with descriptionMarkdown populated and other body formats omitted.

{
"startUrls": [
{
"url": "https://www.reddit.com/r/LocalLLaMA/"
}
],
"descriptionFormat": "markdown",
"maxItems": 100,
"maxComments": 300
}

Deep single-thread capture — Pull one post and its entire comment tree for analysis or fine-tuning on a specific discussion.

→ One post record and all its nested comments with full thread structure (parentId, depth).

{
"startUrls": [
{
"url": "https://www.reddit.com/r/MachineLearning/comments/1abc234/example_discussion/"
}
],
"includeCollapsed": true,
"commentDepth": 10,
"maxComments": 500
}

Output

Each run produces a dataset of structured Reddit records. Results can be downloaded as JSON, CSV, or Excel from the Dataset tab in Apify Console.

Example Reddit record

{
"itemType": "post",
"id": "t3_1ttjtwv",
"url": "https://www.reddit.com/r/programming/comments/1ttjtwv/your_process_memory_is_a_file_the/",
"title": "Your process' memory is a file: The underappreciated gem that is /proc/<pid>/mem",
"body": null,
"bodyHtml": null,
"contentHref": "https://lcamtuf.substack.com/p/weekend-trivia-your-process-memory",
"postType": "link",
"language": "en",
"score": 129,
"upvoteRatio": 0.9708029197080292,
"numComments": 1,
"awardCount": 0,
"author": "mttd",
"authorId": "t2_6gkbb",
"community": "r/programming",
"communityId": "t5_2fwo",
"createdAt": "2026-06-01T08:32:12.581+02:00",
"icon": "https://www.redditstatic.com/avatars/defaults/v2/avatar_default_7.png",
"nsfw": false
}

Example post record

{
"itemType": "post",
"id": "t3_1ml2x7a",
"url": "https://www.reddit.com/r/MachineLearning/comments/1ml2x7a/rag_with_reddit_threads/",
"title": "Anyone successfully built a RAG pipeline over Reddit threads?",
"community": "MachineLearning",
"author": "embeddings_fan",
"score": 312,
"upvoteRatio": 0.97,
"numComments": 84,
"awardCount": 2,
"postType": "self",
"createdAt": "2026-05-14T09:22:11.000Z",
"language": "en",
"description": "I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often tangential. Curious how others...",
"descriptionText": "I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often tangential. Curious how others...",
"descriptionHtml": "<div class=\"py-0\"><p dir=\"auto\">I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often...",
"descriptionMarkdown": "I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often tangential. Curious how others...",
"nsfw": false
}

How to scrape reddit.com

  1. Go to Reddit RAG Dataset in Apify Console.
  2. Configure the input.
  3. Set maxItems to control how many results you need.
  4. Click Start and wait for the run to finish.
  5. Export the dataset as JSON, CSV, or Excel.

Use cases

  • Build RAG knowledge bases from subreddit discussions — index posts and comment threads as chunked, embedded passages for semantic retrieval.
  • Assemble LLM fine-tuning and training corpora from real human conversations, filtered to text-bearing records only.
  • Generate embeddings and semantic search indexes for topic research using community-sourced text at scale.
  • Build topic research corpora on any subject by combining subreddit feeds and keyword searches in one run.
  • Feed structured Reddit threads into AI agents and MCP tools that need grounded, human-written context.
  • Archive community discussions for longitudinal analysis, sentiment tracking, or academic research.

How much does it cost to scrape reddit.com?

Reddit RAG Dataset uses pay-per-event pricing. You pay a small fee when the run starts and then for each result that is actually produced.

  • Run start: $0.008 per run
  • Per result: $0.002 per Reddit record

Example costs:

  • 10 results: $0.028
  • 25 results: $0.058
  • 100 results: $0.21
  • 200 results: $0.41
  • 500 results: $1.01

FAQ

How many results can I get from reddit.com?

The number of results depends on the search query and available listings on reddit.com. Use the maxItems parameter to control how many results are returned per run.

Can I integrate Reddit RAG Dataset with other apps?

Yes. Reddit RAG Dataset works with Apify's integrations to connect with tools like Zapier, Make, Google Sheets, Slack, and more. You can also use webhooks to trigger actions when a run completes.

Can I use Reddit RAG Dataset with the Apify API?

Yes. You can start runs, manage inputs, and retrieve results programmatically through the Apify API. Client libraries are available for JavaScript, Python, and other languages.

Can I use Reddit RAG Dataset through an MCP Server?

Yes. Apify provides an MCP Server that lets AI assistants and agents call this actor directly. Use a single descriptionFormat and excludeEmptyFields to keep payloads manageable for LLM context windows.

This actor extracts publicly available data from reddit.com. Web scraping of public information is generally considered legal, but you should always review the target site's terms of service and ensure your use case complies with applicable laws and regulations, including GDPR where relevant.

Your feedback

If you have questions, need a feature, or found a bug, please open an issue on the actor's page in Apify Console. Your feedback helps us improve.

You might also like

Getting started with Apify

New to Apify? Create a free account with $5 credit — no credit card required.

  1. Sign up — $5 platform credit included
  2. Open this actor and configure your input
  3. Click Start — export results as JSON, CSV, or Excel

Need more later? See Apify pricing.