# Reddit RAG Dataset — LLM Training Data from Posts & Comments (`blackfalcondata/reddit-rag-dataset`) Actor

Build clean LLM and RAG datasets from Reddit. Export posts with full comment threads as ready-to-chunk text, HTML and Markdown — only text-bearing records with parent/child thread structure. No login or developer token needed.

- **URL**: https://apify.com/blackfalcondata/reddit-rag-dataset.md
- **Developed by:** [Black Falcon Data](https://apify.com/blackfalcondata) (community)
- **Categories:** AI, Lead generation, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $2.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### What does Reddit RAG Dataset do?

Reddit RAG Dataset Builder exports Reddit posts and their full nested comment threads as clean, ready-to-chunk text — for LLM training, RAG knowledge bases, and semantic search corpora. Point it at any subreddit, keyword search, or specific post URL and get only text-bearing records (empty and link-only posts are dropped automatically), each with three body formats — plain text, HTML, and Markdown — plus thread structure (postId, parentId, depth), scores, authors, timestamps, and community. No Reddit account or login required.

**New to Apify?** [Sign up free](https://console.apify.com/sign-up?fpr=1h3gvi) and use the included $5 monthly platform credit to test this actor.

### Key features

<!-- KEY_FEATURES:START -->
- **🤖 RAG-ready triple-format output** — every record is exported as clean text, HTML, and Markdown so you can chunk and embed without any preprocessing step.
- **🚫 Text-bearing records only, no empty rows** — link-only posts and records without body text are dropped automatically, giving you a clean dataset with zero empty rows.
- **🌳 Full thread structure preserved** — comments carry `postId`, `parentId`, and `depth` so you can rebuild the conversation tree for context-aware chunking and hierarchical retrieval.
- **🔎 Any subreddit, search query, or thread** — seed the corpus from a subreddit feed, a keyword search, or specific post URLs and mix inputs in one run.
- **⚖️ Scale with maxItems and maxComments** — cap total records and comments per post independently to control dataset size and cost precisely.
- **🤝 AI / MCP / automation-friendly** — structured JSON output with stable field names integrates directly into LangChain, LlamaIndex, MCP tools, and custom agent pipelines.
- **🔑 No login or API key required** — works on any public subreddit or search term without a Reddit account or developer token.
<!-- KEY_FEATURES:END -->

### What data can you extract from reddit.com?

Only records that carry body text are returned — empty posts, link-only posts, and records without a `descriptionText` are filtered out automatically, so your dataset contains no empty rows.

Every record carries:

- **Body in three formats** — `descriptionText` (clean plain text), `descriptionHtml` (raw HTML), and `descriptionMarkdown` (Markdown) — choose one format or keep all three for multi-pipeline flexibility.
- **Thread structure** — `postId`, `parentId`, and `depth` on every comment, so you can reconstruct the full conversation tree for context-aware chunking and retrieval.
- **Standard metadata** — `score`, `author`, `community`, `createdAt`, and canonical `url`.
- **Item type** — `post` or `comment` (`itemType` field), so you can separate top-level posts from replies in your pipeline.

Posts from subreddit feeds carry additional fields: `title`, `upvoteRatio`, `numComments`, `awardCount`, and `postType`. Search hits are lighter discovery records (`id`, `url`, `title`, `subreddit`, `nsfw`) — their comment threads are still fetched unless you skip them.


### Input

Configure the actor through the input schema in Apify Console.

Key parameters:

- **`startUrls`** — Reddit URLs to scrape — subreddits, post pages, user profiles, community pages, or search result pages. Each URL determines what type of content is fetched.
- **`searchTerms`** — Search Reddit for these terms. Each entry becomes an independent search. Search posts are lightweight discovery records (plus their comments) — see Search Type.
- **`searchType`** — Type of results to return when using Search Terms. Post results are lightweight discovery records — id, url, title, subreddit and NSFW flag — plus their comment threads; scrape a result's URL directly for its full post fields (author, body, score, timestamp). (default: `"posts"`)
- **`sort`** — Sort order for posts and search results. (default: `"hot"`)
- **`time`** — Restrict subreddit-feed results to a time window (applies to Top sort on feeds; search is not time-windowed). (default: `"all"`)
- **`includeNSFW`** — Include posts and communities marked as NSFW (18+). (default: `false`)
- **`postDateLimit`** — Skip posts older than this ISO-8601 date (e.g. "2024-01-01"). Applies to subreddit feeds and post URLs; search results carry no date and are not filtered. Leave blank for no date limit.
- **`maxItems`** — Maximum total records to save across all sources (posts, comments, users, communities). (default: `100`)
- **`maxComments`** — Maximum number of comments to collect from each post page. (default: `200`)
- **`includeCollapsed`** — Expand and include comments that are initially collapsed (controversial or low-score). Enables deeper thread coverage, up to the comment and depth limits you set. (default: `true`)
- **`commentDepth`** — Maximum reply nesting depth to collect (1 = top-level only). (default: `10`)
- **`skipComments`** — Do not collect comments from post pages — output posts only. (default: `false`)
- ...and 4 more parameters

### Input examples

**RAG dataset from a subreddit** — Pull text-bearing posts and comment threads from a subreddit to build a domain-specific RAG corpus.

→ Posts with body text from r/MachineLearning, each followed by its nested comments — ready to chunk and embed.

```json
{
  "startUrls": [
    {
      "url": "https://www.reddit.com/r/MachineLearning/"
    }
  ],
  "maxItems": 100,
  "maxComments": 200
}
````

**Topic dataset via keyword search** — Search Reddit for a specific topic and collect the top posts with their threads.

→ Top posts matching the query, each with comments — suitable for a focused fine-tuning corpus.

```json
{
  "searchTerms": [
    "retrieval augmented generation"
  ],
  "searchType": "posts",
  "sort": "top",
  "maxItems": 200
}
```

**Markdown-only export for chunking** — Return only the Markdown body format to keep dataset size small when piping straight into a text splitter.

→ Posts and comments from r/LocalLLaMA with `descriptionMarkdown` populated and other body formats omitted.

```json
{
  "startUrls": [
    {
      "url": "https://www.reddit.com/r/LocalLLaMA/"
    }
  ],
  "descriptionFormat": "markdown",
  "maxItems": 100,
  "maxComments": 300
}
```

**Deep single-thread capture** — Pull one post and its entire comment tree for analysis or fine-tuning on a specific discussion.

→ One post record and all its nested comments with full thread structure (parentId, depth).

```json
{
  "startUrls": [
    {
      "url": "https://www.reddit.com/r/MachineLearning/comments/1abc234/example_discussion/"
    }
  ],
  "includeCollapsed": true,
  "commentDepth": 10,
  "maxComments": 500
}
```

### Output

Each run produces a dataset of structured Reddit records. Results can be downloaded as JSON, CSV, or Excel from the Dataset tab in Apify Console.

### Example Reddit record

```json
{
  "itemType": "post",
  "id": "t3_1ttjtwv",
  "url": "https://www.reddit.com/r/programming/comments/1ttjtwv/your_process_memory_is_a_file_the/",
  "title": "Your process' memory is a file: The underappreciated gem that is /proc/<pid>/mem",
  "body": null,
  "bodyHtml": null,
  "contentHref": "https://lcamtuf.substack.com/p/weekend-trivia-your-process-memory",
  "postType": "link",
  "language": "en",
  "score": 129,
  "upvoteRatio": 0.9708029197080292,
  "numComments": 1,
  "awardCount": 0,
  "author": "mttd",
  "authorId": "t2_6gkbb",
  "community": "r/programming",
  "communityId": "t5_2fwo",
  "createdAt": "2026-06-01T08:32:12.581+02:00",
  "icon": "https://www.redditstatic.com/avatars/defaults/v2/avatar_default_7.png",
  "nsfw": false
}
```

### Example post record

```json
{
  "itemType": "post",
  "id": "t3_1ml2x7a",
  "url": "https://www.reddit.com/r/MachineLearning/comments/1ml2x7a/rag_with_reddit_threads/",
  "title": "Anyone successfully built a RAG pipeline over Reddit threads?",
  "community": "MachineLearning",
  "author": "embeddings_fan",
  "score": 312,
  "upvoteRatio": 0.97,
  "numComments": 84,
  "awardCount": 2,
  "postType": "self",
  "createdAt": "2026-05-14T09:22:11.000Z",
  "language": "en",
  "description": "I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often tangential. Curious how others...",
  "descriptionText": "I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often tangential. Curious how others...",
  "descriptionHtml": "<div class=\"py-0\"><p dir=\"auto\">I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often...",
  "descriptionMarkdown": "I've been experimenting with chunking Reddit threads for a domain-specific RAG system. The challenge is that top-level comments add context but nested replies are often tangential. Curious how others...",
  "nsfw": false
}
```

### How to scrape reddit.com

1. Go to [Reddit RAG Dataset](https://apify.com/blackfalcondata/reddit-rag-dataset?fpr=1h3gvi) in Apify Console.
2. Configure the input.
3. Set `maxItems` to control how many results you need.
4. Click **Start** and wait for the run to finish.
5. Export the dataset as JSON, CSV, or Excel.

### Use cases

- Build RAG knowledge bases from subreddit discussions — index posts and comment threads as chunked, embedded passages for semantic retrieval.
- Assemble LLM fine-tuning and training corpora from real human conversations, filtered to text-bearing records only.
- Generate embeddings and semantic search indexes for topic research using community-sourced text at scale.
- Build topic research corpora on any subject by combining subreddit feeds and keyword searches in one run.
- Feed structured Reddit threads into AI agents and MCP tools that need grounded, human-written context.
- Archive community discussions for longitudinal analysis, sentiment tracking, or academic research.

### How much does it cost to scrape reddit.com?

Reddit RAG Dataset uses [pay-per-event](https://docs.apify.com/platform/actors/paid-actors/pay-per-event) pricing. You pay a small fee when the run starts and then for each result that is actually produced.

- **Run start:** $0.008 per run
- **Per result:** $0.002 per Reddit record

Example costs:

- 10 results: **$0.028**
- 25 results: **$0.058**
- 100 results: **$0.21**
- 200 results: **$0.41**
- 500 results: **$1.01**

### FAQ

#### How many results can I get from reddit.com?

The number of results depends on the search query and available listings on reddit.com. Use the `maxItems` parameter to control how many results are returned per run.

#### Can I integrate Reddit RAG Dataset with other apps?

Yes. Reddit RAG Dataset works with Apify's [integrations](https://apify.com/integrations?fpr=1h3gvi) to connect with tools like Zapier, Make, Google Sheets, Slack, and more. You can also use webhooks to trigger actions when a run completes.

#### Can I use Reddit RAG Dataset with the Apify API?

Yes. You can start runs, manage inputs, and retrieve results programmatically through the [Apify API](https://docs.apify.com/api/v2). Client libraries are available for JavaScript, Python, and other languages.

#### Can I use Reddit RAG Dataset through an MCP Server?

Yes. Apify provides an [MCP Server](https://apify.com/apify/actors-mcp-server?fpr=1h3gvi) that lets AI assistants and agents call this actor directly. Use a single `descriptionFormat` and `excludeEmptyFields` to keep payloads manageable for LLM context windows.

#### Is it legal to scrape reddit.com?

This actor extracts publicly available data from reddit.com. Web scraping of public information is generally considered legal, but you should always review the target site's terms of service and ensure your use case complies with applicable laws and regulations, including GDPR where relevant.

#### Your feedback

If you have questions, need a feature, or found a bug, please [open an issue](https://apify.com/blackfalcondata/reddit-rag-dataset/issues?fpr=1h3gvi) on the actor's page in Apify Console. Your feedback helps us improve.

### You might also like

- [Reddit Email Scraper — Extract Emails from Posts & Comments](https://apify.com/blackfalcondata/reddit-email-scraper?fpr=1h3gvi) — Extract email addresses and contact details from Reddit posts, comments and user profiles. Search.
- [Reddit Lead Scraper — Emails, Socials & Contact Info](https://apify.com/blackfalcondata/reddit-lead-scraper?fpr=1h3gvi) — Turn Reddit into a B2B lead list. Keep only records that expose a contact signal — email, social.
- [Reddit Scraper 💰 $1.25/1K — Posts & Full Comment Threads](https://apify.com/blackfalcondata/reddit-scraper?fpr=1h3gvi) — Scrape Reddit posts with their full nested comment threads, user profiles, and community pages..
- [Reddit Sentiment Scraper — Analyze Posts & Comments](https://apify.com/blackfalcondata/reddit-sentiment-scraper?fpr=1h3gvi) — Scrape Reddit and score every post and comment for sentiment — positive, negative or neutral with a.
- [YouTube Scraper $2/1K — Videos, Channels, Comments, Transcripts](https://apify.com/blackfalcondata/youtube-data-scraper?fpr=1h3gvi) — Scrape YouTube videos, channels, comments, and transcripts in one tool — by keyword or by video,.

### Getting started with Apify

New to Apify? [Create a free account with $5 credit](https://console.apify.com/sign-up?fpr=1h3gvi) — no credit card required.

1. Sign up — $5 platform credit included
2. Open this actor and configure your input
3. Click **Start** — export results as JSON, CSV, or Excel

Need more later? [See Apify pricing](https://apify.com/pricing?fpr=1h3gvi).

# Actor input Schema

## `startUrls` (type: `array`):

Reddit URLs to scrape — subreddits, post pages, user profiles, community pages, or search result pages. Each URL determines what type of content is fetched.

## `searchTerms` (type: `array`):

Search Reddit for these terms. Each entry becomes an independent search. Search posts are lightweight discovery records (plus their comments) — see Search Type.

## `searchType` (type: `string`):

Type of results to return when using Search Terms. Post results are lightweight discovery records — id, url, title, subreddit and NSFW flag — plus their comment threads; scrape a result's URL directly for its full post fields (author, body, score, timestamp).

## `sort` (type: `string`):

Sort order for posts and search results.

## `time` (type: `string`):

Restrict subreddit-feed results to a time window (applies to Top sort on feeds; search is not time-windowed).

## `includeNSFW` (type: `boolean`):

Include posts and communities marked as NSFW (18+).

## `postDateLimit` (type: `string`):

Skip posts older than this ISO-8601 date (e.g. "2024-01-01"). Applies to subreddit feeds and post URLs; search results carry no date and are not filtered. Leave blank for no date limit.

## `maxItems` (type: `integer`):

Maximum total records to save across all sources (posts, comments, users, communities).

## `maxComments` (type: `integer`):

Maximum number of comments to collect from each post page.

## `includeCollapsed` (type: `boolean`):

Expand and include comments that are initially collapsed (controversial or low-score). Enables deeper thread coverage, up to the comment and depth limits you set.

## `commentDepth` (type: `integer`):

Maximum reply nesting depth to collect (1 = top-level only).

## `skipComments` (type: `boolean`):

Do not collect comments from post pages — output posts only.

## `descriptionFormat` (type: `string`):

Controls which body/description fields are included in output. "all" emits text + HTML + markdown variants.

## `excludeEmptyFields` (type: `boolean`):

Strip null and empty fields from output records to reduce payload size.

## `includeRunMetadata` (type: `boolean`):

Append a single run-summary record at the end of the dataset (run ID, timing, item counts). It is marked itemType="runMetadata" and is added in addition to your matching records — filter on itemType to exclude it.

## `appConnector` (type: `string`):

Optional. Pick a connected app under Settings → API & Integrations to receive your scraped Reddit results. Notion is supported today (a run-summary page); other MCP connectors are best-effort as Apify expands its catalog.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://www.reddit.com/r/programming/"
    }
  ],
  "searchTerms": [],
  "searchType": "posts",
  "sort": "hot",
  "time": "all",
  "includeNSFW": false,
  "maxItems": 5,
  "maxComments": 200,
  "includeCollapsed": true,
  "commentDepth": 10,
  "skipComments": false,
  "descriptionFormat": "all",
  "excludeEmptyFields": false,
  "includeRunMetadata": false
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://www.reddit.com/r/programming/"
        }
    ],
    "searchTerms": [],
    "maxItems": 5,
    "includeCollapsed": false,
    "descriptionFormat": "all",
    "excludeEmptyFields": false
};

// Run the Actor and wait for it to finish
const run = await client.actor("blackfalcondata/reddit-rag-dataset").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://www.reddit.com/r/programming/" }],
    "searchTerms": [],
    "maxItems": 5,
    "includeCollapsed": False,
    "descriptionFormat": "all",
    "excludeEmptyFields": False,
}

# Run the Actor and wait for it to finish
run = client.actor("blackfalcondata/reddit-rag-dataset").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://www.reddit.com/r/programming/"
    }
  ],
  "searchTerms": [],
  "maxItems": 5,
  "includeCollapsed": false,
  "descriptionFormat": "all",
  "excludeEmptyFields": false
}' |
apify call blackfalcondata/reddit-rag-dataset --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=blackfalcondata/reddit-rag-dataset",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Reddit RAG Dataset — LLM Training Data from Posts & Comments",
        "description": "Build clean LLM and RAG datasets from Reddit. Export posts with full comment threads as ready-to-chunk text, HTML and Markdown — only text-bearing records with parent/child thread structure. No login or developer token needed.",
        "version": "0.1",
        "x-build-id": "y9Te8kjNwRC0PEZKA"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/blackfalcondata~reddit-rag-dataset/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-blackfalcondata-reddit-rag-dataset",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/blackfalcondata~reddit-rag-dataset/runs": {
            "post": {
                "operationId": "runs-sync-blackfalcondata-reddit-rag-dataset",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/blackfalcondata~reddit-rag-dataset/run-sync": {
            "post": {
                "operationId": "run-sync-blackfalcondata-reddit-rag-dataset",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "🔗 Start URLs",
                        "type": "array",
                        "description": "Reddit URLs to scrape — subreddits, post pages, user profiles, community pages, or search result pages. Each URL determines what type of content is fetched.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "searchTerms": {
                        "title": "🔎 Search Terms",
                        "type": "array",
                        "description": "Search Reddit for these terms. Each entry becomes an independent search. Search posts are lightweight discovery records (plus their comments) — see Search Type.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "searchType": {
                        "title": "🔎 Search Type",
                        "enum": [
                            "posts"
                        ],
                        "type": "string",
                        "description": "Type of results to return when using Search Terms. Post results are lightweight discovery records — id, url, title, subreddit and NSFW flag — plus their comment threads; scrape a result's URL directly for its full post fields (author, body, score, timestamp).",
                        "default": "posts"
                    },
                    "sort": {
                        "title": "📊 Sort",
                        "enum": [
                            "relevance",
                            "hot",
                            "top",
                            "new",
                            "comments"
                        ],
                        "type": "string",
                        "description": "Sort order for posts and search results.",
                        "default": "hot"
                    },
                    "time": {
                        "title": "🕒 Time Filter",
                        "enum": [
                            "hour",
                            "day",
                            "week",
                            "month",
                            "year",
                            "all"
                        ],
                        "type": "string",
                        "description": "Restrict subreddit-feed results to a time window (applies to Top sort on feeds; search is not time-windowed).",
                        "default": "all"
                    },
                    "includeNSFW": {
                        "title": "🔞 Include NSFW",
                        "type": "boolean",
                        "description": "Include posts and communities marked as NSFW (18+).",
                        "default": false
                    },
                    "postDateLimit": {
                        "title": "📅 Post Date Limit",
                        "type": "string",
                        "description": "Skip posts older than this ISO-8601 date (e.g. \"2024-01-01\"). Applies to subreddit feeds and post URLs; search results carry no date and are not filtered. Leave blank for no date limit."
                    },
                    "maxItems": {
                        "title": "🔢 Max Items",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum total records to save across all sources (posts, comments, users, communities).",
                        "default": 100
                    },
                    "maxComments": {
                        "title": "💬 Max Comments Per Post",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of comments to collect from each post page.",
                        "default": 200
                    },
                    "includeCollapsed": {
                        "title": "💬 Include Collapsed Comments",
                        "type": "boolean",
                        "description": "Expand and include comments that are initially collapsed (controversial or low-score). Enables deeper thread coverage, up to the comment and depth limits you set.",
                        "default": true
                    },
                    "commentDepth": {
                        "title": "🌳 Comment Depth",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum reply nesting depth to collect (1 = top-level only).",
                        "default": 10
                    },
                    "skipComments": {
                        "title": "⏭️ Skip Comments",
                        "type": "boolean",
                        "description": "Do not collect comments from post pages — output posts only.",
                        "default": false
                    },
                    "descriptionFormat": {
                        "title": "📄 Description Format",
                        "enum": [
                            "all",
                            "text",
                            "html",
                            "markdown"
                        ],
                        "type": "string",
                        "description": "Controls which body/description fields are included in output. \"all\" emits text + HTML + markdown variants.",
                        "default": "all"
                    },
                    "excludeEmptyFields": {
                        "title": "🧹 Exclude Empty Fields",
                        "type": "boolean",
                        "description": "Strip null and empty fields from output records to reduce payload size.",
                        "default": false
                    },
                    "includeRunMetadata": {
                        "title": "📋 Include Run Metadata",
                        "type": "boolean",
                        "description": "Append a single run-summary record at the end of the dataset (run ID, timing, item counts). It is marked itemType=\"runMetadata\" and is added in addition to your matching records — filter on itemType to exclude it.",
                        "default": false
                    },
                    "appConnector": {
                        "title": "Send results to Notion (or another connected app)",
                        "type": "string",
                        "description": "Optional. Pick a connected app under Settings → API & Integrations to receive your scraped Reddit results. Notion is supported today (a run-summary page); other MCP connectors are best-effort as Apify expands its catalog."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
