# Reddit Scraper (`prodiger/reddit-scraper`) Actor

Extract posts, comments, user profiles, and search results from Reddit. Pure HTTP, no API key required.

- **URL**: https://apify.com/prodiger/reddit-scraper.md
- **Developed by:** [Arnas](https://apify.com/prodiger) (community)
- **Categories:** Social media, AI, Lead generation
- **Stats:** 4 total users, 2 monthly users, 98.6% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.60 / 1,000 posts

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Reddit Scraper

A no-API-key Reddit scraper covering posts, comments, users, and search — input/output/pricing snapshot-compatible with `automation-lab/reddit-scraper` as of the published version date. Pure HTTP, no headless browser, three output formats (default JSON, OpenAI fine-tune JSONL, RAG-ready markdown).

### What does Reddit Scraper do?

Reddit Scraper extracts structured data from Reddit at $1.15 per 1,000 posts (FREE tier, scaling down to $0.28 per 1,000 at DIAMOND). Reddit has 1.7 billion monthly visits and 100,000+ active communities, making it the largest public discussion platform on the web. This actor scrapes posts, comments, search results, and user profiles from any public subreddit. Just paste any Reddit URL or enter a search query and get clean JSON, CSV, or Excel output. No Reddit account or API key needed.

It supports subreddit listings (hot, new, top, rising), individual posts with nested comments, user submission history, and full-text search across all of Reddit or within a specific subreddit.

Built on pure HTTP requests (no browser), it runs fast and keeps costs low.

### Who is Reddit Scraper for?

- **Researchers** — collect public opinion data, survey sentiment on topics, build datasets for academic studies
- **Market analysts** — track brand mentions, product feedback, and competitor discussions across subreddits
- **SEO & content marketers** — discover trending topics, find content ideas, monitor keyword discussions
- **AI/ML engineers** — gather training data, build sentiment analysis datasets, feed LLM pipelines with real conversations
- **Journalists** — monitor communities for breaking stories, track public reactions to events
- **Product managers** — collect user feedback from product subreddits, track feature requests and bug reports
- **Lead generation teams** — find potential customers asking for solutions your product solves
- **Social listening agencies** — monitor Reddit alongside other platforms for brand and reputation tracking

### Why use Reddit Scraper?

- **Posts + comments in one actor** — no need to run separate scrapers
- **All input types** — subreddits, posts, users, search queries, or just paste any Reddit URL
- **Pure HTTP** — no browser, low memory, fast execution
- **Clean, AI-ready output** — three formats including OpenAI fine-tune JSONL and RAG markdown
- **Pagination built in** — scrape hundreds or thousands of posts automatically
- **Pay only for results** — pay-per-event pricing, no monthly subscription
- **No API key required** — works without Reddit developer credentials
- **Keyword filtering** — filter results to only keep posts matching specific terms (filtered posts are not charged)

### What data can you extract from Reddit?

#### Post fields

| Field | Description |
|-------|-------------|
| `title` | Post title |
| `author` | Reddit username |
| `subreddit` | Subreddit name |
| `score` | Net upvotes |
| `upvoteRatio` | Upvote percentage (0-1) |
| `numComments` | Comment count |
| `createdAt` | ISO 8601 timestamp |
| `url` | Full Reddit URL |
| `selfText` | Post body text |
| `link` | External link (for link posts) |
| `domain` | Link domain |
| `isVideo`, `isSelf`, `isNSFW`, `isSpoiler`, `isStickied` | Content flags |
| `linkFlairText` | Post flair |
| `totalAwards` | Award count |
| `subredditSubscribers` | Subreddit size |
| `imageUrls` | Extracted image URLs (gallery- and preview-aware, HTML-decoded) |
| `thumbnail` | Thumbnail URL |
| `scrapedAt` | When this actor scraped it |

#### Comment fields

| Field | Description |
|-------|-------------|
| `author` | Commenter username |
| `body` | Comment text |
| `score` | Net upvotes |
| `createdAt` | ISO 8601 timestamp |
| `depth` | Nesting level (0 = top-level) |
| `isSubmitter` | Whether commenter is the post author |
| `parentId` | Parent comment/post ID |
| `replies` | Number of direct replies |
| `postId` | Parent post ID |
| `postTitle` | Parent post title |
| `scrapedAt` | When this actor scraped it |

### How much does it cost to scrape Reddit?

Pay-per-event pricing — you pay only for what you scrape. No monthly subscription. The tier price applied to your run depends on your Apify subscription plan.

| Event | FREE | BRONZE | SILVER | GOLD | PLATINUM | DIAMOND |
|-------|------|--------|--------|------|----------|---------|
| Actor start | $0.003/run | $0.003/run | $0.003/run | $0.003/run | $0.003/run | $0.003/run |
| Per post | $0.00115 | $0.001 | $0.00078 | $0.0006 | $0.0004 | $0.00028 |
| Per comment | $0.000575 | $0.0005 | $0.00039 | $0.0003 | $0.0002 | $0.00014 |

At the FREE tier, that's roughly **$1.15 per 1,000 posts** or **$0.58 per 1,000 comments**.

#### AI format pricing note

When `outputFormat` is `jsonl-finetune` or `rag-markdown`, you are charged only for posts (not comments). Comments are bundled into the post record at no extra charge — cost-effective for large-scale training-set collection.

#### Real-world cost examples (FREE tier)

| Input | Results | Approx. duration | Approx. cost |
|-------|---------|------------------|--------------|
| 1 subreddit, 100 posts | 100 posts | ~30s | ~$0.12 |
| 5 subreddits, 50 posts each | 250 posts | ~75s | ~$0.30 |
| 1 post + 200 comments | 201 items | ~10s | ~$0.12 |
| Search "AI", 100 results | 100 posts | ~30s | ~$0.12 |
| 1 subreddit, 5 posts + 3 comments each | 20 items | ~15s | ~$0.02 |

> *Times reflect typical observed runs on RESIDENTIAL proxy in 2026. Actual times depend on Reddit response latency and proxy session warm-up. The deployment runbook captures observed numbers per release; check there for current measurements.*

### How to scrape Reddit

1. Open the actor input page.
2. Add Reddit URLs to the **Reddit URLs** field — any of these formats work:
   - `https://www.reddit.com/r/technology/`
   - `https://www.reddit.com/r/AskReddit/comments/abc123/post-title/`
   - `https://www.reddit.com/user/spez/`
   - `r/technology` or just `technology`
3. Or enter a **Search query** to search across Reddit.
4. Set **Max posts per source** to control how many posts to scrape.
5. Enable **Include comments** if you also want comment data.
6. Click **Start** and wait for results.
7. Download your data as JSON, CSV, or Excel from the Dataset tab.

#### Example input

```json
{
    "urls": ["https://www.reddit.com/r/technology/"],
    "maxPostsPerSource": 100,
    "sort": "hot",
    "includeComments": false
}
````

#### Scraping a specific post with comments

```json
{
    "urls": ["https://www.reddit.com/r/technology/comments/abc123/some-post-title/"],
    "includeComments": true,
    "maxCommentsPerPost": 50,
    "commentDepth": 3
}
```

#### Searching Reddit with keyword filtering

```json
{
    "searchQuery": "best project management tools",
    "searchSubreddit": "productivity",
    "sort": "relevance",
    "timeFilter": "month",
    "maxPostsPerSource": 50,
    "filterKeywords": ["Notion", "Asana", "Monday"]
}
```

### Input parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `urls` | string\[] | — | Reddit URLs to scrape (subreddits, posts, users, search URLs). Shortcut forms `r/x` and bare `x` are accepted as subreddit names. |
| `searchQuery` | string | — | Search Reddit for this query. Either `urls` or `searchQuery` must be provided. |
| `searchSubreddit` | string | — | Limit search to a specific subreddit. |
| `sort` | enum | `hot` | Sort order: `hot`, `new`, `top`, `rising`, `relevance`. |
| `timeFilter` | enum | `week` | Time filter for `top` and `relevance`: `hour`, `day`, `week`, `month`, `year`, `all`. |
| `maxPostsPerSource` | integer | `100` | Max posts per subreddit/search/user. `0` = unlimited (capped by Reddit's ~1000-item-per-listing ceiling). |
| `includeComments` | boolean | `false` | Also scrape comments for each post. |
| `maxCommentsPerPost` | integer | `100` | Max comments per post. **Hard-capped at 1000** to bound proxy/compute cost. |
| `commentDepth` | integer | `3` | Max reply nesting depth (1-10). |
| `filterKeywords` | string\[] | `[]` | Only keep posts containing at least one keyword (case-insensitive title/body match). Filtered posts are not charged. |
| `outputFormat` | enum | `default` | `default` (standard JSON), `jsonl-finetune` (OpenAI chat-format SFT records), `rag-markdown` (vector-DB-ready markdown documents). |
| `maxRequestRetries` | integer | `5` | Retry attempts for failed requests (1-10). |
| `proxyConfiguration` | object | `RESIDENTIAL` | Apify proxy. **RESIDENTIAL is the default** — Reddit aggressively blocks datacenter IP ranges in 2026. Override only if you understand the trade-off. |

### Output examples

#### Default format (post)

```json
{
    "type": "post",
    "id": "1qw5kwf",
    "title": "Reddit AMA highlights from this week",
    "author": "Sandstorm400",
    "subreddit": "technology",
    "score": 18009,
    "upvoteRatio": 0.92,
    "numComments": 1363,
    "createdAt": "2026-02-05T00:04:58.000Z",
    "url": "https://www.reddit.com/r/technology/comments/1qw5kwf/...",
    "permalink": "/r/technology/comments/1qw5kwf/...",
    "selfText": "",
    "link": "https://example.com/article",
    "domain": "example.com",
    "isVideo": false,
    "isSelf": false,
    "isNSFW": false,
    "isSpoiler": false,
    "isStickied": false,
    "thumbnail": "https://external-preview.redd.it/...",
    "linkFlairText": "Society",
    "totalAwards": 0,
    "subredditSubscribers": 17101887,
    "imageUrls": [],
    "scrapedAt": "2026-04-18T12:33:50.000Z"
}
```

#### Default format (comment)

```json
{
    "type": "comment",
    "id": "m3abc12",
    "postId": "1qw5kwf",
    "postTitle": "Reddit AMA highlights from this week",
    "author": "commenter123",
    "body": "Phone addiction in teens is a serious issue.",
    "score": 542,
    "createdAt": "2026-02-05T01:15:00.000Z",
    "permalink": "https://www.reddit.com/r/technology/comments/1qw5kwf/.../m3abc12",
    "depth": 0,
    "isSubmitter": false,
    "parentId": "t3_1qw5kwf",
    "replies": 12,
    "scrapedAt": "2026-04-18T12:33:52.000Z"
}
```

#### `jsonl-finetune` format (one record per post, comments bundled in `assistant`)

```json
{
    "type": "finetune",
    "messages": [
        { "role": "system", "content": "You are analyzing a Reddit discussion. Summarize the key viewpoints from the community." },
        { "role": "user", "content": "r/MachineLearning — What's the best approach for few-shot learning in 2026?\n\nI'm building a text classifier with only 50 labeled examples per class..." },
        { "role": "assistant", "content": "1. [user_a, score: 482] Try SetFit — fine-tunes sentence transformers on a handful of examples.\n2. [user_b, score: 217] LLM + chain-of-thought prompting with 5-10 examples..." }
    ],
    "metadata": {
        "postId": "abc123",
        "subreddit": "MachineLearning",
        "score": 1240,
        "upvoteRatio": 0.97,
        "numComments": 87,
        "createdAt": "2026-03-15T10:22:00.000Z",
        "url": "https://www.reddit.com/r/MachineLearning/...",
        "domain": "reddit.com",
        "isNSFW": false,
        "linkFlairText": null
    }
}
```

#### `rag-markdown` format (one self-contained markdown doc per post)

```json
{
    "type": "rag-chunk",
    "chunkId": "reddit-MachineLearning-abc123",
    "markdown": "## What's the best approach for few-shot learning in 2026?\n\n**Subreddit:** r/MachineLearning  \n**Author:** u/researcher42  \n**Score:** 1240 (97% upvoted)  \n\n### Top Comments\n\n#### u/user_a (score: 482)\n\nTry SetFit...\n",
    "metadata": {
        "source": "reddit",
        "postId": "abc123",
        "subreddit": "MachineLearning",
        "title": "What's the best approach for few-shot learning in 2026?",
        "author": "researcher42",
        "score": 1240,
        "upvoteRatio": 0.97,
        "numComments": 87,
        "createdAt": "2026-03-15T10:22:00.000Z",
        "url": "https://www.reddit.com/r/MachineLearning/comments/abc123/...",
        "domain": "reddit.com",
        "isNSFW": false,
        "linkFlairText": null
    }
}
```

> **Note about Console preview:** the dataset preview in Apify Console renders the `default` post schema. When `outputFormat` is `jsonl-finetune` or `rag-markdown`, the records still land in the `posts` dataset but with their respective shapes — most preview columns will appear empty. Use the JSON or JSONL exports to inspect the actual records.

### How to monitor Reddit for brand mentions

1. Search for brand-name mentions:
   ```json
   { "searchQuery": "YourBrandName", "sort": "new", "timeFilter": "day", "maxPostsPerSource": 100 }
   ```
2. Monitor specific product subreddits with comments enabled:
   ```json
   {
       "urls": ["https://www.reddit.com/r/YourProductSubreddit/new/", "https://www.reddit.com/r/CompetitorSubreddit/new/"],
       "maxPostsPerSource": 50,
       "includeComments": true,
       "maxCommentsPerPost": 20
   }
   ```
3. Schedule the actor to run daily via Apify's built-in scheduler.
4. Pipe the output to Slack/email/your CRM via Apify integrations.
5. Filter for sentiment signals with `filterKeywords`.

### How to use Reddit data for AI and LLM workflows

#### `jsonl-finetune` for supervised fine-tuning

Each record is a single training example in OpenAI chat format (system / user / assistant). The assistant message is the top K comments by score (K = `min(maxCommentsPerPost, available)`), formatted as a ranked list with score + author.

Use cases: fine-tuning domain-specific chatbots, building Q\&A models on niche topics, creating instruction-tuning datasets from community knowledge.

#### `rag-markdown` for vector-DB ingestion

Each record is a self-contained markdown document with metadata header + top K comments as `#### u/<author> (score: N)` H3 sections. The `chunkId` follows the stable format `reddit-{subreddit}-{postId}` for deduplication and upsert into Pinecone, Weaviate, Chroma, or Qdrant.

Use cases: building domain-specific RAG systems, enriching knowledge bases with community knowledge, powering chatbots that answer from Reddit discussions.

#### Quality filtering tips

When building training data or RAG corpora from Reddit, filter for quality:

- **Score threshold** — keep only `metadata.score >= 50`
- **Upvote ratio** — `metadata.upvoteRatio >= 0.85` excludes controversial posts
- **NSFW filter** — `metadata.isNSFW === false` for general-purpose datasets
- **Time filter** — `sort: "top"` + `timeFilter: "year"` gets the highest-quality content from the past year

### Reddit data export: CSV, Excel, JSON

Apify datasets support JSON, CSV, Excel, XML, and HTML export formats. Use the dataset export buttons in Console or the dataset URL programmatically:

```sh
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?format=csv" \
  -H "Authorization: Bearer YOUR_API_TOKEN" > reddit_posts.csv
```

### How to scrape Reddit without getting blocked

The actor handles Reddit's rate limits and anti-bot protections automatically:

- **Reactive 429 backoff** — when Reddit returns 429, the actor honors the `Retry-After` header (or falls back to exponential backoff) before retrying. Persistent blocks trigger session retirement (= new proxy IP).
- **Real-Chrome User-Agent** — Reddit filters generic UAs; the actor sends a real-Chrome-shaped UA + the actor's identifier suffix.
- **RESIDENTIAL proxy default** — Reddit blocks datacenter IPs aggressively in 2026; residential is the practical baseline.
- **Pure HTTP** — no browser fingerprinting to mismatch.

For very high-volume scraping (tens of thousands of posts per hour), split the workload across multiple smaller runs.

### How to use Reddit Scraper with the API

#### Node.js

```js
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('automation-lab/reddit-scraper').call({
    urls: ['https://www.reddit.com/r/technology/'],
    maxPostsPerSource: 100,
    sort: 'hot',
    includeComments: false,
});

const { items } = await client.dataset(run.namedDatasetIds.posts).listItems();
console.log(items);
```

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient('YOUR_API_TOKEN')

run = client.actor('automation-lab/reddit-scraper').call(run_input={
    'urls': ['https://www.reddit.com/r/technology/'],
    'maxPostsPerSource': 100,
    'sort': 'hot',
    'includeComments': False,
})

items = client.dataset(run['namedDatasetIds']['posts']).list_items().items
print(items)
```

#### cURL

```sh
curl "https://api.apify.com/v2/acts/automation-lab~reddit-scraper/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "urls": ["https://www.reddit.com/r/technology/"],
    "maxPostsPerSource": 100,
    "sort": "hot"
  }'
```

### Use with AI agents via MCP

Reddit Scraper is available as a tool for AI assistants that support the Model Context Protocol (MCP).

#### Setup for Claude Code

```sh
claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/reddit-scraper"
```

#### Setup for Claude Desktop, Cursor, or VS Code

```json
{
    "mcpServers": {
        "apify": { "url": "https://mcp.apify.com?tools=automation-lab/reddit-scraper" }
    }
}
```

Your AI assistant will use OAuth to authenticate with your Apify account on first use. Then ask:

- "Get the top 100 posts from r/technology this month"
- "Scrape comments from this Reddit thread"
- "Search Reddit for discussions about 'AI coding'"

### Integrations

- **Google Sheets** — auto-export Reddit posts and comments to a spreadsheet
- **Slack / Discord** — get notifications when scraping finishes or when posts match keywords
- **Zapier / Make** — trigger workflows on new Reddit data
- **Webhooks** — send results to your own API endpoint
- **Scheduled runs** — daily/weekly subreddit monitoring via Apify scheduler
- **Data warehouses** — pipe data to BigQuery, Snowflake, or PostgreSQL

### Is it legal to scrape Reddit?

Scraping publicly available data from Reddit is generally considered legal:

- **Public data only.** This actor only accesses publicly available posts and comments. It does not log in, bypass authentication, or access private content.
- **Legal precedent.** The US Ninth Circuit ruling in *hiQ Labs v. LinkedIn* (2022) established that scraping publicly available data does not violate the Computer Fraud and Abuse Act (CFAA).
- **No personal-data extraction.** Usernames are public pseudonyms on Reddit; the actor does not attempt to deanonymize users or collect private information.
- **Terms of Service.** Reddit's ToS restricts automated access, but ToS violations are a contractual matter, not a criminal one.
- **GDPR.** If you scrape data that includes EU users, ensure your use case complies with GDPR. Aggregated, anonymized analysis is generally safe; storing individual user data for profiling may require additional compliance steps.

This information is for educational purposes and does not constitute legal advice.

### FAQ

**Can I scrape any subreddit?** Yes, as long as it's public. Private subreddits return 403 and are skipped.

**Does it scrape NSFW content?** Yes, NSFW posts are included by default with the `isNSFW: true` flag. Filter them client-side if you want them excluded.

**How many posts can I scrape?** Set `maxPostsPerSource: 0` for unlimited, capped by Reddit's ~1000-post pagination ceiling per listing. For more, use search with multiple time filters (e.g., `timeFilter: month` then `year`) to access older content.

**What happens if Reddit rate-limits me?** The actor reads `Retry-After` headers when present, falls back to exponential backoff otherwise, and retires sessions on persistent blocks. No configuration needed.

**Can I export to CSV or Excel?** Yes. Apify datasets support JSON, CSV, Excel, XML, and HTML formats.

**The scraper returns fewer posts than I expected.** Reddit's pagination caps at ~1000 posts per listing. For deeper history, use search with different time filters.

**I'm getting 403 errors for a subreddit.** The subreddit is private, quarantined, or banned. Check in an incognito browser — if you can't see it there, the scraper can't either.

**Can I use `filterKeywords` to narrow search results?** Yes. Set `filterKeywords` to an array of terms; only posts whose title or body contains at least one keyword are kept. Filtered posts are not charged.

**How do I scrape a user's post history?** Paste the user's profile URL (e.g., `https://www.reddit.com/user/spez/`) into the URLs field.

**Does it handle deleted or removed posts?** Deleted authors appear as `[deleted]`; removed bodies appear as `[removed]`. Records are emitted, not dropped.

**What proxy should I use?** RESIDENTIAL is the default and recommended for any sustained run. DATACENTER is offered as a low-cost option but Reddit blocks datacenter IPs aggressively in 2026 — expect 429s and dropped sources for runs above ~100 posts on DATACENTER.

**Do AI formats (`jsonl-finetune`, `rag-markdown`) require comments?** They work without comments, but the assistant message / comments section will be a placeholder. For meaningful AI training data, set `includeComments: true`.

# Actor input Schema

## `urls` (type: `array`):

Reddit URLs to scrape. Accepts subreddit listings (https://www.reddit.com/r/technology/), individual posts (https://www.reddit.com/r/x/comments/abc123/...), user profiles (https://www.reddit.com/user/spez/), and search URLs (https://www.reddit.com/search/?q=...). Shortcut forms 'r/technology' and bare 'technology' are accepted as subreddit names.

## `searchQuery` (type: `string`):

Optional. Search Reddit for this query. Combine with 'searchSubreddit' to limit search to a single subreddit. Either 'urls' or 'searchQuery' must be provided.

## `searchSubreddit` (type: `string`):

When 'searchQuery' is set, restrict search to this subreddit (without the r/ prefix).

## `sort` (type: `string`):

Listing sort order. 'relevance' applies to search only.

## `timeFilter` (type: `string`):

Time window. Applies to sort=top and sort=relevance.

## `maxPostsPerSource` (type: `integer`):

Maximum number of posts to scrape per source (subreddit, user, search). Set to 0 for unlimited (capped by Reddit's ~1000-item-per-listing ceiling).

## `includeComments` (type: `boolean`):

Also scrape comments for each post. Increases run time and cost. Required for 'jsonl-finetune' and 'rag-markdown' output formats to have content to bundle.

## `maxCommentsPerPost` (type: `integer`):

Maximum comments to scrape per post. Hard-capped at 1000 to bound proxy/compute cost.

## `commentDepth` (type: `integer`):

Maximum reply nesting depth (1 = top-level comments only).

## `filterKeywords` (type: `array`):

Only keep posts whose title or body contains at least one of these terms (case-insensitive). Leave empty to keep all posts. Filtered posts are not charged.

## `outputFormat` (type: `string`):

default = standard JSON. jsonl-finetune = OpenAI chat-format records for LLM SFT. rag-markdown = self-contained markdown documents for vector-DB ingestion. AI formats bundle comments into the post record at no extra charge.

## `maxRequestRetries` (type: `integer`):

Number of retry attempts for failed requests (429, 5xx, network errors).

## `proxyConfiguration` (type: `object`):

Apify proxy. RESIDENTIAL is the default — Reddit aggressively blocks datacenter IP ranges in 2026, so DATACENTER will likely fail on sustained runs. Override only if you understand the trade-off.

## Actor input object example

```json
{
  "urls": [
    {
      "url": "https://www.reddit.com/r/technology/"
    }
  ],
  "sort": "hot",
  "timeFilter": "week",
  "maxPostsPerSource": 100,
  "includeComments": false,
  "maxCommentsPerPost": 100,
  "commentDepth": 3,
  "outputFormat": "default",
  "maxRequestRetries": 5,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ]
  }
}
```

# Actor output Schema

## `posts` (type: `string`):

Scraped posts (or AI-format records when outputFormat=jsonl-finetune | rag-markdown).

## `comments` (type: `string`):

Scraped comments (only populated when outputFormat=default and includeComments=true).

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        {
            "url": "https://www.reddit.com/r/technology/"
        }
    ],
    "proxyConfiguration": {
        "useApifyProxy": true,
        "apifyProxyGroups": [
            "RESIDENTIAL"
        ]
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("prodiger/reddit-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": [{ "url": "https://www.reddit.com/r/technology/" }],
    "proxyConfiguration": {
        "useApifyProxy": True,
        "apifyProxyGroups": ["RESIDENTIAL"],
    },
}

# Run the Actor and wait for it to finish
run = client.actor("prodiger/reddit-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    {
      "url": "https://www.reddit.com/r/technology/"
    }
  ],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ]
  }
}' |
apify call prodiger/reddit-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=prodiger/reddit-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Reddit Scraper",
        "description": "Extract posts, comments, user profiles, and search results from Reddit. Pure HTTP, no API key required.",
        "version": "0.2",
        "x-build-id": "h6NMQn4zR7PGPe4ew"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/prodiger~reddit-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-prodiger-reddit-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/prodiger~reddit-scraper/runs": {
            "post": {
                "operationId": "runs-sync-prodiger-reddit-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/prodiger~reddit-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-prodiger-reddit-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "urls": {
                        "title": "Reddit URLs",
                        "type": "array",
                        "description": "Reddit URLs to scrape. Accepts subreddit listings (https://www.reddit.com/r/technology/), individual posts (https://www.reddit.com/r/x/comments/abc123/...), user profiles (https://www.reddit.com/user/spez/), and search URLs (https://www.reddit.com/search/?q=...). Shortcut forms 'r/technology' and bare 'technology' are accepted as subreddit names.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "searchQuery": {
                        "title": "Search query",
                        "type": "string",
                        "description": "Optional. Search Reddit for this query. Combine with 'searchSubreddit' to limit search to a single subreddit. Either 'urls' or 'searchQuery' must be provided."
                    },
                    "searchSubreddit": {
                        "title": "Search subreddit (optional)",
                        "type": "string",
                        "description": "When 'searchQuery' is set, restrict search to this subreddit (without the r/ prefix)."
                    },
                    "sort": {
                        "title": "Sort order",
                        "enum": [
                            "hot",
                            "new",
                            "top",
                            "rising",
                            "relevance"
                        ],
                        "type": "string",
                        "description": "Listing sort order. 'relevance' applies to search only.",
                        "default": "hot"
                    },
                    "timeFilter": {
                        "title": "Time filter",
                        "enum": [
                            "hour",
                            "day",
                            "week",
                            "month",
                            "year",
                            "all"
                        ],
                        "type": "string",
                        "description": "Time window. Applies to sort=top and sort=relevance.",
                        "default": "week"
                    },
                    "maxPostsPerSource": {
                        "title": "Max posts per source",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Maximum number of posts to scrape per source (subreddit, user, search). Set to 0 for unlimited (capped by Reddit's ~1000-item-per-listing ceiling).",
                        "default": 100
                    },
                    "includeComments": {
                        "title": "Include comments",
                        "type": "boolean",
                        "description": "Also scrape comments for each post. Increases run time and cost. Required for 'jsonl-finetune' and 'rag-markdown' output formats to have content to bundle.",
                        "default": false
                    },
                    "maxCommentsPerPost": {
                        "title": "Max comments per post",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum comments to scrape per post. Hard-capped at 1000 to bound proxy/compute cost.",
                        "default": 100
                    },
                    "commentDepth": {
                        "title": "Comment depth",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Maximum reply nesting depth (1 = top-level comments only).",
                        "default": 3
                    },
                    "filterKeywords": {
                        "title": "Filter keywords (optional)",
                        "type": "array",
                        "description": "Only keep posts whose title or body contains at least one of these terms (case-insensitive). Leave empty to keep all posts. Filtered posts are not charged.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "outputFormat": {
                        "title": "Output format",
                        "enum": [
                            "default",
                            "jsonl-finetune",
                            "rag-markdown"
                        ],
                        "type": "string",
                        "description": "default = standard JSON. jsonl-finetune = OpenAI chat-format records for LLM SFT. rag-markdown = self-contained markdown documents for vector-DB ingestion. AI formats bundle comments into the post record at no extra charge.",
                        "default": "default"
                    },
                    "maxRequestRetries": {
                        "title": "Max request retries",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Number of retry attempts for failed requests (429, 5xx, network errors).",
                        "default": 5
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Apify proxy. RESIDENTIAL is the default — Reddit aggressively blocks datacenter IP ranges in 2026, so DATACENTER will likely fail on sustained runs. Override only if you understand the trade-off.",
                        "default": {
                            "useApifyProxy": true,
                            "apifyProxyGroups": [
                                "RESIDENTIAL"
                            ]
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
