Reddit Posts Scraper — Subreddit, Search, User & Thread JSON
Pricing
from $2.00 / 1,000 reddit posts
Reddit Posts Scraper — Subreddit, Search, User & Thread JSON
Scrape Reddit posts, comments, user submissions and search results via public JSON endpoints. PPE pricing, MCP-native, clean markdown output for RAG pipelines.
Pricing
from $2.00 / 1,000 reddit posts
Rating
0.0
(0)
Developer
Tugelbay Konabayev
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
2 days ago
Last modified
Categories
Share
Reddit Posts Scraper API - Subreddit, Search, User & Thread JSON
Fast first run — the default input returns a small 10-post sample from public subreddit JSON. No OAuth required for public listings, search, users, and direct post IDs. Reddit may still rate-limit or block some runs. Pay-per-use after — $0.003/post, with a single clean schema covering subreddits, search, user streams, and full comment trees in one input.
Scrape public Reddit posts and comments without setting up OAuth apps. This actor wraps Reddit's public .json endpoints behind a single Apify-platform input with proxy rotation, incremental dataset writes, and a clean canonical output schema.
Perfect for AI agents, RAG pipelines, market research, competitive intelligence, trend tracking, and content monitoring — returns clean Markdown that plugs straight into LangChain document loaders or Claude MCP tools.
How it works
This actor supports four input sources, used in priority order (first non-empty wins):
postIds— direct fetch of specific posts by ID (e.g.1k0abc). Returns full content + optional comment tree.users— each user's submitted-posts stream.search— Reddit-wide keyword search. Supports Reddit's native operators:author:,subreddit:,site:, etc.subreddits— one or more subreddit listings withsort(hot/new/top/rising/controversial) andtimeframe(hour/day/week/month/year/all).
Posts are pushed to the dataset incrementally — even if the run is aborted, everything fetched so far is already stored.
Why use this instead of alternatives?
| Feature | Reddit Posts Scraper | Reddit PRAW (self-hosted) | Reddit official API (paid) | Other Apify scrapers |
|---|---|---|---|---|
| Auth required | No | Yes (OAuth app) | Yes (app + tier limits) | Varies |
| IP rotation | Yes — Apify proxy | Manual | N/A | Varies |
| Output formats | full / minimal / rag-markdown | custom code | custom code | full only |
| Comment trees | Optional, depth-limited | Manual pagination | Manual pagination | Varies |
| Search | Yes — same input | Yes | Yes | Often separate actor |
| User streams | Yes — same input | Yes | Yes | Often separate actor |
| Direct post-ID fetch | Yes — same input | Yes | Yes | Rarely |
| MCP compatible | Yes (PPE = agent-friendly) | No | No | Rarely |
| Pay model | PPE — only what you scrape | Free but self-host | Token quotas | Varies |
Key advantage: one input schema instead of four separate actors. Drop in subreddit names OR a search query OR user handles OR specific post IDs and the actor just works.
Input examples
Trending in two subreddits (default prefill)
{"subreddits": ["MachineLearning", "webscraping"],"sort": "hot","maxItems": 10}
Top of the week, with comments, for RAG
{"subreddits": ["LocalLLaMA"],"sort": "top","timeframe": "week","maxItems": 50,"includeComments": true,"maxComments": 20,"maxCommentDepth": 2,"outputFormat": "rag"}
Search across all of Reddit for a product mention
{"search": "notebooklm vs chatgpt","sort": "new","maxItems": 100,"minScore": 2}
Pull specific posts by ID (faster than listing)
{"postIds": ["1k0abc", "1jzxy9"],"includeComments": true}
A user's submitted posts
{"users": ["spez"],"maxItems": 50}
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
subreddits | Array[String] | ["MachineLearning","webscraping"] | Subreddit names without r/ |
search | String | — | Reddit-wide search query (overrides subreddits) |
users | Array[String] | — | Usernames without u/ |
postIds | Array[String] | — | Direct post IDs (highest priority) |
sort | Enum | hot | hot / new / top / rising / controversial |
timeframe | Enum | day | For top/controversial: hour / day / week / month / year / all |
maxItems | Integer | 10 | Max posts across all sources (1–10,000) |
includeComments | Boolean | false | Fetch comment tree per post |
maxComments | Integer | 20 | Per-post comment cap (1–500) |
maxCommentDepth | Integer | 3 | Reply-tree depth (0–10) |
outputFormat | Enum | full | full / minimal / rag |
skipNsfw | Boolean | false | Drop posts with over_18=true |
minScore | Integer | 0 | Drop posts below this upvote count |
proxyConfiguration | Object | Apify proxy | Proxy rotation (recommended) |
Choosing the right source mode
Use subreddits when you know exactly where the conversation happens. This is the best mode for routine monitoring because subreddit listings are predictable and easy to schedule.
Use search when you are discovering conversations across Reddit. Search is useful for brand names, competitor names, product categories, and buyer-intent phrases, but Reddit search can be less complete than focused subreddit monitoring.
Use users when you want posts from known accounts. This is useful for creator/influencer research, executive monitoring, and tracking official company accounts.
Use postIds when you already have URLs or IDs and need clean JSON, Markdown, or comments for specific threads.
Priority order matters: if postIds is set, the actor uses post IDs first. If users is set, it uses user streams before search/subreddit listings. Keep one source mode per run for the cleanest reporting.
Recommended workflows
Brand and competitor monitoring
- Start with
searchfor your brand and 2-3 competitors. - Export dataset rows to Google Sheets or BigQuery.
- Filter by
score,num_comments, and subreddit. - Add the best subreddits to a scheduled
subredditsrun. - Use
includeComments=trueonly for posts that need deeper analysis.
Example:
{"search": "\"example product\" OR \"competitor product\"","sort": "new","maxItems": 100,"minScore": 1,"outputFormat": "full"}
Buyer-intent research
Use Reddit search operators and RAG output to collect real purchase language:
{"search": "subreddit:SaaS \"looking for\" \"CRM\"","sort": "new","maxItems": 100,"outputFormat": "rag"}
Then send the markdown field into your LLM workflow to extract:
- pain points
- feature requests
- products mentioned
- objections
- exact customer language
- competitor alternatives
Weekly subreddit digest
{"subreddits": ["LocalLLaMA", "MachineLearning", "artificial"],"sort": "top","timeframe": "week","maxItems": 75,"outputFormat": "minimal"}
This is the simplest scheduled monitoring pattern. It avoids comments by default and keeps cost predictable.
RAG dataset for an agent
{"subreddits": ["webscraping", "dataengineering"],"sort": "top","timeframe": "month","maxItems": 100,"includeComments": true,"maxComments": 20,"maxCommentDepth": 2,"outputFormat": "rag"}
Use this when you want thread context, not just post metadata.
Output format
Each post is one dataset item. Field set depends on outputFormat:
full (default)
{"id": "1k0abc","subreddit": "MachineLearning","title": "[D] Why RAG is harder than it looks","author": "ml_practitioner","score": 412,"upvote_ratio": 0.97,"num_comments": 89,"created_utc": "2026-04-15T12:34:56+00:00","url": "https://arxiv.org/abs/2504.01234","permalink": "https://www.reddit.com/r/MachineLearning/comments/1k0abc/...","selftext": "","selftext_markdown": "","link_flair_text": "Discussion","over_18": false,"stickied": false,"is_self": false,"is_video": false,"is_gallery": false,"thumbnail": "https://b.thumbs.redditmedia.com/...","media_url": null,"gallery_urls": []}
minimal
{"id": "1k0abc","subreddit": "MachineLearning","title": "[D] Why RAG is harder than it looks","author": "ml_practitioner","score": 412,"num_comments": 89,"created_utc": "2026-04-15T12:34:56+00:00","url": "https://arxiv.org/abs/2504.01234","permalink": "https://www.reddit.com/r/...","selftext_markdown": ""}
rag
Single markdown field per post, ready for vector-DB ingestion:
{"id": "1k0abc","subreddit": "MachineLearning","title": "[D] Why RAG is harder than it looks","permalink": "https://www.reddit.com/r/MachineLearning/comments/1k0abc/...","created_utc": "2026-04-15T12:34:56+00:00","score": 412,"markdown": "# [D] Why RAG is harder than it looks\n\n**r/MachineLearning** · u/ml_practitioner · score 412 · 2026-04-15T12:34:56+00:00\n[link](https://www.reddit.com/r/...)\n\n..."}
Output field reference
| Field | Type | Description |
|---|---|---|
id | string | Reddit post ID without the t3_ prefix. |
subreddit | string | Subreddit name. |
title | string | Post title. |
author | string/null | Reddit username when public. |
score | integer | Reddit score at fetch time. |
upvote_ratio | number/null | Upvote ratio when available. |
num_comments | integer | Comment count reported by Reddit. |
created_utc | string | UTC timestamp converted to ISO format. |
url | string/null | External URL for link posts, or null for self posts. |
permalink | string | Canonical Reddit permalink. |
selftext | string/null | Plain self-post body where present. |
selftext_markdown | string/null | Markdown body where Reddit exposes it. |
link_flair_text | string/null | Link flair label. |
author_flair_text | string/null | Author flair label. |
over_18 | boolean | NSFW marker from Reddit. |
spoiler | boolean | Spoiler marker. |
stickied | boolean | Whether the post is stickied. |
locked | boolean | Whether comments are locked. |
is_self | boolean | Whether it is a self/text post. |
is_video | boolean | Whether Reddit marks it as video. |
is_gallery | boolean | Whether gallery media is present. |
thumbnail | string/null | Thumbnail URL when available. |
media_url | string/null | Main media URL when available. |
gallery_urls | array | Gallery image URLs. |
comments | array | Present when includeComments=true; each item has author/body/score/depth. |
markdown | string/null | Present when outputFormat=rag. |
Comment extraction
Comments are optional because every post with comments requires an additional request and more parsing. Leave comments off for feed scanning. Turn them on only when the comment discussion matters.
Recommended settings:
| Goal | includeComments | maxComments | maxCommentDepth |
|---|---|---|---|
| Feed monitoring | false | 20 | 3 |
| RAG summaries | true | 20 | 2 |
| Deep thread analysis | true | 100 | 3 |
| Large historical run | false | 20 | 3 |
Deep comment trees can be noisy and expensive. For most research workflows, top-level and near-top-level comments carry enough context.
Cost estimation
Approximate prompt-sized run costs:
| Use case | Input | Approx. cost |
|---|---|---|
| 10 posts from 2 subs, no comments | default | ~$0.03 |
| 100 posts from 5 subs, no comments | maxItems=100 | ~$0.30 |
| 100 posts with top-20 comments each | includeComments=true | ~$0.40 |
| 1,000 posts from search, RAG output | search mode, maxItems=1000 | ~$3.00 |
Billed as PPE (pay-per-event): one reddit-post event per item written + a small actor-start overhead.
Scheduling and automation
The most useful Reddit workflows are scheduled:
- hourly: brand crisis or launch monitoring
- daily: product category monitoring
- weekly: trend and content research
- monthly: RAG corpus refresh
Recommended Task setup:
{"subreddits": ["webscraping", "dataengineering"],"sort": "top","timeframe": "week","maxItems": 100,"outputFormat": "minimal","skipNsfw": true}
Attach a webhook to send finished datasets to Slack, Google Sheets, BigQuery, or your own API.
Integrations
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("tugelbay/reddit-posts-scraper").call(run_input={"subreddits": ["LocalLLaMA"],"sort": "top","timeframe": "week","maxItems": 50,"outputFormat": "rag",})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["markdown"])
JavaScript
import { ApifyClient } from "apify-client";const client = new ApifyClient({ token: "YOUR_APIFY_TOKEN" });const run = await client.actor("tugelbay/reddit-posts-scraper").call({search: "notebooklm vs chatgpt",sort: "new",maxItems: 100,});const { items } = await client.dataset(run.defaultDatasetId).listItems();
LangChain document loader
from langchain_community.document_loaders import ApifyDatasetLoaderfrom langchain_core.documents import Documentloader = ApifyDatasetLoader(dataset_id=run["defaultDatasetId"],dataset_mapping_function=lambda item: Document(page_content=item["markdown"],metadata={"subreddit": item["subreddit"], "url": item["permalink"]},),)docs = loader.load()
Claude MCP / Apify MCP Server
Works out of the box. Any Claude/GPT agent using the Apify MCP Server can call this actor as a tool and pipe the output straight into its context.
Data quality checklist
Before using results for analysis:
- Filter out posts with very low
scoreif you only want meaningful discussions. - Use
skipNsfw=truefor business/brand monitoring. - Prefer
sort=newfor alerting andsort=topfor research. - Keep one source mode per run so downstream reports are easier to interpret.
- Store the
permalinkfield so analysts can inspect the original thread. - For sentiment or topic extraction, use
outputFormat=ragso the title, body, and comments stay together.
Use cases
- RAG knowledge base — scrape relevant subreddit threads into a vector DB
- Competitive intelligence — monitor mentions of your product or competitors
- Trend research — top posts of the week across 20 subreddits, one run
- Content gap analysis — aggregate questions people ask in niche subs
- Academic data collection — snapshots of subreddit discourse over time
- AI agent tooling — on-demand Reddit search as an MCP tool
- Brand safety monitoring — NSFW filter built in
- Influencer research — pull recent public submitted posts for known users
FAQ
Do I need a Reddit account or OAuth app? No. This actor uses Reddit's public .json endpoints — the exact same data Reddit serves to logged-out visitors.
Why do I need a proxy? Reddit rate-limits by IP. At small volumes you may be fine without one, but larger runs should keep the default residential proxy enabled to reduce block/rate-limit risk.
Can I get removed posts or deleted comments? No. If Reddit has removed content from the public API, this actor sees the same [deleted] placeholder. Use Pushshift for historical archives.
How fresh is the data? Live — each run hits Reddit directly. Default sort is hot which reflects the current front page at run time.
What about NSFW content? Set skipNsfw: true to filter posts marked over_18. Comment filtering is not per-comment.
Can it handle quarantined or private subreddits? Quarantined subs: partial results. Private subs: no access without OAuth.
Can I scrape comments only? Use postIds with includeComments=true for the specific threads you care about. The actor still returns the post row as the parent item.
Can I scrape Reddit profiles? Yes, use users. It fetches submitted posts for each public username.
Can I search inside one subreddit? Yes. Either set subreddits and choose a listing sort, or use Reddit search syntax such as subreddit:LocalLLaMA "fine tuning".
Why do results differ from Reddit's UI? Reddit personalizes and experiments with ranking. The actor reads public JSON endpoints at run time, so output can differ from a logged-in browser session.
Does it bypass Reddit restrictions? No. It reads public pages/data. Private, removed, deleted, or login-only content is outside scope.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Empty dataset | Subreddit name typo or banned sub | Check the name on Reddit first |
HTTP 403 in logs | Reddit temporarily blocked the proxy IP | Leave proxyConfiguration on — session auto-rotates |
| Missing comments | includeComments: false | Set to true |
| Posts look truncated | outputFormat: minimal | Switch to full or rag |
| Search misses posts | Reddit search ranking/coverage varies | Monitor key subreddits directly when possible |
| Run takes too long | Comments enabled on many posts | Lower maxItems, maxComments, or comment depth |
| Duplicate topics | Same story cross-posted across subreddits | Deduplicate downstream by URL/title/permalink |
Limitations
- No OAuth-only data — private subs, user inbox, friends, subscribed feed
- No historical archives — Reddit JSON returns live data only; for posts older than a few weeks on active subs, pagination may stop early
- Comment depth capped — default 3, max 10 (Reddit itself caps around 10)
- Public-data only — no inbox, private communities, mod-only data, or logged-in recommendations
- Search is not exhaustive — Reddit search can miss posts; direct subreddit monitoring is more predictable
Privacy and compliance
This actor is designed for public Reddit content. Do not use it to collect private messages, bypass access controls, or infer sensitive personal data. For business reporting, store only the fields you need and respect deletion/removal signals from Reddit.
Changelog
- 0.1.10 (2026-04-26) — reduced first-run default to 10 posts, added quality contract, and expanded README with workflow, output, scheduling, and troubleshooting guidance.
- 0.1 (2026-04-19) — initial release: subreddits, search, users, postIds; optional comment trees; full / minimal / rag output