Pricing

from $2.00 / 1,000 reddit posts

Reddit Public Data Scraper - Comments & RAG

Extract public Reddit posts, comments, search results, user streams, sentiment, quality scores, and RAG Markdown with no OAuth setup for public data workflows. Guide: https://konabayev.com/tools/reddit-posts-scraper/?utm_source=apify_info&utm_medium=referral&utm_campaign=reddit-posts-scraper

Pricing

from $2.00 / 1,000 reddit posts

Rating

0.0

(0)

Developer

Tugelbay Konabayev

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

Direct answers for search and AI agents

Can I scrape Reddit comments without the official Reddit API? Yes. Provide postIds or run a subreddit/search crawl with includeComments=true; the actor returns the post plus a structured comments array.

Can I use this in a Reddit data API workflow? Yes. Run it through the Apify API and export JSON dataset rows for posts, comments, authors, scores, engagement/source fields, timestamps, media URLs, permalinks, and optional Markdown for RAG.

Can I monitor Reddit search results? Yes. Use the search field for Reddit-wide queries such as a brand, competitor, category, or buyer-intent phrase. Use sort=new for alerts and sort=top for research.

Can I run Reddit sentiment analysis? Yes. Enable includeComments and keep includeSentiment=true; each parsed comment receives a sentiment label suitable for filtering or dashboarding.

Production lead and research recipes

Use these inputs when you want paid, production-size runs rather than a small sample.

Scrape Reddit comments for sentiment analysis

{
  "search": "\"best CRM\" OR \"switching CRM\"",
  "sort": "new",
  "maxItems": 250,
  "includeComments": true,
  "maxComments": 30,
  "maxCommentDepth": 2,
  "includeSentiment": true,
  "outputFormat": "full"
}

Reddit data API for weekly market research

{
  "subreddits": ["SaaS", "startups", "marketing"],
  "sort": "top",
  "timeframe": "week",
  "maxItems": 500,
  "skipNsfw": true,
  "outputFormat": "minimal"
}

Reddit RAG corpus for AI agents

{
  "subreddits": ["LocalLLaMA", "MachineLearning", "artificial"],
  "sort": "top",
  "timeframe": "month",
  "maxItems": 300,
  "includeComments": true,
  "maxComments": 20,
  "maxCommentDepth": 2,
  "outputFormat": "rag"
}

How it works

This actor supports four input sources, processed in priority order until maxItems is reached:

postIds — direct fetch of specific posts by ID (e.g. 1k0abc). Returns full content + optional comment tree first.
users — each user's submitted-posts stream fills the remaining result cap.
search — Reddit-wide keyword search fills the remaining result cap. Supports Reddit's native operators: author:, subreddit:, site:, etc.
subreddits — one or more subreddit listings with sort (hot/new/top/rising/controversial) and timeframe (hour/day/week/month/year/all) fill any remaining capacity.

Posts are pushed to the dataset incrementally — even if the run is aborted, everything fetched so far is already stored.

Why use this instead of alternatives?

Feature	This Apify Actor	Self-hosted script	Reddit official API (paid)	Other Apify scrapers
Auth required	No	Yes (OAuth app)	Yes (app + tier limits)	Varies
IP rotation	Yes — Apify proxy	Manual	N/A	Varies
Output formats	full / minimal / rag-markdown	custom code	custom code	full only
Comment trees	Optional, depth-limited	Manual pagination	Manual pagination	Varies
Search	Yes — same input	Yes	Yes	Often separate actor
User streams	Yes — same input	Yes	Yes	Often separate actor
Direct post-ID fetch	Yes — same input	Yes	Yes	Rarely
MCP compatible	Yes (PPE = agent-friendly)	No	No	Rarely
Pay model	PPE — only what you scrape	Free but self-host	Token quotas	Varies

Key advantage: one input schema instead of four separate actors. Drop in subreddit names OR a search query OR user handles OR specific post IDs and the actor just works.

Input examples

{
  "subreddits": ["MachineLearning", "webscraping"],
  "sort": "hot",
  "maxItems": 10
}

Top of the week, with comments, for RAG

{
  "subreddits": ["LocalLLaMA"],
  "sort": "top",
  "timeframe": "week",
  "maxItems": 50,
  "includeComments": true,
  "maxComments": 20,
  "maxCommentDepth": 2,
  "outputFormat": "rag"
}

Search across all of Reddit for a product mention

{
  "search": "notebooklm vs chatgpt",
  "sort": "new",
  "maxItems": 100,
  "minScore": 2
}

Pull specific posts by ID (faster than listing)

{
  "postIds": ["1k0abc", "1jzxy9"],
  "includeComments": true
}

A user's submitted posts

{
  "users": ["spez"],
  "maxItems": 50
}

Input parameters

Parameter	Type	Default	Description
`subreddits`	Array[String]	`["MachineLearning","webscraping"]`	Subreddit names without `r/`
`search`	String	—	Reddit-wide search query (overrides subreddits)
`users`	Array[String]	—	Usernames without `u/`
`postIds`	Array[String]	—	Direct post IDs (highest priority)
`sort`	Enum	`hot`	hot / new / top / rising / controversial
`timeframe`	Enum	`day`	For `top`/`controversial`: hour / day / week / month / year / all
`maxItems`	Integer	`10`	Max posts across all sources (1–10,000)
`includeComments`	Boolean	`false`	Fetch comment tree per post
`maxComments`	Integer	`20`	Per-post comment cap (1–500)
`maxCommentDepth`	Integer	`3`	Reply-tree depth (0–10)
`outputFormat`	Enum	`full`	full / minimal / rag
`skipNsfw`	Boolean	`false`	Drop posts with `over_18=true`
`minScore`	Integer	`0`	Drop posts below this upvote count
`proxyConfiguration`	Object	Apify proxy	Proxy rotation (recommended)

Choosing the right source mode

Use subreddits when you know exactly where the conversation happens. This is the best mode for routine monitoring because subreddit listings are predictable and easy to schedule.

Use search when you are discovering conversations across Reddit. Search is useful for brand names, competitor names, product categories, and buyer-intent phrases, but Reddit search can be less complete than focused subreddit monitoring.

Use users when you want posts from known accounts. This is useful for creator/influencer research, executive monitoring, and tracking official company accounts.

Use postIds when you already have URLs or IDs and need clean JSON, Markdown, or comments for specific threads.

Priority order matters: if postIds is set, the actor uses post IDs first. If users is set, it uses user streams before search/subreddit listings. Later source modes can still fill the remaining maxItems capacity, but one source mode per run usually produces cleaner reporting.

Recommended workflows

Brand and competitor monitoring

Start with search for your brand and 2-3 competitors.
Export dataset rows to Google Sheets or BigQuery.
Filter by score, num_comments, and subreddit.
Add the best subreddits to a scheduled subreddits run.
Use includeComments=true only for posts that need deeper analysis.

Example:

{
  "search": "\"example product\" OR \"competitor product\"",
  "sort": "new",
  "maxItems": 100,
  "minScore": 1,
  "outputFormat": "full"
}

Buyer-intent research

Use Reddit search operators and RAG output to collect real purchase language:

{
  "search": "subreddit:SaaS \"looking for\" \"CRM\"",
  "sort": "new",
  "maxItems": 100,
  "outputFormat": "rag"
}

Then send the markdown field into your LLM workflow to extract:

pain points
feature requests
products mentioned
objections
exact customer language
competitor alternatives

Weekly subreddit digest

{
  "subreddits": ["LocalLLaMA", "MachineLearning", "artificial"],
  "sort": "top",
  "timeframe": "week",
  "maxItems": 75,
  "outputFormat": "minimal"
}

This is the simplest scheduled monitoring pattern. It avoids comments by default and keeps cost predictable.

RAG dataset for an agent

{
  "subreddits": ["webscraping", "dataengineering"],
  "sort": "top",
  "timeframe": "month",
  "maxItems": 100,
  "includeComments": true,
  "maxComments": 20,
  "maxCommentDepth": 2,
  "outputFormat": "rag"
}

Use this when you want thread context, not just post metadata.

Output format

Each post is one dataset item. Field set depends on outputFormat:

`full` (default)

{
  "id": "1k0abc",
  "subreddit": "MachineLearning",
  "title": "[D] Why RAG is harder than it looks",
  "author": "ml_practitioner",
  "score": 412,
  "upvote_ratio": 0.97,
  "num_comments": 89,
  "engagementScore": 82,
  "commentDensity": 0.216,
  "isDiscussionHeavy": true,
  "contentType": "link",
  "sourceMode": "subreddit",
  "sourceValue": "MachineLearning",
  "created_utc": "2026-04-15T12:34:56+00:00",
  "url": "https://arxiv.org/abs/2504.01234",
  "permalink": "https://www.reddit.com/r/MachineLearning/comments/1k0abc/...",
  "selftext": "",
  "selftext_markdown": "",
  "link_flair_text": "Discussion",
  "over_18": false,
  "stickied": false,
  "is_self": false,
  "is_video": false,
  "is_gallery": false,
  "thumbnail": "https://b.thumbs.redditmedia.com/...",
  "media_url": null,
  "gallery_urls": []
}

`minimal`

{
  "id": "1k0abc",
  "subreddit": "MachineLearning",
  "title": "[D] Why RAG is harder than it looks",
  "author": "ml_practitioner",
  "score": 412,
  "num_comments": 89,
  "engagementScore": 82,
  "commentDensity": 0.216,
  "isDiscussionHeavy": true,
  "contentType": "link",
  "created_utc": "2026-04-15T12:34:56+00:00",
  "url": "https://arxiv.org/abs/2504.01234",
  "permalink": "https://www.reddit.com/r/...",
  "selftext_markdown": ""
}

`rag`

Single markdown field per post, ready for vector-DB ingestion:

{
  "id": "1k0abc",
  "subreddit": "MachineLearning",
  "title": "[D] Why RAG is harder than it looks",
  "permalink": "https://www.reddit.com/r/MachineLearning/comments/1k0abc/...",
  "created_utc": "2026-04-15T12:34:56+00:00",
  "score": 412,
  "engagementScore": 82,
  "commentDensity": 0.216,
  "isDiscussionHeavy": true,
  "contentType": "link",
  "sourceMode": "subreddit",
  "sourceValue": "MachineLearning",
  "markdown": "# [D] Why RAG is harder than it looks\n\n**r/MachineLearning**  ·  u/ml_practitioner  ·  score 412  ·  2026-04-15T12:34:56+00:00\n[link](https://www.reddit.com/r/...)\n\n..."
}

Output field reference

Field	Type	Description
`id`	string	Reddit post ID without the `t3_` prefix.
`subreddit`	string	Subreddit name.
`title`	string	Post title.
`author`	string/null	Reddit username when public.
`score`	integer	Reddit score at fetch time.
`upvote_ratio`	number/null	Upvote ratio when available.
`num_comments`	integer	Comment count reported by Reddit.
`created_utc`	string	UTC timestamp converted to ISO format.
`url`	string/null	External URL for link posts, or null for self posts.
`permalink`	string	Canonical Reddit permalink.
`selftext`	string/null	Plain self-post body where present.
`selftext_markdown`	string/null	Markdown body where Reddit exposes it.
`link_flair_text`	string/null	Link flair label.
`author_flair_text`	string/null	Author flair label.
`over_18`	boolean	NSFW marker from Reddit.
`spoiler`	boolean	Spoiler marker.
`stickied`	boolean	Whether the post is stickied.
`locked`	boolean	Whether comments are locked.
`is_self`	boolean	Whether it is a self/text post.
`is_video`	boolean	Whether Reddit marks it as video.
`is_gallery`	boolean	Whether gallery media is present.
`thumbnail`	string/null	Thumbnail URL when available.
`media_url`	string/null	Main media URL when available.
`gallery_urls`	array	Gallery image URLs.
`comments`	array	Present when `includeComments=true`; each item has author/body/score/depth.
`markdown`	string/null	Present when `outputFormat=rag`.

Comment extraction

Comments are optional because every post with comments requires an additional request and more parsing. Leave comments off for feed scanning. Turn them on only when the comment discussion matters.

Recommended settings:

Goal	`includeComments`	`maxComments`	`maxCommentDepth`
Feed monitoring	`false`	20	3
RAG summaries	`true`	20	2
Deep thread analysis	`true`	100	3
Large historical run	`false`	20	3

Deep comment trees can be noisy and expensive. For most research workflows, top-level and near-top-level comments carry enough context.

Cost estimation

Approximate prompt-sized run costs at the displayed "from" price:

Use case	Input	Approx. cost
10 posts from 2 subs, no comments	default	~$0.03
100 posts from 5 subs, no comments	`maxItems=100`	~$0.30
100 posts with top-20 comments each	`includeComments=true`	~$0.40
1,000 posts from search, RAG output	search mode, `maxItems=1000`	~$3.00

Billed as PPE (pay-per-event) at the Apify Store's displayed per-result price. Each delivered dataset item represents one post/result; check the live Store pricing box before large scheduled runs because Apify pricing configuration can change independently of this README.

Scheduling and automation

The most useful Reddit workflows are scheduled:

hourly: brand crisis or launch monitoring
daily: product category monitoring
weekly: trend and content research
monthly: RAG corpus refresh

Recommended Task setup:

{
  "subreddits": ["webscraping", "dataengineering"],
  "sort": "top",
  "timeframe": "week",
  "maxItems": 100,
  "outputFormat": "minimal",
  "skipNsfw": true
}

Attach a webhook to send finished datasets to Slack, Google Sheets, BigQuery, or your own API.

Integrations

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("tugelbay/reddit-posts-scraper").call(run_input={
    "subreddits": ["LocalLLaMA"],
    "sort": "top",
    "timeframe": "week",
    "maxItems": 50,
    "outputFormat": "rag",
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["markdown"])

JavaScript

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_APIFY_TOKEN" });
const run = await client.actor("tugelbay/reddit-posts-scraper").call({
  search: "notebooklm vs chatgpt",
  sort: "new",
  maxItems: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();

LangChain document loader

from langchain_community.document_loaders import ApifyDatasetLoader
from langchain_core.documents import Document

loader = ApifyDatasetLoader(
    dataset_id=run["defaultDatasetId"],
    dataset_mapping_function=lambda item: Document(
        page_content=item["markdown"],
        metadata={"subreddit": item["subreddit"], "url": item["permalink"]},
    ),
)
docs = loader.load()

Claude MCP / Apify MCP Server

Works out of the box. Any Claude/GPT agent using the Apify MCP Server can call this actor as a tool and pipe the output straight into its context.

Data quality checklist

Before using results for analysis:

Filter out posts with very low score if you only want meaningful discussions.
Use skipNsfw=true for business/brand monitoring.
Prefer sort=new for alerting and sort=top for research.
Keep one source mode per run so downstream reports are easier to interpret.
Store the permalink field so analysts can inspect the original thread.
For sentiment or topic extraction, use outputFormat=rag so the title, body, and comments stay together.

Use cases

RAG knowledge base — scrape relevant subreddit threads into a vector DB
Competitive intelligence — monitor mentions of your product or competitors
Trend research — top posts of the week across 20 subreddits, one run
Content gap analysis — aggregate questions people ask in niche subs
Academic data collection — snapshots of subreddit discourse over time
AI agent tooling — on-demand Reddit search as an MCP tool
Brand safety monitoring — NSFW filter built in
Influencer research — pull recent public submitted posts for known users

FAQ

Do I need a Reddit account or OAuth app? No. This actor uses Reddit's public .json endpoints — the exact same data Reddit serves to logged-out visitors.

Why do I need a proxy? Reddit rate-limits by IP. At small volumes you may be fine without one, but larger runs should keep the default residential proxy enabled to reduce block/rate-limit risk.

Can I get removed posts or deleted comments? No. If Reddit has removed content from the public API, this actor sees the same [deleted] placeholder. Use Pushshift for historical archives.

How fresh is the data? Live — each run hits Reddit directly. Default sort is hot which reflects the current front page at run time.

What about NSFW content? Set skipNsfw: true to filter posts marked over_18. Comment filtering is not per-comment.

Can it handle quarantined or private subreddits? Quarantined subs: partial results. Private subs: no access without OAuth.

Can I scrape comments only? Use postIds with includeComments=true for the specific threads you care about. The actor still returns the post row as the parent item.

Can I scrape Reddit profiles? Yes, use users. It fetches submitted posts for each public username.

Can I search inside one subreddit? Yes. Either set subreddits and choose a listing sort, or use Reddit search syntax such as subreddit:LocalLLaMA "fine tuning".

Why do results differ from Reddit's UI? Reddit personalizes and experiments with ranking. The actor reads public JSON endpoints at run time, so output can differ from a logged-in browser session.

Does it bypass Reddit restrictions? No. It reads public pages/data. Private, removed, deleted, or login-only content is outside scope.

Troubleshooting

Issue	Cause	Fix
Empty dataset	Subreddit name typo or banned sub	Check the name on Reddit first
`HTTP 403` in logs	Reddit temporarily blocked the proxy IP	Leave `proxyConfiguration` on — session auto-rotates
Missing comments	`includeComments: false`	Set to true
Posts look truncated	`outputFormat: minimal`	Switch to `full` or `rag`
Search misses posts	Reddit search ranking/coverage varies	Monitor key subreddits directly when possible
Run takes too long	Comments enabled on many posts	Lower `maxItems`, `maxComments`, or comment depth
Duplicate topics	Same story cross-posted across subreddits	Deduplicate downstream by URL/title/permalink

Limitations

No OAuth-only data — private subs, user inbox, friends, subscribed feed
No historical archives — Reddit JSON returns live data only; for posts older than a few weeks on active subs, pagination may stop early
Comment depth capped — default 3, max 10 (Reddit itself caps around 10)
Public-data only — no inbox, private communities, mod-only data, or logged-in recommendations
Search is not exhaustive — Reddit search can miss posts; direct subreddit monitoring is more predictable

Privacy and compliance

This actor is designed for public Reddit content. Do not use it to collect private messages, bypass access controls, or infer sensitive personal data. For business reporting, store only the fields you need and respect deletion/removal signals from Reddit.

Changelog

0.1.10 (2026-04-26) — reduced first-run default to 10 posts, added quality contract, and expanded README with workflow, output, scheduling, and troubleshooting guidance.
0.1 (2026-04-19) — initial release: subreddits, search, users, postIds; optional comment trees; full / minimal / rag output

Reddit Scraper

janbruinier/jan-reddit-scraper

Scrape posts and comments from Reddit

Jan Bruinier

Reddit Posts Scraper

scrapepilotapi/reddit-posts-scraper

ScrapePilot

Reddit Posts Scraper

scrapeflow/reddit-posts-scraper

ScrapeFlow

Reddit User Posts Scraper

scrapearchitect/reddit-user-posts-scraper

Reddit User Posts Scraper

Scrape Architect

RAG Web Browser API - Search & Extract

tugelbay/rag-web-browser

Google search + public URLs to Markdown/text/HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

Tugelbay Konabayev

Reddit Scraper

gio21/reddit-scraper

Scrape Reddit posts and comments from any subreddit. Extract titles, scores, authors, comments, and more using Reddit's public JSON API.

Gio

Reddit User Profile Posts And Comments Scraper

scrapeflow/reddit-user-profile-posts-and-comments-scraper

ScrapeFlow

Reddit Scraper - Posts, Comments & Subreddits

viralanalyzer/reddit-scraper

Extract Reddit posts, comments, subreddit data, and user profiles.

viralanalyzer

5.0

Reddit User Profile Posts Comments Scraper

scrapers-hub/reddit-user-profile-posts-comments-scraper

Reddit user profile posts comments scraper to extract posts, comments, and activity from Reddit user profiles 💬📊 Ideal for research, sentiment analysis, and audience insights. Fast and efficient.

Scrapers Hub

Reddit Scraper

alex_claw/reddit-scraper

Alex Claw

Reddit Public Data Scraper - Comments & RAG

Direct answers for search and AI agents

Production lead and research recipes

Scrape Reddit comments for sentiment analysis

Reddit data API for weekly market research

Reddit RAG corpus for AI agents

How it works

Why use this instead of alternatives?

Input examples

Trending in two subreddits (default prefill)

Top of the week, with comments, for RAG

Search across all of Reddit for a product mention

Pull specific posts by ID (faster than listing)

A user's submitted posts

Input parameters

Choosing the right source mode

Recommended workflows

Brand and competitor monitoring

Buyer-intent research

Weekly subreddit digest

RAG dataset for an agent

Output format

full (default)

minimal

rag

Output field reference

Comment extraction

Cost estimation

Scheduling and automation

Integrations

Python

JavaScript

LangChain document loader

Claude MCP / Apify MCP Server

Data quality checklist

Use cases

FAQ

Troubleshooting

Limitations

Privacy and compliance

Changelog

You might also like

Reddit Scraper

Reddit Posts Scraper

Reddit Posts Scraper

Reddit User Posts Scraper

RAG Web Browser API - Search & Extract

Reddit Scraper

Reddit User Profile Posts And Comments Scraper

Reddit Scraper - Posts, Comments & Subreddits

Reddit User Profile Posts Comments Scraper

Reddit Scraper

`full` (default)

`minimal`

`rag`