Reddit Posts Scraper — Subreddit, Search, User & Thread JSON avatar

Reddit Posts Scraper — Subreddit, Search, User & Thread JSON

Pricing

from $2.00 / 1,000 reddit posts

Go to Apify Store
Reddit Posts Scraper — Subreddit, Search, User & Thread JSON

Reddit Posts Scraper — Subreddit, Search, User & Thread JSON

Scrape Reddit posts, comments, user submissions and search results via public JSON endpoints. PPE pricing, MCP-native, clean markdown output for RAG pipelines.

Pricing

from $2.00 / 1,000 reddit posts

Rating

0.0

(0)

Developer

Tugelbay Konabayev

Tugelbay Konabayev

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

2 days ago

Last modified

Categories

Share

Reddit Posts Scraper API - Subreddit, Search, User & Thread JSON

Fast first run — the default input returns a small 10-post sample from public subreddit JSON. No OAuth required for public listings, search, users, and direct post IDs. Reddit may still rate-limit or block some runs. Pay-per-use after — $0.003/post, with a single clean schema covering subreddits, search, user streams, and full comment trees in one input.

Scrape public Reddit posts and comments without setting up OAuth apps. This actor wraps Reddit's public .json endpoints behind a single Apify-platform input with proxy rotation, incremental dataset writes, and a clean canonical output schema.

Perfect for AI agents, RAG pipelines, market research, competitive intelligence, trend tracking, and content monitoring — returns clean Markdown that plugs straight into LangChain document loaders or Claude MCP tools.

How it works

This actor supports four input sources, used in priority order (first non-empty wins):

  1. postIds — direct fetch of specific posts by ID (e.g. 1k0abc). Returns full content + optional comment tree.
  2. users — each user's submitted-posts stream.
  3. search — Reddit-wide keyword search. Supports Reddit's native operators: author:, subreddit:, site:, etc.
  4. subreddits — one or more subreddit listings with sort (hot/new/top/rising/controversial) and timeframe (hour/day/week/month/year/all).

Posts are pushed to the dataset incrementally — even if the run is aborted, everything fetched so far is already stored.

Why use this instead of alternatives?

FeatureReddit Posts ScraperReddit PRAW (self-hosted)Reddit official API (paid)Other Apify scrapers
Auth requiredNoYes (OAuth app)Yes (app + tier limits)Varies
IP rotationYes — Apify proxyManualN/AVaries
Output formatsfull / minimal / rag-markdowncustom codecustom codefull only
Comment treesOptional, depth-limitedManual paginationManual paginationVaries
SearchYes — same inputYesYesOften separate actor
User streamsYes — same inputYesYesOften separate actor
Direct post-ID fetchYes — same inputYesYesRarely
MCP compatibleYes (PPE = agent-friendly)NoNoRarely
Pay modelPPE — only what you scrapeFree but self-hostToken quotasVaries

Key advantage: one input schema instead of four separate actors. Drop in subreddit names OR a search query OR user handles OR specific post IDs and the actor just works.

Input examples

{
"subreddits": ["MachineLearning", "webscraping"],
"sort": "hot",
"maxItems": 10
}

Top of the week, with comments, for RAG

{
"subreddits": ["LocalLLaMA"],
"sort": "top",
"timeframe": "week",
"maxItems": 50,
"includeComments": true,
"maxComments": 20,
"maxCommentDepth": 2,
"outputFormat": "rag"
}

Search across all of Reddit for a product mention

{
"search": "notebooklm vs chatgpt",
"sort": "new",
"maxItems": 100,
"minScore": 2
}

Pull specific posts by ID (faster than listing)

{
"postIds": ["1k0abc", "1jzxy9"],
"includeComments": true
}

A user's submitted posts

{
"users": ["spez"],
"maxItems": 50
}

Input parameters

ParameterTypeDefaultDescription
subredditsArray[String]["MachineLearning","webscraping"]Subreddit names without r/
searchStringReddit-wide search query (overrides subreddits)
usersArray[String]Usernames without u/
postIdsArray[String]Direct post IDs (highest priority)
sortEnumhothot / new / top / rising / controversial
timeframeEnumdayFor top/controversial: hour / day / week / month / year / all
maxItemsInteger10Max posts across all sources (1–10,000)
includeCommentsBooleanfalseFetch comment tree per post
maxCommentsInteger20Per-post comment cap (1–500)
maxCommentDepthInteger3Reply-tree depth (0–10)
outputFormatEnumfullfull / minimal / rag
skipNsfwBooleanfalseDrop posts with over_18=true
minScoreInteger0Drop posts below this upvote count
proxyConfigurationObjectApify proxyProxy rotation (recommended)

Choosing the right source mode

Use subreddits when you know exactly where the conversation happens. This is the best mode for routine monitoring because subreddit listings are predictable and easy to schedule.

Use search when you are discovering conversations across Reddit. Search is useful for brand names, competitor names, product categories, and buyer-intent phrases, but Reddit search can be less complete than focused subreddit monitoring.

Use users when you want posts from known accounts. This is useful for creator/influencer research, executive monitoring, and tracking official company accounts.

Use postIds when you already have URLs or IDs and need clean JSON, Markdown, or comments for specific threads.

Priority order matters: if postIds is set, the actor uses post IDs first. If users is set, it uses user streams before search/subreddit listings. Keep one source mode per run for the cleanest reporting.

Brand and competitor monitoring

  1. Start with search for your brand and 2-3 competitors.
  2. Export dataset rows to Google Sheets or BigQuery.
  3. Filter by score, num_comments, and subreddit.
  4. Add the best subreddits to a scheduled subreddits run.
  5. Use includeComments=true only for posts that need deeper analysis.

Example:

{
"search": "\"example product\" OR \"competitor product\"",
"sort": "new",
"maxItems": 100,
"minScore": 1,
"outputFormat": "full"
}

Buyer-intent research

Use Reddit search operators and RAG output to collect real purchase language:

{
"search": "subreddit:SaaS \"looking for\" \"CRM\"",
"sort": "new",
"maxItems": 100,
"outputFormat": "rag"
}

Then send the markdown field into your LLM workflow to extract:

  • pain points
  • feature requests
  • products mentioned
  • objections
  • exact customer language
  • competitor alternatives

Weekly subreddit digest

{
"subreddits": ["LocalLLaMA", "MachineLearning", "artificial"],
"sort": "top",
"timeframe": "week",
"maxItems": 75,
"outputFormat": "minimal"
}

This is the simplest scheduled monitoring pattern. It avoids comments by default and keeps cost predictable.

RAG dataset for an agent

{
"subreddits": ["webscraping", "dataengineering"],
"sort": "top",
"timeframe": "month",
"maxItems": 100,
"includeComments": true,
"maxComments": 20,
"maxCommentDepth": 2,
"outputFormat": "rag"
}

Use this when you want thread context, not just post metadata.

Output format

Each post is one dataset item. Field set depends on outputFormat:

full (default)

{
"id": "1k0abc",
"subreddit": "MachineLearning",
"title": "[D] Why RAG is harder than it looks",
"author": "ml_practitioner",
"score": 412,
"upvote_ratio": 0.97,
"num_comments": 89,
"created_utc": "2026-04-15T12:34:56+00:00",
"url": "https://arxiv.org/abs/2504.01234",
"permalink": "https://www.reddit.com/r/MachineLearning/comments/1k0abc/...",
"selftext": "",
"selftext_markdown": "",
"link_flair_text": "Discussion",
"over_18": false,
"stickied": false,
"is_self": false,
"is_video": false,
"is_gallery": false,
"thumbnail": "https://b.thumbs.redditmedia.com/...",
"media_url": null,
"gallery_urls": []
}

minimal

{
"id": "1k0abc",
"subreddit": "MachineLearning",
"title": "[D] Why RAG is harder than it looks",
"author": "ml_practitioner",
"score": 412,
"num_comments": 89,
"created_utc": "2026-04-15T12:34:56+00:00",
"url": "https://arxiv.org/abs/2504.01234",
"permalink": "https://www.reddit.com/r/...",
"selftext_markdown": ""
}

rag

Single markdown field per post, ready for vector-DB ingestion:

{
"id": "1k0abc",
"subreddit": "MachineLearning",
"title": "[D] Why RAG is harder than it looks",
"permalink": "https://www.reddit.com/r/MachineLearning/comments/1k0abc/...",
"created_utc": "2026-04-15T12:34:56+00:00",
"score": 412,
"markdown": "# [D] Why RAG is harder than it looks\n\n**r/MachineLearning** · u/ml_practitioner · score 412 · 2026-04-15T12:34:56+00:00\n[link](https://www.reddit.com/r/...)\n\n..."
}

Output field reference

FieldTypeDescription
idstringReddit post ID without the t3_ prefix.
subredditstringSubreddit name.
titlestringPost title.
authorstring/nullReddit username when public.
scoreintegerReddit score at fetch time.
upvote_rationumber/nullUpvote ratio when available.
num_commentsintegerComment count reported by Reddit.
created_utcstringUTC timestamp converted to ISO format.
urlstring/nullExternal URL for link posts, or null for self posts.
permalinkstringCanonical Reddit permalink.
selftextstring/nullPlain self-post body where present.
selftext_markdownstring/nullMarkdown body where Reddit exposes it.
link_flair_textstring/nullLink flair label.
author_flair_textstring/nullAuthor flair label.
over_18booleanNSFW marker from Reddit.
spoilerbooleanSpoiler marker.
stickiedbooleanWhether the post is stickied.
lockedbooleanWhether comments are locked.
is_selfbooleanWhether it is a self/text post.
is_videobooleanWhether Reddit marks it as video.
is_gallerybooleanWhether gallery media is present.
thumbnailstring/nullThumbnail URL when available.
media_urlstring/nullMain media URL when available.
gallery_urlsarrayGallery image URLs.
commentsarrayPresent when includeComments=true; each item has author/body/score/depth.
markdownstring/nullPresent when outputFormat=rag.

Comment extraction

Comments are optional because every post with comments requires an additional request and more parsing. Leave comments off for feed scanning. Turn them on only when the comment discussion matters.

Recommended settings:

GoalincludeCommentsmaxCommentsmaxCommentDepth
Feed monitoringfalse203
RAG summariestrue202
Deep thread analysistrue1003
Large historical runfalse203

Deep comment trees can be noisy and expensive. For most research workflows, top-level and near-top-level comments carry enough context.

Cost estimation

Approximate prompt-sized run costs:

Use caseInputApprox. cost
10 posts from 2 subs, no commentsdefault~$0.03
100 posts from 5 subs, no commentsmaxItems=100~$0.30
100 posts with top-20 comments eachincludeComments=true~$0.40
1,000 posts from search, RAG outputsearch mode, maxItems=1000~$3.00

Billed as PPE (pay-per-event): one reddit-post event per item written + a small actor-start overhead.

Scheduling and automation

The most useful Reddit workflows are scheduled:

  • hourly: brand crisis or launch monitoring
  • daily: product category monitoring
  • weekly: trend and content research
  • monthly: RAG corpus refresh

Recommended Task setup:

{
"subreddits": ["webscraping", "dataengineering"],
"sort": "top",
"timeframe": "week",
"maxItems": 100,
"outputFormat": "minimal",
"skipNsfw": true
}

Attach a webhook to send finished datasets to Slack, Google Sheets, BigQuery, or your own API.

Integrations

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("tugelbay/reddit-posts-scraper").call(run_input={
"subreddits": ["LocalLLaMA"],
"sort": "top",
"timeframe": "week",
"maxItems": 50,
"outputFormat": "rag",
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["markdown"])

JavaScript

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "YOUR_APIFY_TOKEN" });
const run = await client.actor("tugelbay/reddit-posts-scraper").call({
search: "notebooklm vs chatgpt",
sort: "new",
maxItems: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();

LangChain document loader

from langchain_community.document_loaders import ApifyDatasetLoader
from langchain_core.documents import Document
loader = ApifyDatasetLoader(
dataset_id=run["defaultDatasetId"],
dataset_mapping_function=lambda item: Document(
page_content=item["markdown"],
metadata={"subreddit": item["subreddit"], "url": item["permalink"]},
),
)
docs = loader.load()

Claude MCP / Apify MCP Server

Works out of the box. Any Claude/GPT agent using the Apify MCP Server can call this actor as a tool and pipe the output straight into its context.

Data quality checklist

Before using results for analysis:

  1. Filter out posts with very low score if you only want meaningful discussions.
  2. Use skipNsfw=true for business/brand monitoring.
  3. Prefer sort=new for alerting and sort=top for research.
  4. Keep one source mode per run so downstream reports are easier to interpret.
  5. Store the permalink field so analysts can inspect the original thread.
  6. For sentiment or topic extraction, use outputFormat=rag so the title, body, and comments stay together.

Use cases

  • RAG knowledge base — scrape relevant subreddit threads into a vector DB
  • Competitive intelligence — monitor mentions of your product or competitors
  • Trend research — top posts of the week across 20 subreddits, one run
  • Content gap analysis — aggregate questions people ask in niche subs
  • Academic data collection — snapshots of subreddit discourse over time
  • AI agent tooling — on-demand Reddit search as an MCP tool
  • Brand safety monitoring — NSFW filter built in
  • Influencer research — pull recent public submitted posts for known users

FAQ

Do I need a Reddit account or OAuth app? No. This actor uses Reddit's public .json endpoints — the exact same data Reddit serves to logged-out visitors.

Why do I need a proxy? Reddit rate-limits by IP. At small volumes you may be fine without one, but larger runs should keep the default residential proxy enabled to reduce block/rate-limit risk.

Can I get removed posts or deleted comments? No. If Reddit has removed content from the public API, this actor sees the same [deleted] placeholder. Use Pushshift for historical archives.

How fresh is the data? Live — each run hits Reddit directly. Default sort is hot which reflects the current front page at run time.

What about NSFW content? Set skipNsfw: true to filter posts marked over_18. Comment filtering is not per-comment.

Can it handle quarantined or private subreddits? Quarantined subs: partial results. Private subs: no access without OAuth.

Can I scrape comments only? Use postIds with includeComments=true for the specific threads you care about. The actor still returns the post row as the parent item.

Can I scrape Reddit profiles? Yes, use users. It fetches submitted posts for each public username.

Can I search inside one subreddit? Yes. Either set subreddits and choose a listing sort, or use Reddit search syntax such as subreddit:LocalLLaMA "fine tuning".

Why do results differ from Reddit's UI? Reddit personalizes and experiments with ranking. The actor reads public JSON endpoints at run time, so output can differ from a logged-in browser session.

Does it bypass Reddit restrictions? No. It reads public pages/data. Private, removed, deleted, or login-only content is outside scope.

Troubleshooting

IssueCauseFix
Empty datasetSubreddit name typo or banned subCheck the name on Reddit first
HTTP 403 in logsReddit temporarily blocked the proxy IPLeave proxyConfiguration on — session auto-rotates
Missing commentsincludeComments: falseSet to true
Posts look truncatedoutputFormat: minimalSwitch to full or rag
Search misses postsReddit search ranking/coverage variesMonitor key subreddits directly when possible
Run takes too longComments enabled on many postsLower maxItems, maxComments, or comment depth
Duplicate topicsSame story cross-posted across subredditsDeduplicate downstream by URL/title/permalink

Limitations

  • No OAuth-only data — private subs, user inbox, friends, subscribed feed
  • No historical archives — Reddit JSON returns live data only; for posts older than a few weeks on active subs, pagination may stop early
  • Comment depth capped — default 3, max 10 (Reddit itself caps around 10)
  • Public-data only — no inbox, private communities, mod-only data, or logged-in recommendations
  • Search is not exhaustive — Reddit search can miss posts; direct subreddit monitoring is more predictable

Privacy and compliance

This actor is designed for public Reddit content. Do not use it to collect private messages, bypass access controls, or infer sensitive personal data. For business reporting, store only the fields you need and respect deletion/removal signals from Reddit.

Changelog

  • 0.1.10 (2026-04-26) — reduced first-run default to 10 posts, added quality contract, and expanded README with workflow, output, scheduling, and troubleshooting guidance.
  • 0.1 (2026-04-19) — initial release: subreddits, search, users, postIds; optional comment trees; full / minimal / rag output