🧠 Reddit Data Extractor avatar

🧠 Reddit Data Extractor

Pricing

Pay per event

Go to Apify Store
🧠 Reddit Data Extractor

🧠 Reddit Data Extractor

Scrape Reddit data to train AI models. Extract nested comments, post URLs, and user profiles into clean datasets for vector databases and RAG pipelines.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

7 hours ago

Last modified

Share

📡 Reddit All-in-One Scraper

Scrape Reddit subreddits, posts, comments, user profiles, and search results via public JSON endpoints. No API key needed. Clean normalized output for downstream analysis.

Store Quickstart

Start with the Quickstart template (r/javascript hot posts). For deep analysis, use Search + Comments to include comment trees. Use Keyword Filter to narrow results.

Key Features

  • 📡 All source types — Subreddits, post URLs, user profiles, and search queries
  • 💬 Comments with depth control — Nested comment trees with configurable depth
  • 🔍 Search support — Reddit-wide search via search:your query
  • 🏷️ Keyword filtering — Filter posts by title/body keywords
  • 📊 Normalized output — Clean, flat objects designed for analysis pipelines
  • ⏱️ Rate-limit aware — Configurable delays, automatic 429 retry

Use Cases

WhoWhy
Market researchersTrack brand mentions and sentiment across subreddits
Content creatorsFind trending topics and popular discussions
Data scientistsCollect training data with comments and metadata
Community managersMonitor subreddit activity and user engagement
Competitive analystsTrack competitor mentions and industry trends

Input

FieldTypeDefaultDescription
sourcesarrayrequiredList of Reddit sources. Each can be a subreddit name (e.g. 'javascript'), subreddit URL, post URL, user name (e.g. 'u/sp
maxPostsPerSourceinteger25Maximum posts to collect from each subreddit, user, or search source.
includeCommentsbooleanfalseFetch comments for each post. Increases run time.
maxCommentsPerPostinteger50Maximum top-level + nested comments to extract per post (when includeComments is on).
commentDepthinteger3How many reply levels to extract (1 = top-level only).
sortstring"hot"Sort order for subreddit and search listings.
timestring"all"Time range filter (applies when sort is 'top' or 'controversial').
keywordsarray[]Only include posts whose title or selftext contains at least one keyword (case-insensitive). Leave empty to include all.

Input Example

{
"sources": ["javascript", "u/spez", "search:web scraping"],
"maxPostsPerSource": 10,
"includeComments": false,
"sort": "hot",
"keywords": [],
"delivery": "dataset"
}

Output

FieldTypeDescription
metaobject
postsarray
posts[].idstring
posts[].subredditstring
posts[].titlestring
posts[].authorstring
posts[].scorenumber
posts[].upvoteRationumber
posts[].numCommentsnumber
posts[].createdAttimestamp
posts[].urlstring (url)
posts[].permalinkstring (url)
posts[].selftextstring
posts[].isSelfboolean
posts[].isNsfwboolean
posts[].isStickiedboolean
posts[].flairstring
posts[].domainstring
posts[].thumbnailnull
posts[].awardsnumber
posts[].sourceTypestring
posts[].sourceValuestring

Output Example

{
"id": "abc123",
"subreddit": "javascript",
"title": "New ESM features in Node 22",
"author": "devuser",
"score": 842,
"upvoteRatio": 0.96,
"numComments": 127,
"createdAt": "2026-01-15T12:30:00.000Z",
"url": "https://example.com/article",
"permalink": "https://www.reddit.com/r/javascript/comments/abc123/…",
"selftext": null,
"isSelf": false,
"isNsfw": false,
"flair": "News",
"sourceType": "subreddit",
"sourceValue": "javascript"
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~reddit-all-in-one-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{ "sources": ["javascript", "u/spez", "search:web scraping"], "maxPostsPerSource": 10, "includeComments": false, "sort": "hot", "keywords": [], "delivery": "dataset" }'

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/reddit-all-in-one-scraper").call(run_input={
"sources": ["javascript", "u/spez", "search:web scraping"],
"maxPostsPerSource": 10,
"includeComments": false,
"sort": "hot",
"keywords": [],
"delivery": "dataset"
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/reddit-all-in-one-scraper').call({
"sources": ["javascript", "u/spez", "search:web scraping"],
"maxPostsPerSource": 10,
"includeComments": false,
"sort": "hot",
"keywords": [],
"delivery": "dataset"
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

  • Use snapshotKey to persist seen-item state across runs so only new items are pushed.
  • For high-volume feeds, limit maxItems per run and increase schedule frequency instead.
  • Webhook delivery payloads are compact — parse on receiver side for routing to multiple channels.
  • Combine this actor with article-content-extractor for full-text bodies when feeds are title-only.
  • Run against your own staging feed first to validate filter keywords before production alerts.

FAQ

Does this need a Reddit API key?

No. It uses Reddit's public .json endpoints that don't require authentication.

Rate limits?

Reddit rate-limits unauthenticated requests. The actor uses configurable delays (default 1.5s) and retries on 429 responses.

Can I scrape private subreddits?

No. Only public subreddits are accessible via the public JSON endpoints.

What about NSFW content?

NSFW posts are included in results with isNsfw: true. Filter them in your pipeline if needed.

How do I filter by keyword?

Most actors expose a watchKeywords or filterKeywords array — matches are flagged in the output with highlight metadata.

Can this work with paywalled content?

No — this actor only processes publicly accessible feed/article URLs. Paywalled content is out of scope.

News & Content cluster — explore related Apify tools:

Cost

Pay Per Event:

  • actor-start: $0.01 (flat fee per run)
  • dataset-item: $0.001 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.001) = $1.01

No subscription required — you only pay for what you use.