🧠 Reddit Data Extractor
Pricing
Pay per event
🧠 Reddit Data Extractor
Scrape Reddit data to train AI models. Extract nested comments, post URLs, and user profiles into clean datasets for vector databases and RAG pipelines.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
7 hours ago
Last modified
Categories
Share
📡 Reddit All-in-One Scraper
Scrape Reddit subreddits, posts, comments, user profiles, and search results via public JSON endpoints. No API key needed. Clean normalized output for downstream analysis.
Store Quickstart
Start with the Quickstart template (r/javascript hot posts). For deep analysis, use Search + Comments to include comment trees. Use Keyword Filter to narrow results.
Key Features
- 📡 All source types — Subreddits, post URLs, user profiles, and search queries
- 💬 Comments with depth control — Nested comment trees with configurable depth
- 🔍 Search support — Reddit-wide search via
search:your query - 🏷️ Keyword filtering — Filter posts by title/body keywords
- 📊 Normalized output — Clean, flat objects designed for analysis pipelines
- ⏱️ Rate-limit aware — Configurable delays, automatic 429 retry
Use Cases
| Who | Why |
|---|---|
| Market researchers | Track brand mentions and sentiment across subreddits |
| Content creators | Find trending topics and popular discussions |
| Data scientists | Collect training data with comments and metadata |
| Community managers | Monitor subreddit activity and user engagement |
| Competitive analysts | Track competitor mentions and industry trends |
Input
| Field | Type | Default | Description |
|---|---|---|---|
| sources | array | required | List of Reddit sources. Each can be a subreddit name (e.g. 'javascript'), subreddit URL, post URL, user name (e.g. 'u/sp |
| maxPostsPerSource | integer | 25 | Maximum posts to collect from each subreddit, user, or search source. |
| includeComments | boolean | false | Fetch comments for each post. Increases run time. |
| maxCommentsPerPost | integer | 50 | Maximum top-level + nested comments to extract per post (when includeComments is on). |
| commentDepth | integer | 3 | How many reply levels to extract (1 = top-level only). |
| sort | string | "hot" | Sort order for subreddit and search listings. |
| time | string | "all" | Time range filter (applies when sort is 'top' or 'controversial'). |
| keywords | array | [] | Only include posts whose title or selftext contains at least one keyword (case-insensitive). Leave empty to include all. |
Input Example
{"sources": ["javascript", "u/spez", "search:web scraping"],"maxPostsPerSource": 10,"includeComments": false,"sort": "hot","keywords": [],"delivery": "dataset"}
Output
| Field | Type | Description |
|---|---|---|
meta | object | |
posts | array | |
posts[].id | string | |
posts[].subreddit | string | |
posts[].title | string | |
posts[].author | string | |
posts[].score | number | |
posts[].upvoteRatio | number | |
posts[].numComments | number | |
posts[].createdAt | timestamp | |
posts[].url | string (url) | |
posts[].permalink | string (url) | |
posts[].selftext | string | |
posts[].isSelf | boolean | |
posts[].isNsfw | boolean | |
posts[].isStickied | boolean | |
posts[].flair | string | |
posts[].domain | string | |
posts[].thumbnail | null | |
posts[].awards | number | |
posts[].sourceType | string | |
posts[].sourceValue | string |
Output Example
{"id": "abc123","subreddit": "javascript","title": "New ESM features in Node 22","author": "devuser","score": 842,"upvoteRatio": 0.96,"numComments": 127,"createdAt": "2026-01-15T12:30:00.000Z","url": "https://example.com/article","permalink": "https://www.reddit.com/r/javascript/comments/abc123/…","selftext": null,"isSelf": false,"isNsfw": false,"flair": "News","sourceType": "subreddit","sourceValue": "javascript"}
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~reddit-all-in-one-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "sources": ["javascript", "u/spez", "search:web scraping"], "maxPostsPerSource": 10, "includeComments": false, "sort": "hot", "keywords": [], "delivery": "dataset" }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/reddit-all-in-one-scraper").call(run_input={"sources": ["javascript", "u/spez", "search:web scraping"],"maxPostsPerSource": 10,"includeComments": false,"sort": "hot","keywords": [],"delivery": "dataset"})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/reddit-all-in-one-scraper').call({"sources": ["javascript", "u/spez", "search:web scraping"],"maxPostsPerSource": 10,"includeComments": false,"sort": "hot","keywords": [],"delivery": "dataset"});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Tips & Limitations
- Use
snapshotKeyto persist seen-item state across runs so only new items are pushed. - For high-volume feeds, limit
maxItemsper run and increase schedule frequency instead. - Webhook delivery payloads are compact — parse on receiver side for routing to multiple channels.
- Combine this actor with
article-content-extractorfor full-text bodies when feeds are title-only. - Run against your own staging feed first to validate filter keywords before production alerts.
FAQ
Does this need a Reddit API key?
No. It uses Reddit's public .json endpoints that don't require authentication.
Rate limits?
Reddit rate-limits unauthenticated requests. The actor uses configurable delays (default 1.5s) and retries on 429 responses.
Can I scrape private subreddits?
No. Only public subreddits are accessible via the public JSON endpoints.
What about NSFW content?
NSFW posts are included in results with isNsfw: true. Filter them in your pipeline if needed.
How do I filter by keyword?
Most actors expose a watchKeywords or filterKeywords array — matches are flagged in the output with highlight metadata.
Can this work with paywalled content?
No — this actor only processes publicly accessible feed/article URLs. Paywalled content is out of scope.
Related Actors
News & Content cluster — explore related Apify tools:
- 📰 Google News Scraper — Scrape Google News articles for any search query via official RSS feed.
- 📰 Article Extractor — Extract clean article content with title, author, publish date, images from news and blog pages.
- 📄 Website Content Extractor — Extract clean main content from any webpage as text, markdown, or HTML.
- 📡 RSS Feed Aggregator — Aggregate multiple RSS and Atom feeds with keyword filtering and deduplication.
- 📰 Hacker News Scraper — Fetch Hacker News top, new, best, ask, show, job stories via official Firebase API.
- 🚨 Reddit Keyword Monitor Alerts — Focused Reddit keyword and subreddit monitor built for recurring alerts, snapshot diffing, and webhook handoff.
Cost
Pay Per Event:
actor-start: $0.01 (flat fee per run)dataset-item: $0.001 per output item
Example: 1,000 items = $0.01 + (1,000 × $0.001) = $1.01
No subscription required — you only pay for what you use.