Reddit All-in-One Scraper
Pricing
Pay per event
Reddit All-in-One Scraper
Scrape massive historical datasets across Reddit by extracting subreddits, complex search results, post content, and deep comment trees.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Maintained by CommunityActor stats
1
Bookmarked
3
Total users
2
Monthly active users
15 hours ago
Last modified
Categories
Share
📡 Reddit All-in-One Scraper
Unlock massive historical datasets with the ultimate Reddit all-in-one scraper, engineered for deep web extraction and historical backfilling. Data scientists, AI researchers, and analysts use this web scraper to systematically extract complex search results, full subreddit feeds, and densely nested comment trees. When you need to scrape Reddit to build foundational training data or run large-scale sentiment analysis, this tool bypasses basic limitations to deliver comprehensive structured data.
Unlike simple point-and-click tools, this scraper is designed for heavy-duty data collection. You can input hundreds of post URLs, execute broad search queries with specific parameters, or target exact subreddits to map out years of discussions. Users frequently schedule this tool to run extensive weekly backfills, ensuring their localized databases contain every relevant thread, upvote metric, and author detail. It is the definitive starting point for any robust social intelligence workflow.
Every time you run the extractor, you capture rich, contextual data including thread_id, nested_comment_text, author_flair, timestamp, and score. By exporting these deep results, AI models get the conversational context they require, and researchers gain a clear picture of community dynamics. Once your massive historical backfill is complete and safely exported, you can seamlessly shift your daily monitoring over to a specialized keyword alert tool to catch newly published posts.
Store Quickstart
- Start with
store-input.example.jsonor Quickstart — Compact Backfill for a compact initial dataset. - Then use the research ladder from
store-input.templates.json:- Quickstart — Compact Backfill
- Recurring Research Refresh
- Webhook → Article Cleanup Queue
- Side presets stay available for deeper pulls: Competitor Subreddit Backfill, Search + Comments Research, and User/Profile Backfill.
- Move true net-new recurring alerting to reddit-keyword-monitor-alerts once the backfill dataset is useful.
- Buyer-facing proof assets live in
sample-output.example.jsonandlive-proof.example.json.
Key Features
- 📡 All source types — Subreddits, post URLs, user profiles, and search queries
- 💬 Comments with depth control — Nested comment trees with configurable depth
- 🔍 Search support — Reddit-wide search via
search:your query - 🏷️ Keyword filtering — Filter posts by title/body keywords
- 📊 Normalized output — Clean, flat objects for research pipelines
- 🤝 Pack handoff — Built for backfill/research before recurring monitoring handoff
Use Cases
| Who | Why |
|---|---|
| Market researchers | Backfill competitor/category subreddit history |
| Analysts | Pull search + comments datasets for thematic analysis |
| Data teams | Collect profile/subreddit sources for downstream scoring |
| PM/GTM teams | Build context sets, then move to recurring monitor alerts |
Input
| Field | Type | Default | Description |
|---|---|---|---|
| sources | array | required | List of sources: subreddit name/URL, post URL, user (e.g. u/spez), user URL, or search:query. |
| maxPostsPerSource | integer | 25 | Maximum posts to collect from each subreddit, user, or search source. |
| includeComments | boolean | false | Fetch comments for each post. Increases run time. |
| maxCommentsPerPost | integer | 50 | Maximum top-level + nested comments to extract per post (when includeComments is on). |
| commentDepth | integer | 3 | How many reply levels to extract (1 = top-level only). |
| sort | string | "hot" | Sort order for subreddit and search listings. |
| time | string | "all" | Time range filter (applies when sort is 'top' or 'controversial'). |
| keywords | array | [] | Only include posts whose title or selftext contains at least one keyword (case-insensitive). Leave empty to include all. |
Input Example
{"sources": ["javascript", "u/spez", "search:web scraping"],"maxPostsPerSource": 10,"includeComments": false,"sort": "hot","keywords": [],"delivery": "dataset"}
Input Examples
Example: Subreddit research
{"mode": "subreddit","subreddit": "MachineLearning","sort": "top","time": "month"}
Example: User activity history
{"mode": "user","username": "anExampleUser","maxItems": 100}
Example: Cross-subreddit keyword search
{"mode": "search","query": "claude vs gpt","sort": "relevance","maxResults": 50}
Output
| Field | Type | Description |
|---|---|---|
meta | object | |
posts | array | |
posts[].id | string | |
posts[].subreddit | string | |
posts[].title | string | |
posts[].author | string | |
posts[].score | number | |
posts[].upvoteRatio | number | |
posts[].numComments | number | |
posts[].createdAt | timestamp | |
posts[].url | string (url) | |
posts[].permalink | string (url) | |
posts[].selftext | string | |
posts[].isSelf | boolean | |
posts[].isNsfw | boolean | |
posts[].isStickied | boolean | |
posts[].flair | string | |
posts[].domain | string | |
posts[].thumbnail | null | |
posts[].awards | number | |
posts[].sourceType | string | |
posts[].sourceValue | string |
Output Example
{"id": "abc123","subreddit": "javascript","title": "New ESM features in Node 22","author": "devuser","score": 842,"upvoteRatio": 0.96,"numComments": 127,"createdAt": "2026-01-15T12:30:00.000Z","url": "https://example.com/article","permalink": "https://www.reddit.com/r/javascript/comments/abc123/…","selftext": null,"isSelf": false,"isNsfw": false,"flair": "News","sourceType": "subreddit","sourceValue": "javascript"}
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~reddit-all-in-one-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "sources": ["javascript", "u/spez", "search:web scraping"], "maxPostsPerSource": 10, "includeComments": false, "sort": "hot", "keywords": [], "delivery": "dataset" }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/reddit-all-in-one-scraper").call(run_input={"sources": ["javascript", "u/spez", "search:web scraping"],"maxPostsPerSource": 10,"includeComments": false,"sort": "hot","keywords": [],"delivery": "dataset"})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/reddit-all-in-one-scraper').call({"sources": ["javascript", "u/spez", "search:web scraping"],"maxPostsPerSource": 10,"includeComments": false,"sort": "hot","keywords": [],"delivery": "dataset"});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Validation & Cloud Setup
This actor follows shared store-ops conventions:
npm test— local unit testsnpm run canary:check— live canary validation against latest Apify run/tasknpm run contract:test:live— live dataset contract checknpm run apify:cloud:setup— bootstrap/update Apify task + schedule from local config
Tips & Limitations
- This actor is best for research/backfill, not recurring diff alerting.
- For net-new recurring alerts + baseline snapshots, use reddit-keyword-monitor-alerts.
- 429s are common on aggressive pulls; increase
delayMsand trimmaxPostsPerSource. - For links discovered in posts, use article-content-extractor for full-page content cleanup.
FAQ
Does this need a Reddit API key?
No. It uses public Reddit .json endpoints without authentication.
Can this replace recurring monitoring?
Not directly. This actor does not maintain monitoring snapshots across runs. Use reddit-keyword-monitor-alerts for net-new recurring alert workflows.
Can I scrape private subreddits?
No. Only public subreddits are accessible via public endpoints.
What is the best pack workflow?
Use this actor to gather research/backfill context, then move recurring alert operations to reddit-keyword-monitor-alerts.
Related Actors
Reddit Intelligence Pack workflow:
- 🚨 Reddit Keyword Monitor Alerts — Hero recurring monitor for net-new alerts + webhook handoff.
- 📰 Article Extractor — Extract linked article text from Reddit URLs.
- 💬 Reddit Scraper (Legacy) — Legacy/proxy-sensitive fallback, not primary entry point.
Cost
Pay Per Event:
actor-start: $0.01 (flat fee per run)dataset-item: $0.001 per output item
Example: 1,000 items = $0.01 + (1,000 × $0.001) = $1.01
No subscription required — you only pay for what you use.
⭐ Was this helpful?
If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.
Bug report or feature request? Open an issue on the Issues tab of this actor.
