Reddit All-in-One Scraper avatar

Reddit All-in-One Scraper

Pricing

Pay per event

Go to Apify Store
Reddit All-in-One Scraper

Reddit All-in-One Scraper

Scrape massive historical datasets across Reddit by extracting subreddits, complex search results, post content, and deep comment trees.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

1

Bookmarked

3

Total users

2

Monthly active users

15 hours ago

Last modified

Share

📡 Reddit All-in-One Scraper

Unlock massive historical datasets with the ultimate Reddit all-in-one scraper, engineered for deep web extraction and historical backfilling. Data scientists, AI researchers, and analysts use this web scraper to systematically extract complex search results, full subreddit feeds, and densely nested comment trees. When you need to scrape Reddit to build foundational training data or run large-scale sentiment analysis, this tool bypasses basic limitations to deliver comprehensive structured data.

Unlike simple point-and-click tools, this scraper is designed for heavy-duty data collection. You can input hundreds of post URLs, execute broad search queries with specific parameters, or target exact subreddits to map out years of discussions. Users frequently schedule this tool to run extensive weekly backfills, ensuring their localized databases contain every relevant thread, upvote metric, and author detail. It is the definitive starting point for any robust social intelligence workflow.

Every time you run the extractor, you capture rich, contextual data including thread_id, nested_comment_text, author_flair, timestamp, and score. By exporting these deep results, AI models get the conversational context they require, and researchers gain a clear picture of community dynamics. Once your massive historical backfill is complete and safely exported, you can seamlessly shift your daily monitoring over to a specialized keyword alert tool to catch newly published posts.

Store Quickstart

  • Start with store-input.example.json or Quickstart — Compact Backfill for a compact initial dataset.
  • Then use the research ladder from store-input.templates.json:
    1. Quickstart — Compact Backfill
    2. Recurring Research Refresh
    3. Webhook → Article Cleanup Queue
  • Side presets stay available for deeper pulls: Competitor Subreddit Backfill, Search + Comments Research, and User/Profile Backfill.
  • Move true net-new recurring alerting to reddit-keyword-monitor-alerts once the backfill dataset is useful.
  • Buyer-facing proof assets live in sample-output.example.json and live-proof.example.json.

Key Features

  • 📡 All source types — Subreddits, post URLs, user profiles, and search queries
  • 💬 Comments with depth control — Nested comment trees with configurable depth
  • 🔍 Search support — Reddit-wide search via search:your query
  • 🏷️ Keyword filtering — Filter posts by title/body keywords
  • 📊 Normalized output — Clean, flat objects for research pipelines
  • 🤝 Pack handoff — Built for backfill/research before recurring monitoring handoff

Use Cases

WhoWhy
Market researchersBackfill competitor/category subreddit history
AnalystsPull search + comments datasets for thematic analysis
Data teamsCollect profile/subreddit sources for downstream scoring
PM/GTM teamsBuild context sets, then move to recurring monitor alerts

Input

FieldTypeDefaultDescription
sourcesarrayrequiredList of sources: subreddit name/URL, post URL, user (e.g. u/spez), user URL, or search:query.
maxPostsPerSourceinteger25Maximum posts to collect from each subreddit, user, or search source.
includeCommentsbooleanfalseFetch comments for each post. Increases run time.
maxCommentsPerPostinteger50Maximum top-level + nested comments to extract per post (when includeComments is on).
commentDepthinteger3How many reply levels to extract (1 = top-level only).
sortstring"hot"Sort order for subreddit and search listings.
timestring"all"Time range filter (applies when sort is 'top' or 'controversial').
keywordsarray[]Only include posts whose title or selftext contains at least one keyword (case-insensitive). Leave empty to include all.

Input Example

{
"sources": ["javascript", "u/spez", "search:web scraping"],
"maxPostsPerSource": 10,
"includeComments": false,
"sort": "hot",
"keywords": [],
"delivery": "dataset"
}

Input Examples

Example: Subreddit research

{
"mode": "subreddit",
"subreddit": "MachineLearning",
"sort": "top",
"time": "month"
}

Example: User activity history

{
"mode": "user",
"username": "anExampleUser",
"maxItems": 100
}
{
"mode": "search",
"query": "claude vs gpt",
"sort": "relevance",
"maxResults": 50
}

Output

FieldTypeDescription
metaobject
postsarray
posts[].idstring
posts[].subredditstring
posts[].titlestring
posts[].authorstring
posts[].scorenumber
posts[].upvoteRationumber
posts[].numCommentsnumber
posts[].createdAttimestamp
posts[].urlstring (url)
posts[].permalinkstring (url)
posts[].selftextstring
posts[].isSelfboolean
posts[].isNsfwboolean
posts[].isStickiedboolean
posts[].flairstring
posts[].domainstring
posts[].thumbnailnull
posts[].awardsnumber
posts[].sourceTypestring
posts[].sourceValuestring

Output Example

{
"id": "abc123",
"subreddit": "javascript",
"title": "New ESM features in Node 22",
"author": "devuser",
"score": 842,
"upvoteRatio": 0.96,
"numComments": 127,
"createdAt": "2026-01-15T12:30:00.000Z",
"url": "https://example.com/article",
"permalink": "https://www.reddit.com/r/javascript/comments/abc123/…",
"selftext": null,
"isSelf": false,
"isNsfw": false,
"flair": "News",
"sourceType": "subreddit",
"sourceValue": "javascript"
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~reddit-all-in-one-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{ "sources": ["javascript", "u/spez", "search:web scraping"], "maxPostsPerSource": 10, "includeComments": false, "sort": "hot", "keywords": [], "delivery": "dataset" }'

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/reddit-all-in-one-scraper").call(run_input={
"sources": ["javascript", "u/spez", "search:web scraping"],
"maxPostsPerSource": 10,
"includeComments": false,
"sort": "hot",
"keywords": [],
"delivery": "dataset"
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/reddit-all-in-one-scraper').call({
"sources": ["javascript", "u/spez", "search:web scraping"],
"maxPostsPerSource": 10,
"includeComments": false,
"sort": "hot",
"keywords": [],
"delivery": "dataset"
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Validation & Cloud Setup

This actor follows shared store-ops conventions:

  • npm test — local unit tests
  • npm run canary:check — live canary validation against latest Apify run/task
  • npm run contract:test:live — live dataset contract check
  • npm run apify:cloud:setup — bootstrap/update Apify task + schedule from local config

Tips & Limitations

  • This actor is best for research/backfill, not recurring diff alerting.
  • For net-new recurring alerts + baseline snapshots, use reddit-keyword-monitor-alerts.
  • 429s are common on aggressive pulls; increase delayMs and trim maxPostsPerSource.
  • For links discovered in posts, use article-content-extractor for full-page content cleanup.

FAQ

Does this need a Reddit API key?

No. It uses public Reddit .json endpoints without authentication.

Can this replace recurring monitoring?

Not directly. This actor does not maintain monitoring snapshots across runs. Use reddit-keyword-monitor-alerts for net-new recurring alert workflows.

Can I scrape private subreddits?

No. Only public subreddits are accessible via public endpoints.

What is the best pack workflow?

Use this actor to gather research/backfill context, then move recurring alert operations to reddit-keyword-monitor-alerts.

Reddit Intelligence Pack workflow:

Cost

Pay Per Event:

  • actor-start: $0.01 (flat fee per run)
  • dataset-item: $0.001 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.001) = $1.01

No subscription required — you only pay for what you use.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the Issues tab of this actor.