👾 Reddit Data Extractor avatar

👾 Reddit Data Extractor

Pricing

Pay per event

Go to Apify Store
👾 Reddit Data Extractor

👾 Reddit Data Extractor

Scrape Reddit data to train AI models or build NLP datasets. Extract posts, comments, and user details via public API endpoints with no browser required.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

1

Monthly active users

an hour ago

Last modified

Share

💬 Reddit Scraper

Scrape Reddit at scale to build high-quality datasets for AI training, machine learning, and NLP applications. This developer-focused Reddit Data Extractor bypasses the overhead of a headless web browser, extracting unstructured community conversations and turning them into clean, structured data using public API endpoints. If you need to gather millions of words for text analysis or train large language models, this tool lets you extract posts, nested comments, and thread details with incredible speed.

Data scientists and AI engineers run this scraper to compile extensive linguistic datasets, analyze user sentiment across specific pages, and track digital subcultures. Instead of struggling with rate limits or complex authentication tools, you can seamlessly integrate this scraper into your existing data pipelines. Schedule it to run nightly to capture the newest discussions, or use search filters to scrape historical top posts for comprehensive analysis.

The extracted results include rich metadata essential for advanced processing. Every run yields precise details such as created_utc, score, author, selftext, and full URL links. Once your scraped data is ready, you can export the results via API to seamlessly feed your vector databases or analytical models.

Store Quickstart

Start with the Quickstart template (1 subreddit, hot, 25 posts). For sentiment analysis, use Deep Scrape with comments enabled.

Key Features

  • 💬 Official Reddit JSON API — Uses old.reddit.com/r/{sub}/{sort}.json
  • 🔀 Multiple sort modes — hot, new, top, rising with time filters
  • 💭 Comments included — Optional nested comment extraction
  • 📊 Post metadata — Score, author, subreddit, created_utc, num_comments
  • 🧩 Self + link posts — Both text posts and URL submissions
  • 🔑 No API key needed — Uses public JSON endpoints

Use Cases

WhoWhy
Market researchersAnalyze consumer sentiment on brand/product subreddits
Crisis monitoringTrack negative mentions in real-time
Content marketersDiscover trending topics and user pain points
Gaming/media analystsMonitor fan community reactions
Academic researchersCollect Reddit datasets for NLP research

Input

FieldTypeDefaultDescription
subredditsstring[](required)Subreddit names (max 20)
sortstringhothot, new, top, rising
maxItemsinteger25Max posts per subreddit (1-500)
includeCommentsbooleanfalseInclude nested comments

Input Example

{
"subreddits": ["programming", "technology"],
"sort": "hot",
"maxItems": 50,
"includeComments": true
}

Output

FieldTypeDescription
idstringReddit post ID
titlestringPost title
authorstringUsername of poster
subredditstringSubreddit name
urlstringPermalink to post
scoreintegerUpvote score
numCommentsintegerComment count
createdAtstringISO timestamp
selftextstringPost body (for text posts)
commentsobject[]Top comments (if includeComments enabled)

Output Example

{
"title": "New JavaScript framework released",
"author": "dev_user",
"score": 1250,
"url": "https://example.com/framework",
"selftext": "Detailed writeup inside...",
"subreddit": "programming",
"createdUtc": 1712345678,
"numComments": 342,
"comments": [{"author": "...", "body": "..."}]
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~reddit-data-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{ "subreddits": ["programming", "technology"], "sort": "hot", "maxItems": 50, "includeComments": true }'

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/reddit-data-scraper").call(run_input={
"subreddits": ["programming", "technology"],
"sort": "hot",
"maxItems": 50,
"includeComments": true
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/reddit-data-scraper').call({
"subreddits": ["programming", "technology"],
"sort": "hot",
"maxItems": 50,
"includeComments": true
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

⚠️ Proxy Required on Apify Datacenter

Reddit blocks the majority of Apify's shared datacenter IPs. Without a proxy:

  • Runs on Apify infrastructure will fail with runStatus: all_blocked and exit code 1.
  • The output meta.subredditResults shows which subreddits were blocked vs. successful.

To fix: In the actor's .env, set APIFY_USE_APIFY_PROXY=true and APIFY_PROXY_GROUPS=RESIDENTIAL before running npm run apify:cloud:setup, or set PROXY_URL to your own residential proxy.

Local / home ISP runs work without a proxy; the block only affects datacenter IPs.

If you bootstrap recurring cloud tasks with npm run apify:cloud:setup, set APIFY_USE_APIFY_PROXY=true, APIFY_PROXY_GROUPS=RESIDENTIAL, and APIFY_RESTART_ON_ERROR=false in .env so the cloud run uses the internal Apify residential proxy path and does not auto-retry four identical blocked runs.

Other Tips

  • Use sort: "top" with a time filter for high-quality content discovery.
  • Set includeComments: true for sentiment analysis workflows.
  • Track subreddits in your industry to spot trends and customer pain points.

FAQ

Does Reddit block this?

Yes — Reddit blocks most Apify datacenter IPs. On Apify infrastructure, runs without a proxy will fail with runStatus: all_blocked (exit code 1) and 0 posts. Configure a residential proxy in the actor's Proxy tab or via PROXY_URL env var. Runs from a home/ISP IP work fine. For scheduled cloud runs, set APIFY_RESTART_ON_ERROR=false to avoid repeated retries after a known block.

What is runStatus in the output?

ValueMeaning
okAll subreddits fetched successfully
partialSome subreddits succeeded; others were blocked or errored
all_blockedEvery subreddit was blocked — no posts collected (exit code 1)

Can I scrape private subreddits?

No. Only public subreddits accessible to unauthenticated users.

How many comments per post?

Top-level comments only, limited to ~200 per post (Reddit API default).

What's the difference from Apify's reddit-scraper-lite?

No DOM dependency, cleaner output schema, proxy fallback built-in, and honest degraded-path reporting when requests are blocked.

DevOps & Tech Intel cluster — explore related Apify tools:

Cost

Pay Per Event:

  • actor-start: $0.01 (flat fee per run)
  • dataset-item: $0.003 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01

No subscription required — you only pay for what you use.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the Issues tab of this actor.