👾 Reddit Data Extractor
Pricing
Pay per event
👾 Reddit Data Extractor
Scrape Reddit data to train AI models or build NLP datasets. Extract posts, comments, and user details via public API endpoints with no browser required.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Actor stats
0
Bookmarked
4
Total users
1
Monthly active users
5 hours ago
Last modified
Categories
Share
💬 Reddit Scraper
Scrape Reddit at scale to build high-quality datasets for AI training, machine learning, and NLP applications. This developer-focused Reddit Data Extractor bypasses the overhead of a headless web browser, extracting unstructured community conversations and turning them into clean, structured data using public API endpoints. If you need to gather millions of words for text analysis or train large language models, this tool lets you extract posts, nested comments, and thread details with incredible speed.
Data scientists and AI engineers run this scraper to compile extensive linguistic datasets, analyze user sentiment across specific pages, and track digital subcultures. Instead of struggling with rate limits or complex authentication tools, you can seamlessly integrate this scraper into your existing data pipelines. Schedule it to run nightly to capture the newest discussions, or use search filters to scrape historical top posts for comprehensive analysis.
The extracted results include rich metadata essential for advanced processing. Every run yields precise details such as created_utc, score, author, selftext, and full URL links. Once your scraped data is ready, you can export the results via API to seamlessly feed your vector databases or analytical models.
Store Quickstart
Start with the Quickstart template (1 subreddit, hot, 25 posts). For sentiment analysis, use Deep Scrape with comments enabled.
Key Features
- 💬 Official Reddit JSON API — Uses old.reddit.com/r/{sub}/{sort}.json
- 🔀 Multiple sort modes — hot, new, top, rising with time filters
- 💭 Comments included — Optional nested comment extraction
- 📊 Post metadata — Score, author, subreddit, created_utc, num_comments
- 🧩 Self + link posts — Both text posts and URL submissions
- 🔑 No API key needed — Uses public JSON endpoints
Use Cases
| Who | Why |
|---|---|
| Market researchers | Analyze consumer sentiment on brand/product subreddits |
| Crisis monitoring | Track negative mentions in real-time |
| Content marketers | Discover trending topics and user pain points |
| Gaming/media analysts | Monitor fan community reactions |
| Academic researchers | Collect Reddit datasets for NLP research |
Input
| Field | Type | Default | Description |
|---|---|---|---|
| subreddits | string[] | (required) | Subreddit names (max 20) |
| sort | string | hot | hot, new, top, rising |
| maxItems | integer | 25 | Max posts per subreddit (1-500) |
| includeComments | boolean | false | Include nested comments |
Input Example
{"subreddits": ["programming", "technology"],"sort": "hot","maxItems": 50,"includeComments": true}
Output
| Field | Type | Description |
|---|---|---|
id | string | Reddit post ID |
title | string | Post title |
author | string | Username of poster |
subreddit | string | Subreddit name |
url | string | Permalink to post |
score | integer | Upvote score |
numComments | integer | Comment count |
createdAt | string | ISO timestamp |
selftext | string | Post body (for text posts) |
comments | object[] | Top comments (if includeComments enabled) |
Output Example
{"title": "New JavaScript framework released","author": "dev_user","score": 1250,"url": "https://example.com/framework","selftext": "Detailed writeup inside...","subreddit": "programming","createdUtc": 1712345678,"numComments": 342,"comments": [{"author": "...", "body": "..."}]}
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~reddit-data-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "subreddits": ["programming", "technology"], "sort": "hot", "maxItems": 50, "includeComments": true }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/reddit-data-scraper").call(run_input={"subreddits": ["programming", "technology"],"sort": "hot","maxItems": 50,"includeComments": true})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/reddit-data-scraper').call({"subreddits": ["programming", "technology"],"sort": "hot","maxItems": 50,"includeComments": true});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Tips & Limitations
⚠️ Proxy Required on Apify Datacenter
Reddit blocks the majority of Apify's shared datacenter IPs. Without a proxy:
- Runs on Apify infrastructure will fail with
runStatus: all_blockedand exit code 1. - The output
meta.subredditResultsshows which subreddits were blocked vs. successful.
To fix: In the actor's .env, set APIFY_USE_APIFY_PROXY=true and APIFY_PROXY_GROUPS=RESIDENTIAL before running npm run apify:cloud:setup, or set PROXY_URL to your own residential proxy.
Local / home ISP runs work without a proxy; the block only affects datacenter IPs.
If you bootstrap recurring cloud tasks with npm run apify:cloud:setup, set
APIFY_USE_APIFY_PROXY=true, APIFY_PROXY_GROUPS=RESIDENTIAL, and
APIFY_RESTART_ON_ERROR=false in .env so the cloud run uses the internal
Apify residential proxy path and does not auto-retry four identical blocked runs.
Other Tips
- Use
sort: "top"with a time filter for high-quality content discovery. - Set
includeComments: truefor sentiment analysis workflows. - Track subreddits in your industry to spot trends and customer pain points.
FAQ
Does Reddit block this?
Yes — Reddit blocks most Apify datacenter IPs. On Apify infrastructure, runs without a proxy
will fail with runStatus: all_blocked (exit code 1) and 0 posts. Configure a residential
proxy in the actor's Proxy tab or via PROXY_URL env var. Runs from a home/ISP IP work fine.
For scheduled cloud runs, set APIFY_RESTART_ON_ERROR=false to avoid repeated retries after a
known block.
What is runStatus in the output?
| Value | Meaning |
|---|---|
ok | All subreddits fetched successfully |
partial | Some subreddits succeeded; others were blocked or errored |
all_blocked | Every subreddit was blocked — no posts collected (exit code 1) |
Can I scrape private subreddits?
No. Only public subreddits accessible to unauthenticated users.
How many comments per post?
Top-level comments only, limited to ~200 per post (Reddit API default).
What's the difference from Apify's reddit-scraper-lite?
No DOM dependency, cleaner output schema, proxy fallback built-in, and honest degraded-path reporting when requests are blocked.
Related Actors
DevOps & Tech Intel cluster — explore related Apify tools:
- 🌐 DNS Propagation Checker — Check DNS propagation across 8 global resolvers (Google, Cloudflare, Quad9, OpenDNS).
- 🔍 Subdomain Finder — Discover subdomains for any domain using Certificate Transparency logs (crt.
- 🧹 CSV Data Cleaner — Clean CSV data: trim whitespace, remove empty rows, deduplicate by columns, sort.
- 📦 NPM Package Analyzer — Analyze npm packages: download stats, dependencies, licenses, deprecation status.
- GitHub Release & Changelog Monitor API — Track GitHub releases, tags, release notes, and changelog drift over time with one summary-first repository row per repo.
- Docs & Changelog Drift Monitor API — Monitor release notes, changelog pages, migration guides, and key docs pages with one summary-first target row per monitored repo, SDK, or product.
- Tech Events Calendar API | Conferences + CFP — Aggregate tech conferences and CFPs across multiple sources into a deduplicated event calendar for DevRel and recruiting workflows.
- 🔒 OSS Vulnerability Monitor — Monitor open-source packages for known security vulnerabilities using OSV and GitHub Security Advisories.
Cost
Pay Per Event:
actor-start: $0.01 (flat fee per run)dataset-item: $0.003 per output item
Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01
No subscription required — you only pay for what you use.
⭐ Was this helpful?
If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.
Bug report or feature request? Open an issue on the Issues tab of this actor.
