Reddit Subreddit Members Scraper
Pricing
from $3.99 / 1,000 results
Reddit Subreddit Members Scraper
Pricing
from $3.99 / 1,000 results
Rating
0.0
(0)
Developer
ScraperX
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
Reddit Subreddit Members Scraper — Async Apify Actor with Smart Proxy Fallback
This actor collects Reddit user handles from subreddits, keywords, or direct user inputs, then optionally enriches each profile with public metadata. It is fully asynchronous (aiohttp), pushes results live to the Apify dataset, and ships with a multi-step proxy strategy: start direct, fallback to Apify datacenter if blocked, then escalate to residential and stay there. This README is long-form (≈1500 words) and SEO-optimized for queries like “Reddit scraper”, “Reddit members extractor”, “Apify actor Reddit”, “Reddit subreddit members”, “Reddit user scraper”, “Reddit proxy scraping”, and “async Python Reddit scraper”.
Why This Actor
- Multi-input flexibility: Accepts subreddit URLs (
https://www.reddit.com/r/python),r/<sub>, user URLs (https://www.reddit.com/user/spez),u/<user>, or free-text keywords to search Reddit posts. - Post + comment coverage: Gathers authors from both posts and comments with pagination, respecting
maxPosts,maxComments, andsort_order(new,hot,top,rising). - Optional enrichment: When
fetchDetails=true, calls/user/{username}/about.jsonto add karma, account creation time, gold status, and avatar. - Proxy escalation: Direct → Apify datacenter → Apify residential (sticky after switch). Logs every transition; adds extra retries on residential.
- Live data safety: Pushes each found user immediately to the Apify dataset so partial results are preserved on crash or block.
- Async and rate-limited: Uses aiohttp with semaphores and jittered sleeps to reduce throttling; rotates User-Agent per request.
- Function-based code: No classes, easier to maintain and extend.
- Production-ready defaults: Conservative timeouts, retry logic, and structured dataset view.
What It Scrapes
- Usernames from subreddit posts and comments.
- Profile URL for each discovered user.
- Optional user details (when enabled): total karma, post karma, comment karma, created UTC, gold flag, avatar URL.
- Source coverage:
- Subreddit posts (sorted by
sort_order) - Subreddit comments
- Reddit search results for provided keywords (post authors)
- Direct user inputs (usernames/URLs)
- Subreddit posts (sorted by
Proxy and Anti-Blocking Strategy
- Start direct (no proxy) — fastest and cheapest.
- If blocked (403/429/503/timeout) → switch to Apify datacenter proxy.
- If still blocked → switch to Apify residential proxy, stick to residential for the rest of the run, and allow 3 extra retries there.
- Logged events — every escalation is written to Apify logs with reason and target URL.
Input Parameters (actor.json)
targets(array, required): Mixed inputs. Supports:- Subreddit URLs (
https://www.reddit.com/r/python) r/<subreddit>- User URLs (
https://www.reddit.com/user/spez) oru/<user> - Free-text keywords (treated as Reddit search queries)
- Subreddit URLs (
subreddits(array, optional): Plain subreddit names if you prefer not to usetargets.sort_order(string, enum):new|hot|top|rising(defaults tonew).maxPosts(integer): 1–1000, default 100. Posts per subreddit or per keyword search.maxComments(integer): 0–1000, default 100. Comments per subreddit.fetchDetails(boolean): If true, enrich each user via/user/{username}/about.json.maxConcurrentUsers(integer): 1–20, default 3. Concurrency for user detail fetches.requestDelay(integer): 0–10 seconds, default 1. Added between detail calls; jittered by +0–0.5s.proxyConfiguration(object): Apify proxy editor. DefaultuseApifyProxy=false(direct). If Reddit blocks, actor escalates automatically to datacenter then residential.
Example Input (balanced)
{"targets": [{ "url": "https://www.reddit.com/r/python" },{ "url": "r/webscraping" },{ "url": "asyncio" },{ "url": "u/spez" }],"sort_order": "new","maxPosts": 50,"maxComments": 50,"fetchDetails": true,"maxConcurrentUsers": 3,"requestDelay": 1,"proxyConfiguration": { "useApifyProxy": false }}
Output Schema (dataset)
Each pushed item (basic):
{"username": "example_user","userId": "t2_1700000000000","profileUrl": "https://reddit.com/user/example_user"}
When fetchDetails=true, additional fields appear:
{"username": "example_user","userId": "t2_1700000000000","profileUrl": "https://reddit.com/user/example_user","totalKarma": 1234,"postKarma": 900,"commentKarma": 334,"createdUTC": 1600000000,"isGold": false,"iconImg": "https://styles.redditmedia.com/..."}
How It Works (Architecture)
- Async requests: aiohttp with rotating User-Agent and per-request proxy resolution.
- Proxy state machine: Direct → Datacenter → Residential; sticky on residential with extra retries.
- Post and comment pagination: Uses
aftercursors up tomaxPosts/maxComments. - Keyword search: Queries
/search.jsonwithtype=link, collects authors from posts. - Live persistence: Each user is
Actor.push_data(...)immediately; no batching required. - Concurrency control: Semaphore on user-detail fetches to respect
maxConcurrentUsers. - Jittered delays: Small random sleep after a few requests and between user detail calls to reduce 429s.
- Resilience: Escalates proxies on 403/429/503/timeouts, logs failures, continues collecting partial data.
Anti-Blocking Tips
- Keep
fetchDetails=falseif you only need usernames; this reduces calls and blocks. - Lower
maxPosts/maxCommentsfor aggressive subreddits. - Increase
requestDelayto 2–3s when fetching details at scale. - Start with direct; allow actor to escalate naturally. If you expect heavy blocking, set
useApifyProxy=trueto begin on datacenter. - Prefer fewer concurrent detail requests (
maxConcurrentUsers2–4) if seeing rate limits.
Setup and Local Run
Requirements: Python 3.10+, Apify CLI, Docker (for full build).
cd Reddit-Subreddit-Members-Scraperpip install -r requirements.txtapify run
Supply input via INPUT.json or Apify UI. Results appear in the default dataset.
Deploy on Apify
- Push:
apify push - Run in the Apify Console with your input.
- Monitor logs for proxy transitions and counts.
- Download results from the Dataset tab (JSON/CSV/Excel).
Field-by-Field Input Guide
- targets: Best entrypoint; mix subreddits, users, and keywords. Use request-list editor in Apify UI.
- subreddits: Convenience fallback; plain names.
- sort_order: Choose
newfor freshness,hotfor trending,topfor high-signal authors,risingfor early discoveries. - maxPosts / maxComments: Balance coverage vs. speed; Reddit caps per page at 100.
- fetchDetails: Enable only when you need karma/created/avatar; otherwise stay off for speed.
- maxConcurrentUsers: Tune to reduce 429s; 3–5 is usually safe.
- requestDelay: Increase if blocked while enriching.
- proxyConfiguration: Leave off by default. The actor will escalate automatically when blocked.
Data Quality Notes
- Deleted or suspended authors are skipped (
[deleted]). - Some profiles may block or return minimal data; enrichment may be partial.
userIdis a generated placeholder (Reddit API hides the true ID through these endpoints).
Performance Tips
- Use smaller
maxPostsandmaxCommentsacross more subreddits to diversify results. - For large runs with details, consider running during off-peak hours and upping delay to 2–3s.
- Keywords can be noisy; combine with subreddits for higher relevance.
Error Handling and Logging
- HTTP 403/429/503/Timeout → proxy escalation with log entry.
- Other HTTP → warning + limited retries.
- Exceptions → logged with stack trace; run continues where possible.
- Completion → summary log to check dataset.
Security and Compliance
- Scrapes only public Reddit endpoints.
- Respect Reddit’s terms and local laws; do not spam users with scraped data.
- Proxies are used solely for block avoidance; residential escalation is logged.
Extending the Actor
- Add score or upvote thresholds: adjust post fetch URL parameters.
- Add
toptime windows: appendt=day/week/month/year/allon top sorting. - Add subreddit filters: pre-validate allowed subs list.
- Add CSV export: post-process dataset with Apify transformations or client-side script.
Troubleshooting FAQ
Q: I get empty results.
A: Check inputs; ensure subreddits exist. Lower maxPosts/maxComments, set sort_order=new, and allow proxy escalation.
Q: Blocks persist even on residential.
A: Increase requestDelay, reduce maxConcurrentUsers, and lower volume. Consider splitting runs by subreddit batches.
Q: Enrichment is slow.
A: Disable fetchDetails or reduce maxConcurrentUsers. Details require per-user calls.
Q: Dataset fields missing.
A: Fields only appear when fetchDetails=true. Basic mode pushes username, userId, profileUrl only.
SEO Snapshot (keywords covered)
- Reddit scraper, Reddit members scraper, Reddit user extractor, Reddit subreddit members, Reddit profile scraper, Apify Reddit scraper, Python aiohttp Reddit scraper, Reddit proxy scraping, Reddit residential proxy, Reddit datacenter proxy, async Reddit scraper, Reddit dataset export.
Quick Start (TL;DR)
- Provide
targetswith subreddits/users/keywords. - Set
maxPosts,maxComments,sort_order. - Decide on
fetchDetails. - Leave proxy off; actor will escalate if blocked.
- Run and grab results from the dataset.
Changelog
- v0.1: Initial public actor with async fetch, post/comment collection, keyword search, optional user details, and direct→datacenter→residential proxy fallback.
Built to stay resilient, transparent, and efficient for Reddit member discovery on Apify. Run it, watch the logs for proxy events, and export the dataset when done.