Reddit User Profile Posts And Comments Scraper
Pricing
$19.99/month + usage
Reddit User Profile Posts And Comments Scraper
Use this actor to extract Reddit posts and comments from any user’s public profile. Perfect for OSINT, threat intelligence, or digital forensics where Reddit activity patterns help track digital footprints.
Pricing
$19.99/month + usage
Rating
0.0
(0)
Developer
SimpleAPI
Actor stats
1
Bookmarked
10
Total users
3
Monthly active users
2 days ago
Last modified
Categories
Share
Reddit User Profile Posts & Comments Scraper (Apify Actor)
This Apify Actor is a production-grade Reddit user profile scraper that extracts public posts from Reddit profiles (old.reddit.com) and optionally pulls comments for each post. It supports usernames, profile URLs, and keyword searches while handling proxy fallbacks, retry logic, and live dataset saving so results are preserved even if a run is interrupted. The code is fully asynchronous, function-based, and follows Apify best practices for proxy handling, rate limiting, and structured outputs.
What this actor does
- Crawls Reddit user profiles on old.reddit.com and collects posts with detailed metadata (title, score, subreddit, permalink, created time, preview/media, flair, NSFW flags, and more).
- Supports keyword searches via
keyword:<term>to gather posts matching a query on old Reddit search. - Optionally fetches top-level comments for each post via Reddit’s JSON endpoint, capped by
maxComments. - Streams every parsed item to the Apify dataset immediately (live saving) to avoid data loss.
- Implements proxy fallback logic: starts direct, then falls back to Apify datacenter proxy on block, and finally to residential proxy (stays on residential after it’s used). Logs every proxy transition.
- Adds polite randomized delays and retry/backoff behavior to reduce blocks and keep runs stable.
Why choose this Reddit scraper
- Proxy-aware by design: automatic direct → datacenter → residential fallback with clear log messages, plus sticky behavior on residential once triggered.
- Async + resilient: aiohttp-based concurrency with per-request retry/backoff and graceful error handling.
- Rich parsing: Ports robust HTML parsing from a proven standalone script, including thumbnails, flair, preview, media hints, and author flair.
- Comments optionality: Pull only the comments you need, up to a defined cap, to control cost and speed.
- Live dataset writes: Results are saved as soon as they are parsed, protecting your data if a run stops.
- Flexible inputs: Accepts usernames, profile URLs (old or new Reddit), and keyword searches (
keyword:term) in bulk. - Sort control: Choose
new,hot,top, orcontroversialto match the ordering you need. - SEO-focused output: Detailed, structured data suitable for analytics, monitoring, sentiment, and content curation pipelines.
Key features at a glance
- Targets: Reddit profiles (submitted posts) and keyword search results on old.reddit.com.
- Inputs: bulk sources, sort selection, post/comment caps, proxy config.
- Outputs: post objects with selftext, flair, preview/media info, engagement fields, and optional comments array.
- Anti-blocking: proxy fallback, jittered delays, 3x retries with exponential-style waits, block detection on status codes 403/429/503.
- Logging: verbose Apify logs for proxy transitions, pagination progress, and parse health (with HTML previews when parsing yields zero posts).
How it works (flow)
- Input parsing: The actor normalizes each
startUrlsentry into one of three kinds:user(username or profile URL),keyword(keyword:<term>), orurl(treated as user unless it’s clearly a search). - Proxy setup: Prepares Apify proxy configs for datacenter and residential. Starts in direct mode; on block escalates to datacenter, then residential (and remains residential after first use). If a tier is unavailable, it logs and stays at the current tier.
- Fetching: Uses aiohttp with shared headers that mimic a real browser. Each page fetch includes retry/backoff and block detection (403/429/503).
- Parsing: Reuses the proven HTML parsing functions from the original standalone scraper to extract rich post metadata from
div#thing_t3_*elements. - Comments (optional): If
maxComments > 0, calls the Reddit JSON thread endpoint for each post (capped) and attaches acommentsarray. - Live saving: Each parsed post is pushed immediately to the dataset view defined in
.actor/actor.json. - Pagination and limits: Continues paging until
maxPostsper target is reached, or no posts are parsed; includes small random delays between pages.
Input schema (actor.json)
startUrls(array, required): Mixed inputs accepted.- Username formats:
u/example,example,https://old.reddit.com/user/example,https://www.reddit.com/u/example. - Keyword format:
keyword:python(scrapes old Reddit search results for that term).
- Username formats:
sortOrder(string):new(default) |hot|top|controversial.maxPosts(integer): Maximum posts per target (default 100, max 500).maxComments(integer): Maximum comments per post (default 0 to skip).proxyConfiguration(object): Standard Apify proxy config; actor starts direct and auto-falls back. Provide Apify token/proxy password for datacenter/residential usage.
Example input
{"startUrls": [{ "url": "https://old.reddit.com/user/spez" },{ "url": "u/kn0thing" },{ "url": "keyword:python" }],"sortOrder": "new","maxPosts": 50,"maxComments": 5,"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Output format
Each dataset item represents a Reddit post with fields aligned to the original HTML scraper. Highlights:
- Identity:
id,name,permalink,url,domain,subreddit,author. - Content:
title,selftext,selftext_html,post_hint,is_self,over_18,spoiler. - Engagement:
score,ups,downs(0),num_comments,gilded. - Flair:
link_flair_text,link_flair_css_class,link_flair_richtext,author_flair_text,author_flair_css_class. - Media/preview:
thumbnail,thumbnail_height,thumbnail_width,preview,media,secure_media,media_embed. - Moderation/status:
stickied,locked,archived,distinguished,treatment_tags. - Comments:
comments(array, only ifmaxComments > 0; each hasid,author,body,score,created_utc,permalink,replies_count).
Dataset view (from .actor/actor.json) shows key columns: author, title, subreddit, score, num_comments, permalink, created_utc, is_self, selftext.
Proxy strategy and anti-blocking
- Start direct (no proxy). If the platform blocks, escalate to datacenter proxy. If blocked again, escalate to residential and stay there.
- Block detection: HTTP 403/429/503 triggers escalation; retries include jittered sleeps (0.8–1.5s).
- Residential retries: up to 3 after switching; then fail fast with an explicit error.
- Logging: every transition is logged (
Switching to datacenter...,Switching to residential...,Residential proxy retry...). - Tips for reliability:
- Provide
APIFY_TOKENso proxy passwords resolve automatically. - Prefer residential proxies for higher success on Reddit; keep
User-Agentas provided unless you must change it. - Use reasonable
maxPostsandmaxCommentsto limit load.
- Provide
Rate limits and performance
- Concurrency: a modest semaphore for targets (default 3) and connection caps on aiohttp to avoid overloading Reddit.
- Delays: random 0.6–1.4s between pages; small jitter between retries.
- Retries: per-request retry until escalation rules are satisfied; residential retries capped at 3.
How to run on Apify console
- Go to your actor: Reddit User Profile Posts & Comments Scraper.
- Open the Input tab and paste your JSON (see example above). Ensure
startUrlsis provided. - Set
proxyConfigurationto use Apify Proxy (datacenter or residential). SupplyingAPIFY_TOKENor proxy password is required for proxy use. - Click Run. Watch live logs for proxy transitions, pagination, and parse status.
- Results appear in the Dataset tab; export as JSON, CSV, or XLSX.
How to run locally
python -m venv .venv.venv\\Scripts\\activatepip install -r requirements.txt# set input using local storagemkdir -p apify_storage/key_value_stores/defaultecho "{ \"startUrls\": [{\"url\": \"https://old.reddit.com/user/spez\"}], \"maxPosts\": 5, \"sortOrder\": \"new\", \"maxComments\": 0, \"proxyConfiguration\": {\"useApifyProxy\": false} }" > apify_storage/key_value_stores/default/INPUT.jsonset APIFY_LOCAL_STORAGE_DIR=%CD%\\apify_storagepython -m src
Notes:
- Without Apify proxy credentials, you may see blocks. Provide
APIFY_TOKENorAPIFY_PROXY_PASSWORDto enable proxies locally. - Logs will show if parsing finds zero posts; a short HTML preview is emitted to help debug blocks or empty pages.
Field-by-field input guidance
- startUrls (array, required)
- Usernames:
u/someone,someone, or full profile URLs (old/new Reddit). - Keywords: prefix with
keyword:to trigger search mode on old Reddit. - Mix freely in one run; the actor will normalize each.
- Usernames:
- sortOrder (string)
- Use
newfor freshest posts;toporcontroversialfor engagement-centric pulls;hotfor trending.
- Use
- maxPosts (int)
- Cap per target; actor stops when limit is reached or when no more posts parse.
- maxComments (int)
- If 0, comments are skipped (fastest). Otherwise, up to this many top-level comments are fetched from Reddit JSON.
- proxyConfiguration (object)
- Standard Apify proxy settings. Actor auto-escalates; residential is sticky once used. Provide token/password for proxy auth.
Practical use cases (SEO-friendly)
- Reddit brand monitoring scraper: Track brand mentions across user posts.
- Content research: Harvest self posts with full text for sentiment and topic modeling.
- Community analysis: Analyze top posts for specific users or keywords to understand engagement patterns.
- Lead discovery: Identify active authors in niche subreddits via keyword-based pulls.
- Trend tracking: Combine
sortOrder=hotwith keyword search to follow emerging topics. - Dataset creation: Build structured corpora of Reddit posts and optional comments for LLM fine-tuning or analytics.
Troubleshooting and tips
- Got “Input 'startUrls' is required”: Ensure
startUrlsis present and non-empty in JSON. - Zero posts parsed: Check logs; a preview of the HTML is printed. Likely a block—enable proxies or switch to residential.
- Proxy errors: Provide
APIFY_TOKENorAPIFY_PROXY_PASSWORD. If groups are needed, setapifyProxyGroupsinproxyConfiguration. - Slow runs: Reduce
maxPosts/maxCommentsor lower concurrency (edit semaphore if you fork). - Comment depth: Only top-level comments are fetched; replies count is provided for context.
FAQ
Does this scrape private or suspended accounts?
No. Only publicly accessible pages on old.reddit.com are parsed.
Can it bypass Reddit rate limits?
It uses respectful delays and proxy rotation, but you should keep maxPosts reasonable and prefer residential proxies for reliability.
Why old.reddit.com?
The HTML structure is stable for scraping and matches the original parser logic for consistency with legacy outputs.
What about pagination?
The actor paginates using after tokens from parsed posts and stops when maxPosts is hit or a page yields no posts.
Is the output schema stable?
It mirrors the original script’s rich fields; minor adjustments may occur if Reddit markup changes. The dataset view highlights primary columns.
Changelog (high level)
- Initial Apify actor: async aiohttp scraper, proxy fallback (direct → DC → RES), live dataset pushes, optional comments, detailed logging, and legacy field parity with the standalone scripts.
Support
- For Apify platform questions: https://console.apify.com/support
- For actor-specific issues, review logs (proxy transitions, HTML preview on empty parse). If blocks persist, run with residential proxy and lower request volumes.
Keywords: Reddit user profile scraper, Reddit posts scraper, Reddit comments scraper, old.reddit.com scraper, Apify actor for Reddit, Reddit keyword scraper, Reddit proxy scraping, Reddit dataset export, Reddit sentiment data, Reddit brand monitoring, Reddit content research, Reddit analytics, Reddit lead generation.
What are other Reddit scraping tools?
If you want to scrape specific Reddit data, you can use any of the dedicated scrapers below for faster and more targeted results.