Pricing

$19.99/month + usage

Reddit User Profile Posts And Comments Scraper

Use this actor to extract Reddit posts and comments from any user’s public profile. Perfect for OSINT, threat intelligence, or digital forensics where Reddit activity patterns help track digital footprints.

Pricing

$19.99/month + usage

Rating

0.0

(0)

Developer

SimpleAPI

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Reddit User Profile Posts & Comments Scraper (Apify Actor)

This Apify Actor is a production-grade Reddit user profile scraper that extracts public posts from Reddit profiles (old.reddit.com) and optionally pulls comments for each post. It supports usernames, profile URLs, and keyword searches while handling proxy fallbacks, retry logic, and live dataset saving so results are preserved even if a run is interrupted. The code is fully asynchronous, function-based, and follows Apify best practices for proxy handling, rate limiting, and structured outputs.

What this actor does

Crawls Reddit user profiles on old.reddit.com and collects posts with detailed metadata (title, score, subreddit, permalink, created time, preview/media, flair, NSFW flags, and more).
Supports keyword searches via keyword:<term> to gather posts matching a query on old Reddit search.
Optionally fetches top-level comments for each post via Reddit’s JSON endpoint, capped by maxComments.
Streams every parsed item to the Apify dataset immediately (live saving) to avoid data loss.
Implements proxy fallback logic: starts direct, then falls back to Apify datacenter proxy on block, and finally to residential proxy (stays on residential after it’s used). Logs every proxy transition.
Adds polite randomized delays and retry/backoff behavior to reduce blocks and keep runs stable.

Why choose this Reddit scraper

Proxy-aware by design: automatic direct → datacenter → residential fallback with clear log messages, plus sticky behavior on residential once triggered.
Async + resilient: aiohttp-based concurrency with per-request retry/backoff and graceful error handling.
Rich parsing: Ports robust HTML parsing from a proven standalone script, including thumbnails, flair, preview, media hints, and author flair.
Comments optionality: Pull only the comments you need, up to a defined cap, to control cost and speed.
Live dataset writes: Results are saved as soon as they are parsed, protecting your data if a run stops.
Flexible inputs: Accepts usernames, profile URLs (old or new Reddit), and keyword searches (keyword:term) in bulk.
Sort control: Choose new, hot, top, or controversial to match the ordering you need.
SEO-focused output: Detailed, structured data suitable for analytics, monitoring, sentiment, and content curation pipelines.

Key features at a glance

Targets: Reddit profiles (submitted posts) and keyword search results on old.reddit.com.
Inputs: bulk sources, sort selection, post/comment caps, proxy config.
Outputs: post objects with selftext, flair, preview/media info, engagement fields, and optional comments array.
Anti-blocking: proxy fallback, jittered delays, 3x retries with exponential-style waits, block detection on status codes 403/429/503.
Logging: verbose Apify logs for proxy transitions, pagination progress, and parse health (with HTML previews when parsing yields zero posts).

How it works (flow)

Input parsing: The actor normalizes each startUrls entry into one of three kinds: user (username or profile URL), keyword (keyword:<term>), or url (treated as user unless it’s clearly a search).
Proxy setup: Prepares Apify proxy configs for datacenter and residential. Starts in direct mode; on block escalates to datacenter, then residential (and remains residential after first use). If a tier is unavailable, it logs and stays at the current tier.
Fetching: Uses aiohttp with shared headers that mimic a real browser. Each page fetch includes retry/backoff and block detection (403/429/503).
Parsing: Reuses the proven HTML parsing functions from the original standalone scraper to extract rich post metadata from div#thing_t3_* elements.
Comments (optional): If maxComments > 0, calls the Reddit JSON thread endpoint for each post (capped) and attaches a comments array.
Live saving: Each parsed post is pushed immediately to the dataset view defined in .actor/actor.json.
Pagination and limits: Continues paging until maxPosts per target is reached, or no posts are parsed; includes small random delays between pages.

Input schema (actor.json)

startUrls (array, required): Mixed inputs accepted.
- Username formats: u/example, example, https://old.reddit.com/user/example, https://www.reddit.com/u/example.
- Keyword format: keyword:python (scrapes old Reddit search results for that term).
sortOrder (string): new (default) | hot | top | controversial.
maxPosts (integer): Maximum posts per target (default 100, max 500).
maxComments (integer): Maximum comments per post (default 0 to skip).
proxyConfiguration (object): Standard Apify proxy config; actor starts direct and auto-falls back. Provide Apify token/proxy password for datacenter/residential usage.

Example input

{
  "startUrls": [
    { "url": "https://old.reddit.com/user/spez" },
    { "url": "u/kn0thing" },
    { "url": "keyword:python" }
  ],
  "sortOrder": "new",
  "maxPosts": 50,
  "maxComments": 5,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

Output format

Each dataset item represents a Reddit post with fields aligned to the original HTML scraper. Highlights:

Identity: id, name, permalink, url, domain, subreddit, author.
Content: title, selftext, selftext_html, post_hint, is_self, over_18, spoiler.
Engagement: score, ups, downs (0), num_comments, gilded.
Flair: link_flair_text, link_flair_css_class, link_flair_richtext, author_flair_text, author_flair_css_class.
Media/preview: thumbnail, thumbnail_height, thumbnail_width, preview, media, secure_media, media_embed.
Moderation/status: stickied, locked, archived, distinguished, treatment_tags.
Comments: comments (array, only if maxComments > 0; each has id, author, body, score, created_utc, permalink, replies_count).

Dataset view (from .actor/actor.json) shows key columns: author, title, subreddit, score, num_comments, permalink, created_utc, is_self, selftext.

Proxy strategy and anti-blocking

Start direct (no proxy). If the platform blocks, escalate to datacenter proxy. If blocked again, escalate to residential and stay there.
Block detection: HTTP 403/429/503 triggers escalation; retries include jittered sleeps (0.8–1.5s).
Residential retries: up to 3 after switching; then fail fast with an explicit error.
Logging: every transition is logged (Switching to datacenter..., Switching to residential..., Residential proxy retry...).
Tips for reliability:
- Provide APIFY_TOKEN so proxy passwords resolve automatically.
- Prefer residential proxies for higher success on Reddit; keep User-Agent as provided unless you must change it.
- Use reasonable maxPosts and maxComments to limit load.

Rate limits and performance

Concurrency: a modest semaphore for targets (default 3) and connection caps on aiohttp to avoid overloading Reddit.
Delays: random 0.6–1.4s between pages; small jitter between retries.
Retries: per-request retry until escalation rules are satisfied; residential retries capped at 3.

How to run on Apify console

Go to your actor: Reddit User Profile Posts & Comments Scraper.
Open the Input tab and paste your JSON (see example above). Ensure startUrls is provided.
Set proxyConfiguration to use Apify Proxy (datacenter or residential). Supplying APIFY_TOKEN or proxy password is required for proxy use.
Click Run. Watch live logs for proxy transitions, pagination, and parse status.
Results appear in the Dataset tab; export as JSON, CSV, or XLSX.

How to run locally

python -m venv .venv
.venv\\Scripts\\activate
pip install -r requirements.txt

# set input using local storage
mkdir -p apify_storage/key_value_stores/default
echo "{ \"startUrls\": [{\"url\": \"https://old.reddit.com/user/spez\"}], \"maxPosts\": 5, \"sortOrder\": \"new\", \"maxComments\": 0, \"proxyConfiguration\": {\"useApifyProxy\": false} }" > apify_storage/key_value_stores/default/INPUT.json

set APIFY_LOCAL_STORAGE_DIR=%CD%\\apify_storage
python -m src

Notes:

Without Apify proxy credentials, you may see blocks. Provide APIFY_TOKEN or APIFY_PROXY_PASSWORD to enable proxies locally.
Logs will show if parsing finds zero posts; a short HTML preview is emitted to help debug blocks or empty pages.

Field-by-field input guidance

startUrls (array, required)
- Usernames: u/someone, someone, or full profile URLs (old/new Reddit).
- Keywords: prefix with keyword: to trigger search mode on old Reddit.
- Mix freely in one run; the actor will normalize each.
sortOrder (string)
- Use new for freshest posts; top or controversial for engagement-centric pulls; hot for trending.
maxPosts (int)
- Cap per target; actor stops when limit is reached or when no more posts parse.
maxComments (int)
- If 0, comments are skipped (fastest). Otherwise, up to this many top-level comments are fetched from Reddit JSON.
proxyConfiguration (object)
- Standard Apify proxy settings. Actor auto-escalates; residential is sticky once used. Provide token/password for proxy auth.

Practical use cases (SEO-friendly)

Reddit brand monitoring scraper: Track brand mentions across user posts.
Content research: Harvest self posts with full text for sentiment and topic modeling.
Community analysis: Analyze top posts for specific users or keywords to understand engagement patterns.
Lead discovery: Identify active authors in niche subreddits via keyword-based pulls.
Trend tracking: Combine sortOrder=hot with keyword search to follow emerging topics.
Dataset creation: Build structured corpora of Reddit posts and optional comments for LLM fine-tuning or analytics.

Troubleshooting and tips

Got “Input 'startUrls' is required”: Ensure startUrls is present and non-empty in JSON.
Zero posts parsed: Check logs; a preview of the HTML is printed. Likely a block—enable proxies or switch to residential.
Proxy errors: Provide APIFY_TOKEN or APIFY_PROXY_PASSWORD. If groups are needed, set apifyProxyGroups in proxyConfiguration.
Slow runs: Reduce maxPosts/maxComments or lower concurrency (edit semaphore if you fork).
Comment depth: Only top-level comments are fetched; replies count is provided for context.

FAQ

Does this scrape private or suspended accounts?
No. Only publicly accessible pages on old.reddit.com are parsed.

Can it bypass Reddit rate limits?
It uses respectful delays and proxy rotation, but you should keep maxPosts reasonable and prefer residential proxies for reliability.

Why old.reddit.com?
The HTML structure is stable for scraping and matches the original parser logic for consistency with legacy outputs.

What about pagination?
The actor paginates using after tokens from parsed posts and stops when maxPosts is hit or a page yields no posts.

Is the output schema stable?
It mirrors the original script’s rich fields; minor adjustments may occur if Reddit markup changes. The dataset view highlights primary columns.

Changelog (high level)

Initial Apify actor: async aiohttp scraper, proxy fallback (direct → DC → RES), live dataset pushes, optional comments, detailed logging, and legacy field parity with the standalone scripts.

Support

For Apify platform questions: https://console.apify.com/support
For actor-specific issues, review logs (proxy transitions, HTML preview on empty parse). If blocks persist, run with residential proxy and lower request volumes.

Keywords: Reddit user profile scraper, Reddit posts scraper, Reddit comments scraper, old.reddit.com scraper, Apify actor for Reddit, Reddit keyword scraper, Reddit proxy scraping, Reddit dataset export, Reddit sentiment data, Reddit brand monitoring, Reddit content research, Reddit analytics, Reddit lead generation.

Reddit Scraper

automation-lab/reddit-scraper

Working Reddit scraper for public Reddit search, subreddit listings, posts, comments, and user profiles. No Reddit account or API key required.

Stas Persiianenko

2.5K

4.7

Reddit Scraper

trudax/reddit-scraper

Unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Trudax

14K

2.5

Reddit User Profile Posts And Comments Scraper

scrapelabsapi/reddit-user-profile-posts-and-comments-scraper

ScrapeLabs

Reddit User Profile Posts And Comments Scraper

scrapeflow/reddit-user-profile-posts-and-comments-scraper

ScrapeFlow

Reddit User Profile Posts And Comments Scraper

scraply/reddit-user-profile-posts-and-comments-scraper

Scraply

Reddit User Profile Posts And Comments Scraper

scrapemesh/reddit-user-profile-posts-and-comments-scraper

ScrapeMesh

Reddit User Profile Posts And Comments Scraper

scrapebase/reddit-user-profile-posts-and-comments-scraper

ScrapeBase

Reddit User Profile Posts Comments Scraper

scrapers-hub/reddit-user-profile-posts-comments-scraper

Reddit user profile posts comments scraper to extract posts, comments, and activity from Reddit user profiles 💬📊 Ideal for research, sentiment analysis, and audience insights. Fast and efficient.

Scrapers Hub

Reddit Scraper - Posts, Comments, Search & Subreddits ($2/1k)

harshmaur/reddit-scraper

Scrape Reddit posts, comments, subreddits, user profiles, and keyword search results - no API key, no rate limits, no login. From $2 per 1,000 results, pay only for what you use. Full comment threads, 60+ fields per post, media and galleries. Works with AI Agents, MCP, n8n, Make, Zapier and more.

Harsh Maur

7.9K

5.0

Reddit Scraper Lite

trudax/reddit-scraper-lite

Pay Per Result, unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Trudax

33K

4.6