Reddit Data Scraper – Scrape Posts, Comments, Upvotes & More avatar

Reddit Data Scraper – Scrape Posts, Comments, Upvotes & More

Pricing

$5.00/month + usage

Go to Apify Store
Reddit Data Scraper – Scrape Posts, Comments, Upvotes & More

Reddit Data Scraper – Scrape Posts, Comments, Upvotes & More

Extract Reddit posts, comments, upvotes, and subreddit data with this powerful Reddit scraper. Ideal for data analysis, lead generation, trend research, and AI datasets. Scrape Reddit data at scale without API limits and export results in JSON, CSV, or Excel format.

Pricing

$5.00/month + usage

Rating

0.0

(0)

Developer

Sovanza

Sovanza

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Reddit Post Scraper

What is Reddit Post Scraper?

Reddit Post Scraper is a powerful Reddit data extraction tool built on Apify that allows you to scrape posts and subreddit listings at scale using Reddit’s public JSON endpoints (no browser required). It is designed for marketers, researchers, developers, and businesses who want to automate trend analysis, content research, lead generation, and AI dataset creation — without relying on Reddit’s official API.

Why Use This Reddit Scraper?

Use this scraper to:

  • Extract trending posts from any subreddit
  • Analyze discussions, engagement, and content performance
  • Track upvotes, comments, and popularity over time
  • Build datasets for AI, sentiment analysis, and research
  • Automate Reddit data collection workflows

Features

  • Scrape subreddit listings (hot, new, top, rising, etc.) from subreddit URLs or names.
  • Scrape single posts directly by URL.
  • Extract rich post-level data:
    • Post title and body (self text + HTML)
    • Author and subreddit information
    • Score, upvote ratio, number of comments
    • Permalink, link URL, flair, thumbnail, domain
    • Creation time and scrape time
  • Uses Reddit’s JSON API (.json endpoints) — no headless browser needed.
  • Structured output for analytics and automation.

How to Use Reddit Post Scraper on Apify

Using the Actor

To use this actor on Apify, follow these steps:

  1. Go to the Reddit Post Scraper on the Apify platform.

  2. Input Configuration:

    • Provide one or more Reddit URLs (subreddit listings or individual posts), or a subreddit name plus sort mode.
    • Configure how many posts to fetch per URL and your proxy settings for reliability.

Input Configuration

The actor supports multiple input styles. A typical configuration looks like:

{
"startUrls": [
{ "url": "https://www.reddit.com/r/marketing/top" },
{ "url": "https://www.reddit.com/r/startups/new" }
],
"subreddit": "python",
"sort": "hot",
"maxPostsPerUrl": 25,
"language": "en",
"proxyCountry": "AUTO_SELECT_PROXY_COUNTRY",
"proxyUrl": null
}

Common fields:

  • productUrls (optional): One or more Reddit URLs (subreddit or post), one per line (array of strings).
  • startUrls (optional): Array of { "url": "..." } objects used as starting points (subreddits or posts).
  • url (optional): Single Reddit URL (legacy single-URL input).
  • subreddit (optional): Subreddit name only, e.g. "python", "AskReddit".
  • sort (optional): Sort order for subreddit listings (e.g. hot, new, top, rising, controversial).
  • maxPostsPerUrl (optional): Maximum number of posts to fetch per listing URL (typically 1–100, default ~25).
  • language (optional): Language/locale hint for requests (default: en).
  • proxyCountry (optional): Apify proxy country, e.g. AUTO_SELECT_PROXY_COUNTRY, US, GB, DE, FR, JP, CA, IT.
  • proxyUrl (optional): Custom proxy URL (e.g. Webshare). When set, overrides Apify Proxy.
  1. Run the Actor:

    • Click Start to begin scraping.
    • The actor fetches .json from each Reddit URL and normalizes post data.
  2. Access Your Results:

    • View results in the Dataset tab.
    • Export data in JSON, CSV, or Excel.
    • Access via the Apify API for programmatic workflows.
  3. Schedule Regular Runs (Optional):

    • Schedule the actor to run periodically to track trends and subreddit activity over time.

Output

Each Reddit post becomes one item in the dataset. According to the dataset schema, each item typically includes:

  • url: Full URL to the Reddit post.
  • permalink: Reddit permalink path.
  • source_url: The URL that was scraped (listing or post URL).
  • id: Reddit post ID.
  • title: Post title.
  • author: Post author username.
  • subreddit: Subreddit name.
  • subreddit_name_prefixed: Subreddit with prefix, e.g. r/python.
  • score: Net score (upvotes - downvotes).
  • upvote_ratio: Ratio of upvotes (0–1).
  • num_comments: Number of comments.
  • selftext: Post body (self text).
  • selftext_html: Post body in HTML (if available).
  • link_url: URL of linked content (for link posts).
  • is_self: true if text (self) post.
  • over_18: true if marked NSFW.
  • link_flair_text: Post flair text (if present).
  • thumbnail: Thumbnail image URL (if available).
  • domain: Domain of the linked content (for link posts).
  • created_utc: ISO timestamp of when the post was created (UTC).
  • scraped_at: Timestamp of when the post was scraped.

Example item (simplified):

{
"url": "https://www.reddit.com/r/marketing/comments/xxxxxx/example_post/",
"permalink": "/r/marketing/comments/xxxxxx/example_post/",
"source_url": "https://www.reddit.com/r/marketing/top",
"id": "xxxxxx",
"title": "Example Reddit post title",
"author": "example_user",
"subreddit": "marketing",
"subreddit_name_prefixed": "r/marketing",
"score": 512,
"upvote_ratio": 0.96,
"num_comments": 74,
"selftext": "Post body text...",
"selftext_html": "<p>Post body text...</p>",
"link_url": null,
"is_self": true,
"over_18": false,
"link_flair_text": "Discussion",
"thumbnail": "https://b.thumbs.redditmedia.com/...",
"domain": "self.marketing",
"created_utc": "2025-01-01T12:00:00Z",
"scraped_at": "2025-01-01T12:05:00Z"
}

➡️ Output is clean, structured, and ready for analysis, trend tracking, or automation.

How the Scraper Works

The Reddit Post Scraper uses Reddit’s public JSON endpoints (no browser or official API) to fetch and normalize post data:

  1. URL normalization: For each input URL or subreddit, the actor builds the corresponding .json endpoint.
  2. HTTP requests: It sends HTTP requests with a descriptive User-Agent and optional proxy configuration.
  3. Data extraction: It parses the JSON response, extracts relevant post fields, and converts timestamps to ISO strings.
  4. Dataset writing: Each post is saved as a structured item in the default Apify dataset.

Anti-blocking & Reliability

To keep scraping stable on Reddit:

  • Uses a descriptive User-Agent (Sovanza Reddit Post Scraper/1.0).
  • Adds a small delay between requests to reduce rate limiting.
  • Supports Apify Proxy and custom proxyUrl so you can use residential or region-specific IPs.
  • Retries failed requests where appropriate to handle transient issues.

Performance Optimization

  • Processes multiple listing URLs in a single run.
  • Uses lightweight HTTP+JSON (no headless browser), which is faster and cheaper.
  • Lets you control maxPostsPerUrl to tune between speed and depth.

Why Choose This Actor?

  • Scalable Reddit data extraction from subreddits and posts.
  • Extracts rich post data and engagement metrics.
  • No official Reddit API required.
  • Automation-ready via Apify API, scheduling, and webhooks.
  • Produces clean, structured datasets suitable for analytics and AI.

FAQ

How does Reddit Post Scraper work?

It appends .json to Reddit URLs (subreddits or posts), fetches the public Reddit JSON API, normalizes post data, and saves it to an Apify dataset.

Can I scrape multiple subreddits at once?

Yes. You can provide multiple subreddit listing URLs or use multiple startUrls and/or productUrls in a single run.

Does it require Reddit API credentials?

No. It works with publicly available Reddit JSON endpoints and does not use Reddit’s official OAuth API.

Can I extract comments and replies?

This actor focuses on post-level data (title, body, score, metadata). If you need full comment trees, you may use or extend it with an additional comments scraper.

Is the data accurate?

Yes. Data is fetched in real time from Reddit’s public JSON responses.

Can I automate scraping?

Yes. You can use Apify scheduling, webhooks, and the API to run it regularly and integrate it into pipelines.

What formats are supported?

JSON, CSV, Excel via Apify dataset export, plus API output for programmatic access.

Is it suitable for AI and sentiment analysis?

Yes. The structured text fields (title, selftext, etc.) are ideal for NLP, topic modeling, and sentiment analysis workflows.

Scraping publicly available data is generally allowed, but you should comply with Reddit’s terms of service and all applicable laws.

Actor permissions

This Actor is designed to work with limited permissions. It only reads input and writes to its default dataset; it does not access other user data or require full account access.

To set limited permissions in Apify Console:

  1. Open your Actor on the Apify platform.
  2. Go to the Source tab (or Settings).
  3. Click Review permissions (or open SettingsPermissions).
  4. Select Limited permissions and save.

Using limited permissions improves trust and can improve your Actor's quality score in the Store.

Anti-blocking Notes

  • Reddit requires a descriptive User-Agent; the actor sends Sovanza Reddit Post Scraper/1.0.
  • A short delay between requests helps reduce the risk of rate limiting.
  • When running on Apify, enable a suitable proxy configuration (often residential) to reduce 429/403 errors, as Reddit may block datacenter IPs.

Limitations

  • Some subreddits or posts may be restricted, removed, or rate-limited.
  • Reddit’s JSON structure can change, which may require actor updates.
  • Large-scale scraping may require appropriate Apify plan limits and careful proxy usage.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Get Started

Start extracting Reddit posts and build powerful datasets for research, marketing, and automation today. 🚀