Reddit Intelligence Scraper (Pay per Event) avatar

Reddit Intelligence Scraper (Pay per Event)

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Reddit Intelligence Scraper (Pay per Event)

Reddit Intelligence Scraper (Pay per Event)

Scrape Reddit posts, full comment trees, user profiles, and search results. Features subreddit monitoring with webhook alerts, batch comparison across multiple subreddits, and AI-native markdown output ready for LLM pipelines and vector databases.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Eimantas V

Eimantas V

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

3 days ago

Last modified

Share

Reddit Intelligence Scraper

Extract posts, full comment trees, user profiles, search results, and trending topics from Reddit — with AI-native structured output designed to drop directly into LLM pipelines, vector databases, and RAG systems without preprocessing.

🚀 What Can This Reddit Scraper Extract?

Data TypeFields Extracted
PostsTitle, body (markdown + plain text), score, upvote ratio, awards, flair, author, timestamps, crosspost data
CommentsFull nested tree (all depths), per-comment score, author, edited flag, reply count
UsersKarma breakdown, account age, post/comment history, profile bio
Search ResultsFull-text Reddit search with subreddit filtering, sorting, and time windows
Subreddit MetadataSubscriber count, active users, description, creation date, icons
Batch ComparisonSide-by-side stats for 10+ subreddits in a single run

✨ Key Features

  • 🔄 Subreddit monitoring mode — Poll any subreddit for new posts matching keyword filters and deliver alerts via webhook in real-time
  • 🌲 Full comment tree traversal — Not just top-level comments. Fetches deeply nested replies via Reddit's morechildren API, up to configurable depth
  • 🤖 AI-native output — Every result includes a _markdown_document field: a clean, structured markdown document ready for LLM context windows or vector embedding
  • 📊 Batch subreddit comparison — Pull top posts from up to 20 subreddits in one run with aggregated stats — ideal for market research and competitive analysis
  • ⚡ Reliable session rotation — Rotates User-Agents, respects X-Ratelimit-* headers, and uses exponential backoff — the #1 failure mode for Reddit scrapers, solved
  • 🔍 Advanced filtering — Filter by flair, keyword, score threshold, date range, NSFW flag, and sort order (hot/new/top/rising)
  • 📋 Schema-versioned output — Every item carries _schema: "reddit-intelligence/v1" so your pipeline always knows what it's getting

📖 How to Use the Reddit Intelligence Scraper

Step 1 — Choose a mode

ModeWhat it does
subredditScrape posts from one or more subreddits
postScrape a specific post URL with all comments
userScrape a user's profile, posts, and comment history
searchFull-text Reddit search with filters
batchCompare top posts across multiple subreddits
monitorWatch subreddits for new posts and deliver webhook alerts

Step 2 — Configure your run

Scrape the top posts from r/MachineLearning this week:

{
"mode": "subreddit",
"subreddits": ["MachineLearning"],
"sortBy": "top",
"timeFilter": "week",
"maxPostsPerSubreddit": 50,
"includeComments": true,
"maxCommentsPerPost": 100,
"outputFormat": "both"
}

Compare 5 subreddits for market research:

{
"mode": "batch",
"subreddits": ["entrepreneur", "startups", "SaaS", "indiehackers", "smallbusiness"],
"sortBy": "top",
"timeFilter": "month",
"maxPostsPerSubreddit": 10
}

Monitor r/ArtificialIntelligence for mentions of "GPT" and alert via webhook:

{
"mode": "monitor",
"subreddits": ["ArtificialIntelligence"],
"keywordFilter": ["GPT", "Claude", "Gemini", "LLM"],
"monitoringInterval": 5,
"webhookUrl": "https://your-server.com/webhooks/reddit"
}

Scrape a specific post with full comment tree:

{
"mode": "post",
"postUrls": ["https://www.reddit.com/r/MachineLearning/comments/abc123/example_post/"],
"maxCommentsPerPost": 500,
"commentDepth": 10
}

Step 3 — Use the output

Every post result includes a ready-to-use markdown document in the _markdown_document field:

# Why GPT-4 is changing enterprise software
**r/MachineLearning** | Score: **4,231** (96% upvoted) | Comments: **312**
Author: u/ml_researcher | Posted: 2024-03-15T14:22:00Z
## Post Content
The shift from rule-based to generative AI...
## Top Comments
### u/ai_engineer (Score: 847)
This is exactly what we're seeing in production...
> #### u/skeptic99 (Score: 234)
> Worth noting the cost implications here...

Paste this directly into your LLM prompt or chunk it for RAG.


💰 How Much Does It Cost to Scrape Reddit?

Reddit Intelligence Scraper is priced per result (pay-per-event):

TaskApproximate Cost
1,000 posts (metadata only)~$3.00
1,000 posts with 100 comments each~$3.00
User profile (1 user, 25 posts)~$0.12
Batch comparison (10 subs × 10 posts)~$0.30
Monitor run (24h, low-traffic sub)~$1.50–6.00

Pricing: $3.00 per 1,000 results. Each post, comment thread, user profile, or search result page counts as one result.

Tip: Disable includeComments and set outputFormat: "json" for faster runs when you only need post metadata.


📤 Output Format

Post object (JSON)

{
"_schema": "reddit-intelligence/v1",
"_scraped_at": "2024-03-15T14:30:00.000Z",
"type": "post",
"id": "abc123",
"url": "https://www.reddit.com/r/MachineLearning/comments/abc123/...",
"subreddit": "MachineLearning",
"title": "Why GPT-4 is changing enterprise software",
"body_markdown": "The shift from rule-based to generative AI...",
"body_text": "The shift from rule-based to generative AI...",
"score": 4231,
"upvote_ratio": 0.96,
"num_comments": 312,
"total_awards_received": 7,
"flair_text": "Discussion",
"author": "ml_researcher",
"created_utc": "2024-03-15T14:22:00.000Z",
"comments": [...],
"subreddit_meta": {
"subscribers": 2800000,
"active_user_count": 4200,
...
},
"_markdown_document": "# Why GPT-4 is changing..."
}

Webhook payload (monitor mode)

{
"event": "keyword_match",
"timestamp": "2024-03-15T14:35:00.000Z",
"subreddit": "ArtificialIntelligence",
"matched_keywords": ["GPT", "LLM"],
"post": {
"id": "xyz789",
"title": "New GPT-4 benchmark results are wild",
"url": "https://www.reddit.com/r/...",
"score": 142,
"body_preview": "Just ran the full MMLU suite..."
}
}

🤔 Frequently Asked Questions

Reddit's public data is accessible without authentication. This actor only scrapes publicly available content — the same data accessible in a browser without logging in. Always review Reddit's Terms of Service and ensure your use complies with applicable laws. This tool is intended for research, analytics, and AI training use cases.

Why is the actor not returning all 1000 posts I requested?

Reddit's top sort with longer time windows (year, all) is the best way to get high-quality historical posts. The API occasionally returns fewer results than requested — this is a Reddit limitation. Try increasing maxPostsPerSubreddit and setting sortBy: "new" for completeness.

What's the difference between outputFormat: "json", "markdown", and "both"?

  • json — returns the full structured JSON object, ideal for data pipelines and databases
  • markdown — returns only the _markdown_document field (the AI-ready version), minimal storage
  • both — returns full JSON and the markdown document in every result

Can I use this to feed Reddit data into a vector database?

Yes — this is a primary use case. Use outputFormat: "markdown", split on ## Top Comments to get post and comment chunks, and embed each chunk separately.

Does it handle private or restricted subreddits?

No. This actor only accesses public Reddit content. Private subreddits require OAuth authentication with approved account credentials.

How does the monitoring mode work exactly?

On first run, the actor seeds its state with the current latest posts (no webhook fires). On subsequent polling cycles (default: every 5 minutes), any new post that matches your filters triggers a dataset push and optionally a webhook. State is persisted in Apify Key-Value Store so it survives between runs.

Why use a proxy?

Without a proxy, repeated scraping from a single IP can trigger Reddit's rate limiting (HTTP 429). Apify's residential proxy pool rotates IPs automatically, making your scraper much more reliable at scale.


  • Web Scraper — General-purpose web scraping
  • Twitter Scraper — Social media monitoring on X/Twitter
  • YouTube Scraper — Video and comment data from YouTube

📬 Support & Feedback

Found a bug or have a feature request? Open an issue or contact us through the Apify platform. We monitor this actor actively and publish updates regularly.