Reddit Thread & Comments Scraper avatar

Reddit Thread & Comments Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Reddit Thread & Comments Scraper

Reddit Thread & Comments Scraper

Scrape any Reddit post and its complete comment thread — including deeply nested replies — in seconds. Supports bulk URLs, cursor-based pagination for large threads, flat or nested output, score filtering, and depth capping. Perfect for sentiment analysis, AI training data, and community research.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Datara

Datara

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

3 days ago

Last modified

Share

Extract any Reddit post and its full comment tree — including nested replies at any depth — in seconds. Supports cursor-based pagination for threads with hundreds or thousands of comments.


What This Actor Does

Given one or more Reddit post URLs, this actor:

  1. Fetches the post metadata (title, author, score, upvote ratio, subscriber count, etc.)
  2. Fetches the full comment tree including nested replies, paginating through all available comment pages
  3. Pushes each post and comment as a clean, structured dataset record

Output records are typed (recordType: "post" or "comment") and immediately usable in spreadsheets, databases, AI pipelines, or downstream automations.


Use Cases

  • Sentiment analysis — analyse how communities respond to products, brands, or announcements
  • AI training data — collect high-quality human conversation threads for LLM fine-tuning or RLHF
  • Community research — surface recurring themes, pain points, and opinions across subreddits
  • Qualitative market research — understand what real users say about your category
  • Content strategy — identify high-scoring discussions to inform editorial direction

Input Fields

FieldTypeDefaultDescription
postUrlstringSingle Reddit post URL to scrape
postUrlsarray[]List of Reddit post URLs for bulk scraping (overrides postUrl)
maxPagesinteger3Max comment pages to fetch per post (cursor pagination, ~25 comments/page)
flattenCommentsbooleantrueOutput comments as individual flat records (true) or with nested replies arrays (false)
includePostRecordbooleantrueInclude the post as a separate dataset record
minCommentScoreinteger0Skip comments below this score threshold
maxCommentDepthinteger10Maximum reply nesting depth to include (0 = top-level only)
maxCommentsPerPostinteger200Cap on total comment records per post (1–5000)

Bulk mode: If postUrls is non-empty, the single postUrl field is ignored. Duplicate URLs are automatically deduplicated.


Single URL Example Input

{
"postUrl": "https://www.reddit.com/r/startups/comments/1abc23/we_just_hit_10k_mrr_heres_what_worked/",
"maxPages": 5,
"flattenComments": true,
"includePostRecord": true,
"minCommentScore": 5,
"maxCommentDepth": 5,
"maxCommentsPerPost": 500
}

Bulk URL Example Input

{
"postUrls": [
"https://www.reddit.com/r/SaaS/comments/1abc11/thoughts_on_pricing_models/",
"https://www.reddit.com/r/entrepreneur/comments/1abc22/bootstrapped_to_1m_ama/",
"https://www.reddit.com/r/startups/comments/1abc33/why_we_shut_down/"
],
"maxPages": 3,
"flattenComments": true,
"includePostRecord": true,
"minCommentScore": 2,
"maxCommentDepth": 10,
"maxCommentsPerPost": 300
}

Output Schema

Post Record (recordType: "post")

FieldTypeDescription
recordTypestringAlways "post"
idstringReddit short ID (e.g. 1lfbo7u)
namestringReddit fullname, prefixed t3_ (e.g. t3_1lfbo7u)
titlestringPost title
authorstringUsername of the poster
authorFullnamestringReddit internal author ID (e.g. t2_16syu27ar1)
subredditstringSubreddit name (without r/)
urlstringFull post URL
scoreintegerNet vote score
upsintegerUpvote count (fuzzy-rounded by Reddit)
downsintegerDownvote count (almost always 0)
upvoteRationumberRatio of upvotes to total votes (0–1)
numCommentsintegerTotal comment count as reported by Reddit
subredditSubscribersintegerSubscriber count of the subreddit
isVideobooleanTrue if the post contains a Reddit-hosted video
totalAwardsReceivedintegerNumber of Reddit awards
createdUtcnumberUnix timestamp (UTC seconds) of post creation
createdAtstringISO 8601 datetime of post creation
scrapedAtstringISO 8601 datetime when the record was scraped

Example post record:

{
"recordType": "post",
"id": "1lfbo7u",
"name": "t3_1lfbo7u",
"title": "What is a thing you love that lots of people hate?",
"author": "Vetro_Nodulare2",
"authorFullname": "t2_16syu27ar1",
"subreddit": "AskReddit",
"url": "https://www.reddit.com/r/AskReddit/comments/1lfbo7u/what_is_a_thing_you_love_that_lots_of_people_hate/",
"score": 47,
"ups": 47,
"downs": 0,
"upvoteRatio": 0.91,
"numComments": 353,
"subredditSubscribers": 56146601,
"isVideo": false,
"totalAwardsReceived": 0,
"createdUtc": 1750341959,
"createdAt": "2025-06-19T14:05:59.000Z",
"scrapedAt": "2025-06-20T09:45:12.000Z"
}

Comment Record (recordType: "comment")

FieldTypeDescription
recordTypestringAlways "comment"
idstringReddit short ID (e.g. mymupxb)
namestringReddit fullname, prefixed t1_ (e.g. t1_mymupxb)
postIdstringShort ID of the parent post
postUrlstringFull URL of the parent post
authorstringUsername of the commenter
authorFullnamestringReddit internal author ID
bodystringPlain-text comment body
scoreintegerNet vote score
upsintegerUpvote count
downsintegerDownvote count
depthintegerNesting depth (0 = top-level, 1 = reply to top-level, etc.)
parentIdstringFullname of the parent (t3_... if replying to post, t1_... if replying to comment)
linkIdstringFullname of the parent post (always t3_...)
subredditstringSubreddit name
urlstringFull URL of this comment
permalinkstringRelative permalink path
gildedintegerNumber of times gilded
stickiedbooleanTrue if pinned by a moderator
lockedbooleanTrue if the comment thread is locked
archivedbooleanTrue if too old to receive votes
controversialityinteger0 or 1; 1 = high vote split
totalAwardsReceivedintegerNumber of Reddit awards
createdUtcnumberUnix timestamp (UTC seconds) of comment creation
createdAtstringISO 8601 datetime of comment creation
scrapedAtstringISO 8601 datetime when scraped
repliesarray[Nested mode only] Child comment records

Example comment record (flat mode):

{
"recordType": "comment",
"id": "mymupxb",
"name": "t1_mymupxb",
"postId": "1lfbo7u",
"postUrl": "https://www.reddit.com/r/AskReddit/comments/1lfbo7u/what_is_a_thing_you_love_that_lots_of_people_hate/",
"author": "Background-Emu-2890",
"authorFullname": "t2_efdlposp6",
"body": "Black cat — I have one and I love her so much!",
"score": 75,
"ups": 75,
"downs": 0,
"depth": 0,
"parentId": "t3_1lfbo7u",
"linkId": "t3_1lfbo7u",
"subreddit": "AskReddit",
"url": "https://www.reddit.com/r/AskReddit/comments/1lfbo7u/what_is_a_thing_you_love_that_lots_of_people_hate/mymupxb/",
"permalink": "/r/AskReddit/comments/1lfbo7u/what_is_a_thing_you_love_that_lots_of_people_hate/mymupxb/",
"gilded": 0,
"stickied": false,
"locked": false,
"archived": false,
"controversiality": 0,
"totalAwardsReceived": 0,
"createdUtc": 1750342221,
"createdAt": "2025-06-19T14:10:21.000Z",
"scrapedAt": "2025-06-20T09:45:12.000Z"
}

Flat vs Nested Comment Output

Flat mode (flattenComments: true, default):

  • Every comment and reply is pushed as its own dataset record
  • Use depth to understand nesting level (0 = top-level)
  • Use parentId to reconstruct the tree (t1_XXX = parent comment, t3_XXX = direct reply to post)
  • Best for: spreadsheet analysis, databases, ML pipelines, CSV export

Nested mode (flattenComments: false):

  • Top-level comment records contain a replies array with child records embedded
  • Each child also contains its own replies array, forming a full tree
  • Best for: JSON tree processing, displaying thread structure

Pagination

The ScrapeCreators API uses cursor-based pagination. Each page returns approximately 25 top-level comments. Set maxPages to control how many pages to fetch:

  • maxPages: 1 — fast, ~25 top-level comments
  • maxPages: 5 — ~125 top-level comments + all their replies
  • maxPages: 20 — comprehensive extraction for large threads

The actor stops pagination early if maxCommentsPerPost is reached or the API signals no more pages available.


Error Handling

  • Failed URLs push an error record (error: true) and processing continues for remaining URLs
  • Posts with no comments push a warning record
  • The dataset always contains at least one record per run

Pricing

This actor uses Pay Per Event (PPE) pricing:

  • $0.30 per 100 records (each post and each comment count as one record)
  • A thread with 1 post + 199 comments = 200 records ≈ $0.60
  • Bulk run: 10 threads × 200 comments = ~2,010 records ≈ $6.03

Support

For questions or feature requests, contact the actor publisher via the Apify Store messaging system.