Reddit Archive Scraper avatar

Reddit Archive Scraper

Pricing

Pay per usage

Go to Apify Store
Reddit Archive Scraper

Reddit Archive Scraper

Reddit Archive Scraper to extract years of historical Reddit posts and comments from the PullPush archive. Reddit's API caps subreddits at ~1000 posts; this Actor pulls months or years from many subreddits by date range and keyword. For historical backfill, research and AI datasets.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

ben

ben

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 hours ago

Last modified

Categories

Share

Reddit Archive Scraper — Historical Posts & Comments (Years of Data)

Pull MONTHS or YEARS of historical Reddit posts and comments from one or many subreddits — by date range and keyword.

This Actor uses the PullPush archive (the public Pushshift successor) to reach data that Reddit's own API simply won't return.

Why this exists

Reddit's official API hard-caps any subreddit listing at ~1000 posts — for an active subreddit that's only a few weeks of history. There is no way around that cap with the official API, in any tool.

This Actor solves that: it reads from the historical archive, so you can backfill a full year (or several) across multiple subreddits in one job.

Need live, up-to-the-minute posts and full threaded comment trees instead? Use the companion Reddit Scraper (official API) for fresh data, and this Archive Scraper for deep history. They pair well: archive for backfill, live scraper for ongoing updates.

What you get

Posts: title, selftext (body), author, subreddit, score, upvote_ratio, num_comments, created date (epoch + ISO), permalink, url, domain, flair, is_self/is_video/over_18/locked/stickied/spoiler, awards.

Comments (optional): body, author, subreddit, score, parent_id, link_id, post_id, created date, permalink, is_submitter.

Each row has a type field (post or comment) so you can split them easily.

Input

FieldTypeDescription
subredditsarraySubreddits to archive (without r/)
searchQuerystringOptional keyword filter (or search all of Reddit)
afterDatestringEarliest date YYYY-MM-DD (lower bound)
beforeDatestringLatest date YYYY-MM-DD (start point)
maxPostsintegerMax posts across all subreddits
includeCommentsbooleanAlso fetch archived comments per post
maxCommentsPerPostintegerCap comments per post

Example: one year of a subreddit

{
"subreddits": ["FragranceClones"],
"afterDate": "2024-01-01",
"beforeDate": "2025-01-01",
"maxPosts": 10000,
"includeComments": false
}

Example: keyword across all of Reddit, posts + comments

{
"searchQuery": "dupe",
"afterDate": "2024-06-01",
"maxPosts": 1000,
"includeComments": true,
"maxCommentsPerPost": 50
}

Sample output (post)

{
"type": "post",
"id": "1d8bw4c",
"title": "Best clone of Cool Water?",
"selftext": "Looking for an affordable alternative...",
"author": "someuser",
"subreddit": "fragranceclones",
"score": 14,
"num_comments": 8,
"created_iso": "2024-06-02T10:14:00+00:00",
"permalink": "https://www.reddit.com/r/fragranceclones/comments/1d8bw4c/..."
}

Use cases

  • Historical backfill — seed a database with years of a subreddit's content
  • Research & sentiment datasets — analyse trends over long time spans
  • AI / RAG training data — large historical corpora by topic
  • Brand / product monitoring — see what was said about a topic over time

Cost tips

  • Pay-per-result: you're charged per post/comment returned.
  • Comments are the bulk of the count — keep includeComments off if you only need posts, or cap maxCommentsPerPost.
  • Use afterDate/beforeDate to scope exactly the window you need.
  • Data comes from the public PullPush archive; coverage and freshness depend on that service. For the most recent posts, pair with the live Reddit Scraper.
  • Use data only for lawful purposes and in line with Reddit's and PullPush's terms.

More scrapers from the same author: