Reddit Historical Archive Scraper avatar

Reddit Historical Archive Scraper

Pricing

from $1.50 / 1,000 results

Go to Apify Store
Reddit Historical Archive Scraper

Reddit Historical Archive Scraper

Access 10+ years of archived Reddit posts and comments via PullPush. Full-text comment search (Reddit can't do this). No login, no proxy. $0.001/item.

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

4

Monthly active users

7 hours ago

Last modified

Share

🗄️ Reddit Historical Archive Scraper — 10+ Years of Reddit Posts & Comments via PullPush

Reddit Historical Archive Scraper

Scrape 10+ years of Reddit history — posts and comments from every public subreddit and user — through one clean Actor. Powered by PullPush.io, the community-maintained open-source successor to Pushshift, this Reddit historical scraper delivers full-text comment-body search (something Reddit's own search literally cannot do), date-range filtering down to the hour, and complete user posting history. No login, no OAuth, no API key, no proxy required. Export structured Reddit data to JSON, CSV or Excel.

It is purpose-built for the jobs Reddit's own API and search are bad at: deep historical research, AI training corpora, academic studies, retrospective brand analysis and sentiment trend backfill. It is not a real-time monitor for brand-new posts.

⚠️ Data freshness: PullPush indexes Reddit content up to approximately May 2025. Use this Actor for historical and retrospective work; use a different tool for live monitoring of posts made this week.

✨ What this Actor does / Key features

  • 📚 10+ years of historical depth — search Reddit content going back to ~2008, far beyond what Reddit's official API exposes.
  • 💬 Reddit-wide comment full-text search — find every comment mentioning a brand, term or phrase across all of Reddit; Reddit's own search only covers post titles and self-text.
  • 🕒 Date-range filteringafterDate / beforeDate accept ISO timestamps for precise event-window queries.
  • 🎯 Six input modes in one run — subreddits, post IDs (with full comment trees), usernames, post search, comment search, and auto-detected Reddit URLs, all combinable.
  • 🌳 Comment tree reconstruction — depth is rebuilt from the flat archive via parentId traversal (0 = top-level).
  • 🔢 Score & content-type filtersminScore plus a userContent switch for posts-only / comments-only / both.
  • 🛡️ No blocks, no proxy — going through the PullPush archive avoids the HTTP 403s direct Reddit scraping hits.
  • 🧹 Auto-deduplication & polite rate limiting with configurable delay and exponential-backoff retries.
  • 📦 Export-ready output — flat JSON tagged with type: post or type: comment, downloadable as CSV, Excel, JSON or XML.

🔍 Input

FieldTypeDescription
subredditsarray of stringsSubreddit names (no /r/ prefix) for historical listing scrapes.
postIdsarray of stringsReddit post IDs from /comments/XXXX/ URLs — fetches the post plus all archived comments.
usernamesarray of stringsReddit usernames (no /u/ prefix) — returns full posting history.
searchQueriesarray of stringsFull-text search across archived Reddit posts (title + self-text).
commentSearchQueriesarray of stringsFull-text search across archived Reddit comments — unique to this Actor.
startUrlsarray of stringsAny Reddit URL (subreddit, post, user, search) — type is auto-detected.
sortstring (enum)new or top natively supported; hot / rising / controversial / best fall back to new. Default new.
afterDatestring (ISO)Only return items created after this date (e.g. 2023-01-01).
beforeDatestring (ISO)Only return items created before this date (e.g. 2024-12-31).
minScoreintegerOnly return items with score ≥ this value.
userContentstring (enum)When scraping users: overview (both), submitted (posts only), or comments (comments only).
maxItemsintegerGlobal hard cap across all targets. 0 = unlimited.
maxItemsPerTargetintegerCap per subreddit / post / user / search.
requestDelayMsintegerMilliseconds between requests. Default 2200 (stays under PullPush's ~30/min limit).
maxRetriesintegerRetry attempts with exponential backoff on 429 / 5xx. Default 4.

🚀 Example input

{
"subreddits": ["wallstreetbets"],
"commentSearchQueries": ["GME", "diamond hands"],
"sort": "top",
"afterDate": "2021-01-01",
"beforeDate": "2021-02-28",
"minScore": 500,
"maxItems": 5000,
"maxItemsPerTarget": 2000,
"requestDelayMs": 2200
}

📦 Output

Each dataset record carries a type field (post or comment); fields are populated as relevant to each type. Records are flat JSON, exportable to CSV, Excel, JSON or XML, or accessible via the Apify API. Pre-built dataset views are provided for Posts, Comments, Users and Subreddits.

FieldDescription
typeRecord type: post or comment
id / fullnameReddit ID and fullname (t3_… for posts, t1_… for comments)
subredditSubreddit the item belongs to
authorUsername ([deleted] if removed by the user)
titlePost title (posts only)
selftextPost body text (text posts only)
bodyComment text (comments only)
url / permalinkExternal link and direct Reddit permalink
scoreScore (upvotes − downvotes) at archive time
upvoteRatioUpvote ratio for posts
numCommentsComment count for posts
createdUtcISO timestamp of creation
subscribersSubscriber count (subreddit records)
linkKarma / commentKarmaKarma fields (user records)

Comment records also include a reconstructed depth field. score reflects engagement at archive time, not the current Reddit value; Reddit fuzzes vote counts so score is the reliable engagement signal.

💡 Use cases

  • AI / ML engineers — build large structured Reddit training corpora from Q&A subreddits (r/AskHistorians, r/explainlikeimfive) for fine-tuning and RAG pipelines.
  • Brand & reputation analysts — search every historical comment mentioning a brand to reconstruct incidents and sentiment evolution.
  • Academic & computational social science teams — run longitudinal discourse, linguistic and election research with precise date windows.
  • Quant & finance researchers — backfill historical Reddit sentiment signals from r/wallstreetbets, r/CryptoCurrency and r/stocks for backtesting.
  • Journalists & accountability researchers — recover content deleted or removed after archiving, with [deleted] vs [removed] distinguishable.
  • B2B & growth teams — map power users and niche thought leaders via full historical posting history.

❓ Frequently Asked Questions

Is this scraper legal? It accesses only publicly available archived Reddit content via the PullPush.io public archive, which is widely used by researchers and journalists. You are responsible for complying with Reddit's and PullPush's terms and with privacy laws (GDPR, CCPA). Do not use it to re-identify anonymous users or for unsolicited outreach.

Do I need a login, API key or proxy? No. PullPush's archive endpoints are public and anonymous — no OAuth, no API key, no proxy. This also avoids the HTTP 403 blocks that direct Reddit scrapers hit from datacenter IPs.

Why is the data only current through ~May 2025? PullPush is community-maintained on a best-effort basis and its most recent indexing is around May 2025. For newer content use a real-time Reddit scraper; for deep historical data PullPush is the most comprehensive public source.

What's the difference between searchQueries and commentSearchQueries? searchQueries searches post titles and self-text across Reddit. commentSearchQueries searches comment bodies across all of Reddit — a capability Reddit's own search does not offer, and often the more valuable one since most mentions live in comments.

How much data can I get and how fast? At the default 2200 ms delay (~27 requests/min, under PullPush's limit) throughput is roughly 25 items/min, so a 1,000-item run takes about 40 minutes and a 10,000-item run several hours. For very large datasets, run on a schedule and accumulate.

Are deleted or removed posts and comments included? Often yes — if PullPush archived the item before deletion. Items deleted by the user show [deleted]; items removed by moderators are frequently still readable. This recoverability is a key research use case.

What output formats are supported and can I integrate it? Every run produces a structured dataset exportable to JSON, CSV, Excel or XML, and accessible via the Apify API and webhooks for use with Google Sheets, data warehouses, or LangChain / LlamaIndex pipelines.

⏰ Scheduling & integration

Schedule this Actor on Apify to grow a historical Reddit dataset over time, and export results to JSON, CSV or Excel. Use the Apify API and webhooks to push data into Google Sheets, BigQuery, Snowflake, PostgreSQL, or vector databases for semantic search and RAG.


Not affiliated with Reddit, Inc. or PullPush.io. Reddit® is a registered trademark of Reddit, Inc.


Changelog

  • 2026-06-01 — Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.

  • 2026-05-25 — Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.

Last reviewed: 2026-06-01.