Reddit Historical Archive Scraper
Pricing
from $1.50 / 1,000 results
Reddit Historical Archive Scraper
Access 10+ years of archived Reddit posts and comments via PullPush. Full-text comment search (Reddit can't do this). No login, no proxy. $0.001/item.
Pricing
from $1.50 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
6
Total users
4
Monthly active users
7 hours ago
Last modified
Categories
Share
🗄️ Reddit Historical Archive Scraper — 10+ Years of Reddit Posts & Comments via PullPush

Scrape 10+ years of Reddit history — posts and comments from every public subreddit and user — through one clean Actor. Powered by PullPush.io, the community-maintained open-source successor to Pushshift, this Reddit historical scraper delivers full-text comment-body search (something Reddit's own search literally cannot do), date-range filtering down to the hour, and complete user posting history. No login, no OAuth, no API key, no proxy required. Export structured Reddit data to JSON, CSV or Excel.
It is purpose-built for the jobs Reddit's own API and search are bad at: deep historical research, AI training corpora, academic studies, retrospective brand analysis and sentiment trend backfill. It is not a real-time monitor for brand-new posts.
⚠️ Data freshness: PullPush indexes Reddit content up to approximately May 2025. Use this Actor for historical and retrospective work; use a different tool for live monitoring of posts made this week.
✨ What this Actor does / Key features
- 📚 10+ years of historical depth — search Reddit content going back to ~2008, far beyond what Reddit's official API exposes.
- 💬 Reddit-wide comment full-text search — find every comment mentioning a brand, term or phrase across all of Reddit; Reddit's own search only covers post titles and self-text.
- 🕒 Date-range filtering —
afterDate/beforeDateaccept ISO timestamps for precise event-window queries. - 🎯 Six input modes in one run — subreddits, post IDs (with full comment trees), usernames, post search, comment search, and auto-detected Reddit URLs, all combinable.
- 🌳 Comment tree reconstruction — depth is rebuilt from the flat archive via
parentIdtraversal (0 = top-level). - 🔢 Score & content-type filters —
minScoreplus auserContentswitch for posts-only / comments-only / both. - 🛡️ No blocks, no proxy — going through the PullPush archive avoids the HTTP 403s direct Reddit scraping hits.
- 🧹 Auto-deduplication & polite rate limiting with configurable delay and exponential-backoff retries.
- 📦 Export-ready output — flat JSON tagged with
type: postortype: comment, downloadable as CSV, Excel, JSON or XML.
🔍 Input
| Field | Type | Description |
|---|---|---|
subreddits | array of strings | Subreddit names (no /r/ prefix) for historical listing scrapes. |
postIds | array of strings | Reddit post IDs from /comments/XXXX/ URLs — fetches the post plus all archived comments. |
usernames | array of strings | Reddit usernames (no /u/ prefix) — returns full posting history. |
searchQueries | array of strings | Full-text search across archived Reddit posts (title + self-text). |
commentSearchQueries | array of strings | Full-text search across archived Reddit comments — unique to this Actor. |
startUrls | array of strings | Any Reddit URL (subreddit, post, user, search) — type is auto-detected. |
sort | string (enum) | new or top natively supported; hot / rising / controversial / best fall back to new. Default new. |
afterDate | string (ISO) | Only return items created after this date (e.g. 2023-01-01). |
beforeDate | string (ISO) | Only return items created before this date (e.g. 2024-12-31). |
minScore | integer | Only return items with score ≥ this value. |
userContent | string (enum) | When scraping users: overview (both), submitted (posts only), or comments (comments only). |
maxItems | integer | Global hard cap across all targets. 0 = unlimited. |
maxItemsPerTarget | integer | Cap per subreddit / post / user / search. |
requestDelayMs | integer | Milliseconds between requests. Default 2200 (stays under PullPush's ~30/min limit). |
maxRetries | integer | Retry attempts with exponential backoff on 429 / 5xx. Default 4. |
🚀 Example input
{"subreddits": ["wallstreetbets"],"commentSearchQueries": ["GME", "diamond hands"],"sort": "top","afterDate": "2021-01-01","beforeDate": "2021-02-28","minScore": 500,"maxItems": 5000,"maxItemsPerTarget": 2000,"requestDelayMs": 2200}
📦 Output
Each dataset record carries a type field (post or comment); fields are populated as relevant to each type. Records are flat JSON, exportable to CSV, Excel, JSON or XML, or accessible via the Apify API. Pre-built dataset views are provided for Posts, Comments, Users and Subreddits.
| Field | Description |
|---|---|
type | Record type: post or comment |
id / fullname | Reddit ID and fullname (t3_… for posts, t1_… for comments) |
subreddit | Subreddit the item belongs to |
author | Username ([deleted] if removed by the user) |
title | Post title (posts only) |
selftext | Post body text (text posts only) |
body | Comment text (comments only) |
url / permalink | External link and direct Reddit permalink |
score | Score (upvotes − downvotes) at archive time |
upvoteRatio | Upvote ratio for posts |
numComments | Comment count for posts |
createdUtc | ISO timestamp of creation |
subscribers | Subscriber count (subreddit records) |
linkKarma / commentKarma | Karma fields (user records) |
Comment records also include a reconstructed
depthfield.scorereflects engagement at archive time, not the current Reddit value; Reddit fuzzes vote counts soscoreis the reliable engagement signal.
💡 Use cases
- AI / ML engineers — build large structured Reddit training corpora from Q&A subreddits (r/AskHistorians, r/explainlikeimfive) for fine-tuning and RAG pipelines.
- Brand & reputation analysts — search every historical comment mentioning a brand to reconstruct incidents and sentiment evolution.
- Academic & computational social science teams — run longitudinal discourse, linguistic and election research with precise date windows.
- Quant & finance researchers — backfill historical Reddit sentiment signals from r/wallstreetbets, r/CryptoCurrency and r/stocks for backtesting.
- Journalists & accountability researchers — recover content deleted or removed after archiving, with
[deleted]vs[removed]distinguishable. - B2B & growth teams — map power users and niche thought leaders via full historical posting history.
❓ Frequently Asked Questions
Is this scraper legal? It accesses only publicly available archived Reddit content via the PullPush.io public archive, which is widely used by researchers and journalists. You are responsible for complying with Reddit's and PullPush's terms and with privacy laws (GDPR, CCPA). Do not use it to re-identify anonymous users or for unsolicited outreach.
Do I need a login, API key or proxy? No. PullPush's archive endpoints are public and anonymous — no OAuth, no API key, no proxy. This also avoids the HTTP 403 blocks that direct Reddit scrapers hit from datacenter IPs.
Why is the data only current through ~May 2025? PullPush is community-maintained on a best-effort basis and its most recent indexing is around May 2025. For newer content use a real-time Reddit scraper; for deep historical data PullPush is the most comprehensive public source.
What's the difference between searchQueries and commentSearchQueries?
searchQueries searches post titles and self-text across Reddit. commentSearchQueries searches comment bodies across all of Reddit — a capability Reddit's own search does not offer, and often the more valuable one since most mentions live in comments.
How much data can I get and how fast? At the default 2200 ms delay (~27 requests/min, under PullPush's limit) throughput is roughly 25 items/min, so a 1,000-item run takes about 40 minutes and a 10,000-item run several hours. For very large datasets, run on a schedule and accumulate.
Are deleted or removed posts and comments included?
Often yes — if PullPush archived the item before deletion. Items deleted by the user show [deleted]; items removed by moderators are frequently still readable. This recoverability is a key research use case.
What output formats are supported and can I integrate it? Every run produces a structured dataset exportable to JSON, CSV, Excel or XML, and accessible via the Apify API and webhooks for use with Google Sheets, data warehouses, or LangChain / LlamaIndex pipelines.
⏰ Scheduling & integration
Schedule this Actor on Apify to grow a historical Reddit dataset over time, and export results to JSON, CSV or Excel. Use the Apify API and webhooks to push data into Google Sheets, BigQuery, Snowflake, PostgreSQL, or vector databases for semantic search and RAG.
Not affiliated with Reddit, Inc. or PullPush.io. Reddit® is a registered trademark of Reddit, Inc.
Changelog
-
2026-06-01 — Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.
-
2026-05-25 — Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.
Last reviewed: 2026-06-01.
