Substack Scraper: Posts, Comments & Newsletter Leaderboards avatar

Substack Scraper: Posts, Comments & Newsletter Leaderboards

Pricing

from $0.002 / actor start

Go to Apify Store
Substack Scraper: Posts, Comments & Newsletter Leaderboards

Substack Scraper: Posts, Comments & Newsletter Leaderboards

Scrape Substack: post archives, full content, comments, author profiles, leaderboards. No login. 6 modes. Half the price of competitors.

Pricing

from $0.002 / actor start

Rating

0.0

(0)

Developer

Charlie Krug

Charlie Krug

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Substack Scraper

Scrape the public web of Substack — no login, no API key. Pull full post archives, individual posts with clean text, top-level + nested comments, author profiles, and ranked publication leaderboards — all six data types in one actor, where most tools do just one. Date filtering, audience filtering, and AI-ready clean-text output are built in.


What makes this different

FeatureThis ActorTypical competitors
Comments + replies (depth-first flat list)❌ rarely offered
Subscriber count estimates from Substack's ranking data
Clean plain text extracted from body HTML✅ (automatic)❌ raw HTML only
Date range filtering for archive
Post type filter (newsletter / podcast / thread)
Custom domain support (e.g. platformer.news)partial
All 6 data types in one actor1–2 per actor
No login / no cookies required

Modes

publication — Post archive

Pull a publication's full post list, newest-first. Paginated automatically.

Input: publication (subdomain or URL), limit, audienceFilter, postType, dateFrom, dateTo, includeBody

Output fields per post:

{
"postId": 140602898,
"title": "Why Platformer is leaving Substack",
"subtitle": "Casey explains the move to Ghost",
"url": "https://platformer.substack.com/p/why-platformer-is-leaving-substack",
"slug": "why-platformer-is-leaving-substack",
"postDate": "2024-01-12T18:00:00.000Z",
"type": "newsletter",
"audience": "everyone",
"isPaid": false,
"reactionCount": 312,
"commentCount": 54,
"restackCount": 28,
"wordcount": 2300,
"coverImage": "https://...",
"description": "Casey explains why Platformer is moving to Ghost.",
"authors": [{ "name": "Casey Newton", "handle": "platformer", "photoUrl": "https://..." }],
"tags": ["tech", "media"],
"publication": "platformer",
"bodyHtml": "<p>Full HTML content...</p>",
"bodyText": "Full content here in clean plain text..."
}

post — Single post

Full content of one post. Paywalled posts return only the public preview — no paywall bypass.

Input: publication, slug (slug or full post URL)


comments — Post comments ⭐ rarely offered

All top-level comments plus nested replies, depth-first flattened — charged at the same flat per-record rate as everything else, with no per-comment surcharge.

Input: publication, slug, limit

Output fields per comment:

{
"id": 47125539,
"postId": 140602898,
"postSlug": "why-platformer-is-leaving-substack",
"postTitle": "Why Platformer is leaving Substack",
"parentId": null,
"depth": 0,
"authorName": "Gordon Strause",
"authorHandle": "gordonstrause",
"body": "Too bad. I think Substack's policies are the right ones...",
"date": "2024-01-12T02:18:02.661Z",
"reactionCount": 132,
"childCount": 2,
"isDeleted": false,
"isPinned": false
}

Replies have "depth": 1, "parentId": <parent comment id>.


author — Publication / author profile

Author bio, publication description, custom domain, paid status. Extracted from the publication's own API — no authentication required.

Input: publication (subdomain or URL) or handle

Output:

{
"name": "Casey Newton",
"handle": "platformer",
"subdomain": "platformer",
"customDomain": "www.platformer.news",
"bio": "Casey Newton is the founder and editor of Platformer...",
"photoUrl": "https://...",
"twitterHandle": "CaseyNewton",
"publicationName": "Platformer",
"publicationDescription": "News at the intersection of Silicon Valley and democracy.",
"publicationLogoUrl": "https://...",
"hasPaid": false
}

category — Leaderboard ⭐ with subscriber counts

Ranked publications in any of Substack's 32 categories, including real subscriber-count estimates.

Input: category (slug or id), limit

Valid category slugs: technology, business, finance, health, science, culture, sports, news, music, crypto, education, literature, fiction, philosophy, climate, travel, parenting, design, art, humor, comics, history, faith, food, film-and-tv, home-garden, international, podcast.

Output:

{
"name": "The Pragmatic Engineer",
"subdomain": "pragmaticengineer",
"customDomain": null,
"description": "The #1 technology newsletter on Substack...",
"logo": "https://...",
"authorName": "Gergely Orosz",
"authorHandle": "pragmaticengineer",
"hasPaid": true,
"subscriberCountEstimate": "1.1M+",
"rankingScore": 10000,
"tier": 2,
"type": "newsletter"
}

Search posts and publications by keyword.

⚠️ Note: Substack's search endpoint returns empty results for anonymous requests. Results may be sparse. For reliable discovery, use category mode instead.


Filters (publication mode)

FilterInput fieldExample
AudienceaudienceFilterfree, paid, all
Date rangedateFrom, dateTo2024-01-01, 2024-12-31
Content typepostTypenewsletter, podcast, thread
Full bodyincludeBodytrue — adds bodyHtml + bodyText per post

Use cases

Newsletter research & competitive intelligence Use publication mode to pull a competitor's full archive. Analyze posting frequency, topic mix (via tags), and engagement trends (reactionCount, commentCount) over time.

Lead generation for agencies Use category mode to pull the top 200 tech newsletters sorted by Substack's own ranking, with subscriber-count estimates (1.1M+, 228K+, etc.) and contact handles. Export to CSV for outreach.

AI training data Use publication with includeBody: true + audienceFilter: free for large batches of high-quality long-form text. bodyText is already clean — no HTML stripping needed. A 1,000-post archive of a major publication costs ~$0.30.

Audience sentiment analysis Use comments mode to pull every comment + reply thread for a specific post. Analyze sentiment, top commenters, and reaction counts. The depth field lets you reconstruct the full conversation tree.

Author discovery Combine category and author modes: pull the top 50 tech newsletters, then loop over each subdomain with author mode to get bios, Twitter handles, and paid status for a complete contact list.


Pricing — $0.30 per 1,000 records

The lowest price of any all-in-one Substack scraper — one flat rate for every data type.

EventPrice
Actor start (once per run)$0.002
Per record returned$0.0003
What you getRecordsCost
Quick 50-post archive50~$0.02
Full archive of a 500-post newsletter500~$0.15
1,000-post AI training dataset1,000~$0.30
Top 200 tech newsletters (category mode)200~$0.06
All comments on a popular post500~$0.15

How we compare

Per 1,000 records, verified live on Apify Store (June 2026):

ActorPrice / 1KCommentsData types
This actor$0.30✅ flat rate6 (posts, content, comments, authors, leaderboards, search)
sourabhbgp/substack-scraper$0.30➕ extra per-comment fee3
benthepythondev/newsletter-scraper$1.001
easyapi/substack-*$2.99–4.99 each1 per actor (6 separate actors to match this one)
automation-lab/substack-scraperhigher1

Same lowest price as the nearest rival — but comments are included at the flat rate (others surcharge), and it's all six data types in one actor instead of six separate purchases.

Why we can price this low: pure JSON API — no Puppeteer, no proxy, no JS rendering. The extra request budget (for includeBody) is the Actor's cost, not yours.


Run locally

# Unit tests (95 tests, no network required)
python3 tests/test_substack.py
# Quick live test — 3 posts from Platformer
python3 -c "
from src.substack import archive
import json
rows = archive('platformer', limit=3)
print(json.dumps(rows, indent=2))
"
# Full Actor run (requires: pip install apify)
apify run

Publish to Apify

npm i -g apify-cli
apify login # paste your API token
apify push

Full step-by-step for pay-per-event pricing → PUBLISH.md


Substack's public post data (no-login content) is covered by the same legal framework as other public-web scrapers. The 2024 Meta Platforms v. Bright Data ruling affirmed that scraping publicly accessible logged-out content is defensible. This actor:

  • Never bypasses paywalls — paywalled posts return only the public preview
  • Never requires login — all data is publicly accessible without authentication
  • Is polite — built-in rate limiting (0.3s between archive pages, 0.5s between body fetches)

Endpoint status (verified 2026-06-30)

EndpointStatus
Archive✅ Stable
Single post✅ Stable
Comments✅ Stable
Author (via bylines)✅ Stable
Category leaderboard✅ Stable
Categories list✅ Stable
Search⚠️ Returns empty without session cookie