Substack Scraper - Posts, Authors, Reactions & Newsletters
Pricing
Pay per event
Substack Scraper - Posts, Authors, Reactions & Newsletters
Scrape Substack newsletters via official API. Title, author, bio, audience (free/paid), reactions, comments, cover, podcast duration. HTTP only, $5/1K.
Pricing
Pay per event
Rating
0.0
(0)
Developer
deusex machine
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
10 hours ago
Last modified
Categories
Share
Substack Scraper — Posts, Authors, Reactions & Newsletter Metadata
Scrape Substack — the largest paid newsletter platform with over 5 million paid subscriptions and 50,000+ active writers — using the canonical public API endpoints each publication ships at /api/v1/archive. Extract every post with full metadata: title, subtitle, byline, audience (free vs paid), reactions, comments count, cover image, podcast duration and canonical URL. HTTP-only, no browser, no proxy, $5 per 1,000 posts.
If you run an ad network for newsletters, manage a sponsorship platform, build a sales tool for B2B SaaS targeting Substack writers, train an LLM on long-form journalism, monitor competitive content strategies, or run an analytics dashboard for the creator economy, this Substack scraper is the cleanest, fastest path to the data.
Why use this Substack scraper
Substack has no advertised public API, but every publication on the platform exposes its archive at /api/v1/archive — the same endpoint Substack's own web app uses to render the post list. The data is canonical, complete and stable.
This actor uses that endpoint directly. That means:
- ✅ Official-grade reliability — when Substack changes UI, the API stays
- ✅ Complete fields — every datapoint Substack itself stores (audience, reactions emoji counts, podcast duration, word count when available)
- ✅ Full byline objects — author name, handle, bio, photo URL, primary publication, guest status
- ✅ No anti-bot encountered on the archive endpoint as of the latest tests
- ✅ Fast — 4–8 posts per second, single worker, 256 MB memory
- ✅ Cheap — $0.005 per post ($5 per 1,000), undercutting every Substack scraper in the Apify Store
What this Substack scraper extracts
Per post
| Field | Description | Example |
|---|---|---|
postId | Substack internal post ID | 156234812 |
publicationId | Internal publication ID | 10845 |
publicationUrl | Canonical publication URL | https://www.lennysnewsletter.com |
title | Post title | Why SaaS freemium playbooks don't work in AI |
subtitle | Subhead/dek shown under the title | How to build an AI monetization strategy that actually works |
slug | URL-safe post slug | why-saas-freemium-playbooks-dont |
canonicalUrl | Full canonical URL of the post | https://www.lennysnewsletter.com/p/why-saas-freemium... |
postType | newsletter, podcast, thread, video | newsletter |
audience | Who can read: everyone, only_paid, only_subscribers, founding | only_paid |
postDate | ISO 8601 publication date | 2026-05-05T13:03:32.007Z |
description | Short marketing description | "How to build an AI monetization strategy..." |
coverImage | Hero/cover image URL | https://substackcdn.com/image/... |
reactions | Emoji-keyed dict of reaction counts | {"❤": 315} |
reactionsTotal | Sum across all emoji | 315 |
commentCount | Total comments on the post | 6 |
wordCount | Word count (when Substack stores it) | 2840 |
podcastDuration | Seconds (for podcast posts only) | 2715 |
freeUnlockRequired | True if reader must subscribe (even free) to unlock | false |
isAudio | True if post has audio content (podcast/voiceover) | false |
sectionId | Internal section ID if publication has sections | 8127 |
bylines | Array of full author objects (see below) | [{...}] |
bylinesCount | Number of authors on the post | 1 |
scrapedAt | ISO 8601 UTC timestamp | 2026-05-18T20:55:14+00:00 |
Per byline (author embedded in each post)
| Field | Description | Example |
|---|---|---|
id | Author user ID | 131847289 |
name | Display name | Vikas Kansal |
handle | Substack username | vikaskansal |
bio | Self-written bio | "Product lead for Google AI subscriptions..." |
photoUrl | Profile photo URL | https://substack-post-media... |
isGuest | True if the author is a guest writer | true |
primaryPublicationName | Their own newsletter name | Vikas Kansal |
primaryPublicationUrl | Their own newsletter URL | https://vikaskansal.substack.com |
primaryPublicationId | Their own publication ID | 8927213 |
Optional: global category catalog
Set includeCategories: true and the actor emits one extra helper record containing all 32 Substack categories (id, name, canonical name, rank). Useful for building a category tree in your application.
Use cases for this Substack data API
📨 Newsletter ad networks & sponsorship platforms
Tools like Beehiiv's ad marketplace, Swapstack, Letterhead and Hypefury sponsorship need fresh metadata for every newsletter they list. Schedule this actor weekly per publication URL — get the latest 50 posts, audience type (paid vs free) and reactions to qualify ad inventory.
💰 B2B SaaS sales prospecting Substack writers
Substack writers are heavy buyers of newsletter automation, video tools, email design tools, course platforms, podcast hosting and CRM software. Build a sales target list by scraping the top 200 Substack publications in your niche and enriching each byline's bio for fit signals ("ex-Google", "Y Combinator", "PhD"...).
🤖 LLM training data and RAG pipelines
Substack hosts the highest density of long-form, original, well-edited writing on the open web. Extract posts from technology, finance, science or culture publications to build a corpus for fine-tuning, retrieval-augmented generation, or topical agents.
📊 Creator economy analytics
How does post frequency correlate with reactions? Which audience type (free vs paid) gets the most comments? Pull thousands of posts across hundreds of publications and answer those questions with real data.
✍️ Competitive content marketing
Marketing teams at Notion, Hubspot, Linear, Vercel and Stripe monitor what their target audience reads on Substack. Schedule this scraper to send daily digests of new posts from competitor / industry publications to a Slack channel via Apify integrations.
📰 Journalism trend tracking
Want to know what every tech newsletter said about the OpenAI o5 launch? Pull the latest 20 posts from the top 50 Substack tech writers and run sentiment + keyword extraction on the bodies.
🎯 Investor / VC research
VCs increasingly track Substack engagement as a leading indicator of founder reputation, market thesis traction and sector heat. Pull the top tech / finance / AI publications and watch reaction velocity.
How to use this Substack scraper
Mode 1: Publication URLs (core)
Pass one or more Substack publication URLs. The actor paginates through each publication's archive using the public /api/v1/archive endpoint and returns one record per post.
{"publicationUrls": ["https://www.lennysnewsletter.com","https://stratechery.com","https://www.platformer.news"],"maxPostsPerPublication": 100,"maxTotalPosts": 500,"sortOrder": "new","audienceFilter": "all"}
Mode 2: Audience filter
Filter results by audience tier:
"all"(default) — every post regardless of paywall"free"— onlyaudience: "everyone"posts (readable without subscribing)"paid"— onlyaudience: "only_paid","only_subscribers","founding"posts (paywalled)
Mode 3: Categories catalog
Set includeCategories: true to also receive the global list of 32 Substack categories (Culture, Technology, Business, U.S. Politics, Finance, AI, Crypto, etc) with internal IDs, ranks and parent relationships. Useful if you're building a Substack discovery interface in your own product.
{"publicationUrls": ["https://www.lennysnewsletter.com"],"includeCategories": true,"maxPostsPerPublication": 20}
Step-by-step tutorial — your first Substack run in 2 minutes
- Click "Try for free" on this actor's Apify Store page. Apify gives every new user $5 in credit.
- Find a Substack publication you want to scrape. Any URL like
https://{anything}.substack.comor any custom domain (stratechery.com,lennysnewsletter.com) works. - Paste the example input:
{"publicationUrls": ["https://www.lennysnewsletter.com"],"maxPostsPerPublication": 20,"audienceFilter": "all"}
- Click "Start". The actor pages the publication's archive and pushes one record per post.
- Download your dataset as JSON, CSV, Excel, RSS or HTML.
You'll have 20 fully-structured Substack posts in under 10 seconds.
Performance and cost
- HTTP only, no browser, no proxy. Uses
curl_cffiChrome 120 impersonate against the publication's native API endpoints. - 4–8 posts per second sustained throughput.
- No anti-bot on the archive endpoint as of the latest tests (Substack designed this endpoint to power their own web app — it's intentionally fast and unauthenticated for free archive access).
- Pricing: $0.005 per post + $0.00005 per actor start. No subscription, no commitment.
Pricing scenarios
| Workload | Posts | Cost |
|---|---|---|
| Try the actor on Lenny's | 20 | $0.10 |
| One Apify free $5 credit | ~1,000 | $5.00 |
| Top 10 tech publications × 50 latest posts | 500 | $2.50 |
| Daily refresh of 30 newsletters (1 month) | ~9,000 | $45.00 |
| Bulk archive snapshot — 100 pubs × 500 posts each | 50,000 | $250.00 |
Output example (single Substack post)
{"type": "post","postId": 156234812,"publicationId": 10845,"publicationUrl": "https://www.lennysnewsletter.com","title": "Why SaaS freemium playbooks don't work in AI, and what to do instead","subtitle": "How to build an AI monetization strategy that actually works","slug": "why-saas-freemium-playbooks-dont","canonicalUrl": "https://www.lennysnewsletter.com/p/why-saas-freemium-playbooks-dont","postType": "newsletter","audience": "only_paid","postDate": "2026-05-05T13:03:32.007Z","description": "How to build an AI monetization strategy that actually works","coverImage": "https://substackcdn.com/image/...","reactions": {"❤": 315},"reactionsTotal": 315,"commentCount": 6,"wordCount": null,"podcastDuration": null,"freeUnlockRequired": false,"isAudio": false,"sectionId": null,"bylines": [{"id": 131847289,"name": "Vikas Kansal","handle": "vikaskansal","bio": "Product lead for Google AI subscriptions...","photoUrl": "https://substack-post-media.s3.amazonaws.com/...","isGuest": true,"primaryPublicationName": "Vikas Kansal","primaryPublicationUrl": "https://vikaskansal.substack.com","primaryPublicationId": 8927213}],"bylinesCount": 1,"scrapedAt": "2026-05-18T20:55:14+00:00"}
How this Substack scraper compares
| Approach | Pros | Cons |
|---|---|---|
| This actor | Official API endpoint, full bylines, $5/1K, no proxy | No global discovery — user provides publication URLs |
| Substack RSS feeds | Free | Sparse fields, no audience info, no reactions, no byline IDs |
| Newsletter Stack DBs (Letterhead, Hypefury) | Curated | Paid subscriptions $50–$300/mo, smaller coverage |
| BeehiivAds inventory data | Real-time | Beehiiv-only, no Substack overlap |
| Manual RSS-to-CSV scripts | Free | Brittle, no paid-post metadata, no API stability |
| Hiring a freelancer | Custom | $200–$800 one-off, not maintained |
How to call this Substack scraper from your code
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("makework36/substack-scraper").call(run_input={"publicationUrls": ["https://www.lennysnewsletter.com"],"maxPostsPerPublication": 50,"audienceFilter": "all",})for p in client.dataset(run["defaultDatasetId"]).iterate_items():if p["type"] == "post":print(p["title"], p["audience"], p["reactionsTotal"], p["commentCount"])
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('makework36/substack-scraper').call({publicationUrls: ['https://stratechery.com'],maxPostsPerPublication: 30,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(p => console.log(p.title, p.bylines[0]?.name, p.reactionsTotal));
cURL (synchronous run)
curl -X POST "https://api.apify.com/v2/acts/makework36~substack-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"publicationUrls":["https://www.lennysnewsletter.com"],"maxPostsPerPublication":10}'
Frequently Asked Questions about scraping Substack
Is scraping Substack legal?
The /api/v1/archive endpoint is the same one Substack's own web client uses to render the archive page on every publication. It is unauthenticated by design — Substack actively wants the archive to be indexable. This actor consumes that same public endpoint at human-like rates. You are responsible for how you store and redistribute the data; respect each writer's copyright on post content, and the Substack Terms of Service for downstream use.
Why doesn't the scraper return post body text?
The archive endpoint returns post metadata only — title, subtitle, description, audience, byline. Full body content lives at the post detail endpoint ({pub}/p/{slug}) and is paywalled for only_paid and only_subscribers posts. Extracting paid bodies is not part of v1 (and would not pass Substack's TOS without subscriber credentials).
Will my account get banned?
The actor runs unauthenticated against public endpoints. Substack has no account-level rate limiting on the archive endpoint. We've not observed any IP block as of the latest tests. The actor inserts a 300 ms pause between paginated requests to remain polite.
How current is the data?
Live — every run hits Substack directly and returns posts as listed at request time. There is no cache.
Can I scrape every Substack publication that exists?
Substack hosts hundreds of thousands of publications. There is no public "top publications" discovery endpoint. To build a comprehensive dataset, supply known URLs (Substack writers usually advertise their newsletters on Twitter/X, LinkedIn or their own websites), or parse Substack's sitemap-tt.xml.gz (millions of pubs — uses lots of credits).
How do I find publication URLs?
- Visit
https://substack.com/exploreand click any newsletter - Look for
*.substack.comsubdomains or custom domains on Twitter/X bios of writers in your niche - Use the
includeCategories: trueoption to get the 32-category catalog, then manually browse top publications
Can I filter by paid vs free posts?
Yes — set audienceFilter to "free" (only public posts) or "paid" (paywalled posts). The actor still extracts metadata for paid posts; only the body is gated.
Does the actor support podcasts?
Yes. Substack hosts thousands of podcast publications. Posts of postType: "podcast" include podcastDuration in seconds. Full audio is on the canonical URL and not part of the JSON output.
Can I schedule this scraper?
Yes. Use Apify's built-in scheduler to refresh your dataset daily, weekly or monthly. Push results to Google Sheets, BigQuery, Postgres or Slack via Apify integrations.
Will the reactions count include all emoji?
Yes — the reactions field is a dict like {"❤": 315, "🔥": 22, "👍": 11}. The reactionsTotal field sums them.
How accurate is the byline data?
Bylines come straight from Substack's user database. The bio field is the self-written bio at the moment of post publication (which Substack stores per-post, not as a live join). primaryPublicationName and primaryPublicationUrl may have changed if the author migrated newsletters since publishing.
Is there a free trial?
Yes — Apify gives every new user $5 in platform credit, enough to extract ~1,000 Substack posts with this actor.
Can I get subscriber counts per publication?
Substack does not expose subscriber counts publicly; the archive endpoint doesn't include them. Some publications display "X,000 subscribers" in their hero copy — that's HTML scraping for a future v1.1 release.
What about Substack Notes (the Twitter-like feed)?
Notes are a separate platform with its own API. This actor covers posts only. A Notes-specific scraper may come in v2.
Can I use this for academic / research projects?
Absolutely. Many social-science researchers and digital-humanities labs use Substack data for studying long-form journalism, polarization, paid content dynamics and creator economy trends. Cite the actor in your bibliography.
🔗 Other actors by makework36
Building content marketing, sales prospecting, or creator-economy tooling? Combine with these:
- Shopify Products Scraper — full Shopify catalog: title, SKU, price, variants, inventory
- Goodreads Scraper — books, authors, ratings, ISBN
- IndiaMART Suppliers Scraper — India B2B suppliers
- Email Finder Scraper — verified business emails
- Reddit SaaS Leads Scraper — startup pain points & buyers
- Lovable Sites Scraper — enumerate AI-builder apps
- Trustpilot Reviews Scraper — customer ratings
See all actors by makework36 on the Apify Store.
Roadmap
- v1.1: subscriber count extraction from publication homepage HTML.
- v1.2: post body extraction for
audience: "everyone"posts (free posts only). - v1.3: comments thread extraction.
- v2: Substack Notes scraper as a separate actor.
Disclaimer
This actor consumes the /api/v1/archive endpoint that every Substack publication exposes by design — the same endpoint that powers the publication's own web client. You are responsible for respecting each writer's copyright on the post content, Substack's Terms of Service, and applicable data protection regulations (GDPR for EU subjects, CCPA for California subjects) when storing, transforming or redistributing the data.
🙏 Ran this Substack scraper successfully? Leaving a review helps the Apify algorithm surface this actor to other newsletter operators and creator-economy teams. Much appreciated.