Substack Scraper - Posts, Authors, Reactions & Newsletters avatar

Substack Scraper - Posts, Authors, Reactions & Newsletters

Pricing

Pay per event

Go to Apify Store
Substack Scraper - Posts, Authors, Reactions & Newsletters

Substack Scraper - Posts, Authors, Reactions & Newsletters

Scrape Substack newsletters via official API. Title, author, bio, audience (free/paid), reactions, comments, cover, podcast duration. HTTP only, $5/1K.

Pricing

Pay per event

Rating

0.0

(0)

Developer

deusex machine

deusex machine

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 hours ago

Last modified

Share

Substack Scraper — Posts, Authors, Reactions & Newsletter Metadata

Scrape Substack — the largest paid newsletter platform with over 5 million paid subscriptions and 50,000+ active writers — using the canonical public API endpoints each publication ships at /api/v1/archive. Extract every post with full metadata: title, subtitle, byline, audience (free vs paid), reactions, comments count, cover image, podcast duration and canonical URL. HTTP-only, no browser, no proxy, $5 per 1,000 posts.

If you run an ad network for newsletters, manage a sponsorship platform, build a sales tool for B2B SaaS targeting Substack writers, train an LLM on long-form journalism, monitor competitive content strategies, or run an analytics dashboard for the creator economy, this Substack scraper is the cleanest, fastest path to the data.

Why use this Substack scraper

Substack has no advertised public API, but every publication on the platform exposes its archive at /api/v1/archive — the same endpoint Substack's own web app uses to render the post list. The data is canonical, complete and stable.

This actor uses that endpoint directly. That means:

  • Official-grade reliability — when Substack changes UI, the API stays
  • Complete fields — every datapoint Substack itself stores (audience, reactions emoji counts, podcast duration, word count when available)
  • Full byline objects — author name, handle, bio, photo URL, primary publication, guest status
  • No anti-bot encountered on the archive endpoint as of the latest tests
  • Fast — 4–8 posts per second, single worker, 256 MB memory
  • Cheap — $0.005 per post ($5 per 1,000), undercutting every Substack scraper in the Apify Store

What this Substack scraper extracts

Per post

FieldDescriptionExample
postIdSubstack internal post ID156234812
publicationIdInternal publication ID10845
publicationUrlCanonical publication URLhttps://www.lennysnewsletter.com
titlePost titleWhy SaaS freemium playbooks don't work in AI
subtitleSubhead/dek shown under the titleHow to build an AI monetization strategy that actually works
slugURL-safe post slugwhy-saas-freemium-playbooks-dont
canonicalUrlFull canonical URL of the posthttps://www.lennysnewsletter.com/p/why-saas-freemium...
postTypenewsletter, podcast, thread, videonewsletter
audienceWho can read: everyone, only_paid, only_subscribers, foundingonly_paid
postDateISO 8601 publication date2026-05-05T13:03:32.007Z
descriptionShort marketing description"How to build an AI monetization strategy..."
coverImageHero/cover image URLhttps://substackcdn.com/image/...
reactionsEmoji-keyed dict of reaction counts{"❤": 315}
reactionsTotalSum across all emoji315
commentCountTotal comments on the post6
wordCountWord count (when Substack stores it)2840
podcastDurationSeconds (for podcast posts only)2715
freeUnlockRequiredTrue if reader must subscribe (even free) to unlockfalse
isAudioTrue if post has audio content (podcast/voiceover)false
sectionIdInternal section ID if publication has sections8127
bylinesArray of full author objects (see below)[{...}]
bylinesCountNumber of authors on the post1
scrapedAtISO 8601 UTC timestamp2026-05-18T20:55:14+00:00

Per byline (author embedded in each post)

FieldDescriptionExample
idAuthor user ID131847289
nameDisplay nameVikas Kansal
handleSubstack usernamevikaskansal
bioSelf-written bio"Product lead for Google AI subscriptions..."
photoUrlProfile photo URLhttps://substack-post-media...
isGuestTrue if the author is a guest writertrue
primaryPublicationNameTheir own newsletter nameVikas Kansal
primaryPublicationUrlTheir own newsletter URLhttps://vikaskansal.substack.com
primaryPublicationIdTheir own publication ID8927213

Optional: global category catalog

Set includeCategories: true and the actor emits one extra helper record containing all 32 Substack categories (id, name, canonical name, rank). Useful for building a category tree in your application.

Use cases for this Substack data API

📨 Newsletter ad networks & sponsorship platforms

Tools like Beehiiv's ad marketplace, Swapstack, Letterhead and Hypefury sponsorship need fresh metadata for every newsletter they list. Schedule this actor weekly per publication URL — get the latest 50 posts, audience type (paid vs free) and reactions to qualify ad inventory.

💰 B2B SaaS sales prospecting Substack writers

Substack writers are heavy buyers of newsletter automation, video tools, email design tools, course platforms, podcast hosting and CRM software. Build a sales target list by scraping the top 200 Substack publications in your niche and enriching each byline's bio for fit signals ("ex-Google", "Y Combinator", "PhD"...).

🤖 LLM training data and RAG pipelines

Substack hosts the highest density of long-form, original, well-edited writing on the open web. Extract posts from technology, finance, science or culture publications to build a corpus for fine-tuning, retrieval-augmented generation, or topical agents.

📊 Creator economy analytics

How does post frequency correlate with reactions? Which audience type (free vs paid) gets the most comments? Pull thousands of posts across hundreds of publications and answer those questions with real data.

✍️ Competitive content marketing

Marketing teams at Notion, Hubspot, Linear, Vercel and Stripe monitor what their target audience reads on Substack. Schedule this scraper to send daily digests of new posts from competitor / industry publications to a Slack channel via Apify integrations.

📰 Journalism trend tracking

Want to know what every tech newsletter said about the OpenAI o5 launch? Pull the latest 20 posts from the top 50 Substack tech writers and run sentiment + keyword extraction on the bodies.

🎯 Investor / VC research

VCs increasingly track Substack engagement as a leading indicator of founder reputation, market thesis traction and sector heat. Pull the top tech / finance / AI publications and watch reaction velocity.

How to use this Substack scraper

Mode 1: Publication URLs (core)

Pass one or more Substack publication URLs. The actor paginates through each publication's archive using the public /api/v1/archive endpoint and returns one record per post.

{
"publicationUrls": [
"https://www.lennysnewsletter.com",
"https://stratechery.com",
"https://www.platformer.news"
],
"maxPostsPerPublication": 100,
"maxTotalPosts": 500,
"sortOrder": "new",
"audienceFilter": "all"
}

Mode 2: Audience filter

Filter results by audience tier:

  • "all" (default) — every post regardless of paywall
  • "free" — only audience: "everyone" posts (readable without subscribing)
  • "paid" — only audience: "only_paid", "only_subscribers", "founding" posts (paywalled)

Mode 3: Categories catalog

Set includeCategories: true to also receive the global list of 32 Substack categories (Culture, Technology, Business, U.S. Politics, Finance, AI, Crypto, etc) with internal IDs, ranks and parent relationships. Useful if you're building a Substack discovery interface in your own product.

{
"publicationUrls": ["https://www.lennysnewsletter.com"],
"includeCategories": true,
"maxPostsPerPublication": 20
}

Step-by-step tutorial — your first Substack run in 2 minutes

  1. Click "Try for free" on this actor's Apify Store page. Apify gives every new user $5 in credit.
  2. Find a Substack publication you want to scrape. Any URL like https://{anything}.substack.com or any custom domain (stratechery.com, lennysnewsletter.com) works.
  3. Paste the example input:
    {
    "publicationUrls": ["https://www.lennysnewsletter.com"],
    "maxPostsPerPublication": 20,
    "audienceFilter": "all"
    }
  4. Click "Start". The actor pages the publication's archive and pushes one record per post.
  5. Download your dataset as JSON, CSV, Excel, RSS or HTML.

You'll have 20 fully-structured Substack posts in under 10 seconds.

Performance and cost

  • HTTP only, no browser, no proxy. Uses curl_cffi Chrome 120 impersonate against the publication's native API endpoints.
  • 4–8 posts per second sustained throughput.
  • No anti-bot on the archive endpoint as of the latest tests (Substack designed this endpoint to power their own web app — it's intentionally fast and unauthenticated for free archive access).
  • Pricing: $0.005 per post + $0.00005 per actor start. No subscription, no commitment.

Pricing scenarios

WorkloadPostsCost
Try the actor on Lenny's20$0.10
One Apify free $5 credit~1,000$5.00
Top 10 tech publications × 50 latest posts500$2.50
Daily refresh of 30 newsletters (1 month)~9,000$45.00
Bulk archive snapshot — 100 pubs × 500 posts each50,000$250.00

Output example (single Substack post)

{
"type": "post",
"postId": 156234812,
"publicationId": 10845,
"publicationUrl": "https://www.lennysnewsletter.com",
"title": "Why SaaS freemium playbooks don't work in AI, and what to do instead",
"subtitle": "How to build an AI monetization strategy that actually works",
"slug": "why-saas-freemium-playbooks-dont",
"canonicalUrl": "https://www.lennysnewsletter.com/p/why-saas-freemium-playbooks-dont",
"postType": "newsletter",
"audience": "only_paid",
"postDate": "2026-05-05T13:03:32.007Z",
"description": "How to build an AI monetization strategy that actually works",
"coverImage": "https://substackcdn.com/image/...",
"reactions": {"❤": 315},
"reactionsTotal": 315,
"commentCount": 6,
"wordCount": null,
"podcastDuration": null,
"freeUnlockRequired": false,
"isAudio": false,
"sectionId": null,
"bylines": [
{
"id": 131847289,
"name": "Vikas Kansal",
"handle": "vikaskansal",
"bio": "Product lead for Google AI subscriptions...",
"photoUrl": "https://substack-post-media.s3.amazonaws.com/...",
"isGuest": true,
"primaryPublicationName": "Vikas Kansal",
"primaryPublicationUrl": "https://vikaskansal.substack.com",
"primaryPublicationId": 8927213
}
],
"bylinesCount": 1,
"scrapedAt": "2026-05-18T20:55:14+00:00"
}

How this Substack scraper compares

ApproachProsCons
This actorOfficial API endpoint, full bylines, $5/1K, no proxyNo global discovery — user provides publication URLs
Substack RSS feedsFreeSparse fields, no audience info, no reactions, no byline IDs
Newsletter Stack DBs (Letterhead, Hypefury)CuratedPaid subscriptions $50–$300/mo, smaller coverage
BeehiivAds inventory dataReal-timeBeehiiv-only, no Substack overlap
Manual RSS-to-CSV scriptsFreeBrittle, no paid-post metadata, no API stability
Hiring a freelancerCustom$200–$800 one-off, not maintained

How to call this Substack scraper from your code

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("makework36/substack-scraper").call(run_input={
"publicationUrls": ["https://www.lennysnewsletter.com"],
"maxPostsPerPublication": 50,
"audienceFilter": "all",
})
for p in client.dataset(run["defaultDatasetId"]).iterate_items():
if p["type"] == "post":
print(p["title"], p["audience"], p["reactionsTotal"], p["commentCount"])

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('makework36/substack-scraper').call({
publicationUrls: ['https://stratechery.com'],
maxPostsPerPublication: 30,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(p => console.log(p.title, p.bylines[0]?.name, p.reactionsTotal));

cURL (synchronous run)

curl -X POST "https://api.apify.com/v2/acts/makework36~substack-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"publicationUrls":["https://www.lennysnewsletter.com"],"maxPostsPerPublication":10}'

Frequently Asked Questions about scraping Substack

The /api/v1/archive endpoint is the same one Substack's own web client uses to render the archive page on every publication. It is unauthenticated by design — Substack actively wants the archive to be indexable. This actor consumes that same public endpoint at human-like rates. You are responsible for how you store and redistribute the data; respect each writer's copyright on post content, and the Substack Terms of Service for downstream use.

Why doesn't the scraper return post body text?

The archive endpoint returns post metadata only — title, subtitle, description, audience, byline. Full body content lives at the post detail endpoint ({pub}/p/{slug}) and is paywalled for only_paid and only_subscribers posts. Extracting paid bodies is not part of v1 (and would not pass Substack's TOS without subscriber credentials).

Will my account get banned?

The actor runs unauthenticated against public endpoints. Substack has no account-level rate limiting on the archive endpoint. We've not observed any IP block as of the latest tests. The actor inserts a 300 ms pause between paginated requests to remain polite.

How current is the data?

Live — every run hits Substack directly and returns posts as listed at request time. There is no cache.

Can I scrape every Substack publication that exists?

Substack hosts hundreds of thousands of publications. There is no public "top publications" discovery endpoint. To build a comprehensive dataset, supply known URLs (Substack writers usually advertise their newsletters on Twitter/X, LinkedIn or their own websites), or parse Substack's sitemap-tt.xml.gz (millions of pubs — uses lots of credits).

How do I find publication URLs?

  • Visit https://substack.com/explore and click any newsletter
  • Look for *.substack.com subdomains or custom domains on Twitter/X bios of writers in your niche
  • Use the includeCategories: true option to get the 32-category catalog, then manually browse top publications

Can I filter by paid vs free posts?

Yes — set audienceFilter to "free" (only public posts) or "paid" (paywalled posts). The actor still extracts metadata for paid posts; only the body is gated.

Does the actor support podcasts?

Yes. Substack hosts thousands of podcast publications. Posts of postType: "podcast" include podcastDuration in seconds. Full audio is on the canonical URL and not part of the JSON output.

Can I schedule this scraper?

Yes. Use Apify's built-in scheduler to refresh your dataset daily, weekly or monthly. Push results to Google Sheets, BigQuery, Postgres or Slack via Apify integrations.

Will the reactions count include all emoji?

Yes — the reactions field is a dict like {"❤": 315, "🔥": 22, "👍": 11}. The reactionsTotal field sums them.

How accurate is the byline data?

Bylines come straight from Substack's user database. The bio field is the self-written bio at the moment of post publication (which Substack stores per-post, not as a live join). primaryPublicationName and primaryPublicationUrl may have changed if the author migrated newsletters since publishing.

Is there a free trial?

Yes — Apify gives every new user $5 in platform credit, enough to extract ~1,000 Substack posts with this actor.

Can I get subscriber counts per publication?

Substack does not expose subscriber counts publicly; the archive endpoint doesn't include them. Some publications display "X,000 subscribers" in their hero copy — that's HTML scraping for a future v1.1 release.

What about Substack Notes (the Twitter-like feed)?

Notes are a separate platform with its own API. This actor covers posts only. A Notes-specific scraper may come in v2.

Can I use this for academic / research projects?

Absolutely. Many social-science researchers and digital-humanities labs use Substack data for studying long-form journalism, polarization, paid content dynamics and creator economy trends. Cite the actor in your bibliography.

🔗 Other actors by makework36

Building content marketing, sales prospecting, or creator-economy tooling? Combine with these:

See all actors by makework36 on the Apify Store.

Roadmap

  • v1.1: subscriber count extraction from publication homepage HTML.
  • v1.2: post body extraction for audience: "everyone" posts (free posts only).
  • v1.3: comments thread extraction.
  • v2: Substack Notes scraper as a separate actor.

Disclaimer

This actor consumes the /api/v1/archive endpoint that every Substack publication exposes by design — the same endpoint that powers the publication's own web client. You are responsible for respecting each writer's copyright on the post content, Substack's Terms of Service, and applicable data protection regulations (GDPR for EU subjects, CCPA for California subjects) when storing, transforming or redistributing the data.

🙏 Ran this Substack scraper successfully? Leaving a review helps the Apify algorithm surface this actor to other newsletter operators and creator-economy teams. Much appreciated.