Substack Insights Scraper
Pricing
from $0.015 / entity scraped
Substack Insights Scraper
Scrape Substack publications, posts (HTML + markdown for AI/RAG), comments, Notes feeds, author profiles and contact info from bios. Input-time filters (paywall, reactions, words, keywords, subscribers). Historical subscriber-growth tracking across scheduled runs.
Pricing
from $0.015 / entity scraped
Rating
0.0
(0)
Developer
Yuliia Kulakova
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
The most complete Substack data extractor for AI/RAG builders, content marketers, VC scouts, brand sponsors, and lead-gen teams. Pull publications, posts (HTML + markdown), comments, author profiles, Notes feeds, contact info from bios, and historical subscriber growth — with input-time filters so you only pay for entities that match.
A serious upgrade over the existing Substack scrapers on Apify Store: zero of the 36 others offer subscriber-growth tracking, engagement filters at input time, or a cross-publication author graph. We do.

What you get
| Publication metadata | Name, subdomain, custom domain, description, language, subscriber count (when public), payments enabled, primary author |
| Full post archive | Up to 500 posts per publication. Title, subtitle, body (HTML + Markdown), word count, reading time, cover image, audience (free/paid), tags, section, podcast attachment |
| Engagement metrics | Reactions (with per-emoji breakdown), comments, restacks, quotes |
| AI/RAG-ready markdown | Substack platform widgets (subscribe boxes, share buttons) stripped automatically. Optional export of per-post .md files with YAML front matter to the key-value store — drop straight into LlamaIndex / LangChain |
| Nested comments | Full reply tree per post (parent → child mapping, depth, reactions per comment) |
| Author profiles | Bio, photo, Twitter handle, all publications they author (cross-pub graph for KOL / influencer discovery) |
| Notes feed | Per-author Notes timeline including restacks (which posts the author has boosted, with full source attribution) |
| Global publication search | Discover newsletters by keyword across all of Substack |
| Lead-gen contact info | Emails, phones, payment links (PayPal, Cash App, Venmo), aggregators (Linktree, Beacons), website, and 20+ social platforms categorised |
| Engagement filters at input | Free-only / paid-only, min reactions / comments / restacks, min word count, min subscribers, language, date range, include / exclude keywords, has podcast. You pay only for entities that match — no other Substack scraper on Apify does this. |
| Analytics block | Per-publication engagement rate, virality index, average reactions / comments per post, average word count, publish cadence per week, paid-post ratio, momentum (when history is on) |
| Historical trend tracking | Persisted snapshots + automatic deltas (subscriber growth per day, posts added, paid-ratio change, trend up / down / flat) across scheduled runs |
| Multi-format export | JSON, CSV, Excel, XML, HTML, plus optional markdown files |
Works out of the box — proxies included, no configuration required.
Use cases
- AI / RAG builders — schedule daily scrape of niche newsletters, get clean markdown chunks straight into Pinecone / Weaviate / LlamaIndex with one
.mdfile per post - Content marketers / agencies — track 15-30 competitor newsletters weekly, surface viral posts and topic gaps with the keyword + min-reactions filter
- VC scouts / investors — snapshot 200 creators monthly, flag newsletters with subscriber growth > X / day using the history feature
- Brand sponsors / ad buyers — filter publications by subscriber threshold + topic keywords to build a placement target list
- Lead-gen / SDR — extract author emails, websites, socials from publication bios for cold outreach
- Recruiters — find domain experts who write about specific topics (search by keyword in body, filter by min reactions to qualify)
- Newsletter operators — weekly archive diff on competitors, identify "what's working" by sorting posts by reactions
- Journalists / researchers — full-text public newsletter archives for citation, with provenance per post
Quick start
Paste a Substack URL or bare handle — that's the whole minimum input:
{"publicationHandles": ["noahpinion"]}
You get one publication record (Noahpinion: 450,000 subscribers, custom domain noahpinion.blog, payments enabled) plus 25 most-recent posts with full body, reactions, comments, tags, and analytics.
Common inputs
Multiple publications in one run:
{"publicationHandles": ["piratewires", "platformer", "noahpinion"],"maxPostsPerPublication": 10}
AI / RAG corpus build — free posts only, high engagement, long-form, markdown output:
{"publicationHandles": ["noahpinion"],"maxPostsPerPublication": 50,"paywallFilter": "free-only","minReactions": 100,"minWordCount": 500,"bodyFormat": "markdown","outputMarkdownFiles": true}
Each matching post is also written as <handle>__<slug>.md to the substack-markdown key-value store with YAML front matter — drop straight into LlamaIndex/LangChain.
Lead-gen sweep with contact extraction:
{"searchQueries": ["startups", "founders"],"maxSearchResults": 20,"minSubscribers": 5000,"includeContactInfo": true,"includePosts": false}
Newsletter monitoring with subscriber-growth tracking — schedule daily:
{"publicationHandles": ["noahpinion", "platformer", "piratewires"],"enableHistory": true,"historyDatasetName": "my-newsletter-tracker"}
From the second scheduled run onwards every publication carries a history.delta block with subscriberChange, subscriberGrowthPerDay, postsAdded, and a trend (up / down / flat).
Keyword-filtered competitor research:
{"publicationHandles": ["noahpinion"],"maxPostsPerPublication": 50,"keywordsInclude": ["china", "trade"],"dateFrom": "2026-01-01"}
Single-author profile + Notes feed:
{"userHandles": ["casey", "noah"],"includeNotes": true,"maxNotesPerAuthor": 50}
Global discovery — find newsletters by topic:
{"searchQueries": ["artificial intelligence", "macroeconomics", "biotech"],"maxSearchResults": 10}
Inputs reference
Targeting
| Field | Type | Default | Description |
|---|---|---|---|
urls | array | — | Substack URLs (publications or posts) or @username form |
publicationHandles | array | — | Bare publication handles (e.g. noahpinion) |
userHandles | array | — | Bare user handles for author profile + Notes |
searchQueries | array | — | Keywords for global publication search |
maxSearchResults | int | 10 | Top N publications per search query |
includePosts | boolean | true | Fetch the post archive per publication |
maxPostsPerPublication | int | 25 | Cap on posts fetched per publication |
includeFullBody | boolean | true | Extra request per post to fetch the full body |
bodyFormat | enum | both | both / html / markdown / none |
includeComments | boolean | false | Fetch nested comments per post |
includeNotes | boolean | false | Fetch Notes / restacks feed per user handle |
Engagement filters (paid only if matched)
| Field | Default | Description |
|---|---|---|
paywallFilter | any | any / free-only / paid-only |
minReactions | 0 | Skip posts below this reaction count |
minComments | 0 | Skip posts below this comment count |
minRestacks | 0 | Skip posts below this restack count |
minWordCount | 0 | Drop short posts (great for AI/RAG quality bar) |
maxWordCount | 0 (unlimited) | Drop overlong posts |
hasPodcast | any | Keep only posts with (or without) a podcast attachment |
postType | — | Allow-list (newsletter, podcast, thread) |
language | — | Two-letter codes (e.g. en, es) |
dateFrom / dateTo | — | ISO date or YYYY-MM-DD |
keywordsInclude | — | Must contain at least one keyword in title / subtitle / body (case-insensitive) |
keywordsExclude | — | Must NOT contain |
minSubscribers | 0 | Publication-level: skip pubs below this subscriber count |
paymentsEnabledOnly | false | Publication-level: keep only pubs that offer paid subscriptions |
Enrichment
| Field | Default | Description |
|---|---|---|
includeContactInfo | true | Emails, phones, socials, payment links from bios |
includeAnalytics | true | Engagement rate, virality, cadence, paid-post ratio |
enableHistory | false | Persist snapshots, compute deltas across runs |
historyDatasetName | substack-history | Named store reused across scheduled runs |
outputMarkdownFiles | false | Per-post .md files written to the substack-markdown KV store with YAML front matter |
Output
Every record carries type, id, handle (or publicationHandle for posts), url, scrapedAt. Type-specific fields are layered on top.
Publication record
{"type": "publication","id": "12345","handle": "noahpinion","name": "Noahpinion","url": "https://noahpinion.substack.com","customDomain": "noahpinion.blog","description": "Economics and other interesting stuff...","subscriberCount": 450000,"subscriberCountString": "450,000","paymentsEnabled": true,"language": "en","primaryAuthor": { "id": "...", "handle": "...", "name": "..." },"topPosts": [ /* top 5 by reactions */ ],"analytics": { /* see below */ },"history": { /* when enableHistory */ },"contactInfo": { /* when includeContactInfo */ }}
Post record
{"type": "post","id": "...","slug": "...","url": "...","publicationHandle": "noahpinion","title": "...","subtitle": "...","publishedAt": "...","audience": "everyone" /* or "only_paid" / "founding" */,"isPaywalled": false,"wordCount": 1234,"readingTimeMin": 6,"reactionCount": 256,"reactionsBreakdown": { "❤": 200, "🔥": 56 },"commentCount": 45,"restackCount": 12,"podcastUrl": null,"sectionName": null,"tags": ["..."],"bodyHtml": "...","bodyMarkdown": "...","coAuthors": [ /* with bio */ ]}
Analytics block
{"sampleSize": 10,"sampleDateRange": { "from": "...", "to": "..." },"avgReactionsPerPost": 580,"avgCommentsPerPost": 65,"avgWordCount": 3200,"totalReactions": 5800,"topPostReactions": 937,"topPost": { "slug": "...", "title": "...", "reactionCount": 937 },"publishCadencePerWeek": 4.67,"paidPostRatio": 0.33,"engagementRate": 0.00129,"viralityIndex": 1.62,"momentum": { "subscriberGrowthPerDay": 250, "trend": "up" }}
History block (with enableHistory)
{"firstSeen": "2026-06-01T00:00:00Z","snapshotCount": 14,"previous": { "at": "...", "subscriberCount": 449750 },"delta": {"daysElapsed": 1.0,"subscriberChange": 250,"subscriberGrowthPerDay": 250,"postsAdded": 1,"trend": "up"},"series": [ /* recent points */ ]}
Contact info block
{"emails": ["hello@example.com"],"phones": [],"website": "https://example.com","socials": {"twitter": "...","linkedin": "...","medium": "...","beehiiv": "..."},"paymentLinks": { "paypal": "..." },"aggregator": "https://linktr.ee/..."}
Pricing
Simple pay-per-event:
| Event | Cost |
|---|---|
| Actor start | $0.01 |
| Each scraped entity (publication / post / comment / note / author) | $0.015 |
Example runs:
| What you scrape | Calculation | Cost |
|---|---|---|
| 1 publication + 25 posts | $0.01 + 26 × $0.015 | $0.40 |
| 10 publications + 25 posts each | $0.01 + 260 × $0.015 | $3.91 |
| 100 publications daily monitoring (no posts) | $1.51 / run × 30 days | ~$45 / month |
| AI / RAG corpus: 50 newsletters × 100 posts | $0.01 + 5,050 × $0.015 | $75.76 |
| 500 publications discovery sweep (search + metadata) | $0.01 + 500 × $0.015 | $7.51 |
Compare: enterprise newsletter analytics tools start at $250–500 / month. We're per-use, no commitment.
FAQ
Does this work without configuration? Yes. Paste a Substack URL or bare handle and run. Proxies are included.
Why is subscriberCount sometimes null?
About half of publications choose to hide their subscriber number in their public settings — when that's the case, no scraper (ours or anyone else's) can recover it without a logged-in session. When subscriberCountVisibility: "private" is set by the publication owner, we surface null. For the half that keep it public, we extract the real number (verified on Noahpinion 450k, Platformer 176k, etc.).
Are paywalled post bodies fully scraped?
For free posts: full body is returned. For paid posts: Substack returns a short preview (the same the site shows to non-subscribers) — we surface what's exposed and flag isPaywalled: true. No scraper can extract paid bodies without an sp_dc-style subscriber cookie (and that requires a real paid subscription).
What's the markdown export mode?
With outputMarkdownFiles: true, every scraped post body is written as a separate .md file to the substack-markdown key-value store. The file is named <handle>__<slug>.md and starts with YAML front matter (title, slug, publication, publishedAt, audience, reactionCount, commentCount, wordCount). Platform widgets (subscribe boxes, share buttons) are stripped. Ready for LlamaIndex / LangChain document loaders.
Can I get historical subscriber growth?
Yes — with enableHistory: true plus a stable historyDatasetName. From the second scheduled run onwards every publication record carries a history.delta block with subscriberChange, subscriberGrowthPerDay, postsAdded, and a trend. This is the only Substack scraper on Apify that does this.
Will repeated runs over-charge me? Each run is metered independently. If you re-scrape the same publications daily, that's a billable entity per publication per run. The history store deduplicates time-series points so trend data stays clean.
Can I export to CSV / Excel? Yes. Apify automatically offers CSV, JSON, Excel, XML and HTML on every dataset.
Are the engagement filters applied before billing?
Yes. If a post fails your minReactions / paywallFilter / language / date filter, it's never written to the dataset and you don't pay for it.
What about Notes vs restacks — what's the difference?
A Substack Note is a short original post (like a tweet). A restack is an author boosting someone else's post into their feed. Both come back from the user's Notes feed. We tag each item with kind: "note" (original) or kind: "restack" (boost), so you can filter.
Does the cross-publication author graph work?
Yes — for any user handle, we surface their full publications[] list (with role: admin / author / etc.). Find a Substack author who also writes on Beehiiv? contactInfo.socials.beehiiv will surface the link if it's in their bio.
How fresh is the data? Real-time per request — the actor reads Substack's public feed at run time.
What's the difference vs other Substack scrapers on Apify? Most return raw post dumps. We add input-time engagement filters (no charge for filtered entities), markdown export ready for AI/RAG, bio-contact extraction (lead-gen), per-publication analytics (engagement rate, virality, cadence, paid-post ratio), historical subscriber-growth tracking, and a cross-publication author graph. The 36 existing Substack scrapers on Apify offer at most one or two of these — never all.
Legal
This actor scrapes only publicly available Substack data. It is not affiliated with, endorsed by, or sponsored by Substack Inc. Use responsibly, respect Substack's Terms of Service and the original authors' copyright. For analytics, research, journalism, lead generation, and AI training corpora.
Maintained by brilliant_gum on Apify.