# Substack Insights Scraper (`brilliant_gum/substack-insights-scraper`) Actor

Scrape Substack publications, posts (HTML + markdown for AI/RAG), comments, Notes feeds, author profiles and contact info from bios. Input-time filters (paywall, reactions, words, keywords, subscribers). Historical subscriber-growth tracking across scheduled runs.

- **URL**: https://apify.com/brilliant\_gum/substack-insights-scraper.md
- **Developed by:** [Yuliia Kulakova](https://apify.com/brilliant_gum) (community)
- **Categories:** Automation, AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.015 / entity scraped

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Substack Insights Scraper

The most complete Substack data extractor for AI/RAG builders, content marketers, VC scouts, brand sponsors, and lead-gen teams. Pull **publications**, **posts** (HTML + markdown), **comments**, **author profiles**, **Notes feeds**, **contact info from bios**, and **historical subscriber growth** — with input-time filters so you only pay for entities that match.

A serious upgrade over the existing Substack scrapers on Apify Store: zero of the 36 others offer subscriber-growth tracking, engagement filters at input time, or a cross-publication author graph. We do.

![banner](https://i.imgur.com/nT5Cexy.png)

---

### What you get

| | |
|---|---|
| **Publication metadata** | Name, subdomain, custom domain, description, language, subscriber count (when public), payments enabled, primary author |
| **Full post archive** | Up to 500 posts per publication. Title, subtitle, body (HTML + Markdown), word count, reading time, cover image, audience (free/paid), tags, section, podcast attachment |
| **Engagement metrics** | Reactions (with per-emoji breakdown), comments, restacks, quotes |
| **AI/RAG-ready markdown** | Substack platform widgets (subscribe boxes, share buttons) stripped automatically. Optional export of per-post `.md` files with YAML front matter to the key-value store — drop straight into LlamaIndex / LangChain |
| **Nested comments** | Full reply tree per post (parent → child mapping, depth, reactions per comment) |
| **Author profiles** | Bio, photo, Twitter handle, all publications they author (cross-pub graph for KOL / influencer discovery) |
| **Notes feed** | Per-author Notes timeline including restacks (which posts the author has boosted, with full source attribution) |
| **Global publication search** | Discover newsletters by keyword across all of Substack |
| **Lead-gen contact info** | Emails, phones, payment links (PayPal, Cash App, Venmo), aggregators (Linktree, Beacons), website, and 20+ social platforms categorised |
| **Engagement filters at input** | Free-only / paid-only, min reactions / comments / restacks, min word count, min subscribers, language, date range, include / exclude keywords, has podcast. **You pay only for entities that match — no other Substack scraper on Apify does this.** |
| **Analytics block** | Per-publication engagement rate, virality index, average reactions / comments per post, average word count, publish cadence per week, paid-post ratio, momentum (when history is on) |
| **Historical trend tracking** | Persisted snapshots + automatic deltas (subscriber growth per day, posts added, paid-ratio change, trend up / down / flat) across scheduled runs |
| **Multi-format export** | JSON, CSV, Excel, XML, HTML, plus optional markdown files |

Works out of the box — proxies included, no configuration required.

---

### Use cases

- **AI / RAG builders** — schedule daily scrape of niche newsletters, get clean markdown chunks straight into Pinecone / Weaviate / LlamaIndex with one `.md` file per post
- **Content marketers / agencies** — track 15-30 competitor newsletters weekly, surface viral posts and topic gaps with the keyword + min-reactions filter
- **VC scouts / investors** — snapshot 200 creators monthly, flag newsletters with subscriber growth > X / day using the history feature
- **Brand sponsors / ad buyers** — filter publications by subscriber threshold + topic keywords to build a placement target list
- **Lead-gen / SDR** — extract author emails, websites, socials from publication bios for cold outreach
- **Recruiters** — find domain experts who write about specific topics (search by keyword in body, filter by min reactions to qualify)
- **Newsletter operators** — weekly archive diff on competitors, identify "what's working" by sorting posts by reactions
- **Journalists / researchers** — full-text public newsletter archives for citation, with provenance per post

---

### Quick start

Paste a Substack URL or bare handle — that's the whole minimum input:

```json
{
  "publicationHandles": ["noahpinion"]
}
````

You get one publication record (Noahpinion: 450,000 subscribers, custom domain `noahpinion.blog`, payments enabled) plus 25 most-recent posts with full body, reactions, comments, tags, and analytics.

***

### Common inputs

**Multiple publications in one run:**

```json
{
  "publicationHandles": ["piratewires", "platformer", "noahpinion"],
  "maxPostsPerPublication": 10
}
```

**AI / RAG corpus build — free posts only, high engagement, long-form, markdown output:**

```json
{
  "publicationHandles": ["noahpinion"],
  "maxPostsPerPublication": 50,
  "paywallFilter": "free-only",
  "minReactions": 100,
  "minWordCount": 500,
  "bodyFormat": "markdown",
  "outputMarkdownFiles": true
}
```

Each matching post is also written as `<handle>__<slug>.md` to the `substack-markdown` key-value store with YAML front matter — drop straight into LlamaIndex/LangChain.

**Lead-gen sweep with contact extraction:**

```json
{
  "searchQueries": ["startups", "founders"],
  "maxSearchResults": 20,
  "minSubscribers": 5000,
  "includeContactInfo": true,
  "includePosts": false
}
```

**Newsletter monitoring with subscriber-growth tracking — schedule daily:**

```json
{
  "publicationHandles": ["noahpinion", "platformer", "piratewires"],
  "enableHistory": true,
  "historyDatasetName": "my-newsletter-tracker"
}
```

From the second scheduled run onwards every publication carries a `history.delta` block with `subscriberChange`, `subscriberGrowthPerDay`, `postsAdded`, and a `trend` (`up` / `down` / `flat`).

**Keyword-filtered competitor research:**

```json
{
  "publicationHandles": ["noahpinion"],
  "maxPostsPerPublication": 50,
  "keywordsInclude": ["china", "trade"],
  "dateFrom": "2026-01-01"
}
```

**Single-author profile + Notes feed:**

```json
{
  "userHandles": ["casey", "noah"],
  "includeNotes": true,
  "maxNotesPerAuthor": 50
}
```

**Global discovery — find newsletters by topic:**

```json
{
  "searchQueries": ["artificial intelligence", "macroeconomics", "biotech"],
  "maxSearchResults": 10
}
```

***

### Inputs reference

#### Targeting

| Field | Type | Default | Description |
|---|---|---|---|
| `urls` | array | — | Substack URLs (publications or posts) or `@username` form |
| `publicationHandles` | array | — | Bare publication handles (e.g. `noahpinion`) |
| `userHandles` | array | — | Bare user handles for author profile + Notes |
| `searchQueries` | array | — | Keywords for global publication search |
| `maxSearchResults` | int | 10 | Top N publications per search query |
| `includePosts` | boolean | true | Fetch the post archive per publication |
| `maxPostsPerPublication` | int | 25 | Cap on posts fetched per publication |
| `includeFullBody` | boolean | true | Extra request per post to fetch the full body |
| `bodyFormat` | enum | `both` | `both` / `html` / `markdown` / `none` |
| `includeComments` | boolean | false | Fetch nested comments per post |
| `includeNotes` | boolean | false | Fetch Notes / restacks feed per user handle |

#### Engagement filters (paid only if matched)

| Field | Default | Description |
|---|---|---|
| `paywallFilter` | `any` | `any` / `free-only` / `paid-only` |
| `minReactions` | 0 | Skip posts below this reaction count |
| `minComments` | 0 | Skip posts below this comment count |
| `minRestacks` | 0 | Skip posts below this restack count |
| `minWordCount` | 0 | Drop short posts (great for AI/RAG quality bar) |
| `maxWordCount` | 0 (unlimited) | Drop overlong posts |
| `hasPodcast` | any | Keep only posts with (or without) a podcast attachment |
| `postType` | — | Allow-list (`newsletter`, `podcast`, `thread`) |
| `language` | — | Two-letter codes (e.g. `en`, `es`) |
| `dateFrom` / `dateTo` | — | ISO date or YYYY-MM-DD |
| `keywordsInclude` | — | Must contain at least one keyword in title / subtitle / body (case-insensitive) |
| `keywordsExclude` | — | Must NOT contain |
| `minSubscribers` | 0 | Publication-level: skip pubs below this subscriber count |
| `paymentsEnabledOnly` | false | Publication-level: keep only pubs that offer paid subscriptions |

#### Enrichment

| Field | Default | Description |
|---|---|---|
| `includeContactInfo` | **true** | Emails, phones, socials, payment links from bios |
| `includeAnalytics` | **true** | Engagement rate, virality, cadence, paid-post ratio |
| `enableHistory` | false | Persist snapshots, compute deltas across runs |
| `historyDatasetName` | `substack-history` | Named store reused across scheduled runs |
| `outputMarkdownFiles` | false | Per-post `.md` files written to the `substack-markdown` KV store with YAML front matter |

***

### Output

Every record carries `type`, `id`, `handle` (or `publicationHandle` for posts), `url`, `scrapedAt`. Type-specific fields are layered on top.

#### Publication record

```json
{
  "type": "publication",
  "id": "12345",
  "handle": "noahpinion",
  "name": "Noahpinion",
  "url": "https://noahpinion.substack.com",
  "customDomain": "noahpinion.blog",
  "description": "Economics and other interesting stuff...",
  "subscriberCount": 450000,
  "subscriberCountString": "450,000",
  "paymentsEnabled": true,
  "language": "en",
  "primaryAuthor": { "id": "...", "handle": "...", "name": "..." },
  "topPosts": [ /* top 5 by reactions */ ],
  "analytics": { /* see below */ },
  "history": { /* when enableHistory */ },
  "contactInfo": { /* when includeContactInfo */ }
}
```

#### Post record

```json
{
  "type": "post",
  "id": "...",
  "slug": "...",
  "url": "...",
  "publicationHandle": "noahpinion",
  "title": "...",
  "subtitle": "...",
  "publishedAt": "...",
  "audience": "everyone" /* or "only_paid" / "founding" */,
  "isPaywalled": false,
  "wordCount": 1234,
  "readingTimeMin": 6,
  "reactionCount": 256,
  "reactionsBreakdown": { "❤": 200, "🔥": 56 },
  "commentCount": 45,
  "restackCount": 12,
  "podcastUrl": null,
  "sectionName": null,
  "tags": ["..."],
  "bodyHtml": "...",
  "bodyMarkdown": "...",
  "coAuthors": [ /* with bio */ ]
}
```

#### Analytics block

```json
{
  "sampleSize": 10,
  "sampleDateRange": { "from": "...", "to": "..." },
  "avgReactionsPerPost": 580,
  "avgCommentsPerPost": 65,
  "avgWordCount": 3200,
  "totalReactions": 5800,
  "topPostReactions": 937,
  "topPost": { "slug": "...", "title": "...", "reactionCount": 937 },
  "publishCadencePerWeek": 4.67,
  "paidPostRatio": 0.33,
  "engagementRate": 0.00129,
  "viralityIndex": 1.62,
  "momentum": { "subscriberGrowthPerDay": 250, "trend": "up" }
}
```

#### History block (with `enableHistory`)

```json
{
  "firstSeen": "2026-06-01T00:00:00Z",
  "snapshotCount": 14,
  "previous": { "at": "...", "subscriberCount": 449750 },
  "delta": {
    "daysElapsed": 1.0,
    "subscriberChange": 250,
    "subscriberGrowthPerDay": 250,
    "postsAdded": 1,
    "trend": "up"
  },
  "series": [ /* recent points */ ]
}
```

#### Contact info block

```json
{
  "emails": ["hello@example.com"],
  "phones": [],
  "website": "https://example.com",
  "socials": {
    "twitter": "...",
    "linkedin": "...",
    "medium": "...",
    "beehiiv": "..."
  },
  "paymentLinks": { "paypal": "..." },
  "aggregator": "https://linktr.ee/..."
}
```

***

### Pricing

Simple pay-per-event:

| Event | Cost |
|---|---|
| Actor start | $0.01 |
| Each scraped entity (publication / post / comment / note / author) | $0.015 |

**Example runs:**

| What you scrape | Calculation | Cost |
|---|---|---|
| 1 publication + 25 posts | $0.01 + 26 × $0.015 | **$0.40** |
| 10 publications + 25 posts each | $0.01 + 260 × $0.015 | **$3.91** |
| 100 publications daily monitoring (no posts) | $1.51 / run × 30 days | **~$45 / month** |
| AI / RAG corpus: 50 newsletters × 100 posts | $0.01 + 5,050 × $0.015 | **$75.76** |
| 500 publications discovery sweep (search + metadata) | $0.01 + 500 × $0.015 | **$7.51** |

Compare: enterprise newsletter analytics tools start at $250–500 / month. We're per-use, no commitment.

***

### FAQ

**Does this work without configuration?**
Yes. Paste a Substack URL or bare handle and run. Proxies are included.

**Why is `subscriberCount` sometimes null?**
About half of publications choose to hide their subscriber number in their public settings — when that's the case, no scraper (ours or anyone else's) can recover it without a logged-in session. When `subscriberCountVisibility: "private"` is set by the publication owner, we surface `null`. For the half that keep it public, we extract the real number (verified on Noahpinion 450k, Platformer 176k, etc.).

**Are paywalled post bodies fully scraped?**
For free posts: full body is returned. For paid posts: Substack returns a short preview (the same the site shows to non-subscribers) — we surface what's exposed and flag `isPaywalled: true`. No scraper can extract paid bodies without an `sp_dc`-style subscriber cookie (and that requires a real paid subscription).

**What's the markdown export mode?**
With `outputMarkdownFiles: true`, every scraped post body is written as a separate `.md` file to the `substack-markdown` key-value store. The file is named `<handle>__<slug>.md` and starts with YAML front matter (title, slug, publication, publishedAt, audience, reactionCount, commentCount, wordCount). Platform widgets (subscribe boxes, share buttons) are stripped. Ready for LlamaIndex / LangChain document loaders.

**Can I get historical subscriber growth?**
Yes — with `enableHistory: true` plus a stable `historyDatasetName`. From the second scheduled run onwards every publication record carries a `history.delta` block with `subscriberChange`, `subscriberGrowthPerDay`, `postsAdded`, and a `trend`. This is the only Substack scraper on Apify that does this.

**Will repeated runs over-charge me?**
Each run is metered independently. If you re-scrape the same publications daily, that's a billable entity per publication per run. The history store deduplicates time-series points so trend data stays clean.

**Can I export to CSV / Excel?**
Yes. Apify automatically offers CSV, JSON, Excel, XML and HTML on every dataset.

**Are the engagement filters applied before billing?**
Yes. If a post fails your `minReactions` / `paywallFilter` / language / date filter, it's never written to the dataset and you don't pay for it.

**What about Notes vs restacks — what's the difference?**
A Substack Note is a short original post (like a tweet). A restack is an author boosting someone else's post into their feed. Both come back from the user's Notes feed. We tag each item with `kind: "note"` (original) or `kind: "restack"` (boost), so you can filter.

**Does the cross-publication author graph work?**
Yes — for any user handle, we surface their full `publications[]` list (with role: admin / author / etc.). Find a Substack author who also writes on Beehiiv? `contactInfo.socials.beehiiv` will surface the link if it's in their bio.

**How fresh is the data?**
Real-time per request — the actor reads Substack's public feed at run time.

**What's the difference vs other Substack scrapers on Apify?**
Most return raw post dumps. We add input-time engagement filters (no charge for filtered entities), markdown export ready for AI/RAG, bio-contact extraction (lead-gen), per-publication analytics (engagement rate, virality, cadence, paid-post ratio), historical subscriber-growth tracking, and a cross-publication author graph. The 36 existing Substack scrapers on Apify offer at most one or two of these — never all.

***

### Legal

This actor scrapes only publicly available Substack data. It is **not affiliated with, endorsed by, or sponsored by Substack Inc.** Use responsibly, respect Substack's Terms of Service and the original authors' copyright. For analytics, research, journalism, lead generation, and AI training corpora.

***

Maintained by **brilliant\_gum** on Apify.

# Actor input Schema

## `urls` (type: `array`):

Publication or post URLs (e.g. https://example.substack.com or https://example.substack.com/p/some-post). Bare handles and @usernames also accepted. Leave empty if you use 'Publication handles', 'User handles' or 'Search queries' instead.

## `publicationHandles` (type: `array`):

Bare publication handles (e.g. 'noahpinion'). Convenient for CSV lists.

## `userHandles` (type: `array`):

User handles to fetch author profile and Notes feed.

## `searchQueries` (type: `array`):

Keywords to search Substack's global publication directory.

## `maxSearchResults` (type: `integer`):

How many top publications to return per search query.

## `includePosts` (type: `boolean`):

For each publication, fetch the post archive.

## `maxPostsPerPublication` (type: `integer`):

Cap on posts fetched per publication.

## `includeFullBody` (type: `boolean`):

Fetch the full body of each post (extra request per post). Disable to skip and rely on archive snippets only.

## `bodyFormat` (type: `string`):

Which body format(s) to keep. 'markdown' is the AI/RAG-friendly form.

## `includeComments` (type: `boolean`):

Fetch nested comments per post (extra request per post that has comments).

## `maxCommentsPerPost` (type: `integer`):

Cap on comments fetched per post.

## `includeNotes` (type: `boolean`):

For each user handle, pull their Notes/restacks feed.

## `maxNotesPerAuthor` (type: `integer`):

Cap on notes/restacks fetched per author handle.

## `paywallFilter` (type: `string`):

Keep all posts, only free ones, or only paid ones.

## `minReactions` (type: `integer`):

Skip posts below this reaction count.

## `minComments` (type: `integer`):

Skip posts below this comment count.

## `minRestacks` (type: `integer`):

Skip posts below this restack count.

## `minWordCount` (type: `integer`):

Useful for AI/RAG to filter out short announcements.

## `maxWordCount` (type: `integer`):

Drop posts longer than this. 0 means no upper limit.

## `dateFrom` (type: `string`):

ISO date or YYYY-MM-DD. Drop posts older than this.

## `dateTo` (type: `string`):

ISO date or YYYY-MM-DD. Drop posts newer than this.

## `keywordsInclude` (type: `array`):

Drop posts whose title + subtitle + body text doesn't include at least one of these (case-insensitive).

## `keywordsExclude` (type: `array`):

Drop posts whose text contains any of these keywords.

## `hasPodcast` (type: `string`):

Keep only posts with (or without) an attached podcast.

## `postType` (type: `array`):

Keep only certain post types (newsletter, podcast, thread). Empty = all.

## `minSubscribers` (type: `integer`):

Skip publications below this subscriber count. Note: ~50% of publications hide subscriber counts — those are also skipped when this filter is on.

## `paymentsEnabledOnly` (type: `boolean`):

Skip publications that don't offer paid subscriptions.

## `includeContactInfo` (type: `boolean`):

Parse emails, phones, payment links and categorise social/website links from bios.

## `includeAnalytics` (type: `boolean`):

Per-publication engagement rate, virality index, publish cadence, paid-post ratio, momentum (when history is on).

## `outputMarkdownFiles` (type: `boolean`):

Write each post body as a separate .md file to the 'substack-markdown' key-value store, with YAML front matter. Useful for LlamaIndex/LangChain ingestion.

## `enableHistory` (type: `boolean`):

Persist a timestamped snapshot every run into a named store, then compute deltas (subscriber growth per day, posts added, trend) vs the previous run. Schedule the actor daily/weekly to build a time-series.

## `historyDatasetName` (type: `string`):

Named key-value store to persist snapshots into. Reused across scheduled runs.

## `proxyConfiguration` (type: `object`):

Optional. Proxies are included and configured automatically — leave empty unless overriding.

## `maxConcurrency` (type: `integer`):

Parallel workers.

## `minDelayMs` (type: `integer`):

Per-request throttle.

## Actor input object example

```json
{
  "urls": [
    "https://piratewires.substack.com",
    "https://platformer.substack.com/p/the-state-of-tech",
    "@casey"
  ],
  "publicationHandles": [
    "piratewires",
    "platformer",
    "noahpinion"
  ],
  "userHandles": [
    "casey",
    "noah"
  ],
  "searchQueries": [
    "ai",
    "startups",
    "crypto"
  ],
  "maxSearchResults": 10,
  "includePosts": true,
  "maxPostsPerPublication": 25,
  "includeFullBody": true,
  "bodyFormat": "both",
  "includeComments": false,
  "maxCommentsPerPost": 100,
  "includeNotes": false,
  "maxNotesPerAuthor": 50,
  "paywallFilter": "any",
  "minReactions": 0,
  "minComments": 0,
  "minRestacks": 0,
  "minWordCount": 0,
  "maxWordCount": 0,
  "keywordsInclude": [
    "ai",
    "startup"
  ],
  "keywordsExclude": [
    "spam",
    "giveaway"
  ],
  "hasPodcast": "any",
  "postType": [
    "newsletter"
  ],
  "minSubscribers": 0,
  "paymentsEnabledOnly": false,
  "includeContactInfo": true,
  "includeAnalytics": true,
  "outputMarkdownFiles": false,
  "enableHistory": false,
  "historyDatasetName": "substack-history",
  "maxConcurrency": 5,
  "minDelayMs": 1000
}
```

# Actor output Schema

## `summary` (type: `string`):

Counts of scraped entities and pointers to the dataset and history store.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://noahpinion.substack.com"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("brilliant_gum/substack-insights-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": ["https://noahpinion.substack.com"] }

# Run the Actor and wait for it to finish
run = client.actor("brilliant_gum/substack-insights-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://noahpinion.substack.com"
  ]
}' |
apify call brilliant_gum/substack-insights-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=brilliant_gum/substack-insights-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Substack Insights Scraper",
        "description": "Scrape Substack publications, posts (HTML + markdown for AI/RAG), comments, Notes feeds, author profiles and contact info from bios. Input-time filters (paywall, reactions, words, keywords, subscribers). Historical subscriber-growth tracking across scheduled runs.",
        "version": "1.0",
        "x-build-id": "gOEvW94Ne9ph6fmYI"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/brilliant_gum~substack-insights-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-brilliant_gum-substack-insights-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/brilliant_gum~substack-insights-scraper/runs": {
            "post": {
                "operationId": "runs-sync-brilliant_gum-substack-insights-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/brilliant_gum~substack-insights-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-brilliant_gum-substack-insights-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "urls": {
                        "title": "Substack URLs",
                        "type": "array",
                        "description": "Publication or post URLs (e.g. https://example.substack.com or https://example.substack.com/p/some-post). Bare handles and @usernames also accepted. Leave empty if you use 'Publication handles', 'User handles' or 'Search queries' instead.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "publicationHandles": {
                        "title": "Publication handles",
                        "type": "array",
                        "description": "Bare publication handles (e.g. 'noahpinion'). Convenient for CSV lists.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "userHandles": {
                        "title": "User handles (for author profile + Notes)",
                        "type": "array",
                        "description": "User handles to fetch author profile and Notes feed.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "searchQueries": {
                        "title": "Search queries (publication discovery)",
                        "type": "array",
                        "description": "Keywords to search Substack's global publication directory.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxSearchResults": {
                        "title": "Max results per search query",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "How many top publications to return per search query.",
                        "default": 10
                    },
                    "includePosts": {
                        "title": "Include posts",
                        "type": "boolean",
                        "description": "For each publication, fetch the post archive.",
                        "default": true
                    },
                    "maxPostsPerPublication": {
                        "title": "Max posts per publication",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Cap on posts fetched per publication.",
                        "default": 25
                    },
                    "includeFullBody": {
                        "title": "Include full post body",
                        "type": "boolean",
                        "description": "Fetch the full body of each post (extra request per post). Disable to skip and rely on archive snippets only.",
                        "default": true
                    },
                    "bodyFormat": {
                        "title": "Body format",
                        "enum": [
                            "both",
                            "html",
                            "markdown",
                            "none"
                        ],
                        "type": "string",
                        "description": "Which body format(s) to keep. 'markdown' is the AI/RAG-friendly form.",
                        "default": "both"
                    },
                    "includeComments": {
                        "title": "Include comments",
                        "type": "boolean",
                        "description": "Fetch nested comments per post (extra request per post that has comments).",
                        "default": false
                    },
                    "maxCommentsPerPost": {
                        "title": "Max comments per post",
                        "minimum": 0,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Cap on comments fetched per post.",
                        "default": 100
                    },
                    "includeNotes": {
                        "title": "Include Notes feed (per user handle)",
                        "type": "boolean",
                        "description": "For each user handle, pull their Notes/restacks feed.",
                        "default": false
                    },
                    "maxNotesPerAuthor": {
                        "title": "Max notes per author",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Cap on notes/restacks fetched per author handle.",
                        "default": 50
                    },
                    "paywallFilter": {
                        "title": "Paywall filter",
                        "enum": [
                            "any",
                            "free-only",
                            "paid-only"
                        ],
                        "type": "string",
                        "description": "Keep all posts, only free ones, or only paid ones.",
                        "default": "any"
                    },
                    "minReactions": {
                        "title": "Min reactions per post (filter applied before billing — no charge for filtered entities)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip posts below this reaction count.",
                        "default": 0
                    },
                    "minComments": {
                        "title": "Min comments per post",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip posts below this comment count.",
                        "default": 0
                    },
                    "minRestacks": {
                        "title": "Min restacks per post",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip posts below this restack count.",
                        "default": 0
                    },
                    "minWordCount": {
                        "title": "Min word count per post",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Useful for AI/RAG to filter out short announcements.",
                        "default": 0
                    },
                    "maxWordCount": {
                        "title": "Max word count per post (0 = unlimited)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Drop posts longer than this. 0 means no upper limit.",
                        "default": 0
                    },
                    "dateFrom": {
                        "title": "Earliest post date",
                        "type": "string",
                        "description": "ISO date or YYYY-MM-DD. Drop posts older than this."
                    },
                    "dateTo": {
                        "title": "Latest post date",
                        "type": "string",
                        "description": "ISO date or YYYY-MM-DD. Drop posts newer than this."
                    },
                    "keywordsInclude": {
                        "title": "Must contain keywords",
                        "type": "array",
                        "description": "Drop posts whose title + subtitle + body text doesn't include at least one of these (case-insensitive).",
                        "items": {
                            "type": "string"
                        }
                    },
                    "keywordsExclude": {
                        "title": "Must NOT contain keywords",
                        "type": "array",
                        "description": "Drop posts whose text contains any of these keywords.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "hasPodcast": {
                        "title": "Has podcast filter",
                        "enum": [
                            "any",
                            "yes",
                            "no"
                        ],
                        "type": "string",
                        "description": "Keep only posts with (or without) an attached podcast.",
                        "default": "any"
                    },
                    "postType": {
                        "title": "Post type allow-list",
                        "type": "array",
                        "description": "Keep only certain post types (newsletter, podcast, thread). Empty = all.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "minSubscribers": {
                        "title": "Min subscribers per publication",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip publications below this subscriber count. Note: ~50% of publications hide subscriber counts — those are also skipped when this filter is on.",
                        "default": 0
                    },
                    "paymentsEnabledOnly": {
                        "title": "Payments-enabled publications only",
                        "type": "boolean",
                        "description": "Skip publications that don't offer paid subscriptions.",
                        "default": false
                    },
                    "includeContactInfo": {
                        "title": "Extract contact info (lead-gen)",
                        "type": "boolean",
                        "description": "Parse emails, phones, payment links and categorise social/website links from bios.",
                        "default": true
                    },
                    "includeAnalytics": {
                        "title": "Compute analytics block",
                        "type": "boolean",
                        "description": "Per-publication engagement rate, virality index, publish cadence, paid-post ratio, momentum (when history is on).",
                        "default": true
                    },
                    "outputMarkdownFiles": {
                        "title": "Export per-post .md files (AI/RAG mode)",
                        "type": "boolean",
                        "description": "Write each post body as a separate .md file to the 'substack-markdown' key-value store, with YAML front matter. Useful for LlamaIndex/LangChain ingestion.",
                        "default": false
                    },
                    "enableHistory": {
                        "title": "Enable historical tracking (subscriber growth & post velocity)",
                        "type": "boolean",
                        "description": "Persist a timestamped snapshot every run into a named store, then compute deltas (subscriber growth per day, posts added, trend) vs the previous run. Schedule the actor daily/weekly to build a time-series.",
                        "default": false
                    },
                    "historyDatasetName": {
                        "title": "History store name",
                        "type": "string",
                        "description": "Named key-value store to persist snapshots into. Reused across scheduled runs.",
                        "default": "substack-history"
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Optional. Proxies are included and configured automatically — leave empty unless overriding."
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Parallel workers.",
                        "default": 5
                    },
                    "minDelayMs": {
                        "title": "Min delay between requests (ms)",
                        "minimum": 200,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Per-request throttle.",
                        "default": 1000
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
