# Substack Scraper: Newsletter Posts, Archives & Subscribers (`perconey/substack-scraper`) Actor

Scrape any Substack publication: full post archive, single post detail with body, comment counts, reactions, paid/free audience, podcast metadata. No auth, no proxies, no cookies. Uses Substack official JSON API. Pay only per result.

- **URL**: https://apify.com/perconey/substack-scraper.md
- **Developed by:** [Perconey](https://apify.com/perconey) (community)
- **Categories:** Social media, Developer tools, Lead generation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

$1.00 / 1,000 result items

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Substack Scraper: Newsletter Posts, Archives & Subscribers

**Scrape any Substack publication** in seconds. Get the full post archive, single-post detail with body and comments, reactions, paid/free audience tier, podcast metadata - everything that Substack itself shows publicly. No browser, no proxies, no cookies, no Substack account needed. The actor calls Substack's official JSON API directly, so the data you get is the same data the web app gets, with full fidelity.

Works with every Substack publication: subdomains like `stratechery.substack.com`, custom domains like `lennysnewsletter.com` or `astralcodexten.com`, and paid newsletters (teaser content shown for paid posts).

### Why use this Substack scraper?

- **No auth, no setup** - point at any publication URL and start scraping. Substack's archive API is fully public.
- **Real data, full fidelity** - 25+ fields per post: title, subtitle, body HTML, reaction counts, comment counts, audience tier (free/paid), podcast duration, author bylines, cover image, canonical URL.
- **Custom-domain support** - we follow redirects, so `lennysnewsletter.com` works exactly like `*.substack.com`.
- **Date range filters** - `since` and `until` keep your runs scoped and cheap when monitoring on a schedule.
- **Pay-per-result pricing** - you only pay for posts you actually receive. Stop a run any time and the meter stops.
- **API + scheduler + integrations** - run from the Apify API, cron-schedule, fire webhooks into Slack/Make/Zapier on every new post.

### Use cases

- **Competitor / market research** - track top newsletters in your niche, see what topics drive reactions, monitor publication cadence.
- **Content audits** - export an entire newsletter's archive to spreadsheet, sort by reaction count, find your best-performing posts.
- **Lead generation** - identify active subscribers / commenters on industry newsletters (comment counts are public).
- **NLP / research datasets** - bulk-export posts with body text and metadata for sentiment, topic modeling, embedding indexes.
- **News monitoring** - schedule daily scrapes of trade newsletters, alert on new posts via Apify webhook.
- **Migration backups** - if you run a Substack, this is the easiest way to back up your full archive as JSON.

### How to use the Substack scraper

1. Pick an action from the **What do you want to scrape?** dropdown: `getArchive` for a publication's post list, `getPost` for one post's full body.
2. Fill **Substack publication URLs** - subdomain (`https://stratechery.substack.com`) or custom domain (`https://www.lennysnewsletter.com`).
3. Set **Max posts per publication** (default 100, 0 = unlimited).
4. Optional: `since` / `until` ISO dates, `audience` filter (everyone / only_paid / only_free), `includeBody` to fetch full HTML body for every post.
5. Click **Save & Start**. Results stream into the Dataset tab in real time. Export as JSON, CSV, Excel, HTML or XML.

### Input

| Field | Required | What it does |
|---|---|---|
| `action` | yes | `getArchive` or `getPost` |
| `publications` | yes | Publication URLs (getArchive) or post URLs (getPost), one per line |
| `maxItems` | no | Max posts per publication for getArchive (0 = unlimited) |
| `since` / `until` | no | ISO date filter for post_date |
| `audience` | no | `everyone` (default), `only_paid`, or `only_free` |
| `includeBody` | no | Default false. Set true to fetch each post's body_html in the archive run (one extra API call per post). |

### Output

Every run produces one **Dataset** with one item per result. Real example - `getArchive` on `lennysnewsletter.com`, 3 most recent posts:

```json
[
  { "_type": "post", "_action": "getArchive", "_publication": "https://www.lennysnewsletter.com",
    "title": "Why SaaS freemium playbooks don't work in AI, and what to do instead",
    "post_date": "2026-05-05T...", "audience": "only_paid",
    "canonical_url": "https://www.lennysnewsletter.com/p/...",
    "reaction_count": 292, "comment_count": 187, "word_count": 4521,
    "author_name": "Lenny Rachitsky" },
  { "_type": "post", "title": "Your Couch-to-5K for AI", "audience": "only_paid",
    "reaction_count": 363, "comment_count": 142 },
  { "_type": "post", "title": "New: A free year of Cursor, Google AI Pro...",
    "audience": "everyone", "reaction_count": 297, "comment_count": 88 }
]
````

You can download the dataset in various formats such as **JSON, JSON-Lines, CSV, Excel, HTML or XML**, or fetch programmatically via the [Apify Dataset API](https://docs.apify.com/api/v2#/reference/datasets).

### Data fields

| Field | Type | Description |
|---|---|---|
| `_type` | string | `post` or `error` |
| `_action` | string | The action that produced this row |
| `_publication` | string | Publication base URL |
| `id` / `slug` | string | Substack internal id and url slug |
| `title` / `subtitle` / `description` | string | Headlines |
| `post_date` | ISO 8601 | When the post was published |
| `type` | string | `newsletter`, `podcast`, `thread`, etc. |
| `audience` | string | `everyone` (free), `only_paid`, `only_free` |
| `canonical_url` | string | Web URL of the post |
| `cover_image` | string | Hero image URL |
| `word_count` | int | Word count (may be null for some posts) |
| `reaction_count` / `reactions` | int / object | Heart count + per-emoji breakdown |
| `comment_count` / `child_comment_count` | int | Top-level and total comments |
| `podcast_duration` | int | Seconds (podcast posts only) |
| `author_name` / `author_handle` / `author_id` | string | Primary author |
| `body_html` | string | Full HTML body. Empty in archive mode unless `includeBody=true`; always populated in `getPost` mode. |

### Pricing - what does scraping Substack cost?

Pricing is **pay-per-result** - you pay only for posts you receive. A budget cap on each run means you never spend more than you allow.

**Sample budgets** at the published per-item price:

| Use case | Items | Approx. cost |
|---|---|---|
| Monitor a newsletter for 100 latest posts | 100 | ~$0.10 |
| Full archive of a 500-post publication | 500 | ~$0.50 |
| Daily scheduled scrape, 5 new posts/day, 30 days | 150 | ~$0.15 |
| 50-publication competitive scan, 20 posts each | 1 000 | ~$1.00 |

See the **Pricing** section on this page for the exact per-item rate.

### Tips & advanced options

- **Schedule it.** Set `since` to the last-run timestamp in your scheduled task so each run only ingests new posts. Cost stays flat regardless of publication age.
- **Skip bodies by default.** `getArchive` returns metadata only by default - that's enough to know what's new. Use `getPost` separately when you actually need a full post's body.
- **Custom domains work transparently.** `https://www.lennysnewsletter.com` and `https://lenny.substack.com` both work; we follow Substack's redirects automatically.
- **Audience filter is useful for "free-only" datasets.** Paid posts return teaser content only - if you don't want them, set `audience: only_free`.

### Integrations

- **REST API** - `POST /v2/acts/perconey~substack-scraper/runs`
- **Scheduler** - cron-style in Apify console
- **Webhooks** - Slack, Discord, custom endpoints on `RUN_SUCCEEDED`
- **Sheets / Notion / Airtable / Google Drive** via [Apify Integrations](https://apify.com/integrations)
- **Make / Zapier / n8n** via the same catalog

### FAQ

**Do I need a Substack account?**
No. Every action works fully anonymously.

**What about paid posts?**
Paid posts return teaser content (title, subtitle, preview, audience flag). The body and full post is behind the paywall - the actor surfaces what Substack itself exposes to non-subscribers.

**Is this allowed by Substack's terms of service?**
The actor uses Substack's public JSON API the same way the web app does. Public data is publicly readable. Use the results responsibly - respect privacy, attribute creators, don't redistribute paid-only content.

**Can I scrape comments too?**
The `comment_count` field is included on every post. Full comment threads are a separate Substack endpoint - planned for a future release.

**Rate limits?**
Substack is generally lenient on the public archive endpoint. The actor paces ~6 req/s, retries on 429 with `Retry-After`. Heavy parallel runs may still hit limits - start with one run.

### Support & feedback

Bug, feature, or custom version? Open an issue from the **Issues** tab on the Apify page, or message `@perconey.bsky.social` on Bluesky.

***

*Disclaimer: this scraper reads public Substack data only. Don't use it to harass writers, scrape paid content for redistribution, or violate Substack's [Terms of Use](https://substack.com/tos).*

# Actor input Schema

## `action` (type: `string`):

Pick the type of data to collect. Each action accepts different input format (filled below).

## `publications` (type: `array`):

For getArchive: one publication URL per line. Either subdomain form (https://on.substack.com) or custom domain (https://www.lennysnewsletter.com). For getPost: paste full post URLs (https://on.substack.com/p/open-tab-emily-sundberg).

## `maxItems` (type: `integer`):

For getArchive only. Maximum posts to fetch per publication. Use 0 for unlimited (the entire archive). Default 100.

## `since` (type: `string`):

Only posts published on or after this date. Format 2024-01-01 or 2024-01-01T00:00:00Z. Applies to getArchive.

## `until` (type: `string`):

Only posts published before this date.

## `audience` (type: `string`):

Filter posts by paywall audience. 'everyone' = all posts (free + paid teasers), 'only\_paid' = only paid posts (you only see teaser/title if not subscribed), 'only\_free' = only fully free posts.

## `includeBody` (type: `boolean`):

For getArchive: by default we return metadata only. Enable this to also fetch each post's full HTML body in the same run (uses one extra API call per post; raises run time and event count).

## Actor input object example

```json
{
  "action": "getArchive",
  "publications": [
    "https://on.substack.com"
  ],
  "maxItems": 100,
  "audience": "everyone",
  "includeBody": false
}
```

# Actor output Schema

## `dataset` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "publications": [
        "https://on.substack.com"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("perconey/substack-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "publications": ["https://on.substack.com"] }

# Run the Actor and wait for it to finish
run = client.actor("perconey/substack-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "publications": [
    "https://on.substack.com"
  ]
}' |
apify call perconey/substack-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=perconey/substack-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Substack Scraper: Newsletter Posts, Archives & Subscribers",
        "description": "Scrape any Substack publication: full post archive, single post detail with body, comment counts, reactions, paid/free audience, podcast metadata. No auth, no proxies, no cookies. Uses Substack official JSON API. Pay only per result.",
        "version": "0.2",
        "x-build-id": "MAMsGDXrCgpacAErs"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/perconey~substack-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-perconey-substack-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/perconey~substack-scraper/runs": {
            "post": {
                "operationId": "runs-sync-perconey-substack-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/perconey~substack-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-perconey-substack-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "action",
                    "publications"
                ],
                "properties": {
                    "action": {
                        "title": "What do you want to scrape?",
                        "enum": [
                            "getArchive",
                            "getPost"
                        ],
                        "type": "string",
                        "description": "Pick the type of data to collect. Each action accepts different input format (filled below).",
                        "default": "getArchive"
                    },
                    "publications": {
                        "title": "Substack publication URLs",
                        "type": "array",
                        "description": "For getArchive: one publication URL per line. Either subdomain form (https://on.substack.com) or custom domain (https://www.lennysnewsletter.com). For getPost: paste full post URLs (https://on.substack.com/p/open-tab-emily-sundberg).",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxItems": {
                        "title": "Max posts per publication",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "For getArchive only. Maximum posts to fetch per publication. Use 0 for unlimited (the entire archive). Default 100.",
                        "default": 100
                    },
                    "since": {
                        "title": "Since (ISO date)",
                        "type": "string",
                        "description": "Only posts published on or after this date. Format 2024-01-01 or 2024-01-01T00:00:00Z. Applies to getArchive."
                    },
                    "until": {
                        "title": "Until (ISO date)",
                        "type": "string",
                        "description": "Only posts published before this date."
                    },
                    "audience": {
                        "title": "Audience filter",
                        "enum": [
                            "everyone",
                            "only_paid",
                            "only_free"
                        ],
                        "type": "string",
                        "description": "Filter posts by paywall audience. 'everyone' = all posts (free + paid teasers), 'only_paid' = only paid posts (you only see teaser/title if not subscribed), 'only_free' = only fully free posts.",
                        "default": "everyone"
                    },
                    "includeBody": {
                        "title": "Include full post body",
                        "type": "boolean",
                        "description": "For getArchive: by default we return metadata only. Enable this to also fetch each post's full HTML body in the same run (uses one extra API call per post; raises run time and event count).",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```