Substack Newsletter Scraper avatar

Substack Newsletter Scraper

Pricing

Pay per usage

Go to Apify Store
Substack Newsletter Scraper

Substack Newsletter Scraper

Scrape posts from any Substack publication — title, subtitle, date, paywall status, reaction count, comment count, word count, and full body HTML for free posts. Handles custom domains. Paginates to the full archive.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share


🎯 What this scrapes

Substack hosts thousands of newsletters across every niche — from tech and finance to culture and investigative journalism. This Actor taps each publication's public JSON API to extract the full post archive: metadata for every post, plus body HTML for free-tier content. Works with native Substack domains (author.substack.com) and custom domains alike.

🔥 What we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured.
  • 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.

💡 Use cases

  • Newsletter intelligence — track a competitor publication's cadence, topics, and engagement over time.
  • Content research — aggregate posts across multiple newsletters on a niche to find trends and gaps.
  • Lead gen — identify high-engagement free posts to understand what resonates with a target audience.
  • Archive backfill — pull the full post history into a warehouse for analysis or ML training.
  • Paywall mapping — understand the free-vs-paid content split for any publication.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Fill in the input form — most fields have sensible defaults.
  3. Click Start. Output streams into the run's dataset.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.

📥 Input

FieldTypeRequiredDefaultNotes
publicationsarrayyes['https://www.lennysnewsletter.com']One or more Substack publication URLs or custom domains. Accepts both native substack.com URLs and custom domains (e.g.
maxPostsPerPublicationintegerno50Maximum number of posts to collect per publication. Set to 0 to collect the full archive.
includeBodybooleannoTrueWhen enabled, body_html is included for free posts. Paywalled posts return null body.
proxyConfigurationobjectno{'useApifyProxy': True}Apify Proxy configuration. Substack's public API is well-behaved; the default shared pool is sufficient for most volumes

Example input

{
"publications": [
"https://www.lennysnewsletter.com"
],
"maxPostsPerPublication": 5,
"includeBody": false,
"proxyConfiguration": {
"useApifyProxy": true
}
}

📤 Output

Every row is one dataset item.

FieldTypeNotes
publicationstringBase URL of the publication that produced this post.
post_idintegerSubstack internal numeric post ID.
titlestringPost title.
subtitle['string', 'null']Post subtitle / deck (may be null).
urlstringCanonical URL of the post.
post_datestringPublication timestamp (ISO-8601).
audiencestringAccess tier: free or only_paid.
is_paywalledbooleanTrue when the post is behind the paywall.
reaction_countintegerTotal like/heart reactions.
comment_countintegerNumber of public comments.
word_count['integer', 'null']Word count when provided by the API.
body_html['string', 'null']Full post HTML for free posts; null for paywalled posts or when includeBody is false.
scraped_atstringISO-8601 timestamp of when this row was recorded.

Example output

{
"publication": "https://www.lennysnewsletter.com",
"post_id": 123456,
"title": "The most important metric you're ignoring",
"subtitle": "How to think about retention in your product",
"url": "https://www.lennysnewsletter.com/p/the-most-important-metric",
"post_date": "2024-03-15T09:00:00.000Z",
"audience": "free",
"is_paywalled": false,
"reaction_count": 847,
"comment_count": 43,
"word_count": 2100,
"body_html": null,
"scraped_at": "2026-06-07T10:00:00Z"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.0025Per dataset item

Example: 1 000 results at the rates above ≈ $2.50. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.

🚧 Limitations

Body HTML is only available for posts the publication makes free. Paywalled posts return metadata only. Private Substack communities (invite-only) are not accessible. The Actor honours Substack's public API rate limits — very large archives may take several minutes.

❓ FAQ

Does this work with custom domains?

Yes. Pass the full base URL of the publication — whether it's author.substack.com or a custom domain like https://www.lennysnewsletter.com — and the Actor resolves the correct API endpoint.

Can I get paywalled post bodies?

No. Paywalled posts require a paid subscription; body_html returns null for those. You get all public metadata (title, date, reaction count, etc.) regardless of paywall status.

How far back does the archive go?

As far as the publication's API returns. We paginate until we hit maxPostsPerPublication or exhaust the archive, whichever comes first.

What if a publication has thousands of posts?

Set maxPostsPerPublication to 0 and the Actor will paginate to the full archive. Runs are priced per result, so longer archives cost proportionally more.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.