Substack Scraper
Pricing
from $3.00 / 1,000 results
Substack Scraper
Scrape Substack publications via the public RSS feed of any newsletter. Extract post title, URL, author, publication date, body HTML, categories, and enclosures. HTTP-only with TLS impersonation (no auth, no proxy).
Pricing
from $3.00 / 1,000 results
Rating
5.0
(13)
Developer
Crawler Bros
Actor stats
13
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Scrape any Substack publication via its public RSS feed. Pulls post title, URL, author, publication date, body HTML, categories, cover image. Multi-publication batch supported. HTTP-only with curl_cffi Chrome TLS impersonation. No auth, no proxy.
What this actor does
- Accepts publication URLs in any form: full URL, custom domain,
*.substack.com, or bare slug - Auto-rewrites to
<publication>/feed - Parses RSS feed → extracts title / link / pubDate / dc:creator / content:encoded / categories / enclosure
- Filters: category, published-after, keyword in title/summary
- Optional body HTML inclusion (default on)
- Approximate
wordCountandreadingTimeMinutes - Empty fields are omitted
Output per post
title,url,guidauthor— from<dc:creator>publishedAt— ISO 8601 UTC (parsed from RFC 822 pubDate)publishedAtRaw— original RFC 822 stringsummary— plain-text version of<description>(capped at 500 chars)bodyHtml— full HTML body from<content:encoded>(whenincludeBody=true)wordCount,readingTimeMinutescategories[]coverImage— from<enclosure>URLpublication,publicationUrlrecordType: "post",scrapedAt
Input
| Field | Type | Default | Description |
|---|---|---|---|
publications | array | ["platformer.news"] | List of publication URLs / domains / slugs (required) |
categoryAnyOf | array | [] | Match at least one RSS <category> tag |
publishedAfter | string | – | YYYY-MM-DD |
containsKeyword | string | – | Title/summary contains substring |
includeBody | bool | true | Include full body HTML |
maxItems | int | 50 | Hard cap (1–1000) |
Example: scrape Platformer + Noahpinion
{"publications": ["platformer.news", "noahpinion.substack.com"],"publishedAfter": "2024-01-01","maxItems": 100}
Example: filter by keyword
{"publications": ["platformer.news"],"containsKeyword": "antitrust","includeBody": true}
Example: bare slugs (auto-resolved to .substack.com)
{"publications": ["noahpinion", "thedailyupside"]}
Use cases
- Newsletter intel — track competitor publications, harvest content
- Market research — newsletters in your domain (analyst notes, sector reports)
- RSS aggregation — consolidate multiple Substacks into a single feed
- Content analysis — bulk-export newsletter posts for NLP / topic modeling
- Backup — archive your own / a friend's Substack posts
FAQ
Do I need a Substack account? No. The actor only reads public RSS feeds.
Why does it use TLS impersonation? Substack's edge sometimes 403s requests with default Python TLS fingerprint. curl_cffi with chrome131 profile sends a real Chrome handshake, which Substack accepts.
What's the post URL format? https://<publication>/p/<slug>. The actor preserves whatever the RSS feed returns.
Are paid-only posts included? Substack's public RSS includes free posts and the public previews of paid posts. Full paid post content is not accessible without a subscription.
How fresh is the data? Real-time. RSS feeds update within minutes of post publish.
Can I scrape multiple publications in one run? Yes — pass multiple entries in publications. The actor walks each feed sequentially and dedupes by URL.
What if a publication's RSS is blocked / rate-limited? The actor retries with exponential backoff on 403/429/5xx. After 3 retries it skips to the next publication and logs a warning.
Custom-domain Substacks? Yes — pass the custom domain (e.g. platformer.news, stratechery.com). The actor appends /feed regardless of subdomain shape.