Scrape posts, engagement metrics, and author data from any Substack publication. Get title, author, publish date, likes, comments, paywall status, and full body in Markdown or HTML. Paginates the full archive automatically.
Root cause fix for post data extraction: Substack's window._preloads is now assigned as JSON.parse("...") where the full page payload (including post data, author, body_html) is a JSON-encoded string argument. The previous extraction logic found the first { inside that string, parsed only the config metadata (isEU, base_url, etc.), and returned early — leaving post content unreachable. The new Attempt 1 uses a regex that correctly captures the full string argument from JSON.parse("..."), decodes it via JSON.parse('"' + content + '"'), then parses the resulting JSON. This recovers post.title, post.body_html, post.publishedBylines, post.comment_count, post.reactions, and all other post fields.
[1.1.4] - 2026-03-09
Fixed
extractNextData now handles Substack's newer HTML format where window._preloads is a JSON-encoded string assignment (= "{\"...\"}") rather than a plain object. Added JS string-escape unescaping as Attempt 2.
Added __NEXT_DATA__ script-tag extraction as a fallback (Attempt 3) for resilience against future Substack HTML changes.
[1.1.3] - 2026-03-09
Fixed
Archive API response normalisation: Substack returns a flat array[{...}], not { posts: [...] }. Updated ArchiveApiResponse type to a union and added normalisation logic (Array.isArray check) so both shapes are handled.
Pagination now falls back to page-fullness heuristic (≥12 items = more pages) when total_count is absent.
[1.1.2] - 2026-03-09
Fixed
ARCHIVE_API handler: replaced $('body').text() with context.body (raw response string). CheerioCrawler does not initialise Cheerio for application/json responses, causing $ is not a function errors and preventing any posts from being scraped.
[1.1.1] - 2026-03-09
Fixed
Added Substack detection guard in the DEFAULT handler — non-Substack URLs now log a clear error and exit cleanly instead of silently failing with a JSON parse error on the archive API endpoint.
[1.1.0] - 2025-07-07
Added
webhookUrl optional input field — pass any HTTPS URL to receive a POST notification when the Actor run succeeds, fails, times out, or is aborted. Enables zero-config integration with Zapier, Make, n8n, and custom pipelines.
Webhook is registered via Actor.addWebhook() with an idempotency key (APIFY_ACTOR_RUN_ID) to prevent duplicate registrations on Actor restart.
New Integrations section in README covering webhook, Apify Integrations tab, and REST API workflows.
Fixed
Critical bug: ARCHIVE_API handler was parsing request.payload (always undefined for GET requests) instead of the actual API response body. Publication archive scraping now correctly uses $('body').text() to read the JSON response.
Free-tier local run cap: when running outside the Apify platform, runs are capped at 50 posts with a clear status message, preventing accidental large local jobs.
output_schema.json: Populated with full definitions for all 9 output fields (was empty {}).
README: Full rewrite to match Apify Store format — question-format H2 headers, no technical internals, no broken badges or placeholders, corrected pricing model (Pay Per Result).
package.json: Fixed "author" placeholder value.
[1.0.0] - 2026-03-09
Initial release of Substack Posts & Creator Scraper.
Implemented extremely fast JSON extraction via Substack's __NEXT_DATA__.
Extracted post bodies, paywall statuses, and engagement metrics (likes/comments).