Substack Scraper | $2 / 1k | All-In-One avatar

Substack Scraper | $2 / 1k | All-In-One

Pricing

from $1.99 / 1,000 results

Go to Apify Store
Substack Scraper | $2 / 1k | All-In-One

Substack Scraper | $2 / 1k | All-In-One

Get full articles, user profiles, and search results with All-in-One Substack Scraper. Extract rich data including titles, bios, subscriber counts, social links and engagement metrics. ideal for market research, creator discovery, trend tracking, and audience analysis.

Pricing

from $1.99 / 1,000 results

Rating

0.0

(0)

Developer

Fatih Tahta

Fatih Tahta

Maintained by Community

Actor stats

4

Bookmarked

92

Total users

18

Monthly active users

0.23 hours

Issues response

3 days ago

Last modified

Share

Substack Scraper | All-In-One

Slug: fatihtahta/substack-scraper

Overview

Substack Scraper | All-In-One collects structured public data from https://substack.com, including search results, publications, people, posts, notes, comments, and reply-thread content where supported by the selected input. Records can include core identifiers, URLs, titles, publication metadata, author information, timestamps, engagement counts, and richer content fields when detail enrichment is enabled. Substack is a widely used publishing and newsletter platform, which makes its public content useful for market intelligence, editorial analysis, research, enrichment, and monitoring workflows. This actor is designed for repeatable, automated collection with consistent JSON output that can be used directly in downstream systems. It is suited to recurring data acquisition where users need a dependable, operationally clear workflow for collecting current public records over time.

Why Use This Actor

  • Market research and analytics teams: collect structured Substack posts, publications, people, and recent discussions for market intelligence, topic tracking, coverage analysis, and operational reporting.
  • Product and content teams: monitor subject trends, publication activity, post velocity, and audience-facing content patterns to support editorial planning and content benchmarking.
  • Developers and data engineering teams: feed normalized JSON records into ETL jobs, search indexes, data warehouses, enrichment pipelines, and other downstream systems with minimal reshaping.
  • Lead generation and enrichment workflows: gather public creator, publication, and content attributes that can supplement prospecting, segmentation, and account research processes.
  • Monitoring and competitive tracking teams: run recurring collections to watch for newly published content, public conversation activity, and movement across tracked keywords or known pages.

Common Use Cases

  • Market intelligence: track topic coverage, newly published posts, publication activity, and audience-visible conversation around specific themes.
  • Lead generation: build curated lists of relevant writers, publications, or topic-specific public profiles for outreach research or enrichment.
  • Competitive monitoring: watch how tracked publications or authors publish, position, and discuss topics over time.
  • Catalog and directory building: populate internal databases with structured public publication, author, and post records.
  • Data enrichment: append fresh public content and publication attributes to CRM, BI, or analytics datasets.
  • Recurring reporting: schedule periodic runs to produce current datasets for dashboards, alerts, and trend reviews.
  • Conversation tracking: collect notes, comments, and optional reply threads when the goal is to analyze public engagement rather than long-form posts alone.

Quick Start

  1. Choose your input strategy: add known startUrls, enter one or more queries, or combine both when you need discovery and direct collection in the same run.
  2. For a first validation run, set a small limit so you can quickly confirm the output shape and record types.
  3. Select the appropriate result_type for keyword searches, and add optional filters such as publication_date, within_days, or language if you need a narrower scope.
  4. Enable enrich_data when you want fuller records, and turn on get_replies only when reply-thread collection is relevant to explicit note or comment-thread targets.
  5. Run the actor in Apify Console and inspect the first dataset records.
  6. Increase coverage, refine filters, or add a schedule after the dataset matches your intended workflow.

Input Parameters

Use startUrls for direct collection from known Substack pages, queries for keyword-driven discovery, or combine both in a single run.

ParameterTypeDescriptionDefault
startUrlsarray of stringsExact Substack pages to collect from directly. Publication homepages are collected through the archive feed (sort=new, paged by offset/limit), while note/comment pages are treated as thread sources. Useful for known search pages, publication profiles, individual posts, notes, comment pages, or custom-domain pages.
queriesarray of stringsOne or more keyword searches. Each query becomes its own search target for discovery-oriented runs.
result_typestringResult set to return for keyword searches only. Allowed values: top, recent, posts, publications, people.top
publication_datestringOptional recency filter for keyword-based post results. Allowed values: last_day (24 hours), last_week (7 days), last_month, last_year.
within_daysintegerKeep only posts published within the last N days. Minimum value: 1 day.
languagestringRestrict keyword-based post results to a specific language. Allowed values: ar, zh, cs, nl, en, fr, de, el, hi, hu, id, it, ja, ko, la, no, pl, pt, ro, ru, es, sv, th, tr, vi.
enrich_databooleanFetch fuller records for supported posts, publications, or people instead of lighter search-result entries. Useful when detail is more important than run speed.true
get_repliesbooleanCollect reply threads for supported note or comment detail pages. It does not expand publication homepage discovery into comment-thread output.false
max_repliesintegerMaximum replies to save per post or thread. Use 0 to suppress replies when reply collection is enabled.
limitintegerPer-input cap on the number of results saved for each startUrls entry or search query. Minimum value: 1. Leave empty for broader collection.
proxyConfigurationobjectConnection settings for Apify Proxy or custom proxy configuration when you need a different routing path.{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}

Choosing Inputs

Use startUrls when you already know the exact Substack pages you want to collect, such as a specific publication, post, search page, note, or curated list of known targets. Publication homepage startUrls collect posts from that publication through the archive API rather than a mixed homepage snapshot. Use queries when you want discovery-oriented collection based on topics, names, brands, or keywords.

Narrower filters such as result_type, publication_date, within_days, and language produce more targeted datasets, while broader settings improve discovery and surface more varied records. If you are validating a new workflow, start with a small limit, inspect the first records, and then increase coverage after confirming that the returned types and fields match your downstream needs.

Example Inputs

Scenario: keyword discovery for recent posts

{
"queries": ["artificial intelligence", "developer tools"],
"result_type": "posts",
"publication_date": "last_week",
"language": "en",
"enrich_data": true,
"limit": 20
}

Scenario: direct URL collection from known pages

{
"startUrls": [
"https://substack.com/search/ai?searching=top",
"https://www.platformer.news/"
],
"enrich_data": true,
"get_replies": false,
"within_days": 30,
"limit": 15
}

Scenario: targeted monitoring with reply collection

{
"queries": ["movie"],
"result_type": "recent",
"within_days": 7,
"enrich_data": true,
"get_replies": true,
"max_replies": 25,
"limit": 10
}

Output

9.1 Output destination

The actor writes results to an Apify dataset as JSON records. The dataset is designed for direct consumption by analytics tools, ETL pipelines, and downstream APIs without post-processing.

Each item contains a stable record envelope plus a type-specific payload when multiple entity types are returned in the same run.

9.2 Record envelope (all items)

  • type (string, required): record family such as post or comment.
  • id (number, required): primary record identifier. Treat this as an opaque identifier in downstream systems because some record families may expose prefixed or string-serialized forms in the dataset.
  • url (string, required): canonical public URL for the record.

Recommended idempotency key: type + ":" + id

Use that composite key for deduplication and upserts when loading repeated runs into warehouses, search indexes, CRMs, or operational databases. The stable envelope makes records easier to merge, deduplicate, and sync across recurring collections.

9.3 Examples

Notes:

  • The examples below match the current enriched dataset structure returned when enrich_data is enabled.
  • Long body_html strings are abbreviated inside the value for readability, but the example records show the full field shape now returned by the actor.

Example: post (type = "post")

{
"id": "193819329",
"type": "post",
"title": "Substack Finally Introduced Notes Scheduling. This Is How To Do It.",
"description": "Plus, at least FIVE ways you can use it to help your Substack grow and experiment intentionally.",
"published_at": "2026-04-11T12:02:55.328000+00:00",
"updated_at": "2026-04-11T12:10:13.090000+00:00",
"url": "https://unstackit.substack.com/p/schedule-substack-notes",
"author_name": "Kristi Keller \ud83c\udde8\ud83c\udde6",
"author_handle": "kristikeller",
"publication_name": "Unstack Substack",
"publication_url": "https://unstackit.substack.com",
"main_post_id": "193819329",
"main_post_title": "Substack Finally Introduced Notes Scheduling. This Is How To Do It.",
"main_post_url": "https://unstackit.substack.com/p/schedule-substack-notes",
"main_post_published_at": "2026-04-11T12:02:55.328000+00:00",
"subtitle": "Plus, at least FIVE ways you can use it to experiment and help your Substack grow intentionally.",
"truncated_body_text": "This question has come up multiple times from several clients since Substack introduced the Notes feature:",
"image_url": "https://substack-post-media.s3.amazonaws.com/public/images/626d8a7f-9db9-4f6e-933a-6663a9f9fbe8_2240x1260.png",
"body_text": "This question has come up multiple times from several clients since Substack introduced the Notes feature:",
"body_html": "<p>This question has come up multiple times from several clients since Substack introduced the Notes feature:</p><blockquote><p><em><strong>\u201cCan we schedule our notes to publish at a later date?\u201d</strong></em></p></blockquote><p>...</p>",
"word_count": 549,
"reaction_count": 60,
"comment_count": 19,
"child_comment_count": 11,
"restack_count": 8,
"post_tags": [
{
"id": "871d1cd1-6fa0-4a94-bf79-d4657addba2d",
"publication_id": 2328962,
"name": "Substack Notes",
"slug": "substack-notes",
"hidden": false
},
{
"id": "ed4d7b60-078b-404f-a2e9-21f49cf290f6",
"publication_id": 2328962,
"name": "Substack How-To",
"slug": "substack-how-to",
"hidden": false
}
],
"published_bylines": [
{
"id": 165322060,
"name": "Kristi Keller \ud83c\udde8\ud83c\udde6",
"handle": "kristikeller",
"photo_url": "https://substackcdn.com/image/fetch/$s_!TdHE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bee1713-6b2a-45fa-bbb6-e9437756af86_316x326.png",
"bio": "Certified independent liberation advisor & builder of highly engaged Substack communities.",
"is_guest": false
}
],
"audio_items": [
{
"post_id": 193819329,
"voice_id": "en-US-AlloyTurboMultilingualNeural",
"audio_url": "https://substack-video.s3.amazonaws.com/video_upload/post/193819329/tts/12334368-98cb-45d6-a507-805e3a789351/en-US-AlloyTurboMultilingualNeural.mp3",
"type": "tts",
"status": "completed"
}
],
"comments": [
{
"id": 241695637,
"type": "comment",
"body_text": "I scheduled a note everyday this week about tulips \ud83c\udf37 and it was quick and easy ...",
"published_at": "2026-04-11T12:34:50.936000+00:00",
"author_name": "Cerina Triglavcanin",
"author_handle": "cerinatrig",
"photo_url": "https://substack-post-media.s3.amazonaws.com/public/images/c3d05e18-ea68-4208-981e-47638e58953b_405x405.png",
"reaction_count": 3,
"children_count": 2,
"childrenSummary": "2 replies by Kristi Keller \ud83c\udde8\ud83c\udde6 and others"
}
],
"post": {
"audience": "everyone",
"canonical_url": "https://unstackit.substack.com/p/schedule-substack-notes",
"id": 193819329,
"post_date": "2026-04-11T12:02:55.328Z",
"updated_at": "2026-04-11T12:10:13.090Z",
"publication_id": 2328962,
"search_engine_description": "FIVE ways you can use Notes scheduling to help your Substack grow and experiment intentionally.",
"slug": "schedule-substack-notes",
"subtitle": "Plus, at least FIVE ways you can use it to experiment and help your Substack grow intentionally.",
"title": "Substack Finally Introduced Notes Scheduling. This Is How To Do It.",
"type": "newsletter",
"write_comment_permissions": "everyone",
"is_published": true,
"restacks": 8,
"reactions": {
"\u2764": 60
}
},
"publication": {
"id": 2328962,
"name": "Unstack Substack",
"subdomain": "unstackit",
"logo_url": "https://substack-post-media.s3.amazonaws.com/public/images/9ec83ce1-c3b2-4b6f-87d9-41902e60a69f_500x500.png",
"homepage_type": "magaziney",
"community_enabled": true,
"payments_state": "enabled"
},
"publication_settings": {
"enable_new_publisher": true,
"podcast_enabled": false
},
"page_rank": 0
}

Example: comment (type = "comment")

{
"id": "c-247726222",
"type": "comment",
"title": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
"description": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
"published_at": "2026-04-23T00:49:31.944000+00:00",
"url": "https://substack.com/@backyardmoviecritic/note/c-247726222",
"author_name": "Backyard Movie Critic",
"author_handle": "backyardmoviecritic",
"publication_name": "Backyard Movie Critic",
"publication_url": "https://backyardmoviecritic.substack.com",
"image_url": "https://substack-post-media.s3.amazonaws.com/public/images/1edaf33e-19c0-4c26-ab04-0df6f0e34e3d_452x452.png",
"body_text": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
"body_json": {
"type": "doc",
"attrs": {
"schemaVersion": "v1"
},
"content": [
{
"type": "paragraph",
"content": [
{
"type": "text",
"text": "Movies that Made Me Love Movies:"
}
]
}
]
},
"attachments": [
{
"id": "3c033bf5-9636-4857-b762-dc298a726531",
"type": "image",
"imageUrl": "https://substack-post-media.s3.amazonaws.com/public/images/8057c6bf-2643-42a2-a5bd-800cc068e831_700x1050.png",
"imageWidth": 700,
"imageHeight": 1050,
"explicit": false
}
],
"reaction_count": 40,
"comment_count": 6,
"restack_count": 2,
"context": {
"timestamp": "2026-04-23T00:49:31.944Z",
"type": "note",
"source": "db-note"
},
"comment": {
"id": 247726222,
"body": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
"body_json": {
"type": "doc",
"attrs": {
"schemaVersion": "v1"
},
"content": [
{
"type": "paragraph",
"content": [
{
"type": "text",
"text": "Movies that Made Me Love Movies:"
}
]
}
]
},
"type": "feed",
"date": "2026-04-23T00:49:31.944Z",
"name": "Backyard Movie Critic",
"photo_url": "https://substack-post-media.s3.amazonaws.com/public/images/b427d26c-e0cb-4908-b8b6-2033359c2565_399x399.jpeg",
"bio": "Gen X film junkie: silent horror to streaming chaos. Grew up on VHS gore and HBO late nights.",
"handle": "backyardmoviecritic",
"reaction_count": 40,
"reactions": {
"\u2764": 40
},
"restacks": 2,
"children_count": 6,
"attachments": [
{
"id": "3c033bf5-9636-4857-b762-dc298a726531",
"type": "image",
"imageUrl": "https://substack-post-media.s3.amazonaws.com/public/images/8057c6bf-2643-42a2-a5bd-800cc068e831_700x1050.png",
"imageWidth": 700,
"imageHeight": 1050,
"explicit": false
}
],
"user_primary_publication": {
"id": 7618357,
"subdomain": "backyardmoviecritic",
"name": "Backyard Movie Critic",
"logo_url": "https://substack-post-media.s3.amazonaws.com/public/images/1edaf33e-19c0-4c26-ab04-0df6f0e34e3d_452x452.png",
"payments_state": "disabled"
},
"language": "en"
},
"publication": {
"id": 7618357,
"subdomain": "backyardmoviecritic",
"name": "Backyard Movie Critic",
"logo_url": "https://substack-post-media.s3.amazonaws.com/public/images/1edaf33e-19c0-4c26-ab04-0df6f0e34e3d_452x452.png",
"payments_state": "disabled"
},
"page_rank": 1
}

Example: reply (type = "reply")

{
"id": "c-205156724",
"type": "reply",
"title": "This is such a thoughtful comment, thank you.",
"published_at": "2026-01-26T01:26:46.936000+00:00",
"url": "https://substack.com/@vanessathe/note/c-205156724",
"author_name": "Vanessa",
"author_handle": "vanessathe",
"main_post_id": "173236288",
"main_post_title": "eight films that made me fall madly in love with films",
"main_post_url": "https://vanessateh.substack.com/p/eight-films-that-made-me-fall-madly",
"main_post_published_at": "2026-01-16T05:06:39.097000+00:00",
"body_text": "This is such a thoughtful comment, thank you.",
"context": {
"timestamp": "2026-01-26T01:26:46.936Z",
"type": "reply"
},
"comment": {
"id": 205156724,
"type": "reply",
"date": "2026-01-26T01:26:46.936Z",
"body": "This is such a thoughtful comment, thank you.",
"name": "Vanessa",
"handle": "vanessathe",
"children_count": 0
},
"comment_count": 0,
"page_rank": 1
}

Field Reference

Record type: post

  • id (string, required): unique post identifier.
  • type (string, required): record type, always post.
  • url (string, required): public post URL.
  • title (string, required): post title.
  • description (string, optional): summary or teaser text.
  • subtitle (string, optional): secondary heading or subtitle.
  • published_at (string, required): publication timestamp in ISO 8601 format.
  • author_name / author_handle (string, optional): primary author display name and handle.
  • publication_name / publication_url (string, optional): source publication name and homepage URL.
  • main_post_id / main_post_title / main_post_url / main_post_published_at (string, optional): parent post references when applicable.
  • image_url (string, optional): primary image or media preview URL.
  • truncated_body_text (string, optional): shortened plain-text excerpt from the post detail payload.
  • body_text (string, optional): plain-text body content.
  • body_html / body_json (string or object, optional): HTML and structured rich-text body content.
  • word_count (integer, optional): word count for the post body.
  • child_comment_count (integer, optional): nested reply count returned by post detail endpoints.
  • reaction_count / comment_count / restack_count (integer, optional): visible engagement counters.
  • post_tags / published_bylines / audio_items (array, optional): structured post tags, bylines, and audio metadata when returned by enriched post detail.
  • comments (array, optional): trimmed top-level comment preview objects included on enriched post records.
  • post / publication / publication_settings (object, optional): cleaned public detail payloads preserved from the enriched response.
  • page_rank (integer, optional): rank position within the collected result page.
  • attributes (object, optional): generic normalized attributes bucket used for non-Substack-like fields such as price, brand, SKU, currency, or location when available.

Record type: comment

  • id (string, required): unique comment or note identifier.
  • type (string, required): record type, always comment.
  • url (string, required): public URL of the note or comment.
  • title (string, optional): short title derived from the content.
  • description (string, optional): summary text or repeat of the body.
  • published_at (string, required): publication timestamp in ISO 8601 format.
  • author_name / author_handle (string, optional): primary author display name and handle.
  • publication_name / publication_url (string, optional): related publication name and URL.
  • main_post_id / main_post_title / main_post_url / main_post_published_at (string, optional): parent post references when the comment belongs to a post thread.
  • subtitle (string, optional): subtitle when the record is comment-like but backed by a richer content item.
  • image_url (string, optional): preview image URL when present.
  • body_text (string, optional): plain-text comment or note body.
  • body_html / body_json (string or object, optional): HTML or structured rich-text body content when the source exposes it.
  • attachments (array, optional): media attachments associated with the comment or note.
  • word_count (integer, optional): word count when available.
  • reaction_count / comment_count / restack_count (integer, optional): visible engagement counters.
  • context (object, optional): cleaned note or thread context metadata such as source type and timestamp.
  • comment (object, optional): cleaned enriched comment payload, including public author-facing fields, attachments, and visible counters.
  • parent_comments (array, optional): trimmed parent-thread metadata when the detail endpoint includes thread ancestry.
  • can_reply / is_muted (boolean, optional): public interaction state returned by comment detail endpoints.
  • publication / author / post (object, optional): cleaned public nested entities when exposed by the source detail endpoint.
  • search_score (number, optional): ranking score returned with search results.
  • page_rank (integer, optional): rank position within the collected result page.
  • attributes (object, optional): generic normalized attributes bucket used for non-Substack-like fields when available.

Record type: reply

  • id (string, required): unique reply identifier.
  • type (string, required): record type, always reply.
  • url (string, required): public URL of the reply.
  • author_name / author_handle (string, optional): primary author display name and handle.
  • main_post_id / main_post_title / main_post_url / main_post_published_at (string, optional): parent post references for the thread.
  • body_text / body_html / body_json (string or object, optional): reply content in plain text, HTML, or structured rich-text form.
  • attachments (array, optional): media attachments associated with the reply.
  • context (object, optional): cleaned thread context metadata.
  • comment (object, optional): cleaned enriched reply payload.
  • publication / author / post (object, optional): cleaned nested entities when available from the detail endpoint.
  • reaction_count / comment_count / restack_count (integer, optional): visible engagement counters when present.
  • page_rank (integer, optional): rank position within the collected result page.

Data Quality, Guarantees, And Handling

  • Structured records: results are normalized into predictable JSON objects for downstream use.
  • Best-effort extraction: fields may vary by region, session, public availability, or source-side interface changes.
  • Optional fields: null-check in downstream code.
  • Deduplication: recommend type + ":" + id.
  • Freshness: results reflect the publicly available data at run time.
  • Repeated runs: use the recommended idempotency key when syncing data into warehouses, CRMs, or search indexes.

Tips For Best Results

  • Start with a small limit to validate the output shape before scaling up.
  • Use startUrls for known high-value targets and queries for broader discovery.
  • Choose result_type carefully because it affects query-based searches, not direct startUrls.
  • Apply publication_date, within_days, or language only when you need tighter targeting.
  • Turn on enrich_data when downstream systems need fuller records rather than lightweight search entries.
  • Enable get_replies only for runs where note or comment thread depth matters.
  • Use type + ":" + id as your stable deduplication key across repeated runs.

How to Run on Apify

  1. Open the actor in Apify Console.
  2. Configure the available input fields for the target scope.
  3. Set the maximum number of outputs to collect with limit if you want a capped run.
  4. Click Start and wait for the run to finish.
  5. Review the dataset and download results in JSON, CSV, Excel, or other supported formats.

Scheduling & Automation

Scheduling

Automated Data Collection

You can schedule recurring runs to keep your dataset current without manual execution. This is useful for monitoring keywords, publications, and recent public conversation over time.

  • Navigate to Schedules in Apify Console
  • Create a new schedule (daily, weekly, or custom cron)
  • Configure input parameters
  • Enable notifications for run completion
  • Add webhooks for automated processing

Integration Options

  • Data warehouses: load normalized post and comment records into analytical stores for trend analysis, historical reporting, and topic-level monitoring.
  • BI dashboards: track publication activity, content volume, engagement signals, and keyword coverage over time.
  • Webhooks: trigger ingestion, validation, alerting, or transformation workflows after each completed run.
  • API-driven applications: consume dataset records directly in internal tools, search services, or customer-facing applications.
  • Google Sheets or Excel-based review: export smaller runs for editorial review, QA, or lightweight operational workflows.
  • Enrichment pipelines: append fresh public publication, author, and content attributes to existing CRM or analytics records.

Export Formats And Downstream Use

Apify datasets can be exported or consumed in formats that fit both manual review and automated data delivery workflows.

  • JSON: for APIs, applications, and data pipelines
  • CSV or Excel: for spreadsheet workflows and manual review
  • API access: for automated ingestion into internal systems
  • BI and warehouses: for reporting, dashboards, and historical analysis

Performance

Estimated run times:

  • Small runs (< 1,000 outputs): ~3-5 minutes
  • Medium runs (1,000-5,000 outputs): ~5-15 minutes
  • Large runs (5,000+ outputs): ~15-30 minutes

Execution time varies based on filters, result volume, and how much information is returned per record. Highly filtered runs can finish faster, while broad discovery or detail-rich records may take longer.

Limitations

  • Availability depends on what https://substack.com publicly exposes at run time.
  • Some optional fields may be absent on sparse records or record types with limited visible metadata.
  • Very broad searches may take longer or require a higher limit to capture the desired coverage.
  • Public field availability and naming can change when the target platform changes how content is presented.
  • Regional, account-level, language, or visibility differences can affect which records and attributes are available.
  • Reply-thread depth depends on the selected inputs and whether get_replies is enabled.

Troubleshooting

  • No results returned: check your filters, keyword spelling, direct URLs, and whether the target has matching public records.
  • Fewer results than expected: broaden filters, raise limit, or confirm that enough matching public records exist.
  • Some fields are empty: optional fields depend on what each record publicly provides.
  • Run takes longer than expected: reduce scope, lower limit for validation, or split broad collection into smaller runs.
  • Output changed: compare the current dataset with the field reference and include a small sample if you need support.

FAQ

What data does this actor collect?

It collects public Substack records such as posts, publications, people, notes, comments, and optional reply-thread content, depending on your selected inputs and filters.

Can I use direct URLs and keyword queries in the same run?

Yes. You can provide startUrls, queries, or both if you want direct collection and discovery-oriented search in one workflow.

Can I filter by date or language?

Yes. The actor supports publication_date, within_days, and language where those filters apply to the selected search flow.

Why did I receive fewer results than my limit?

limit sets an upper bound, not a guarantee. The final count depends on how many matching public records are available for your inputs and filters.

Can I schedule recurring runs?

Yes. Apify schedules can run the actor automatically on a daily, weekly, or custom cadence.

How do I avoid duplicates across runs?

Use type + ":" + id as the idempotency key when storing or syncing records downstream.

Can I export the data to CSV, Excel, or JSON?

Yes. Apify datasets support JSON export for pipelines and CSV or Excel export for spreadsheet-based review.

Does this actor collect private data?

No. It is intended for publicly available information from https://substack.com.

What should I include when reporting an issue?

Include the input you used with sensitive values redacted, the run ID, a short description of expected versus actual behavior, and an optional small output sample that shows the issue.

Compliance & Ethics

Responsible Data Collection

This actor collects publicly available newsletter, publication, author, post, note, and comment information from https://substack.com for legitimate business purposes, including:

  • Media and publishing research and market analysis
  • Content monitoring and reporting
  • Dataset enrichment and operational analytics

This section is informational and not legal advice. Users are responsible for ensuring their collection and use of data complies with applicable laws, regulations, and platform terms.

Best Practices

  • Use collected data in accordance with applicable laws, regulations, and the target site's terms
  • Respect individual privacy and personal information
  • Use data responsibly and avoid disruptive or excessive collection
  • Do not use this actor for spamming, harassment, or other harmful purposes
  • Follow relevant data protection requirements where applicable (for example GDPR or CCPA)

Support

For help, use the actor page or the repository Issues section. When reporting a problem, include the input used with sensitive values redacted, the run ID, expected versus actual behavior, and, if helpful, a small sample of the output that demonstrates the issue.