Medium Scraper | All-In-One | $2 / 1k avatar

Medium Scraper | All-In-One | $2 / 1k

Pricing

$1.99 / 1,000 results

Go to Apify Store
Medium Scraper | All-In-One | $2 / 1k

Medium Scraper | All-In-One | $2 / 1k

Get full articles, user profiles, and search results with the All-in-One Medium Scraper. Extract rich data including titles, bios, subscriber counts, social links and engagement metrics. ideal for market research, creator discovery, trend tracking, and audience analysis.

Pricing

$1.99 / 1,000 results

Rating

0.0

(0)

Developer

Fatih Tahta

Fatih Tahta

Maintained by Community

Actor stats

3

Bookmarked

24

Total users

3

Monthly active users

5 days ago

Last modified

Share

Medium Scraper

Slug: fatihtahta/medium-scraper

Overview

Medium Scraper collects structured public records from Medium, including articles, user profiles, publications, topics, lists, search results, and key attributes such as titles, URLs, author details, engagement metrics, timestamps, tags, and article content where available. Medium is a large publishing platform for independent writers, publications, companies, and communities, making its public data useful for content research, audience analysis, editorial tracking, and market intelligence. The actor is built for automated, repeatable data acquisition with structured JSON output that can be used consistently across analytics, enrichment, and monitoring workflows. It supports both targeted collection from known Medium URLs and discovery-oriented collection from search queries or Medium search result URLs. The result is a dependable workflow for recurring public data collection without requiring manual browsing or ad hoc copy-and-paste processes.

Why Use This Actor

  • Market research and analytics: collect structured extraction outputs for content trends, author activity, topic coverage, publication movement, and engagement analysis.
  • Product and content teams: monitor relevant articles, creators, publications, and search result categories to inform editorial planning, competitive content review, and operational reporting.
  • Developers and data engineering teams: feed normalized JSON records into downstream systems, warehouses, enrichment pipelines, search indexes, and automated quality checks.
  • Lead generation and enrichment teams: identify public authors, profiles, publications, and topic-aligned content for research, segmentation, and CRM enrichment workflows.
  • Monitoring and competitive tracking teams: schedule repeatable collection for market intelligence, keyword tracking, publication monitoring, and change detection across public Medium content.

Common Use Cases

  • Market intelligence: track public writing activity, topic visibility, article engagement, author presence, and publication coverage across selected Medium searches.
  • Lead generation: build targeted prospect lists from public Medium profiles, authors, publications, or keyword-aligned content.
  • Competitive monitoring: follow specific authors, publications, topic areas, or search results to observe content cadence and audience response over time.
  • Catalog and directory building: populate internal databases with structured public records for profiles, articles, publications, topics, and lists.
  • Data enrichment: add current public Medium attributes to existing CRM, BI, analytics, or content intelligence datasets.
  • Recurring reporting: schedule periodic runs for dashboards, alerts, editorial reviews, and trend analysis.

Quick Start

  1. Choose whether to collect known pages using pageAndProfileUrls or discover records using searchQueriesOrUrls.
  2. For search-based runs, select the searchResultType that matches the entity you want to collect: Stories, Profiles, Publications, Topics, or Lists.
  3. Set a small limit, such as 10 or 25, for the first validation run.
  4. Run the actor in Apify Console.
  5. Inspect the first dataset records to confirm that the output shape matches your use case.
  6. Increase coverage, add more URLs or queries, or schedule the actor once the output is verified.

Input Parameters

Configure the available filters below to define the collection scope.

ParameterTypeDescriptionDefault
pageAndProfileUrlsarray of stringsDirect Medium article or profile URLs to collect. Use this for targeted collection when you already know the author profiles or articles you need.-
searchResultTypestringType of Medium search result to collect from search inputs. Allowed values: Stories, Profiles, Publications, Topics, Lists.Stories
searchQueriesOrUrlsarray of stringsMedium search keywords or Medium search result URLs. Each value is used to discover matching records for the selected searchResultType.-
limitintegerMaximum number of main records to save across the run. Minimum: 10 records.50000
proxyConfigurationobjectConnection configuration for the run. The default value is suitable for most collections.{ "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] }

Choosing Inputs

Use pageAndProfileUrls when you have known Medium articles or profiles and need targeted, repeatable collection from those pages. Use searchQueriesOrUrls when you want discovery from keywords or when you already have Medium search result URLs that define a useful collection scope.

For search-based runs, searchResultType controls the entity category returned by each query. Choose Stories for articles, Profiles for users, Publications for Medium publications, Topics for topic pages, and Lists for list-style results.

Narrow, specific queries generally produce more targeted datasets. Broader queries improve discovery but may require a higher limit and more review before production use. Start with a small limit to validate output quality, then increase the value once the dataset shape and coverage match your workflow.

Example Inputs

Search-driven article discovery

{
"searchResultType": "Stories",
"searchQueriesOrUrls": [
"artificial intelligence",
"product management"
],
"limit": 25
}

Direct profile and article collection

{
"pageAndProfileUrls": [
"https://medium.com/@anilmatcha",
"https://medium.com/@cly11204/from-goldman-to-green-tea-why-burnout-tastes-like-ceremonial-grade-0e6d3695ba53"
],
"limit": 10
}

Publication discovery from Medium search URLs

{
"searchResultType": "Publications",
"searchQueriesOrUrls": [
"https://medium.com/search?q=data%20engineering",
"https://medium.com/search?q=startups"
],
"limit": 50
}

Output

Output destination

The actor writes results to an Apify dataset as JSON records. The dataset is designed for direct consumption by analytics tools, ETL pipelines, and downstream APIs without post-processing.

Each item contains a stable record envelope plus a type-specific payload. In the provided output shape, the record category is represented by record_type; downstream consumers can map this to type when building a unified envelope.

Record envelope (all items)

The stable record envelope is:

  • type (string, required): record category, such as profile or post. In raw records, this may be provided as record_type.
  • id (number, required): stable record identifier for the entity. If the raw entity identifier is returned as a string, store it consistently as the dataset-provided value in downstream systems.
  • url (string, required): canonical or Medium URL for the record.

Recommended idempotency key: type + ":" + id.

Use the idempotency key for deduplication and upserts when syncing records into warehouses, CRMs, search indexes, or internal applications. The envelope makes records easier to merge, deduplicate, and sync across repeated runs because each record can be identified independently from run time or query source.

Examples

Example: profile (type = "profile")

{
"record_type": "profile",
"source_context": {
"requested_url": "https://medium.com/@anilmatcha",
"scraped_at": "2026-04-24T10:52:11.678Z",
"input": {
"type": "profileUrl",
"value": "https://medium.com/@anilmatcha"
},
"discovered_from": null,
"search": {
"query": null,
"result_type": null
}
},
"profile": {
"id": "b2e46d5c4730",
"url": "https://medium.com/@anilmatcha",
"name": "Anil Chandra Naidu Matcha",
"username": "anilmatcha",
"bio": "CTO@VadooAI. Reach me at https://twitter.com/matchaman11"
},
"profile_details": {
"image_id": "2*cXQfG4itJf-Oh7v4fvgVUg.png",
"pronouns": [],
"is_book_author": false,
"has_subdomain": false,
"custom_domain": {
"domain": null,
"url": null
}
},
"audience": {
"follower_count": 790,
"following_count": 48
},
"linked_accounts": {
"mastodon": null,
"newsletter": {
"id": "c2f315e75766",
"name": "b2e46d5c4730",
"slug": "b2e46d5c4730",
"type": "NEWSLETTER_TYPE_AUTHOR"
}
},
"homepage_posts": {
"items": [
{
"id": "8a701d141070",
"title": "How to train a custom GPT on your data with EmbedAI + LlamaIndex",
"subtitle": "ChatGPT, developed by OpenAI, has changed the way we interact online.",
"url": "https://medium.com/llamaindex-blog/how-to-train-a-custom-gpt-on-your-data-with-embedai-llamaindex-8a701d141070",
"published_at": {
"first": 1702596440828,
"first_iso": "2023-12-14T23:27:20.828Z",
"latest": 1702596440828,
"latest_iso": "2023-12-14T23:27:20.828Z"
},
"metrics": {
"reading_time_minutes": 4.923899371069182
},
"engagement": {
"clap_count": 297,
"voter_count": 37
},
"access": {
"is_locked": false
},
"preview_image": {
"id": "1*g5SWoNXvEj6hqhZPB6CZCA.png",
"original_width": 1024,
"original_height": 1024
},
"collection": {
"id": "d7683ed5043e",
"name": "LlamaIndex Blog",
"slug": "llamaindex-blog"
}
}
],
"fetched_count": 10,
"next_page": {
"from": "L1757361037228",
"limit": 10
}
}
}

Example: post (type = "post")

{
"record_type": "post",
"source_context": {
"requested_url": "https://medium.com/@cly11204/from-goldman-to-green-tea-why-burnout-tastes-like-ceremonial-grade-0e6d3695ba53",
"scraped_at": "2026-04-24T10:52:11.821Z",
"input": {
"type": "articleUrl",
"value": "https://medium.com/@cly11204/from-goldman-to-green-tea-why-burnout-tastes-like-ceremonial-grade-0e6d3695ba53"
},
"discovered_from": null,
"search": {
"query": null,
"result_type": null
}
},
"post": {
"id": "0e6d3695ba53",
"title": "Why Every Investment Banker is Starting a Matcha Company: A Business Breakdown",
"subtitle": "Inside the Investment Banker-to-Matcha Brand Pipeline",
"urls": {
"medium": "https://medium.com/@cly11204/from-goldman-to-green-tea-why-burnout-tastes-like-ceremonial-grade-0e6d3695ba53",
"canonical": null
},
"published_at": {
"first": 1750182485763,
"first_iso": "2025-06-17T17:48:05.763Z",
"latest": 1750811376100,
"latest_iso": "2025-06-25T00:29:36.100Z"
}
},
"metrics": {
"reading_time_minutes": 13.09433962264151,
"word_count": 3099
},
"engagement": {
"clap_count": 40,
"voter_count": 6
},
"access": {
"is_locked": false,
"is_locked_preview_only": false
},
"author": {
"id": "3cc4d24c3017",
"name": "Caitlin",
"username": "cly11204",
"bio": "Product | UX | Business | Plant Mom | Gym Girlie https://caitlinyeung.com",
"image_id": "1*VXj8YlfaNt2yBIZVVyPtKA.jpeg",
"profile_url": "https://medium.com/@cly11204"
},
"collection": null,
"tags": [
{
"id": "business",
"title": "Business",
"slug": "business"
}
],
"content": {
"sections": [
{
"section_id": "c264",
"start_index": 0
}
],
"paragraphs": [
{
"id": "f1569197a174_0",
"type": "IMG",
"text": "Image courtesy of Unsplash",
"href": null,
"annotations": []
},
{
"id": "f1569197a174_1",
"type": "H3",
"text": "Why Every Investment Banker is Starting a Matcha Company: A Business Breakdown",
"href": null,
"annotations": []
}
],
"plain_text": "Image courtesy of Unsplash\n\nWhy Every Investment Banker is Starting a Matcha Company: A Business Breakdown",
"plain_text_preview": "Image courtesy of Unsplash\n\nWhy Every Investment Banker is Starting a Matcha Company...",
"stats": {
"paragraph_count": 196,
"image_paragraph_count": 13,
"link_paragraph_count": 0,
"annotation_count": 9,
"link_annotation_count": 9
}
}
}

Field Reference

Common fields

  • record_type (string, required): record category, such as profile or post.
  • source_context.requested_url (string, optional): URL requested for the record when available.
  • source_context.scraped_at (string, required): ISO timestamp for when the record was collected.
  • source_context.input.type (string, optional): input source category that produced the record.
  • source_context.input.value (string, optional): input value that produced the record.
  • source_context.discovered_from (string/object/null, optional): discovery context when the record was found from another source.
  • source_context.search.query (string/null, optional): search query associated with the record.
  • source_context.search.result_type (string/null, optional): selected search result type associated with the record.

Profile records

  • profile.id (string, required): Medium profile identifier.
  • profile.url (string, required): public Medium profile URL.
  • profile.name (string, optional): display name.
  • profile.username (string, optional): Medium username.
  • profile.bio (string/null, optional): public profile biography.
  • profile_details.image_id (string/null, optional): profile image identifier when available.
  • profile_details.pronouns (array, optional): public pronouns listed on the profile.
  • profile_details.is_book_author (boolean, optional): whether the profile is marked as a book author.
  • profile_details.has_subdomain (boolean, optional): whether the profile has a Medium subdomain.
  • profile_details.custom_domain.domain (string/null, optional): custom domain when configured.
  • profile_details.custom_domain.url (string/null, optional): custom domain URL when configured.
  • audience.follower_count / audience.following_count (number/null, optional): public audience counts.
  • linked_accounts.mastodon (object/null, optional): public Mastodon account details when available.
  • linked_accounts.newsletter.id (string, optional): newsletter identifier.
  • linked_accounts.newsletter.name (string, optional): newsletter name.
  • linked_accounts.newsletter.slug (string, optional): newsletter slug.
  • linked_accounts.newsletter.type (string, optional): newsletter type.
  • homepage_posts.items (array, optional): recent public posts shown for the profile.
  • homepage_posts.items.id (string, optional): post identifier.
  • homepage_posts.items.title (string, optional): post title.
  • homepage_posts.items.subtitle (string/null, optional): post subtitle.
  • homepage_posts.items.url (string, optional): post URL.
  • homepage_posts.items.published_at.first / homepage_posts.items.published_at.latest (number, optional): publication timestamps in milliseconds.
  • homepage_posts.items.published_at.first_iso / homepage_posts.items.published_at.latest_iso (string, optional): publication timestamps in ISO format.
  • homepage_posts.items.metrics.reading_time_minutes (number, optional): estimated reading time in minutes.
  • homepage_posts.items.engagement.clap_count / homepage_posts.items.engagement.voter_count (number, optional): public engagement counts.
  • homepage_posts.items.access.is_locked (boolean, optional): whether the post is marked as locked.
  • homepage_posts.items.preview_image.id (string/null, optional): preview image identifier.
  • homepage_posts.items.preview_image.original_width / homepage_posts.items.preview_image.original_height (number/null, optional): preview image dimensions in pixels.
  • homepage_posts.items.collection.id (string, optional): publication identifier when the post belongs to a collection.
  • homepage_posts.items.collection.name (string, optional): publication name.
  • homepage_posts.items.collection.slug (string, optional): publication slug.
  • homepage_posts.fetched_count (number, optional): number of homepage posts returned.
  • homepage_posts.next_page.from (string, optional): continuation marker when present.
  • homepage_posts.next_page.limit (number, optional): page size associated with the continuation marker.

Post records

  • post.id (string, required): Medium post identifier.
  • post.title (string, optional): article title.
  • post.subtitle (string/null, optional): article subtitle.
  • post.urls.medium (string, required): Medium article URL.
  • post.urls.canonical (string/null, optional): canonical URL when publicly available.
  • post.published_at.first / post.published_at.latest (number, optional): publication timestamps in milliseconds.
  • post.published_at.first_iso / post.published_at.latest_iso (string, optional): publication timestamps in ISO format.
  • metrics.reading_time_minutes (number, optional): estimated reading time in minutes.
  • metrics.word_count (number, optional): estimated word count.
  • engagement.clap_count / engagement.voter_count (number, optional): public engagement counts.
  • access.is_locked (boolean, optional): whether the post is marked as locked.
  • access.is_locked_preview_only (boolean, optional): whether only a locked preview is available.
  • author.id (string, optional): author profile identifier.
  • author.name (string, optional): author display name.
  • author.username (string, optional): Medium username.
  • author.bio (string/null, optional): public author biography.
  • author.image_id (string/null, optional): author profile image identifier.
  • author.profile_url (string, optional): author profile URL.
  • collection (object/null, optional): publication details when the post belongs to a collection.
  • tags (array, optional): public tags associated with the post.
  • tags.id (string, optional): tag identifier.
  • tags.title (string, optional): tag display title.
  • tags.slug (string, optional): tag slug.
  • content.sections (array, optional): article section index data.
  • content.sections.section_id (string, optional): section identifier.
  • content.sections.start_index (number, optional): paragraph start index for the section.
  • content.paragraphs (array, optional): article paragraphs and embedded content markers.
  • content.paragraphs.id (string, optional): paragraph identifier.
  • content.paragraphs.type (string, optional): paragraph type, such as P, H3, IMG, BQ, ULI, OLI, or IFRAME.
  • content.paragraphs.text (string/null, optional): paragraph text when available.
  • content.paragraphs.href (string/null, optional): linked URL when associated with the paragraph.
  • content.paragraphs.annotations (array, optional): inline annotation metadata.
  • content.paragraphs.annotations.type (string, optional): annotation type.
  • content.paragraphs.annotations.start / content.paragraphs.annotations.end (number, optional): text offsets for the annotation.
  • content.paragraphs.annotations.href (string, optional): annotation URL.
  • content.paragraphs.annotations.anchor_type (string, optional): link anchor type.
  • content.plain_text (string, optional): extracted article text.
  • content.plain_text_preview (string, optional): shortened article text preview.
  • content.stats.paragraph_count (number, optional): number of paragraphs returned.
  • content.stats.image_paragraph_count (number, optional): number of image paragraphs.
  • content.stats.link_paragraph_count (number, optional): number of link paragraphs.
  • content.stats.annotation_count (number, optional): number of annotations.
  • content.stats.link_annotation_count (number, optional): number of link annotations.

Data Quality, Guarantees, And Handling

  • Structured records: results are normalized into predictable JSON objects for downstream use.
  • Best-effort extraction: fields may vary by region, session, availability, account visibility, or Medium interface changes.
  • Optional fields: null-check optional values in downstream code, especially biography, canonical URL, publication, image, audience, and content fields.
  • Deduplication: use type + ":" + id, mapping record_type to type where needed.
  • Freshness: results reflect the publicly available data at run time.
  • Repeated runs: use the recommended idempotency key when syncing data into warehouses, CRMs, search indexes, or internal applications.

Tips For Best Results

  • Start with a small limit to validate the output shape before scaling up.
  • Use direct pageAndProfileUrls when the target authors or articles are already known.
  • Use searchQueriesOrUrls for discovery, keyword tracking, or broader content research.
  • Choose one searchResultType per run when you need cleaner segmentation by entity type.
  • Add queries gradually to understand how each term changes coverage and output composition.
  • Schedule recurring runs for monitoring workflows instead of relying on manual one-off exports.
  • Use stable identifiers for deduplication when storing results over time.

How to Run on Apify

  1. Open the actor in Apify Console.
  2. Configure the available input fields for the target scope.
  3. Set the maximum number of outputs to collect with limit.
  4. Click Start and wait for the run to finish.
  5. Open the dataset and review the first records.
  6. Download results in JSON, CSV, Excel, or another supported format.

Scheduling & Automation

Scheduling

Automated Data Collection

You can schedule recurring runs to keep Medium datasets fresh for monitoring, reporting, and enrichment workflows. Use a smaller, well-defined input scope for recurring jobs when consistency across runs is important.

  • Navigate to Schedules in Apify Console
  • Create a new schedule: daily, weekly, or custom cron
  • Configure input parameters
  • Enable notifications for run completion
  • Add webhooks for automated processing

Integration Options

  • CRM enrichment: sync public author, profile, publication, and article attributes into lead or account records.
  • BI dashboards: track content volume, author activity, engagement metrics, topic coverage, and publication trends over time.
  • Data warehouses: store normalized JSON records for historical analysis, joins, reporting, and model-ready datasets.
  • Webhooks: trigger validation, notification, or ingestion workflows after each completed run.
  • Google Sheets or Airtable: review smaller profile, article, or publication datasets with lightweight operational teams.
  • ETL pipelines: route dataset exports into enrichment, classification, search, or monitoring systems.

Export Formats And Downstream Use

Apify datasets can be exported or consumed by downstream systems for operational workflows, reporting, and automation.

  • JSON: for APIs, applications, and data pipelines
  • CSV or Excel: for spreadsheet workflows and manual review
  • API access: for automated ingestion into internal systems
  • BI and warehouses: for reporting, dashboards, and historical analysis

Performance

Estimated run times:

  • Small runs (< 1,000 outputs): ~3-5 minutes
  • Medium runs (1,000-5,000 outputs): ~5-15 minutes
  • Large runs (5,000+ outputs): ~15-30 minutes

Execution time varies based on filters, result volume, and how much information is returned per record. Highly targeted runs can finish faster, while broad discovery or detail-rich records may take longer.

Limitations

  • Availability depends on what Medium publicly exposes at run time.
  • Some optional fields may be missing on sparse profiles, articles, publications, topics, or list records.
  • Very broad searches may take longer or require higher limit values to capture enough records.
  • Medium-side changes can affect field availability, naming, or visible data.
  • Regional, account, or availability differences may change which results are visible.
  • Engagement metrics and audience counts should be treated as public point-in-time values.

Troubleshooting

  • No results returned: check query spelling, direct URLs, selected searchResultType, and whether Medium has matching public records.
  • Fewer results than expected: broaden queries, raise limit, add more input URLs, or verify that the target contains enough matching public records.
  • Some fields are empty: optional fields depend on what each record publicly provides.
  • Run takes longer than expected: reduce scope, lower limit for validation, or split broad collection into smaller segments.
  • Output changed: compare the current output with the field reference and include a small sample if support is needed.

FAQ

What data does this actor collect?

It collects public Medium articles, profiles, publications, topics, lists, and search result records, depending on the inputs and selected searchResultType.

Can I filter by location, category, date, price, or other criteria?

The available inputs are direct article/profile URLs, search queries or Medium search URLs, searchResultType, limit, and connection settings. Location, category, date, price, and sort fields are not part of the current input schema.

Can I provide direct Medium URLs?

Yes. Use pageAndProfileUrls for known Medium article or profile URLs.

Can I use keyword-based discovery?

Yes. Use searchQueriesOrUrls with a selected searchResultType such as Stories, Profiles, Publications, Topics, or Lists.

Why did I receive fewer results than my limit?

The limit is a maximum, not a guarantee. The run may return fewer records when the provided URLs, queries, or selected result type produce fewer matching public records.

Can I schedule recurring runs?

Yes. Use Apify schedules to run the actor daily, weekly, or on a custom cron schedule.

How do I avoid duplicates across runs?

Use the recommended idempotency key type + ":" + id. In raw records, map record_type to type when constructing the key.

Can I export the data to CSV, Excel, or JSON?

Yes. Apify datasets support exports including JSON, CSV, Excel, and other formats available in Apify Console.

Does this actor collect private data?

No. The actor is intended for publicly available Medium data only. Users are responsible for using collected data in accordance with applicable laws, regulations, and platform terms.

What should I include when reporting an issue?

Include the input used, with sensitive values redacted if needed, the run ID, expected versus actual behavior, and a small output sample when available.

Compliance & Ethics

Responsible Data Collection

This actor collects publicly available Medium content and profile information from https://medium.com for legitimate business purposes, including:

  • Content and media intelligence research and market analysis
  • Editorial, publication, and author monitoring
  • CRM, BI, and analytics enrichment workflows

Users are responsible for ensuring that their use of the data complies with applicable laws, regulations, and the target site's terms. This section is informational and not legal advice.

Best Practices

  • Use collected data in accordance with applicable laws, regulations, and the target site's terms
  • Respect individual privacy and personal information
  • Use data responsibly and avoid disruptive or excessive collection
  • Do not use this actor for spamming, harassment, or other harmful purposes
  • Follow relevant data protection requirements where applicable, such as GDPR and CCPA

Support

For help, use the Issues tab or the actor page support options. Include the input used with any sensitive values redacted, the run ID, expected versus actual behavior, and a small output sample if it helps illustrate the issue. Avoid sharing private credentials or confidential business data in support requests.