AI Training Data Scraper (Substack / Medium) avatar

AI Training Data Scraper (Substack / Medium)

Pricing

from $30.00 / 1,000 results

Go to Apify Store
AI Training Data Scraper (Substack / Medium)

AI Training Data Scraper (Substack / Medium)

Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.

Pricing

from $30.00 / 1,000 results

Rating

0.0

(0)

Developer

Brian

Brian

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.

Why this scraper?

LLM training requires massive volumes of high-quality, long-form text. Substack and Medium are the internet's richest sources of expert-written articles, but scraping them manually is tedious and the raw HTML is full of noise (ads, popups, subscribe widgets).

This Actor thoughtfully cleans the content, stripping out:

  • Subscribe popups & paywall gates
  • Navigation headers & footers
  • Author bio cards & social share buttons
  • Script/style tags and embedded widgets

What you get is pure, clean content — exactly what AI training pipelines need.

What does it extract?

FieldDescription
titleArticle headline
subtitleArticle subtitle (if present)
authorAuthor name
datePublication date
contentFull article body in your chosen format
urlOriginal article URL

Input Parameters

ParameterTypeDescription
publicationUrlsArraySubstack or Medium publication URLs
maxArticlesPerPublicationIntegerMax articles to scrape per publication (default: 10)
outputFormatStringmarkdown or text (default: markdown)

Sample Input

{
"publicationUrls": [
"https://newsletter.banklesshq.com",
"https://medium.com/@exampleauthor"
],
"maxArticlesPerPublication": 5,
"outputFormat": "markdown"
}

Sample Output

{
"url": "https://newsletter.banklesshq.com/p/the-future-of-defi",
"title": "The Future of DeFi",
"subtitle": "Where decentralized finance is headed next",
"author": "Bankless",
"date": "2024-01-15",
"content": "# The Future of DeFi\n\nDecentralized finance has come a long way..."
}

Use Cases

  • LLM Fine-Tuning: Build domain-specific training datasets from expert writers
  • RAG Pipelines: Populate vector databases with high-quality knowledge bases
  • Content Analysis: Analyze publication trends, writing styles, and topic coverage
  • Research: Systematically collect articles on specific subjects