Pricing

from $30.00 / 1,000 results

AI Training Data Scraper (Substack / Medium)

Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.

Pricing

from $30.00 / 1,000 results

Rating

0.0

(0)

Developer

Brian

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Why this scraper?

LLM training requires massive volumes of high-quality, long-form text. Substack and Medium are the internet's richest sources of expert-written articles, but scraping them manually is tedious and the raw HTML is full of noise (ads, popups, subscribe widgets).

This Actor thoughtfully cleans the content, stripping out:

Subscribe popups & paywall gates
Navigation headers & footers
Author bio cards & social share buttons
Script/style tags and embedded widgets

What you get is pure, clean content — exactly what AI training pipelines need.

What does it extract?

Field	Description
`title`	Article headline
`subtitle`	Article subtitle (if present)
`author`	Author name
`date`	Publication date
`content`	Full article body in your chosen format
`url`	Original article URL

Input Parameters

Parameter	Type	Description
`publicationUrls`	Array	Substack or Medium publication URLs
`maxArticlesPerPublication`	Integer	Max articles to scrape per publication (default: 10)
`outputFormat`	String	`markdown` or `text` (default: markdown)

Sample Input

{
    "publicationUrls": [
        "https://newsletter.banklesshq.com",
        "https://medium.com/@exampleauthor"
    ],
    "maxArticlesPerPublication": 5,
    "outputFormat": "markdown"
}

Sample Output

{
    "url": "https://newsletter.banklesshq.com/p/the-future-of-defi",
    "title": "The Future of DeFi",
    "subtitle": "Where decentralized finance is headed next",
    "author": "Bankless",
    "date": "2024-01-15",
    "content": "# The Future of DeFi\n\nDecentralized finance has come a long way..."
}

Use Cases

LLM Fine-Tuning: Build domain-specific training datasets from expert writers
RAG Pipelines: Populate vector databases with high-quality knowledge bases
Content Analysis: Analyze publication trends, writing styles, and topic coverage
Research: Systematically collect articles on specific subjects

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Ken M

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Medium Publications Search Scraper

easyapi/medium-publications-search-scraper

Scrape Medium publications by keywords - Extract publication details including name, description, URL and avatar from Medium's search results efficiently and reliably.

EasyApi

Substack Articles Extractor

extremescrapes/substack-articles-extractor

Extract Substack newsletter posts as clean Markdown for LLM consumption

Extreme Scrapes

Substack Profile Scraper

getdataforme/substack-profile-scraper

The Substack Profile Scraper efficiently extracts detailed data from Substack profiles and posts for analysis, research, and content aggregation....

GetDataForMe

Substack Discovery Scraper

getdataforme/substack-discovery-scraper

The Substack Discovery Scraper efficiently extracts and analyzes data from Substack publications, supporting market research, competitive intelligence, and content aggregation....

GetDataForMe