AI Training Data Scraper (Substack / Medium)
Pricing
from $30.00 / 1,000 results
AI Training Data Scraper (Substack / Medium)
Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.
Pricing
from $30.00 / 1,000 results
Rating
0.0
(0)
Developer

Brian
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.
Why this scraper?
LLM training requires massive volumes of high-quality, long-form text. Substack and Medium are the internet's richest sources of expert-written articles, but scraping them manually is tedious and the raw HTML is full of noise (ads, popups, subscribe widgets).
This Actor thoughtfully cleans the content, stripping out:
- Subscribe popups & paywall gates
- Navigation headers & footers
- Author bio cards & social share buttons
- Script/style tags and embedded widgets
What you get is pure, clean content — exactly what AI training pipelines need.
What does it extract?
| Field | Description |
|---|---|
title | Article headline |
subtitle | Article subtitle (if present) |
author | Author name |
date | Publication date |
content | Full article body in your chosen format |
url | Original article URL |
Input Parameters
| Parameter | Type | Description |
|---|---|---|
publicationUrls | Array | Substack or Medium publication URLs |
maxArticlesPerPublication | Integer | Max articles to scrape per publication (default: 10) |
outputFormat | String | markdown or text (default: markdown) |
Sample Input
{"publicationUrls": ["https://newsletter.banklesshq.com","https://medium.com/@exampleauthor"],"maxArticlesPerPublication": 5,"outputFormat": "markdown"}
Sample Output
{"url": "https://newsletter.banklesshq.com/p/the-future-of-defi","title": "The Future of DeFi","subtitle": "Where decentralized finance is headed next","author": "Bankless","date": "2024-01-15","content": "# The Future of DeFi\n\nDecentralized finance has come a long way..."}
Use Cases
- LLM Fine-Tuning: Build domain-specific training datasets from expert writers
- RAG Pipelines: Populate vector databases with high-quality knowledge bases
- Content Analysis: Analyze publication trends, writing styles, and topic coverage
- Research: Systematically collect articles on specific subjects