Blog Post Scraper for LLM
Pricing
from $50.00 / 1,000 article scrapeds
Blog Post Scraper for LLM
Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.
Pricing
from $50.00 / 1,000 article scrapeds
Rating
0.0
(0)
Developer
Extreme Scrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Share
Blog Post for AI Training Data
Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.
Features
- Image-free output — strips all images for pure text training data
- Word count filtering — skip short posts that don't meet your quality threshold
- JSONL output — combined output file ready for fine-tuning pipelines
- Batch processing — extract hundreds of blog posts in a single run
- Dataset + KV store — results in both Apify dataset and as a single JSONL file
How It Works
- Provide blog post URLs and set a minimum word count threshold.
- The Actor fetches each post and strips all images.
- Posts below the word count threshold are skipped.
- Valid posts are stored in the dataset AND as a combined JSONL file in the Key-Value store.
Input
{"startUrls": [{ "url": "https://lilianweng.github.io/posts/2023-06-23-agent/" },{ "url": "https://blog.google/technology/ai/google-gemini-ai/" }],"minWordCount": 200}
Output
Dataset record:
{"url": "https://lilianweng.github.io/posts/2023-06-23-agent/","wordCount": 8542,"markdown": "# LLM Powered Autonomous Agents\n\nContent..."}
Key-Value store: A single OUTPUT file in JSONL format containing all records.
Use Cases
- Build fine-tuning datasets for LLMs
- Create training corpora from technical blogs
- Collect AI/ML research blog content
- Generate evaluation datasets
Keywords
blog scraper, AI training data, LLM dataset, fine-tuning data, blog to jsonl, training corpus
Pricing
$50 per 1,000 blog post extractions.