Blog Post Scraper for LLM avatar

Blog Post Scraper for LLM

Pricing

from $50.00 / 1,000 article scrapeds

Go to Apify Store
Blog Post Scraper for LLM

Blog Post Scraper for LLM

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Pricing

from $50.00 / 1,000 article scrapeds

Rating

0.0

(0)

Developer

Extreme Scrapes

Extreme Scrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Categories

Share

Blog Post for AI Training Data

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Features

  • Image-free output — strips all images for pure text training data
  • Word count filtering — skip short posts that don't meet your quality threshold
  • JSONL output — combined output file ready for fine-tuning pipelines
  • Batch processing — extract hundreds of blog posts in a single run
  • Dataset + KV store — results in both Apify dataset and as a single JSONL file

How It Works

  1. Provide blog post URLs and set a minimum word count threshold.
  2. The Actor fetches each post and strips all images.
  3. Posts below the word count threshold are skipped.
  4. Valid posts are stored in the dataset AND as a combined JSONL file in the Key-Value store.

Input

{
"startUrls": [
{ "url": "https://lilianweng.github.io/posts/2023-06-23-agent/" },
{ "url": "https://blog.google/technology/ai/google-gemini-ai/" }
],
"minWordCount": 200
}

Output

Dataset record:

{
"url": "https://lilianweng.github.io/posts/2023-06-23-agent/",
"wordCount": 8542,
"markdown": "# LLM Powered Autonomous Agents\n\nContent..."
}

Key-Value store: A single OUTPUT file in JSONL format containing all records.

Use Cases

  • Build fine-tuning datasets for LLMs
  • Create training corpora from technical blogs
  • Collect AI/ML research blog content
  • Generate evaluation datasets

Keywords

blog scraper, AI training data, LLM dataset, fine-tuning data, blog to jsonl, training corpus

Pricing

$50 per 1,000 blog post extractions.