Pricing

Pay per usage

AI Training Dataset Builder: Articles, Blogs & Web Pages

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Moses Ndambuki

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Who this is for

AI / ML engineers building training corpora for LLMs and small language models
RAG developers populating vector stores with fresh, structured content
Dataset curators assembling fine-tuning sets from public web sources
Content intelligence teams monitoring articles, blogs, and editorial pages
Researchers harvesting public web pages for analysis at scale

If you currently maintain hand-rolled scrapers per site, this replaces all of them with one tool.

What you get per URL

{
  "url": "https://example.com/article",
  "title": "How Retrieval Augmented Generation Works",
  "description": "A practical guide to RAG architectures.",
  "author": "Jane Doe",
  "publishedAt": "2026-04-12T08:30:00Z",
  "language": "en",
  "wordCount": 1842,
  "text": "Retrieval augmented generation combines a retriever with a generator...",
  "scrapedAt": "2026-05-01T14:02:11Z"
}

Every field is normalized. Empty pages and thin content (under 50 words by default) are skipped automatically so your dataset stays clean.

How it works

flowchart LR
    A[Input: list of URLs] --> B[Headless Chromium]
    B --> C[Extract metadata + main text]
    C --> D{Word count above threshold?}
    D -- yes --> E[Push to dataset]
    D -- no --> F[Skip]
    E --> G[Charge per page]

Behind the scenes: Playwright renders the page (handles JS-heavy sites), the extractor pulls semantic HTML (article, main, [role=main]), and the dataset emits one JSON item per successful URL. No DOM tweaking, no per-site config.

Quick start

Run from the Apify Console

Click Try for free.
Paste your URLs.
Click Start.
Download the dataset as JSON, CSV, Excel, or stream it into your pipeline.

Run from the API

curl -X POST "https://api.apify.com/v2/acts/Turboextract~ai-training-dataset-builder/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [
      { "url": "https://blog.apify.com/web-scraping-vs-web-crawling/" },
      { "url": "https://example.com/article-2" }
    ],
    "maxPages": 100,
    "minWordCount": 50,
    "includeImages": false
  }'

Run from Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("Turboextract/ai-training-dataset-builder").call(run_input={
    "startUrls": [{"url": "https://example.com/post"}],
    "maxPages": 500,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["wordCount"])

Input fields

Field	Type	Default	Description
`startUrls`	array	required	URLs to process
`maxPages`	integer	100	Safety cap per run
`includeImages`	boolean	false	Attach image URLs from the article body
`minWordCount`	integer	50	Skip pages below this word count

Pricing

Pay per page processed. No subscriptions.

Volume	Price per page	Total
First 50 pages (free tier)	$0.000	$0.00
Per page after that	$0.005	1,000 pages = $5
10,000 pages	$0.005	$50

How it compares

Tool	Pricing model	1,000 pages
AI Training Dataset Builder	$0.005 per page	$5
Apify Web Content Crawler	Per result + compute	$7 to $15
Diffbot Article API	$299 per month base	$300+
Custom in-house scraper	Engineer time	$500+ build cost

You only pay for pages that return clean content. Thin, blocked, or failed pages cost nothing.

Common use cases

LLM fine-tuning datasets from public blogs, documentation sites, and editorial archives
RAG knowledge bases populated from a curated URL list, refreshed on a schedule
Competitive content audits comparing publish cadence and word count across competitors
Academic and journalistic research assembling source corpora across many domains

Tips for best results

Start with 10 to 20 URLs to verify extraction quality on your target sites
Set minWordCount higher (200 to 500) if you only want long-form content
Use maxPages as a hard safety cap on every run
Schedule the actor weekly to keep your training data fresh

Pairs well with

Reddit Brand Monitor & Lead Finder — pair article harvesting with social signals
Website Lead Extractor — turn the same URL list into a B2B contact dataset
Lead Enrichment Pipeline — chain extractors together for multi-source enrichment

(Links updated as related actors ship.)

FAQ

Does it handle JavaScript-rendered pages? Yes. The actor uses headless Chromium via Playwright, so SPAs and JS-heavy sites work the same as static HTML.

What about paywalls and login walls? The actor reads what an unauthenticated browser sees. Paywalled content is not bypassed.

How is this different from a generic web scraper? Output is normalized for AI use cases: cleaned body text (not raw HTML), word count, language, and metadata. You can pipe it straight into a vector store or training pipeline.

Can I run this on a schedule? Yes. Apify's built-in scheduler runs the actor on any cron expression. Pair it with a webhook to ship new items to your store of choice.

What if a page fails? Failed pages are logged and skipped. You are not charged for failures.

Support

Open an issue on the actor's Apify page or message the maintainer. Bug reports with the failing URL get fastest turnaround.

Built and maintained by Turboextract on the Apify platform.

📄 Article Content Scraper & Extractor

taroyamada/article-content-extractor

Scrape clean article bodies, authors, and metadata from messy newsrooms. Built for AI models, NLP datasets, and SEO audits requiring pristine text.

太郎山田

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

JI JUN

Webpage Text Extractor

automation-lab/webpage-text-extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

Stas Persiianenko

Wikipedia Article Extractor

glassventures/wikipedia-article-extractor

Extract Wikipedia articles via MediaWiki API. Get full text, summaries, sections, categories, images, links. Multi-language. Perfect for AI/ML training data and RAG.

Glass Ventures

RAG Web Browser

parseforge/rag-web-browser

Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!

ParseForge

Google Search Results Scraper

parseforge/google-search-scraper

Scrape search engine results at lightning speed! Run any query and pull ranked URLs, titles, snippets, hostnames, and positions across pages. Supports country and language targeting, perfect for SEO research, SERP tracking, and competitor monitoring. Start searching in minutes!

ParseForge

N8n Template Scraper

api-empire/n8n-template-scraper

N8n Template Scraper helps you extract structured data from public n8n templates. Retrieve workflow details, integrations, and descriptions efficiently. Designed for automation teams, researchers, and system builders.

API Empire

Commonwealth Corporate Registry MCP

ryanclinton/commonwealth-corporate-registry-mcp

Multi-jurisdiction corporate registry intelligence for AI agents via the Model Context Protocol.

ryan clinton

Semrush Scraper

radeance/semrush-scraper

Extract enterprise-level SEO data from Semrush including domain authority, traffic analytics, keyword rankings, backlink profiles, trending sites & competitor insights. Supports bulk analysis with historical trends. Download as CSV, JSON, Excel, XML & more.

Radeance

4.9

(12)

AI Markdown Maker

onescales/bulk-ai-markdown-maker

Convert any web page into clean, AI ready markdown format in seconds. Perfect for feeding content to AI models, creating documentation, or archiving web content in a portable format. In addition it intelligently parse web content, removing ads, navigation, and other clutter. Generate Markdown Today!