# AI Training Dataset Builder: Articles, Blogs & Web Pages (`turboextract/ai-training-dataset-builder`) Actor

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

- **URL**: https://apify.com/turboextract/ai-training-dataset-builder.md
- **Developed by:** [Moses Ndambuki](https://apify.com/turboextract) (community)
- **Categories:** AI, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Training Dataset Builder: Articles, Blogs & Web Pages

**Turn any list of URLs into clean, structured training data for AI models, RAG pipelines, and LLM fine-tuning.** Built for ML engineers, AI researchers, and dataset teams who need reliable web content at scale without writing custom scrapers for every site.

Pass in URLs. Get back clean JSON with title, author, publish date, body text, language, and word count. Pay only for pages that succeed.

---

### Who this is for

- **AI / ML engineers** building training corpora for LLMs and small language models
- **RAG developers** populating vector stores with fresh, structured content
- **Dataset curators** assembling fine-tuning sets from public web sources
- **Content intelligence teams** monitoring articles, blogs, and editorial pages
- **Researchers** harvesting public web pages for analysis at scale

If you currently maintain hand-rolled scrapers per site, this replaces all of them with one tool.

---

### What you get per URL

```json
{
  "url": "https://example.com/article",
  "title": "How Retrieval Augmented Generation Works",
  "description": "A practical guide to RAG architectures.",
  "author": "Jane Doe",
  "publishedAt": "2026-04-12T08:30:00Z",
  "language": "en",
  "wordCount": 1842,
  "text": "Retrieval augmented generation combines a retriever with a generator...",
  "scrapedAt": "2026-05-01T14:02:11Z"
}
````

Every field is normalized. Empty pages and thin content (under 50 words by default) are skipped automatically so your dataset stays clean.

***

### How it works

```mermaid
flowchart LR
    A[Input: list of URLs] --> B[Headless Chromium]
    B --> C[Extract metadata + main text]
    C --> D{Word count above threshold?}
    D -- yes --> E[Push to dataset]
    D -- no --> F[Skip]
    E --> G[Charge per page]
```

Behind the scenes: Playwright renders the page (handles JS-heavy sites), the extractor pulls semantic HTML (`article`, `main`, `[role=main]`), and the dataset emits one JSON item per successful URL. No DOM tweaking, no per-site config.

***

### Quick start

#### Run from the Apify Console

1. Click **Try for free**.
2. Paste your URLs.
3. Click **Start**.
4. Download the dataset as JSON, CSV, Excel, or stream it into your pipeline.

#### Run from the API

```bash
curl -X POST "https://api.apify.com/v2/acts/Turboextract~ai-training-dataset-builder/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [
      { "url": "https://blog.apify.com/web-scraping-vs-web-crawling/" },
      { "url": "https://example.com/article-2" }
    ],
    "maxPages": 100,
    "minWordCount": 50,
    "includeImages": false
  }'
```

#### Run from Python

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("Turboextract/ai-training-dataset-builder").call(run_input={
    "startUrls": [{"url": "https://example.com/post"}],
    "maxPages": 500,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["wordCount"])
```

***

### Input fields

| Field | Type | Default | Description |
|---|---|---|---|
| `startUrls` | array | required | URLs to process |
| `maxPages` | integer | 100 | Safety cap per run |
| `includeImages` | boolean | false | Attach image URLs from the article body |
| `minWordCount` | integer | 50 | Skip pages below this word count |

***

### Pricing

**Pay per page processed. No subscriptions.**

| Volume | Price per page | Total |
|---|---|---|
| First 50 pages (free tier) | $0.000 | $0.00 |
| Per page after that | $0.005 | 1,000 pages = $5 |
| 10,000 pages | $0.005 | $50 |

#### How it compares

| Tool | Pricing model | 1,000 pages |
|---|---|---|
| **AI Training Dataset Builder** | $0.005 per page | **$5** |
| Apify Web Content Crawler | Per result + compute | $7 to $15 |
| Diffbot Article API | $299 per month base | $300+ |
| Custom in-house scraper | Engineer time | $500+ build cost |

You only pay for pages that return clean content. Thin, blocked, or failed pages cost nothing.

***

### Common use cases

- **LLM fine-tuning datasets** from public blogs, documentation sites, and editorial archives
- **RAG knowledge bases** populated from a curated URL list, refreshed on a schedule
- **Competitive content audits** comparing publish cadence and word count across competitors
- **Academic and journalistic research** assembling source corpora across many domains

***

### Tips for best results

- Start with 10 to 20 URLs to verify extraction quality on your target sites
- Set `minWordCount` higher (200 to 500) if you only want long-form content
- Use `maxPages` as a hard safety cap on every run
- Schedule the actor weekly to keep your training data fresh

***

### Pairs well with

- **Reddit Brand Monitor & Lead Finder** — pair article harvesting with social signals
- **Website Lead Extractor** — turn the same URL list into a B2B contact dataset
- **Lead Enrichment Pipeline** — chain extractors together for multi-source enrichment

(Links updated as related actors ship.)

***

### FAQ

**Does it handle JavaScript-rendered pages?**
Yes. The actor uses headless Chromium via Playwright, so SPAs and JS-heavy sites work the same as static HTML.

**What about paywalls and login walls?**
The actor reads what an unauthenticated browser sees. Paywalled content is not bypassed.

**How is this different from a generic web scraper?**
Output is normalized for AI use cases: cleaned body text (not raw HTML), word count, language, and metadata. You can pipe it straight into a vector store or training pipeline.

**Can I run this on a schedule?**
Yes. Apify's built-in scheduler runs the actor on any cron expression. Pair it with a webhook to ship new items to your store of choice.

**What if a page fails?**
Failed pages are logged and skipped. You are not charged for failures.

***

### Support

Open an issue on the actor's Apify page or message the maintainer. Bug reports with the failing URL get fastest turnaround.

Built and maintained by [Turboextract](https://apify.com/Turboextract) on the Apify platform.

# Actor input Schema

## `startUrls` (type: `array`):

URLs of articles, blog posts, or web pages to turn into training data items.

## `maxPages` (type: `integer`):

Safety cap on the number of pages processed in one run.

## `includeImages` (type: `boolean`):

If true, attaches the list of image URLs found inside the article body.

## `minWordCount` (type: `integer`):

Skip pages with fewer words than this. Filters out thin or empty content.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://blog.apify.com/web-scraping-vs-web-crawling/"
    }
  ],
  "maxPages": 100,
  "includeImages": false,
  "minWordCount": 50
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://blog.apify.com/web-scraping-vs-web-crawling/"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("turboextract/ai-training-dataset-builder").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [{ "url": "https://blog.apify.com/web-scraping-vs-web-crawling/" }] }

# Run the Actor and wait for it to finish
run = client.actor("turboextract/ai-training-dataset-builder").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://blog.apify.com/web-scraping-vs-web-crawling/"
    }
  ]
}' |
apify call turboextract/ai-training-dataset-builder --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=turboextract/ai-training-dataset-builder",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Training Dataset Builder: Articles, Blogs & Web Pages",
        "description": "Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.",
        "version": "0.0",
        "x-build-id": "pL9Uwnbm95bssxzcu"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/turboextract~ai-training-dataset-builder/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-turboextract-ai-training-dataset-builder",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/turboextract~ai-training-dataset-builder/runs": {
            "post": {
                "operationId": "runs-sync-turboextract-ai-training-dataset-builder",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/turboextract~ai-training-dataset-builder/run-sync": {
            "post": {
                "operationId": "run-sync-turboextract-ai-training-dataset-builder",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Source URLs",
                        "type": "array",
                        "description": "URLs of articles, blog posts, or web pages to turn into training data items.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxPages": {
                        "title": "Max pages",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Safety cap on the number of pages processed in one run.",
                        "default": 100
                    },
                    "includeImages": {
                        "title": "Include image URLs",
                        "type": "boolean",
                        "description": "If true, attaches the list of image URLs found inside the article body.",
                        "default": false
                    },
                    "minWordCount": {
                        "title": "Min word count",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip pages with fewer words than this. Filters out thin or empty content.",
                        "default": 50
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
