Pricing

from $2.00 / 1,000 results

AI Training Data Scraper

Curate high-quality training datasets for AI/ML models. Extract, clean & format text data from websites, papers & forums. Perfect for LLM training, RAG systems & research.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Vhub Systems

Actor stats

Bookmarked

Total users

Monthly active users

11 days ago

Last modified

Куратор обучающих данных для ИИ

Производственный Apify-актор для сбора чистого структурированного текста с веб-сайтов для создания наборов обучающих данных ИИ/МО и тонкой настройки языковых моделей.

Возможности

Умный краулинг: Обход ссылок из начальных URL с настраиваемыми ограничениями глубины
Чистое извлечение: Удаление навигации, футеров, скриптов, стилей и шаблонного контента
Фильтрация качества: Установка минимальной длины текста для исключения слабого контента
Структурированный вывод: Сохранение URL, заголовка, чистого текста, количества слов и временной метки
Гибкость экспорта: Вывод в формате JSONL (рекомендуется для LLM) или CSV

Конфигурация входных данных

{
    "urls": ["https://example.com"],
    "maxPagesPerCrawl": 100,
    "minTextLength": 150,
    "outputFormat": "jsonl",
    "maxCrawlDepth": 3
}

Параметр	Тип	Обязательный	Описание
`urls`	string[]	Да	Начальные URL для обхода (поддерживает карты сайтов)
`maxPagesPerCrawl`	integer	Нет	Максимальное общее количество страниц для сканирования. По умолчанию: `100`
`minTextLength`	integer	Нет	Минимальная длина текста в символах для включения. По умолчанию: `150`
`outputFormat`	string	Нет	Формат вывода: `"jsonl"` или `"csv"`. По умолчанию: `"jsonl"`
`maxCrawlDepth`	integer	Нет	Максимальная глубина перехода по ссылкам от начальных URL. По умолчанию: `3`

Совет: Начните с 10-20 страниц на URL для оценки объема данных перед масштабированием.

Структура вывода

Каждая сканированная страница создает одну запись набора данных:

{
    "url": "https://example.com/page",
    "title": "Заголовок страницы",
    "text": "Чистый извлеченный текстовый контент без HTML, навигации или шаблонных элементов...",
    "wordCount": 427,
    "scrapedAt": "2024-01-01T12:34:56.789Z",
    "domain": "example.com"
}

Поля:

url: Канонический URL страницы
title: Заголовок страницы или первый H1
text: Очищенный текстовый контент с нормализованными пробелами
wordCount: Количество слов в извлеченном тексте
scrapedAt: Временная метка извлечения в формате ISO 8601
domain: Домен источника для лучшей организации данных

Ai Training Data Curator

digital_troubadour/ai-training-data-curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Digital Troubadour

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

Cody Churchwell

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.

ryan clinton

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

165

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Blukaze Automations

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

251

5.0

Aitrainingdatacollector

kenneth256/aitrainingdatacollector

# AI Training Data Collector Automated scraper collecting text data from HackerNews, Reddit, and news sites. Search any topic, set limits, filter quality. Export as JSON/CSV. Collects 100-200 records per minute. Perfect for AI training datasets and research.

KENNETH DAVID

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

Quick Website Content Scraper ( Extract Text for RAG & LLMs )

automateitplease/ai-web-content-scraper-extract-text-for-rag-llms

Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.