Under maintenance

Pricing

Pay per usage

Try for free

Go to Apify Store

Url To Llm Dataset

Under maintenance

Try for free

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What does it do?

URL to LLM Dataset crawls websites starting from the URLs you provide and produces a structured dataset optimized for LLM fine-tuning and AI training. For each page, it extracts the clean text content, title, metadata, word count, language, and heading structure. Pages below a configurable word count threshold are automatically filtered out, ensuring only substantive content makes it into your training data.

Why use this actor?

Building high-quality training datasets for LLMs requires processing web content at scale. This actor automates the entire pipeline: crawling, cleaning, structuring, and filtering web content into a format ready for fine-tuning or RAG (Retrieval-Augmented Generation) systems. It handles the tedious work of stripping navigation, ads, and boilerplate, leaving you with clean, structured text data.

How to use it

Go to the actor's page on the Apify platform.
Click Start to open the input configuration.
Enter one or more starting URLs to crawl.
Set the maximum pages per URL and minimum word count.
Click Start and wait for the results.
Download your dataset in JSON format for direct use in AI pipelines.

The actor follows same-domain links from your starting URLs, building a comprehensive dataset from each site.

Input configuration

Field	Type	Description	Default
urls	array	Starting URLs to crawl	["https://docs.apify.com"]
maxPagesPerUrl	integer	Max pages per starting URL	50
minWordCount	integer	Skip pages with fewer words	100
proxyConfiguration	object	Proxy settings	Apify Proxy

Output data

Each item in the dataset contains:

{
    "url": "https://docs.apify.com/platform/actors",
    "title": "Actors | Apify Documentation",
    "text": "Actors are serverless cloud programs that can do almost anything...",
    "wordCount": 1250,
    "charCount": 7890,
    "description": "Learn about Apify Actors",
    "language": "en",
    "headings": ["Actors", "Getting started", "Configuration"],
    "headingCount": 8,
    "baseUrl": "https://docs.apify.com",
    "scrapedAt": "2026-02-19T14:30:00.000Z"
}

Cost of usage

This actor uses CheerioCrawler for fast, efficient crawling. A typical run crawling 50 pages from one site takes about 1-3 minutes and costs approximately $0.02-0.05 in platform credits. The actor is priced at $0.75 per 1,000 results with pay-per-event pricing, reflecting the high value of structured AI training data.

Tips

Set minWordCount to 100 or higher to filter out thin pages like redirects, error pages, and navigation pages.
Use maxPagesPerUrl to control crawl depth and keep costs predictable.
The text field contains up to 50,000 characters per page to accommodate long-form content.
The headings array provides document structure information useful for chunking strategies in RAG.
Export as JSON for direct compatibility with popular LLM fine-tuning frameworks.
Combine datasets from multiple domains to build diverse training corpora.
Schedule runs to keep your training data fresh with the latest website content.

Built with Crawlee and Apify SDK. See more scrapers by consummate_mandala on Apify Store.

PDF to Text Extractor

consummate_mandala/pdf-to-text-extractor

Donny Nguyen

LLM Dataset Processor

dusan.vystrcil/llm-dataset-processor

Allows you to process output of other actors or stored dataset with single LLM prompt. It's useful if you need to enrich data, summarize content, extract specific information, or manipulate data in a structured way using AI.

Dušan Vystrčil

128

Sitemap to URL Crawler

logiover/sitemap-to-url-crawler

nstantly extract all public URLs from any website's sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest & cheapest way to build URL lists for RAG pipelines, LLM training, and SEO audits. Zero-config & blazing fast.

Logiover

Opentable Urls Script

hello.datawizards/opentable-urls-script

"Opentable-Urls Script extracts rich restaurant data from OpenTable pages, including menus, images, ratings, location, cuisine, and pricing. Ideal for food apps, analytics, travel platforms, and AI datasets, delivering clean, structured JSON output with proxy support."

datawizards

Scrape Website To Llm Dataset — Data, Details & Metadata

tropical_quince/website-to-llm-dataset

Scrape website to llm dataset data at scale with this powerful Apify actor. Extracts data, details & metadata with automatic pagination and proxy rotation. Perfect for market research, competitive intelligence, and data-driven decision making.

Donny Nguyen

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

257

3.8

(3)

Web Scraper 🚀

datascoutapi/web-scraper

Web Scraper Pro extracts clean structured data for LLMs/RAG. Browser-based, 10x faster with anti-detection bypassing Cloudflare/CAPTCHA & proxy rotation. Bulk/recursive crawl 50k URLs at 500 pages/min. JSON/CSV/API, free tier.

halam

Google News Scraper

scrapier/google-news-scraper

Pull fresh news coverage from Google News with reliable scraping. Extract article metadata, summaries, sources, and URLs for trend analysis or reporting workflows. Designed for content teams, researchers, and automation pipelines.

Scrapier

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

Cody Churchwell

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.