Url To Llm Dataset
Pricing
Pay per usage
Url To Llm Dataset
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny Nguyen
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
What does it do?
URL to LLM Dataset crawls websites starting from the URLs you provide and produces a structured dataset optimized for LLM fine-tuning and AI training. For each page, it extracts the clean text content, title, metadata, word count, language, and heading structure. Pages below a configurable word count threshold are automatically filtered out, ensuring only substantive content makes it into your training data.
Why use this actor?
Building high-quality training datasets for LLMs requires processing web content at scale. This actor automates the entire pipeline: crawling, cleaning, structuring, and filtering web content into a format ready for fine-tuning or RAG (Retrieval-Augmented Generation) systems. It handles the tedious work of stripping navigation, ads, and boilerplate, leaving you with clean, structured text data.
How to use it
- Go to the actor's page on the Apify platform.
- Click Start to open the input configuration.
- Enter one or more starting URLs to crawl.
- Set the maximum pages per URL and minimum word count.
- Click Start and wait for the results.
- Download your dataset in JSON format for direct use in AI pipelines.
The actor follows same-domain links from your starting URLs, building a comprehensive dataset from each site.
Input configuration
| Field | Type | Description | Default |
|---|---|---|---|
| urls | array | Starting URLs to crawl | ["https://docs.apify.com"] |
| maxPagesPerUrl | integer | Max pages per starting URL | 50 |
| minWordCount | integer | Skip pages with fewer words | 100 |
| proxyConfiguration | object | Proxy settings | Apify Proxy |
Output data
Each item in the dataset contains:
{"url": "https://docs.apify.com/platform/actors","title": "Actors | Apify Documentation","text": "Actors are serverless cloud programs that can do almost anything...","wordCount": 1250,"charCount": 7890,"description": "Learn about Apify Actors","language": "en","headings": ["Actors", "Getting started", "Configuration"],"headingCount": 8,"baseUrl": "https://docs.apify.com","scrapedAt": "2026-02-19T14:30:00.000Z"}
Cost of usage
This actor uses CheerioCrawler for fast, efficient crawling. A typical run crawling 50 pages from one site takes about 1-3 minutes and costs approximately $0.02-0.05 in platform credits. The actor is priced at $0.75 per 1,000 results with pay-per-event pricing, reflecting the high value of structured AI training data.
Tips
- Set minWordCount to 100 or higher to filter out thin pages like redirects, error pages, and navigation pages.
- Use maxPagesPerUrl to control crawl depth and keep costs predictable.
- The text field contains up to 50,000 characters per page to accommodate long-form content.
- The headings array provides document structure information useful for chunking strategies in RAG.
- Export as JSON for direct compatibility with popular LLM fine-tuning frameworks.
- Combine datasets from multiple domains to build diverse training corpora.
- Schedule runs to keep your training data fresh with the latest website content.
Built with Crawlee and Apify SDK. See more scrapers by consummate_mandala on Apify Store.