Url To Llm Dataset avatar

Url To Llm Dataset

Under maintenance

Pricing

Pay per usage

Go to Apify Store
Url To Llm Dataset

Url To Llm Dataset

Under maintenance

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

What does it do?

URL to LLM Dataset crawls websites starting from the URLs you provide and produces a structured dataset optimized for LLM fine-tuning and AI training. For each page, it extracts the clean text content, title, metadata, word count, language, and heading structure. Pages below a configurable word count threshold are automatically filtered out, ensuring only substantive content makes it into your training data.

Why use this actor?

Building high-quality training datasets for LLMs requires processing web content at scale. This actor automates the entire pipeline: crawling, cleaning, structuring, and filtering web content into a format ready for fine-tuning or RAG (Retrieval-Augmented Generation) systems. It handles the tedious work of stripping navigation, ads, and boilerplate, leaving you with clean, structured text data.

How to use it

  1. Go to the actor's page on the Apify platform.
  2. Click Start to open the input configuration.
  3. Enter one or more starting URLs to crawl.
  4. Set the maximum pages per URL and minimum word count.
  5. Click Start and wait for the results.
  6. Download your dataset in JSON format for direct use in AI pipelines.

The actor follows same-domain links from your starting URLs, building a comprehensive dataset from each site.

Input configuration

FieldTypeDescriptionDefault
urlsarrayStarting URLs to crawl["https://docs.apify.com"]
maxPagesPerUrlintegerMax pages per starting URL50
minWordCountintegerSkip pages with fewer words100
proxyConfigurationobjectProxy settingsApify Proxy

Output data

Each item in the dataset contains:

{
"url": "https://docs.apify.com/platform/actors",
"title": "Actors | Apify Documentation",
"text": "Actors are serverless cloud programs that can do almost anything...",
"wordCount": 1250,
"charCount": 7890,
"description": "Learn about Apify Actors",
"language": "en",
"headings": ["Actors", "Getting started", "Configuration"],
"headingCount": 8,
"baseUrl": "https://docs.apify.com",
"scrapedAt": "2026-02-19T14:30:00.000Z"
}

Cost of usage

This actor uses CheerioCrawler for fast, efficient crawling. A typical run crawling 50 pages from one site takes about 1-3 minutes and costs approximately $0.02-0.05 in platform credits. The actor is priced at $0.75 per 1,000 results with pay-per-event pricing, reflecting the high value of structured AI training data.

Tips

  • Set minWordCount to 100 or higher to filter out thin pages like redirects, error pages, and navigation pages.
  • Use maxPagesPerUrl to control crawl depth and keep costs predictable.
  • The text field contains up to 50,000 characters per page to accommodate long-form content.
  • The headings array provides document structure information useful for chunking strategies in RAG.
  • Export as JSON for direct compatibility with popular LLM fine-tuning frameworks.
  • Combine datasets from multiple domains to build diverse training corpora.
  • Schedule runs to keep your training data fresh with the latest website content.

Built with Crawlee and Apify SDK. See more scrapers by consummate_mandala on Apify Store.