DomainForge LLM Dataset Builder avatar

DomainForge LLM Dataset Builder

Pricing

from $0.01 / 1,000 results

Go to Apify Store
DomainForge LLM Dataset Builder

DomainForge LLM Dataset Builder

Crawls websites and transforms web content into clean, structured datasets optimized for LLM fine-tuning, RAG applications, and knowledge base construction

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

fanio zilla

fanio zilla

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

5 hours ago

Last modified

Categories

Share

DomainForge LLM Dataset Builder 🛠️

DomainForge Logo

Turn Any Website into Production-Ready LLM Training & RAG Data with One Click.

Apify Actor LICENSE Hugging Face Ready


💡 Stop Scraping, Start Training

AI Engineers and researchers spend 70-80% of their time building web scrapers, stripping boilerplate HTML, deduplicating content, and chunking files.

DomainForge does all of this for you automatically.

Whether you are fine-tuning a custom LLM, building a Retrieval-Augmented Generation (RAG) pipeline, or feeding a vector database, DomainForge crawls your target websites and processes them into clean, structured, and deduplicated datasets instantly.


✨ Superpowers

  • 🕷️ Smart Website Crawling: Crawl recursively, target specific URL paths (using globs or regex), or ingest entire sites instantly using sitemap.xml files.
  • 🧼 High-Fidelity Noise Removal: Powered by Mozilla's Readability engine. We strip headers, footers, sidebars, cookie banners, navigation menus, and ads—leaving only the pristine, valuable content.
  • 🧩 LLM-Optimized Auto-Chunking: Automatically splits long articles into clean, overlapping chunks (customizable chunk sizes and overlaps) optimized for vector store embeddings.
  • ⚡ Exact Deduplication: Removes duplicate pages and identical content blocks using SHA-256 hashing, collapsing republished or near-identical content into a single canonical record. (Semantic, embedding-based near-duplicate deduplication is on the roadmap.)
  • 📂 Multi-Format Export: Download your clean dataset as JSONL (Hugging Face ready), JSON, CSV, or raw Markdown files.

🚀 How It Works

DomainForge operates as a state-of-the-art data refinery:

graph LR
A[Raw URL / Sitemap] --> B[Smart Crawler]
B --> C[Noise & Boilerplate Stripper]
C --> D[SHA-256 Deduplication]
D --> E[Semantic Chunking]
E --> F[LLM-Ready Dataset]
style A fill:#4dabf7,stroke:#228be6,stroke-width:2px,color:#fff
style F fill:#37b24d,stroke:#2b8a3e,stroke-width:2px,color:#fff
style B fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fff
style C fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fff
style D fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fff
style E fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fff
  1. Input a URL: Just enter a start URL (e.g. https://docs.yourcompany.com) or a sitemap.
  2. Forge processes the data: The actor extracts metadata (author, publish date, language, counts), cleans the markup, deduplicates content, and splits it into semantic chunks.
  3. Deploy: Directly ingest the output into your vector databases (Pinecone, Chroma, Qdrant) or Hugging Face datasets.

🛠️ Simple No-Code Input

DomainForge is built for both developers and non-technical builders. You can configure it with a simple JSON or run it through the Apify Console UI.

Simple Configuration (Just the basics)

{
"startUrls": [{ "url": "https://docs.apify.com" }],
"maxCrawlPages": 100
}

Full Enterprise Settings (Complete control)

{
"startUrls": [
{ "url": "https://example.com/blog" },
{ "url": "https://example.com/sitemap.xml" }
],
"maxCrawlPages": 1000,
"maxDepth": 3,
"includePatterns": ["*/blog/*", "*/docs/*"],
"excludePatterns": ["*/admin/*", "*/login*"],
"respectRobotsTxt": true,
"saveMarkdown": true,
"enableDeduplication": true,
"chunkSize": 1024,
"chunkOverlap": 200
}

📊 Beautiful Structured Output

Each crawled page is returned as a structured item ready for database storage or model ingestion:

{
"url": "https://example.com/blog/getting-started",
"title": "Getting Started with AI Datasets",
"markdown": "# Getting Started with AI Datasets\n\nHigh quality data is all you need...",
"text": "Getting Started with AI Datasets High quality data is all you need...",
"metadata": {
"author": "Jane Doe",
"publishDate": "2026-06-21T00:00:00.000Z",
"language": "en",
"wordCount": 745,
"tokenCountApprox": 980,
"crawledAt": "2026-06-21T12:00:00.000Z"
},
"chunks": [
{ "text": "High quality data is all you need to train custom LLMs...", "tokenCount": 512 }
],
"dedupHash": "8f3b2a9e..."
}

💡 Quick Tips for Best Results

  • Sitemaps are your friend: Use the sitemap URL (e.g. https://example.com/sitemap.xml) to crawl an entire site instantly. Set maxDepth to 0 to only crawl pages listed in the sitemap.
  • Targeted Ingestion: Use includePatterns (like */docs/* or */help/*) to prevent wasting compute on irrelevant pages like contacts or terms of service.
  • Hugging Face Friendly: Download the dataset in JSONL format using the Apify API for native integration with Hugging Face pipelines.

📁 Example Configurations

Pre-built input presets for common scenarios live in the ./examples directory. Copy one into the Apify Console (or pass it as the Actor input) and adjust the URLs.

PresetBest forHighlights
./examples/blog-crawl.jsonHarvesting a blogPath-scoped to /blog/, skips tag/author/pagination noise, exact-dedup on
./examples/documentation-sitemap.jsonIngesting a whole docs siteSitemap-driven, maxDepth 0 crawls only listed pages — fast and complete
./examples/rag-chunking.jsonBuilding a RAG / vector corpusSmall 512-char chunks, tight overlap, markdown omitted for a lean dataset

🔗 Try DomainForge Now

Get started in seconds! Run this actor directly on the Apify Console:

$apify call domainforge-llm-dataset-builder

For custom builds, bug reports, or feature requests, feel free to check the project LICENSE and source code.