DomainForge LLM Dataset Builder
Pricing
from $0.01 / 1,000 results
DomainForge LLM Dataset Builder
Crawls websites and transforms web content into clean, structured datasets optimized for LLM fine-tuning, RAG applications, and knowledge base construction
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer
fanio zilla
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
1
Monthly active users
5 hours ago
Last modified
Categories
Share
DomainForge LLM Dataset Builder 🛠️
Turn Any Website into Production-Ready LLM Training & RAG Data with One Click.
💡 Stop Scraping, Start Training
AI Engineers and researchers spend 70-80% of their time building web scrapers, stripping boilerplate HTML, deduplicating content, and chunking files.
DomainForge does all of this for you automatically.
Whether you are fine-tuning a custom LLM, building a Retrieval-Augmented Generation (RAG) pipeline, or feeding a vector database, DomainForge crawls your target websites and processes them into clean, structured, and deduplicated datasets instantly.
✨ Superpowers
- 🕷️ Smart Website Crawling: Crawl recursively, target specific URL paths (using globs or regex), or ingest entire sites instantly using
sitemap.xmlfiles. - 🧼 High-Fidelity Noise Removal: Powered by Mozilla's Readability engine. We strip headers, footers, sidebars, cookie banners, navigation menus, and ads—leaving only the pristine, valuable content.
- 🧩 LLM-Optimized Auto-Chunking: Automatically splits long articles into clean, overlapping chunks (customizable chunk sizes and overlaps) optimized for vector store embeddings.
- ⚡ Exact Deduplication: Removes duplicate pages and identical content blocks using SHA-256 hashing, collapsing republished or near-identical content into a single canonical record. (Semantic, embedding-based near-duplicate deduplication is on the roadmap.)
- 📂 Multi-Format Export: Download your clean dataset as JSONL (Hugging Face ready), JSON, CSV, or raw Markdown files.
🚀 How It Works
DomainForge operates as a state-of-the-art data refinery:
graph LRA[Raw URL / Sitemap] --> B[Smart Crawler]B --> C[Noise & Boilerplate Stripper]C --> D[SHA-256 Deduplication]D --> E[Semantic Chunking]E --> F[LLM-Ready Dataset]style A fill:#4dabf7,stroke:#228be6,stroke-width:2px,color:#fffstyle F fill:#37b24d,stroke:#2b8a3e,stroke-width:2px,color:#fffstyle B fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fffstyle C fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fffstyle D fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fffstyle E fill:#1c7ed6,stroke:#1971c2,stroke-width:1px,color:#fff
- Input a URL: Just enter a start URL (e.g.
https://docs.yourcompany.com) or a sitemap. - Forge processes the data: The actor extracts metadata (author, publish date, language, counts), cleans the markup, deduplicates content, and splits it into semantic chunks.
- Deploy: Directly ingest the output into your vector databases (Pinecone, Chroma, Qdrant) or Hugging Face datasets.
🛠️ Simple No-Code Input
DomainForge is built for both developers and non-technical builders. You can configure it with a simple JSON or run it through the Apify Console UI.
Simple Configuration (Just the basics)
{"startUrls": [{ "url": "https://docs.apify.com" }],"maxCrawlPages": 100}
Full Enterprise Settings (Complete control)
{"startUrls": [{ "url": "https://example.com/blog" },{ "url": "https://example.com/sitemap.xml" }],"maxCrawlPages": 1000,"maxDepth": 3,"includePatterns": ["*/blog/*", "*/docs/*"],"excludePatterns": ["*/admin/*", "*/login*"],"respectRobotsTxt": true,"saveMarkdown": true,"enableDeduplication": true,"chunkSize": 1024,"chunkOverlap": 200}
📊 Beautiful Structured Output
Each crawled page is returned as a structured item ready for database storage or model ingestion:
{"url": "https://example.com/blog/getting-started","title": "Getting Started with AI Datasets","markdown": "# Getting Started with AI Datasets\n\nHigh quality data is all you need...","text": "Getting Started with AI Datasets High quality data is all you need...","metadata": {"author": "Jane Doe","publishDate": "2026-06-21T00:00:00.000Z","language": "en","wordCount": 745,"tokenCountApprox": 980,"crawledAt": "2026-06-21T12:00:00.000Z"},"chunks": [{ "text": "High quality data is all you need to train custom LLMs...", "tokenCount": 512 }],"dedupHash": "8f3b2a9e..."}
💡 Quick Tips for Best Results
- Sitemaps are your friend: Use the sitemap URL (e.g.
https://example.com/sitemap.xml) to crawl an entire site instantly. SetmaxDepthto0to only crawl pages listed in the sitemap. - Targeted Ingestion: Use
includePatterns(like*/docs/*or*/help/*) to prevent wasting compute on irrelevant pages like contacts or terms of service. - Hugging Face Friendly: Download the dataset in
JSONLformat using the Apify API for native integration with Hugging Face pipelines.
📁 Example Configurations
Pre-built input presets for common scenarios live in the ./examples directory. Copy one into the Apify Console (or pass it as the Actor input) and adjust the URLs.
| Preset | Best for | Highlights |
|---|---|---|
| ./examples/blog-crawl.json | Harvesting a blog | Path-scoped to /blog/, skips tag/author/pagination noise, exact-dedup on |
| ./examples/documentation-sitemap.json | Ingesting a whole docs site | Sitemap-driven, maxDepth 0 crawls only listed pages — fast and complete |
| ./examples/rag-chunking.json | Building a RAG / vector corpus | Small 512-char chunks, tight overlap, markdown omitted for a lean dataset |
🔗 Try DomainForge Now
Get started in seconds! Run this actor directly on the Apify Console:
$apify call domainforge-llm-dataset-builder
For custom builds, bug reports, or feature requests, feel free to check the project LICENSE and source code.