LLM Data Pipeline Pro
Pricing
from $0.10 / 1,000 processed chunks
LLM Data Pipeline Pro
Transform websites into LLM training data. Scrape, validate, deduplicate, chunk for RAG, and export to OpenAI/Anthropic/Mistral formats. Built-in PII detection and GDPR compliance. Vector DB export to Pinecone & Qdrant.
Pricing
from $0.10 / 1,000 processed chunks
Rating
0.0
(0)
Developer

Theo Sanz
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Transform any website into LLM-ready training data in minutes.
The Problem
Building datasets for LLM fine-tuning or RAG pipelines is painful:
- Web data is messy and inconsistent
- Duplicates waste your training budget
- PII creates legal liability
- Each LLM provider needs a different format
- GDPR compliance is a nightmare
The Solution
LLM Data Pipeline Pro handles the entire data preparation workflow in one click. Scrape, validate, deduplicate, chunk, and export to any format — with full compliance reporting.
Key Features
Data Collection
| Source | Description |
|---|---|
| Website Crawling | Crawl any site with configurable depth (1-10 levels) |
| Apify Datasets | Process existing scraped data from any Apify actor |
| Direct Text | Paste raw text for quick processing |
Quality Assurance
| Feature | What It Does |
|---|---|
| Content Validation | Filters out empty, too short, or too long content |
| Quality Scoring | Rates content 0-1, filters below your threshold |
| PII Detection | Finds emails, phones, SSN, credit cards, addresses |
| Auto-Masking | Automatically redacts sensitive information |
| Language Detection | Identifies content language |
Deduplication
| Method | Benefit |
|---|---|
| Hash-Based | Removes exact duplicates instantly |
| Semantic | Catches near-duplicates with similar meaning |
| Cross-Document | Ensures no overlap across your entire dataset |
Output Formats
| Format | Best For |
|---|---|
| OpenAI | GPT-3.5, GPT-4 fine-tuning |
| Anthropic | Claude fine-tuning via AWS Bedrock |
| Mistral | Mistral AI model training |
| HuggingFace | Open-source model training |
| Raw | Custom pipelines, RAG applications |
Vector Database Export
| Provider | Status |
|---|---|
| Pinecone | Supported |
| Qdrant | Supported |
| Weaviate | Coming Soon |
| Chroma | Coming Soon |
Compliance & Security
| Feature | Description |
|---|---|
| robots.txt Respect | Honors website crawling rules |
| ai.txt Respect | Follows the new AI training opt-out standard |
| Sensitive Site Exclusion | Automatically skips healthcare, government, financial sites |
| GDPR Reports | Generates audit-ready compliance documentation |
| Data Retention | Configurable retention policies |
How It Works
Step 1: Choose Your Source Provide URLs to crawl, an existing Apify dataset, or paste text directly.
Step 2: Configure Quality Rules Set minimum content length, quality threshold, and PII handling preferences.
Step 3: Select Output Format Pick your target LLM provider format and chunking settings.
Step 4: Run The pipeline handles validation, deduplication, chunking, and formatting automatically.
Step 5: Download Get your ready-to-use JSONL file or have chunks uploaded directly to your vector database.
Pricing
Pay Per Event
| Event | Price | When Charged |
|---|---|---|
| Actor Start | $0.001 | Once per run |
| Processed Chunk | $0.0001 | Per output chunk |
Cost Examples
| Use Case | Pages | Est. Chunks | Total Cost |
|---|---|---|---|
| Small docs site | 50 | ~200 | ~$0.02 |
| Medium knowledge base | 500 | ~2,000 | ~$0.20 |
| Large documentation | 5,000 | ~20,000 | ~$2.00 |
| Enterprise wiki | 10,000 | ~40,000 | ~$4.00 |
Vector Export Options
Option A: Bring Your Own Key (BYOK) Use your own OpenAI API key for embeddings. You pay OpenAI directly at their rates.
Option B: Managed Embeddings We handle everything. No API keys needed. Additional $0.0005 per chunk.
Output Structure
Dataset Items
Each processed chunk is saved individually in your chosen format, ready for:
- Direct upload to OpenAI fine-tuning
- Import into your training pipeline
- Integration with RAG frameworks
Key-Value Store
| File | Contents |
|---|---|
| OUTPUT | Complete pipeline results |
| STATS | Execution statistics by stage |
| COMPLIANCE_REPORT | GDPR audit documentation |
| training_data.jsonl | Ready-to-use training file |
Statistics Tracked
- Pages crawled (success/failed)
- Validation results (passed/failed)
- Duplicates removed
- Chunks generated
- Average chunk size
- Processing time per stage
Use Cases
Fine-Tuning Dataset Creation
Scrape your company documentation and export directly to OpenAI's fine-tuning format. Train custom models on your proprietary knowledge.
RAG Knowledge Base
Build a searchable knowledge base with automatic chunking and vector embeddings. Export directly to Pinecone or Qdrant.
Documentation Migration
Convert legacy documentation into modern LLM-compatible formats for chatbots and AI assistants.
Competitive Intelligence
Monitor competitor documentation and extract structured data for analysis.
Compliance Auditing
Generate detailed reports showing what data was collected, from where, and how it was processed.
Environment Variables
For BYOK mode, set these in your Apify actor settings:
| Variable | Purpose |
|---|---|
OPENAI_API_KEY | Generate embeddings for vector export |
PINECONE_API_KEY | Upload to Pinecone |
QDRANT_API_KEY | Upload to Qdrant |
Frequently Asked Questions
Is this GDPR compliant? Yes. The actor respects robots.txt and ai.txt, excludes sensitive sites, detects and masks PII, and generates compliance audit reports.
What's the maximum I can process? Up to 10,000 pages per run with configurable crawl depth up to 10 levels.
How does chunking work? Recursive text splitting with configurable chunk size (100-10,000 characters) and overlap (0-1,000 characters). Splits on paragraphs, sentences, then words.
Can I use my own vector database? Currently supports Pinecone and Qdrant. Weaviate and Chroma support coming soon.
What PII types are detected? Email addresses, phone numbers, Social Security numbers, credit card numbers, and physical addresses.
Support
- Issues: Open a ticket on the actor page
- Feature Requests: Contact via Apify messaging
- Documentation: Check the input schema for all available options
Changelog
v1.0 — Initial Release
- Multi-source input (URL, dataset, text)
- Five output formats (OpenAI, Anthropic, Mistral, HuggingFace, Raw)
- Pinecone and Qdrant integration
- PII detection and masking
- GDPR compliance reporting
- Configurable chunking with overlap
Built for the AI era. Process responsibly.