LLM Data Pipeline Pro avatar
LLM Data Pipeline Pro

Pricing

from $0.10 / 1,000 processed chunks

Go to Apify Store
LLM Data Pipeline Pro

LLM Data Pipeline Pro

Transform websites into LLM training data. Scrape, validate, deduplicate, chunk for RAG, and export to OpenAI/Anthropic/Mistral formats. Built-in PII detection and GDPR compliance. Vector DB export to Pinecone & Qdrant.

Pricing

from $0.10 / 1,000 processed chunks

Rating

0.0

(0)

Developer

Theo Sanz

Theo Sanz

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Transform any website into LLM-ready training data in minutes.


The Problem

Building datasets for LLM fine-tuning or RAG pipelines is painful:

  • Web data is messy and inconsistent
  • Duplicates waste your training budget
  • PII creates legal liability
  • Each LLM provider needs a different format
  • GDPR compliance is a nightmare

The Solution

LLM Data Pipeline Pro handles the entire data preparation workflow in one click. Scrape, validate, deduplicate, chunk, and export to any format — with full compliance reporting.


Key Features

Data Collection

SourceDescription
Website CrawlingCrawl any site with configurable depth (1-10 levels)
Apify DatasetsProcess existing scraped data from any Apify actor
Direct TextPaste raw text for quick processing

Quality Assurance

FeatureWhat It Does
Content ValidationFilters out empty, too short, or too long content
Quality ScoringRates content 0-1, filters below your threshold
PII DetectionFinds emails, phones, SSN, credit cards, addresses
Auto-MaskingAutomatically redacts sensitive information
Language DetectionIdentifies content language

Deduplication

MethodBenefit
Hash-BasedRemoves exact duplicates instantly
SemanticCatches near-duplicates with similar meaning
Cross-DocumentEnsures no overlap across your entire dataset

Output Formats

FormatBest For
OpenAIGPT-3.5, GPT-4 fine-tuning
AnthropicClaude fine-tuning via AWS Bedrock
MistralMistral AI model training
HuggingFaceOpen-source model training
RawCustom pipelines, RAG applications

Vector Database Export

ProviderStatus
PineconeSupported
QdrantSupported
WeaviateComing Soon
ChromaComing Soon

Compliance & Security

FeatureDescription
robots.txt RespectHonors website crawling rules
ai.txt RespectFollows the new AI training opt-out standard
Sensitive Site ExclusionAutomatically skips healthcare, government, financial sites
GDPR ReportsGenerates audit-ready compliance documentation
Data RetentionConfigurable retention policies

How It Works

Step 1: Choose Your Source Provide URLs to crawl, an existing Apify dataset, or paste text directly.

Step 2: Configure Quality Rules Set minimum content length, quality threshold, and PII handling preferences.

Step 3: Select Output Format Pick your target LLM provider format and chunking settings.

Step 4: Run The pipeline handles validation, deduplication, chunking, and formatting automatically.

Step 5: Download Get your ready-to-use JSONL file or have chunks uploaded directly to your vector database.


Pricing

Pay Per Event

EventPriceWhen Charged
Actor Start$0.001Once per run
Processed Chunk$0.0001Per output chunk

Cost Examples

Use CasePagesEst. ChunksTotal Cost
Small docs site50~200~$0.02
Medium knowledge base500~2,000~$0.20
Large documentation5,000~20,000~$2.00
Enterprise wiki10,000~40,000~$4.00

Vector Export Options

Option A: Bring Your Own Key (BYOK) Use your own OpenAI API key for embeddings. You pay OpenAI directly at their rates.

Option B: Managed Embeddings We handle everything. No API keys needed. Additional $0.0005 per chunk.


Output Structure

Dataset Items

Each processed chunk is saved individually in your chosen format, ready for:

  • Direct upload to OpenAI fine-tuning
  • Import into your training pipeline
  • Integration with RAG frameworks

Key-Value Store

FileContents
OUTPUTComplete pipeline results
STATSExecution statistics by stage
COMPLIANCE_REPORTGDPR audit documentation
training_data.jsonlReady-to-use training file

Statistics Tracked

  • Pages crawled (success/failed)
  • Validation results (passed/failed)
  • Duplicates removed
  • Chunks generated
  • Average chunk size
  • Processing time per stage

Use Cases

Fine-Tuning Dataset Creation

Scrape your company documentation and export directly to OpenAI's fine-tuning format. Train custom models on your proprietary knowledge.

RAG Knowledge Base

Build a searchable knowledge base with automatic chunking and vector embeddings. Export directly to Pinecone or Qdrant.

Documentation Migration

Convert legacy documentation into modern LLM-compatible formats for chatbots and AI assistants.

Competitive Intelligence

Monitor competitor documentation and extract structured data for analysis.

Compliance Auditing

Generate detailed reports showing what data was collected, from where, and how it was processed.


Environment Variables

For BYOK mode, set these in your Apify actor settings:

VariablePurpose
OPENAI_API_KEYGenerate embeddings for vector export
PINECONE_API_KEYUpload to Pinecone
QDRANT_API_KEYUpload to Qdrant

Frequently Asked Questions

Is this GDPR compliant? Yes. The actor respects robots.txt and ai.txt, excludes sensitive sites, detects and masks PII, and generates compliance audit reports.

What's the maximum I can process? Up to 10,000 pages per run with configurable crawl depth up to 10 levels.

How does chunking work? Recursive text splitting with configurable chunk size (100-10,000 characters) and overlap (0-1,000 characters). Splits on paragraphs, sentences, then words.

Can I use my own vector database? Currently supports Pinecone and Qdrant. Weaviate and Chroma support coming soon.

What PII types are detected? Email addresses, phone numbers, Social Security numbers, credit card numbers, and physical addresses.


Support

  • Issues: Open a ticket on the actor page
  • Feature Requests: Contact via Apify messaging
  • Documentation: Check the input schema for all available options

Changelog

v1.0 — Initial Release

  • Multi-source input (URL, dataset, text)
  • Five output formats (OpenAI, Anthropic, Mistral, HuggingFace, Raw)
  • Pinecone and Qdrant integration
  • PII detection and masking
  • GDPR compliance reporting
  • Configurable chunking with overlap

Built for the AI era. Process responsibly.