Pricing

from $5.00 / 1,000 results

RAG Pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Mick

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

What It Does

This actor orchestrates three sub-actors in sequence to build a complete RAG (Retrieval-Augmented Generation) pipeline. Feed it your content and it handles chunking, embedding, and vector storage automatically. Returns a pipeline summary -- ready for orchestration or consumption by AI agents via MCP.

Your content
  -> RAG Content Chunker (chunk by paragraphs, sentences, or Markdown headers)
    -> RAG Embedding Generator (OpenAI or Cohere embeddings)
      -> RAG Vector Store Writer (upsert to Pinecone or Qdrant)

You provide your content, API keys, and vector DB config. The pipeline handles dataset handoff between steps automatically.

Input

Parameter	Required	Default	Description
`text`	One of `text` or `source_dataset_id`	-	Plain text or Markdown to process
`source_dataset_id`	One of `text` or `source_dataset_id`	-	Apify dataset ID from a crawler
`source_dataset_field`	No	`text`	Field to read from source dataset items
`chunking_strategy`	No	`recursive`	`recursive`, `markdown`, or `sentence`
`chunk_size`	No	`512`	Target chunk size in tokens (64-8192)
`chunk_overlap`	No	`64`	Overlap between chunks in tokens (0-2048)
`embedding_api_key`	Yes	-	OpenAI or Cohere API key
`embedding_provider`	No	`openai`	`openai` or `cohere`
`embedding_model`	No	`text-embedding-3-small`	Embedding model name
`embedding_batch_size`	No	`128`	Texts per API request
`vector_db_api_key`	Yes	-	Pinecone or Qdrant API key
`vector_db_provider`	No	`pinecone`	`pinecone` or `qdrant`
`index_name`	Yes	-	Index (Pinecone) or collection (Qdrant) name
`qdrant_url`	If Qdrant	-	Qdrant Cloud cluster URL
`pinecone_namespace`	No	`""`	Pinecone namespace
`qdrant_distance_metric`	No	`Cosine`	`Cosine`, `Dot`, or `Euclid`

Output

A single summary item in the default dataset:

{
  "_summary": true,
  "pipeline": {
    "total_duration_seconds": 12.345,
    "steps": {
      "chunker": { "actor": "labrat011/rag-content-chunker", "status": "SUCCEEDED", "duration_seconds": 3.2 },
      "embedder": { "actor": "labrat011/rag-embedding-generator", "status": "SUCCEEDED", "duration_seconds": 5.1 },
      "writer": { "actor": "labrat011/rag-vector-store-writer", "status": "SUCCEEDED", "duration_seconds": 4.0 }
    }
  },
  "result": {
    "total_upserted": 42,
    "vector_db_provider": "qdrant",
    "index_name": "my-collection"
  }
}

Pricing

The orchestrator charges $0.005 per pipeline run ($5.00 per 1,000 runs). Sub-actors charge separately:

Actor	Rate
RAG Content Chunker	$0.0005/chunk
RAG Embedding Generator	$0.0003/embedding
RAG Vector Store Writer	$0.0004/vector

You also pay the embedding provider (OpenAI/Cohere) and vector DB provider (Pinecone/Qdrant) at their standard rates.

Example: Quick Start with Qdrant

{
  "text": "Your document content goes here...",
  "chunking_strategy": "recursive",
  "chunk_size": 512,
  "embedding_api_key": "sk-...",
  "embedding_provider": "openai",
  "embedding_model": "text-embedding-3-small",
  "vector_db_api_key": "your-qdrant-key",
  "vector_db_provider": "qdrant",
  "index_name": "my-rag-collection",
  "qdrant_url": "https://your-cluster.us-west-1.aws.cloud.qdrant.io:6333"
}

Sub-Actors

Security

API keys are validated for presence only and never logged
Qdrant URLs are validated against cloud.qdrant.io pattern (SSRF prevention)
All string inputs are sanitized against control characters
Dataset IDs and field names are validated with strict regex patterns

License

MIT

MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

Endpoint: https://mcp.apify.com?tools=labrat011/rag-pipeline
Auth: Authorization: Bearer <APIFY_TOKEN>
Transport: Streamable HTTP
Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
    "mcpServers": {
        "rag-pipeline": {
            "url": "https://mcp.apify.com?tools=labrat011/rag-pipeline",
            "headers": {
                "Authorization": "Bearer <APIFY_TOKEN>"
            }
        }
    }
}

AI agents can use this actor to ingest text into a vector database, build RAG knowledge bases, and set up retrieval-augmented generation pipelines -- all as a single callable MCP tool.

Rag Pipeline Manager Mcp

bronze_quarterback/rag-pipeline-manager-mcp

Segun Zubair

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

ozapp/ai-data-pipeline

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

Ozapp

Rag Vector Store Writer

labrat011/rag-vector-store-writer

Apify Actor that writes embedding vectors to Pinecone or Qdrant vector databases. Chains directly with RAG Embedding Generator output or accepts raw vectors with metadata. Handles batching, retries, collection creation, metadata mapping, and ID generation. Bring your own vector DB API key.

Mick

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Mick

RAG Pipeline Data Collector

scraper_guru/rag-pipeline-data-collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

LIAICHI MUSTAPHA

Rag

zenisjan/rag

Interviews RAG — An Actor that answers questions about customer meeting notes using RAG. It searches a Pinecone vector store for relevant transcript chunks, ranks results by semantic similarity and recency, then generates answers. Runs in Standby mode as an HTTP service, exposing a /query endpoint.

Jan Ženíšek

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Mick

LLM Data Pipeline Pro

sanztheo/llm-data-pipeline-pro

Transform websites into LLM training data. Scrape, validate, deduplicate, chunk for RAG, and export to OpenAI/Anthropic/Mistral formats. Built-in PII detection and GDPR compliance. Vector DB export to Pinecone & Qdrant.

Theo Sanz

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.