RAG Pipeline
Pricing
from $5.00 / 1,000 results
RAG Pipeline
One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer
mick_
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
10 days ago
Last modified
Categories
Share
One-click RAG pipeline on Apify. Chunk text, generate embeddings, and store vectors in Pinecone or Qdrant -- all in a single actor run. MCP-ready for AI agent integration.
What It Does
This actor orchestrates three sub-actors in sequence to build a complete RAG (Retrieval-Augmented Generation) pipeline. Feed it your content and it handles chunking, embedding, and vector storage automatically. Returns a pipeline summary -- ready for orchestration or consumption by AI agents via MCP.
Your content-> RAG Content Chunker (chunk by paragraphs, sentences, or Markdown headers)-> RAG Embedding Generator (OpenAI or Cohere embeddings)-> RAG Vector Store Writer (upsert to Pinecone or Qdrant)
You provide your content, API keys, and vector DB config. The pipeline handles dataset handoff between steps automatically.
π New Feature: Bulk File Upload
Already have a document? Upload it directly to Apify Storage and run the full pipeline against it β no crawling, no copy-pasting.
How to upload a file:
- Go to Apify Console β Storage β Key-Value Stores
- Click + Create new store β give it a name
- Click + Add record β upload your
.txt,.md, or.pdffile - Find your file β click the π icon to copy the direct URL
Make sure the URL starts with
api.apify.comβ notconsole.apify.com. The console URL is the web page, not the file. - Paste the URL into the
file_urlfield along with your API keys and run
Example β full pipeline from a file:
{"file_url": "https://api.apify.com/v2/key-value-stores/YOUR_STORE_ID/records/document.md","chunking_strategy": "markdown","chunk_size": 512,"chunk_overlap": 64,"embedding_api_key": "sk-...","embedding_provider": "openai","embedding_model": "text-embedding-3-small","vector_db_api_key": "your-qdrant-key","vector_db_provider": "qdrant","index_name": "my-rag-collection","qdrant_url": "https://your-cluster.us-west-1.aws.cloud.qdrant.io:6333"}
Or skip storage entirely β paste text or Markdown directly into the text field:
{"text": "# My Document\n\nThis is my content...","chunking_strategy": "markdown","chunk_size": 512,"embedding_api_key": "sk-...",...}
Supported file formats: .txt, .md, .markdown, .html, .pdf
Max file size: 5MB
URL requirements: Must be a public HTTPS URL. Apify Storage, S3, Dropbox (shared public link), or GitHub raw URLs all work.
PDF note: Text-based PDFs are supported. Scanned/image-only PDFs have no text layer and will fail β convert them with OCR first. Office documents (.docx, .xlsx) are not yet supported.
Input
Content Source (choose one)
Option A β Direct text:
{ "text": "# My Document\n\nContent..." }
Option B β Single file URL (Apify Storage):
{"file_url": "https://api.apify.com/v2/key-value-stores/STORE_ID/records/doc.txt","chunking_strategy": "markdown","chunk_size": 512}
Option C β Multiple file URLs, bulk (Apify Storage):
{"file_urls": ["https://api.apify.com/v2/key-value-stores/STORE_ID/records/doc1.txt","https://api.apify.com/v2/key-value-stores/STORE_ID/records/doc2.md"]}
Option D β Dataset from crawler:
{ "source_dataset_id": "your-crawler-dataset-id", "source_dataset_field": "markdown" }
Priority order when multiple sources are provided:
source_dataset_id>file_urls>file_url>text
| Parameter | Required | Default | Description |
|---|---|---|---|
text | One of text, file_url, file_urls, or source_dataset_id | - | Plain text or Markdown to process |
file_url | One of text, file_url, file_urls, or source_dataset_id | - | HTTPS URL to a single file in Apify Storage |
file_urls | One of text, file_url, file_urls, or source_dataset_id | - | List of HTTPS URLs to files in Apify Storage (max 20, 10 MB per file). Contents are fetched and concatenated before chunking. |
source_dataset_id | One of text, file_url, file_urls, or source_dataset_id | - | Apify dataset ID from a crawler |
source_dataset_field | No | text | Field to read from source dataset items |
chunking_strategy | No | recursive | recursive, markdown, or sentence |
chunk_size | No | 512 | Target chunk size in tokens (64-8192) |
chunk_overlap | No | 64 | Overlap between chunks in tokens (0-2048) |
embedding_api_key | Yes | - | OpenAI or Cohere API key |
embedding_provider | No | openai | openai or cohere |
embedding_model | No | text-embedding-3-small | Embedding model name |
embedding_batch_size | No | 128 | Texts per API request |
vector_db_api_key | Yes | - | Pinecone or Qdrant API key |
vector_db_provider | No | pinecone | pinecone or qdrant |
index_name | Yes | - | Index (Pinecone) or collection (Qdrant) name |
qdrant_url | If Qdrant | - | Qdrant Cloud cluster URL |
pinecone_namespace | No | "" | Pinecone namespace |
qdrant_distance_metric | No | Cosine | Cosine, Dot, or Euclid |
Output
A single summary item in the default dataset:
{"_summary": true,"pipeline": {"total_duration_seconds": 12.345,"steps": {"chunker": { "actor": "labrat011/rag-content-chunker", "status": "SUCCEEDED", "duration_seconds": 3.2 },"embedder": { "actor": "labrat011/rag-embedding-generator", "status": "SUCCEEDED", "duration_seconds": 5.1 },"writer": { "actor": "labrat011/rag-vector-store-writer", "status": "SUCCEEDED", "duration_seconds": 4.0 }}},"result": {"total_upserted": 42,"vector_db_provider": "qdrant","index_name": "my-collection"}}
Pricing
The orchestrator charges $0.005 per pipeline run ($5.00 per 1,000 runs). Sub-actors charge separately:
| Actor | Rate |
|---|---|
| RAG Content Chunker | $0.0005/chunk |
| RAG Embedding Generator | $0.0003/embedding |
| RAG Vector Store Writer | $0.0004/vector |
You also pay the embedding provider (OpenAI/Cohere) and vector DB provider (Pinecone/Qdrant) at their standard rates.
Example: Quick Start with Qdrant
Option A β direct text:
{"text": "Your document content goes here...","chunking_strategy": "recursive","chunk_size": 512,"embedding_api_key": "sk-...","embedding_provider": "openai","embedding_model": "text-embedding-3-small","vector_db_api_key": "your-qdrant-key","vector_db_provider": "qdrant","index_name": "my-rag-collection","qdrant_url": "https://your-cluster.us-west-1.aws.cloud.qdrant.io:6333"}
Option B β file upload (Apify Storage, S3, or any public HTTPS URL):
{"file_url": "https://api.apify.com/v2/key-value-stores/YOUR_STORE_ID/records/document.md","chunking_strategy": "markdown","chunk_size": 512,"embedding_api_key": "sk-...","embedding_provider": "openai","embedding_model": "text-embedding-3-small","vector_db_api_key": "your-qdrant-key","vector_db_provider": "qdrant","index_name": "my-rag-collection","qdrant_url": "https://your-cluster.us-west-1.aws.cloud.qdrant.io:6333"}
Sub-Actors
Security
- API keys are validated for presence only and never logged
- Qdrant URLs are validated against
cloud.qdrant.iopattern (SSRF prevention) - All string inputs are sanitized against control characters
- Dataset IDs and field names are validated with strict regex patterns
License
MIT
MCP Integration
This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.
- Endpoint:
https://mcp.apify.com?tools=labrat011/rag-pipeline - Auth:
Authorization: Bearer <APIFY_TOKEN> - Transport: Streamable HTTP
- Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI
Example MCP config (Claude Desktop / Cursor):
{"mcpServers": {"rag-pipeline": {"url": "https://mcp.apify.com?tools=labrat011/rag-pipeline","headers": {"Authorization": "Bearer <APIFY_TOKEN>"}}}}
AI agents can use this actor to ingest text into a vector database, build RAG knowledge bases, and set up retrieval-augmented generation pipelines -- all as a single callable MCP tool.