Rag Embedding Generator
Pricing
from $0.30 / 1,000 results
Rag Embedding Generator
Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.
Pricing
from $0.30 / 1,000 results
Rating
0.0
(0)
Developer

Mick
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
4 days ago
Last modified
Categories
Share
Apify Actor that generates vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains directly with RAG Content Chunker or any crawler output. Outputs flat embedding objects with pass-through metadata, ready for any vector database. No vendor lock-in. MCP-ready for AI agent integration.
Features
- Two embedding providers: OpenAI (text-embedding-3-small/large, ada-002) and Cohere (embed-english/multilingual-v3.0, light variants)
- Three input modes: single text, text list, or dataset chaining from any previous actor
- Pass-through metadata from RAG Content Chunker (chunk_id, source_url, page_title, section_heading)
- Batched API requests for throughput (up to 2048 texts per OpenAI call, 96 per Cohere call)
- Exponential backoff retry on rate limits and transient failures (3 attempts)
- API key marked
isSecret-- never logged, never stored, never included in output - Hardcoded API base URLs to prevent SSRF attacks
- Input validation and sanitization (key format checks, dataset ID regex, text length limits)
- Output: raw float arrays compatible with any vector DB (Pinecone, Qdrant, Weaviate, Chroma, etc.)
Requirements
- Python 3.11+
- Apify platform account (for running as Actor)
- OpenAI or Cohere API key
Install dependencies:
$pip install -r requirements.txt
Configuration
Actor Inputs
Defined in .actor/INPUT_SCHEMA.json:
api_key(string, required) -- your OpenAI or Cohere API key. MarkedisSecretprovider(string, optional) --"openai"(default) or"cohere"model(string, optional) -- embedding model to use. Default:"text-embedding-3-small"text(string, optional) -- a single text string to embed, max 100,000 characterstexts(array, optional) -- a list of text strings to embed, max 10,000 itemsdataset_id(string, optional) -- Apify dataset ID from a previous actor run (e.g., RAG Content Chunker). Takes priority over text/textsdataset_field(string, optional) -- field to read from each dataset item. Default:"text". Supports dot notationbatch_size(integer, optional) -- texts per API request. Default: 128. Max: 2048 (OpenAI) or 96 (Cohere)include_text(boolean, optional) -- include original text in output. Default: false
At least one of text, texts, or dataset_id must be provided, plus api_key.
Supported Models
| Provider | Model | Dimensions | Notes |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | Default. Cheapest, good quality |
| OpenAI | text-embedding-3-large | 3072 | Best quality, higher cost |
| OpenAI | text-embedding-ada-002 | 1536 | Legacy, widely deployed |
| Cohere | embed-english-v3.0 | 1024 | English-optimized |
| Cohere | embed-multilingual-v3.0 | 1024 | 100+ languages |
| Cohere | embed-english-light-v3.0 | 384 | Faster, smaller vectors |
| Cohere | embed-multilingual-light-v3.0 | 384 | Faster, multilingual |
Usage
Local (CLI)
$APIFY_TOKEN=your-token apify run
Single Text Input
{"api_key": "sk-your-openai-key","provider": "openai","model": "text-embedding-3-small","text": "This is a sample text to embed into a vector representation."}
Text List Input
{"api_key": "sk-your-openai-key","texts": ["First document to embed.","Second document to embed.","Third document to embed."]}
Dataset Chaining (from RAG Content Chunker)
{"api_key": "sk-your-openai-key","dataset_id": "abc123XYZ","dataset_field": "text","model": "text-embedding-3-small","batch_size": 256}
Example Output
Each embedding is a separate dataset item:
{"index": 0,"embedding": [0.0123, -0.0456, 0.0789, "...1536 floats total"],"dimensions": 1536,"token_count": 12,"chunk_id": "a1b2c3d4e5f67890","source_url": "https://example.com/page","page_title": "Example Page","section_heading": "Introduction"}
A summary item is appended at the end:
{"_summary": true,"total_embeddings": 42,"total_tokens": 8374,"provider": "openai","model": "text-embedding-3-small","dimensions": 1536,"processing_time": 3.241,"billing": {"total_embeddings": 42,"amount": 0.0126,"rate_per_embedding": 0.0003}}
Pipeline Position
This actor fills the embedding step in a standard RAG pipeline:
Crawl (Website Content Crawler, 101K+ users)-> Clean (optional preprocessing)-> Chunk (RAG Content Chunker)-> Embed (this actor)-> Store (Pinecone, Qdrant, Weaviate integrations)
Chaining with RAG Content Chunker
- Run RAG Content Chunker on your text or crawler output
- Copy the output dataset ID from the chunker run
- Pass it as
dataset_idto this actor - This actor reads each chunk, skips
_summaryrows, and passes throughchunk_id,source_url,page_title, andsection_headingmetadata
The output vectors include all the metadata needed to store them in a vector database with proper source attribution.
Architecture
src/agent/main.py-- Actor entry point, input routing (text/texts/dataset), dataset loading, outputsrc/agent/embedder.py-- Core embedding engine, OpenAI + Cohere API calls, batching, retry logicsrc/agent/validation.py-- Input validation, API key format checks, provider/model whitelist, sanitizationsrc/agent/pricing.py-- PPE billing calculator ($0.0003/embedding)skill.md-- Machine-readable skill contract for agent discovery
Security
- API key handling: Marked
isSecretin input schema, validated for format only, never logged or stored, stripped from error messages - SSRF prevention: Outbound requests hardcoded to
api.openai.comandapi.cohere.aionly -- no user-supplied URLs - Provider/model whitelist: Only known provider+model combinations accepted, prevents arbitrary endpoint injection
- Input sanitization: Control characters stripped, dataset IDs and field names regex-validated, text length bounded
- Error safety: All error messages pass through
_sanitize_error()to ensure API keys are never leaked in logs or output - No data retention: Texts and embeddings exist only in memory during the run
Pricing
Pay-Per-Event (PPE): $0.0003 per embedding ($0.30 per 1,000 embeddings).
This is the actor's platform fee only. You also pay the embedding provider (OpenAI or Cohere) directly via your own API key.
| Content Size | Approx. Embeddings | Actor Fee | Provider Fee (OpenAI 3-small) |
|---|---|---|---|
| Single blog post | 10-20 | $0.003-$0.006 | ~$0.001 |
| 10-page website | 50-100 | $0.015-$0.03 | ~$0.005 |
| 100-page docs site | 500-1,000 | $0.15-$0.30 | ~$0.05 |
| Large knowledge base | 5,000-10,000 | $1.50-$3.00 | ~$0.50 |
Troubleshooting
- "API key is required": Provide your OpenAI or Cohere API key in the
api_keyfield - "Invalid OpenAI API key format": OpenAI keys start with
sk-followed by alphanumeric characters - "Invalid model for provider": Check the supported models table above. Model names are case-sensitive
- "No input provided": Supply at least one of
text,texts, ordataset_id - "Text exceeds maximum length": Individual texts are limited to 100K characters. Use
textsordataset_idfor bulk - "Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 characters
- "API key is invalid or expired": Your provider API key was rejected. Verify it in your OpenAI/Cohere dashboard
- "Failed after 3 attempts": Transient API error. Try again, or reduce
batch_sizeif hitting rate limits - Dataset errors: Verify the dataset ID exists and the actor has access to it
License
See LICENSE file for details.
MCP Integration
This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.
- Endpoint:
https://mcp.apify.com?tools=labrat011/rag-embedding-generator - Auth:
Authorization: Bearer <APIFY_TOKEN> - Transport: Streamable HTTP
- Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI
Example MCP config (Claude Desktop / Cursor):
{"mcpServers": {"rag-embedding-generator": {"url": "https://mcp.apify.com?tools=labrat011/rag-embedding-generator","headers": {"Authorization": "Bearer <APIFY_TOKEN>"}}}}
AI agents can use this actor to generate vector embeddings from text using OpenAI or Cohere, embed chunked documents, and prepare data for vector database storage -- all as a callable MCP tool.