Rag Embedding Generator avatar

Rag Embedding Generator

Pricing

from $0.30 / 1,000 results

Go to Apify Store
Rag Embedding Generator

Rag Embedding Generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Pricing

from $0.30 / 1,000 results

Rating

0.0

(0)

Developer

Mick

Mick

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

4 days ago

Last modified

Share

Apify Actor that generates vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains directly with RAG Content Chunker or any crawler output. Outputs flat embedding objects with pass-through metadata, ready for any vector database. No vendor lock-in. MCP-ready for AI agent integration.

Features

  • Two embedding providers: OpenAI (text-embedding-3-small/large, ada-002) and Cohere (embed-english/multilingual-v3.0, light variants)
  • Three input modes: single text, text list, or dataset chaining from any previous actor
  • Pass-through metadata from RAG Content Chunker (chunk_id, source_url, page_title, section_heading)
  • Batched API requests for throughput (up to 2048 texts per OpenAI call, 96 per Cohere call)
  • Exponential backoff retry on rate limits and transient failures (3 attempts)
  • API key marked isSecret -- never logged, never stored, never included in output
  • Hardcoded API base URLs to prevent SSRF attacks
  • Input validation and sanitization (key format checks, dataset ID regex, text length limits)
  • Output: raw float arrays compatible with any vector DB (Pinecone, Qdrant, Weaviate, Chroma, etc.)

Requirements

  • Python 3.11+
  • Apify platform account (for running as Actor)
  • OpenAI or Cohere API key

Install dependencies:

$pip install -r requirements.txt

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

  • api_key (string, required) -- your OpenAI or Cohere API key. Marked isSecret
  • provider (string, optional) -- "openai" (default) or "cohere"
  • model (string, optional) -- embedding model to use. Default: "text-embedding-3-small"
  • text (string, optional) -- a single text string to embed, max 100,000 characters
  • texts (array, optional) -- a list of text strings to embed, max 10,000 items
  • dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., RAG Content Chunker). Takes priority over text/texts
  • dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
  • batch_size (integer, optional) -- texts per API request. Default: 128. Max: 2048 (OpenAI) or 96 (Cohere)
  • include_text (boolean, optional) -- include original text in output. Default: false

At least one of text, texts, or dataset_id must be provided, plus api_key.

Supported Models

ProviderModelDimensionsNotes
OpenAItext-embedding-3-small1536Default. Cheapest, good quality
OpenAItext-embedding-3-large3072Best quality, higher cost
OpenAItext-embedding-ada-0021536Legacy, widely deployed
Cohereembed-english-v3.01024English-optimized
Cohereembed-multilingual-v3.01024100+ languages
Cohereembed-english-light-v3.0384Faster, smaller vectors
Cohereembed-multilingual-light-v3.0384Faster, multilingual

Usage

Local (CLI)

$APIFY_TOKEN=your-token apify run

Single Text Input

{
"api_key": "sk-your-openai-key",
"provider": "openai",
"model": "text-embedding-3-small",
"text": "This is a sample text to embed into a vector representation."
}

Text List Input

{
"api_key": "sk-your-openai-key",
"texts": [
"First document to embed.",
"Second document to embed.",
"Third document to embed."
]
}

Dataset Chaining (from RAG Content Chunker)

{
"api_key": "sk-your-openai-key",
"dataset_id": "abc123XYZ",
"dataset_field": "text",
"model": "text-embedding-3-small",
"batch_size": 256
}

Example Output

Each embedding is a separate dataset item:

{
"index": 0,
"embedding": [0.0123, -0.0456, 0.0789, "...1536 floats total"],
"dimensions": 1536,
"token_count": 12,
"chunk_id": "a1b2c3d4e5f67890",
"source_url": "https://example.com/page",
"page_title": "Example Page",
"section_heading": "Introduction"
}

A summary item is appended at the end:

{
"_summary": true,
"total_embeddings": 42,
"total_tokens": 8374,
"provider": "openai",
"model": "text-embedding-3-small",
"dimensions": 1536,
"processing_time": 3.241,
"billing": {
"total_embeddings": 42,
"amount": 0.0126,
"rate_per_embedding": 0.0003
}
}

Pipeline Position

This actor fills the embedding step in a standard RAG pipeline:

Crawl (Website Content Crawler, 101K+ users)
-> Clean (optional preprocessing)
-> Chunk (RAG Content Chunker)
-> Embed (this actor)
-> Store (Pinecone, Qdrant, Weaviate integrations)

Chaining with RAG Content Chunker

  1. Run RAG Content Chunker on your text or crawler output
  2. Copy the output dataset ID from the chunker run
  3. Pass it as dataset_id to this actor
  4. This actor reads each chunk, skips _summary rows, and passes through chunk_id, source_url, page_title, and section_heading metadata

The output vectors include all the metadata needed to store them in a vector database with proper source attribution.

Architecture

  • src/agent/main.py -- Actor entry point, input routing (text/texts/dataset), dataset loading, output
  • src/agent/embedder.py -- Core embedding engine, OpenAI + Cohere API calls, batching, retry logic
  • src/agent/validation.py -- Input validation, API key format checks, provider/model whitelist, sanitization
  • src/agent/pricing.py -- PPE billing calculator ($0.0003/embedding)
  • skill.md -- Machine-readable skill contract for agent discovery

Security

  • API key handling: Marked isSecret in input schema, validated for format only, never logged or stored, stripped from error messages
  • SSRF prevention: Outbound requests hardcoded to api.openai.com and api.cohere.ai only -- no user-supplied URLs
  • Provider/model whitelist: Only known provider+model combinations accepted, prevents arbitrary endpoint injection
  • Input sanitization: Control characters stripped, dataset IDs and field names regex-validated, text length bounded
  • Error safety: All error messages pass through _sanitize_error() to ensure API keys are never leaked in logs or output
  • No data retention: Texts and embeddings exist only in memory during the run

Pricing

Pay-Per-Event (PPE): $0.0003 per embedding ($0.30 per 1,000 embeddings).

This is the actor's platform fee only. You also pay the embedding provider (OpenAI or Cohere) directly via your own API key.

Content SizeApprox. EmbeddingsActor FeeProvider Fee (OpenAI 3-small)
Single blog post10-20$0.003-$0.006~$0.001
10-page website50-100$0.015-$0.03~$0.005
100-page docs site500-1,000$0.15-$0.30~$0.05
Large knowledge base5,000-10,000$1.50-$3.00~$0.50

Troubleshooting

  • "API key is required": Provide your OpenAI or Cohere API key in the api_key field
  • "Invalid OpenAI API key format": OpenAI keys start with sk- followed by alphanumeric characters
  • "Invalid model for provider": Check the supported models table above. Model names are case-sensitive
  • "No input provided": Supply at least one of text, texts, or dataset_id
  • "Text exceeds maximum length": Individual texts are limited to 100K characters. Use texts or dataset_id for bulk
  • "Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 characters
  • "API key is invalid or expired": Your provider API key was rejected. Verify it in your OpenAI/Cohere dashboard
  • "Failed after 3 attempts": Transient API error. Try again, or reduce batch_size if hitting rate limits
  • Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.


MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

  • Endpoint: https://mcp.apify.com?tools=labrat011/rag-embedding-generator
  • Auth: Authorization: Bearer <APIFY_TOKEN>
  • Transport: Streamable HTTP
  • Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
"mcpServers": {
"rag-embedding-generator": {
"url": "https://mcp.apify.com?tools=labrat011/rag-embedding-generator",
"headers": {
"Authorization": "Bearer <APIFY_TOKEN>"
}
}
}
}

AI agents can use this actor to generate vector embeddings from text using OpenAI or Cohere, embed chunked documents, and prepare data for vector database storage -- all as a callable MCP tool.