Under maintenance

Pricing

from $0.30 / 1,000 results

Try for free

Go to Apify Store

Rag Embedding Generator

Under maintenance

Try for free

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Pricing

from $0.30 / 1,000 results

Rating

0.0

(0)

Developer

Mick

Actor stats

Bookmarked

Total users

Monthly active users

2 hours ago

Last modified

Features

Two embedding providers: OpenAI (text-embedding-3-small/large, ada-002) and Cohere (embed-english/multilingual-v3.0, light variants)
Three input modes: single text, text list, or dataset chaining from any previous actor
Pass-through metadata from RAG Content Chunker (chunk_id, source_url, page_title, section_heading)
Batched API requests for throughput (up to 2048 texts per OpenAI call, 96 per Cohere call)
Exponential backoff retry on rate limits and transient failures (3 attempts)
API key marked isSecret -- never logged, never stored, never included in output
Hardcoded API base URLs to prevent SSRF attacks
Input validation and sanitization (key format checks, dataset ID regex, text length limits)
Output: raw float arrays compatible with any vector DB (Pinecone, Qdrant, Weaviate, Chroma, etc.)

Requirements

Python 3.11+
Apify platform account (for running as Actor)
OpenAI or Cohere API key

Install dependencies:

$pip install -r requirements.txt

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

api_key (string, required) -- your OpenAI or Cohere API key. Marked isSecret
provider (string, optional) -- "openai" (default) or "cohere"
model (string, optional) -- embedding model to use. Default: "text-embedding-3-small"
text (string, optional) -- a single text string to embed, max 100,000 characters
texts (array, optional) -- a list of text strings to embed, max 10,000 items
dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., RAG Content Chunker). Takes priority over text/texts
dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
batch_size (integer, optional) -- texts per API request. Default: 128. Max: 2048 (OpenAI) or 96 (Cohere)
include_text (boolean, optional) -- include original text in output. Default: false

At least one of text, texts, or dataset_id must be provided, plus api_key.

Supported Models

Provider	Model	Dimensions	Notes
OpenAI	`text-embedding-3-small`	1536	Default. Cheapest, good quality
OpenAI	`text-embedding-3-large`	3072	Best quality, higher cost
OpenAI	`text-embedding-ada-002`	1536	Legacy, widely deployed
Cohere	`embed-english-v3.0`	1024	English-optimized
Cohere	`embed-multilingual-v3.0`	1024	100+ languages
Cohere	`embed-english-light-v3.0`	384	Faster, smaller vectors
Cohere	`embed-multilingual-light-v3.0`	384	Faster, multilingual

Usage

Local (CLI)

$APIFY_TOKEN=your-token apify run

Single Text Input

{
  "api_key": "sk-your-openai-key",
  "provider": "openai",
  "model": "text-embedding-3-small",
  "text": "This is a sample text to embed into a vector representation."
}

Text List Input

{
  "api_key": "sk-your-openai-key",
  "texts": [
    "First document to embed.",
    "Second document to embed.",
    "Third document to embed."
  ]
}

Dataset Chaining (from RAG Content Chunker)

{
  "api_key": "sk-your-openai-key",
  "dataset_id": "abc123XYZ",
  "dataset_field": "text",
  "model": "text-embedding-3-small",
  "batch_size": 256
}

Example Output

Each embedding is a separate dataset item:

{
  "index": 0,
  "embedding": [0.0123, -0.0456, 0.0789, "...1536 floats total"],
  "dimensions": 1536,
  "token_count": 12,
  "chunk_id": "a1b2c3d4e5f67890",
  "source_url": "https://example.com/page",
  "page_title": "Example Page",
  "section_heading": "Introduction"
}

A summary item is appended at the end:

{
  "_summary": true,
  "total_embeddings": 42,
  "total_tokens": 8374,
  "provider": "openai",
  "model": "text-embedding-3-small",
  "dimensions": 1536,
  "processing_time": 3.241,
  "billing": {
    "total_embeddings": 42,
    "amount": 0.0126,
    "rate_per_embedding": 0.0003
  }
}

Pipeline Position

This actor fills the embedding step in a standard RAG pipeline:

Crawl (Website Content Crawler, 101K+ users)
  -> Clean (optional preprocessing)
    -> Chunk (RAG Content Chunker)
      -> Embed (this actor)
        -> Store (Pinecone, Qdrant, Weaviate integrations)

Chaining with RAG Content Chunker

Run RAG Content Chunker on your text or crawler output
Copy the output dataset ID from the chunker run
Pass it as dataset_id to this actor
This actor reads each chunk, skips _summary rows, and passes through chunk_id, source_url, page_title, and section_heading metadata

The output vectors include all the metadata needed to store them in a vector database with proper source attribution.

Architecture

src/agent/main.py -- Actor entry point, input routing (text/texts/dataset), dataset loading, output
src/agent/embedder.py -- Core embedding engine, OpenAI + Cohere API calls, batching, retry logic
src/agent/validation.py -- Input validation, API key format checks, provider/model whitelist, sanitization
src/agent/pricing.py -- PPE billing calculator ($0.0003/embedding)
skill.md -- Machine-readable skill contract for agent discovery

Security

API key handling: Marked isSecret in input schema, validated for format only, never logged or stored, stripped from error messages
SSRF prevention: Outbound requests hardcoded to api.openai.com and api.cohere.ai only -- no user-supplied URLs
Provider/model whitelist: Only known provider+model combinations accepted, prevents arbitrary endpoint injection
Input sanitization: Control characters stripped, dataset IDs and field names regex-validated, text length bounded
Error safety: All error messages pass through _sanitize_error() to ensure API keys are never leaked in logs or output
No data retention: Texts and embeddings exist only in memory during the run

Pricing

Pay-Per-Event (PPE): $0.0003 per embedding ($0.30 per 1,000 embeddings).

This is the actor's platform fee only. You also pay the embedding provider (OpenAI or Cohere) directly via your own API key.

Content Size	Approx. Embeddings	Actor Fee	Provider Fee (OpenAI 3-small)
Single blog post	10-20	$0.003-$0.006	~$0.001
10-page website	50-100	$0.015-$0.03	~$0.005
100-page docs site	500-1,000	$0.15-$0.30	~$0.05
Large knowledge base	5,000-10,000	$1.50-$3.00	~$0.50

Troubleshooting

"API key is required": Provide your OpenAI or Cohere API key in the api_key field
"Invalid OpenAI API key format": OpenAI keys start with sk- followed by alphanumeric characters
"Invalid model for provider": Check the supported models table above. Model names are case-sensitive
"No input provided": Supply at least one of text, texts, or dataset_id
"Text exceeds maximum length": Individual texts are limited to 100K characters. Use texts or dataset_id for bulk
"Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 characters
"API key is invalid or expired": Your provider API key was rejected. Verify it in your OpenAI/Cohere dashboard
"Failed after 3 attempts": Transient API error. Try again, or reduce batch_size if hitting rate limits
Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.

MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

Endpoint: https://mcp.apify.com?tools=labrat011/rag-embedding-generator
Auth: Authorization: Bearer <APIFY_TOKEN>
Transport: Streamable HTTP
Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
    "mcpServers": {
        "rag-embedding-generator": {
            "url": "https://mcp.apify.com?tools=labrat011/rag-embedding-generator",
            "headers": {
                "Authorization": "Bearer <APIFY_TOKEN>"
            }
        }
    }
}

AI agents can use this actor to generate vector embeddings from text using OpenAI or Cohere, embed chunked documents, and prepare data for vector database storage -- all as a callable MCP tool.

Rag Vector Store Writer

labrat011/rag-vector-store-writer

Apify Actor that writes embedding vectors to Pinecone or Qdrant vector databases. Chains directly with RAG Embedding Generator output or accepts raw vectors with metadata. Handles batching, retries, collection creation, metadata mapping, and ID generation. Bring your own vector DB API key.

Mick

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Mick

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Andok

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Vector Loader — Document Embedding & Vector DB Ingestion

apricot_blackberry/vector-loader

Documents → embeddings → vector DB. Chunking, embedding generation, and ingestion for Pinecone, Weaviate, or Chroma.

Creator Fusion

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Badruddeen Naseem

5.0

RAG Pipeline

labrat011/rag-pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

Mick