Google Live Translate avatar

Google Live Translate

Pricing

from $0.90 / successful api call

Go to Apify Store
Google Live Translate

Google Live Translate

Apify Actor & MCP Server for real-time translation, transcription, and language detection using Google Gemini 3.5 Live Translate with emotional voice preservation.

Pricing

from $0.90 / successful api call

Rating

0.0

(0)

Developer

Sergio Calvo

Sergio Calvo

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

18 hours ago

Last modified

Share

Google Live Translate - Apify Actor & MCP Server

Apify Actor MCP Server Gemini

Google Live Translate is an industrial-grade dual system written in Node.js and TypeScript. It natively integrates Google's official multimodal translation API to process text, audio files, and Google Meet recordings in real time while preserving the emotional intonation and voice characteristics of the original speaker.

The project offers two production-ready distribution interfaces:

  1. Apify Actor (/actor): Engineered for high-throughput batch processing, audio streaming, and large-scale subtitle generation.
  2. MCP Server (/mcp-server): A server compatible with Anthropic's Model Context Protocol (MCP) that exposes translation and language detection tools directly to AI agents such as Claude Desktop.

🎯 Target Audience & 💡 Primary Use Cases

Commercial Value & Use Cases (Primary Use Cases)

  • Automated Content Localization: Automatically generate multi-language subtitles (SRT, VTT, JSON) for corporate videos, webinars, and tutorials in minutes.
  • International Meeting Auditing: Transcribe and translate sales or support calls in real-time, capturing emotional nuances and voice tone.
  • Machine Learning & Datasets: Process large volumes of audio files to compile clean datasets for AI training or sector-specific customer service analysis.

Target Audience

  • Data Analysts & Scientists: Retrieve clean JSON datasets containing exact timestamps (startMs, endMs), original transcriptions, translations, and accuracy scores (confidence).
  • Operations & Business Teams: Automate the translation of Google Meet recordings stored in Drive without writing code, improving global team collaboration.
  • AI Developers & Engineers: Seamlessly integrate audio translation with emotional voice cloning into local agent workflows via the MCP server.

⚙️ Key Features (What the Actor does)

  • Direct & Optimized API Calls: Connects natively to Google's official gemini-3.5-live-translate API via raw WebSockets.
  • Emotional Voice Style Preservation: Automatically detects tone, rhythm, and expressiveness when the preserveVoiceStyle: true setting is enabled.
  • Automatic Language Detection: Autodetects the source language (supporting 70+ languages under the BCP-47 standard) with a corresponding confidence metric.
  • Smart Audio Chunking: Processes large files by splitting audio into optimized segments (e.g., 8 seconds) using static FFmpeg to prevent context limit errors and guarantee precise timestamps.
  • Translated Audio Output: Captures the translated PCM audio streamed back from Gemini, concatenates all segments, and saves the final translated voice output as play-ready WAV and MP3 files.
  • Ultra-Fast Inactivity Latch: Implements a smart text-activity detector that closes the Bidi stream 4 seconds after transcription stops, avoiding the default 90-second socket timeout and reducing processing time by 90%.
  • Native Error Management: Instead of crashing on invalid inputs or network errors, it records a structured error payload in the dataset.
  • Flexible Export Formats: Outputs clean results in JSON, SRT, VTT, and plaintext formats.
  • Rate Limiting & Exponential Backoff: Built-in throttling at a maximum of 10 requests per second with automatic exponential retries for network drops or rate limits (HTTP 429).

Why Google Live Translate? (Competitive Advantage)

Unlike traditional translators that only process text and strip away the speaker's vocal characteristics, Google Live Translate merges acoustic transcription with a multimodal AI model. This setup delivers:

  1. An 85% reduction in latency compared to traditional cascaded pipelines (transcribe -> translate -> synthesize).
  2. True emotional voice style preservation (capturing humor, severity, or urgency) to improve empathy in automated customer service channels.
  3. Unique Technical Versatility: Runs on serverless cloud infrastructure (Apify) for large batch processing, or locally on a developer's machine (MCP) as an LLM utility extension.

⚙️ Input Schema

The Actor accepts the following parameters in its input form:

FieldTypeRequiredDefaultDescription
modestringYestextSupported modes: text, audio_file, audio_base64, audio_url, meet_recording
targetLangstringYes-Target language code selected from a dropdown (e.g., es, fr, zh, en)
inputTextstringNo-Plain text to translate (required if mode is text)
audioFilestringNo-Upload local audio file directly from your computer (required if mode is audio_file)
audioBase64stringNo-Base64-encoded audio track (required if mode is audio_base64)
audioUrlstringNo-Public URL of the audio/video file, or Google Drive URL (for Meet recordings)
sourceLangstringNoautoBCP-47 source language code or auto for auto-detection (dropdown select)
preserveVoiceStylebooleanNotruePreserve the speaker's original emotional tone and voice style
outputFormatstringNojsonFormat of the output: json, srt, vtt, plaintext
googleCloudApiKeystringNo-Google Cloud API Key. If omitted, the Actor attempts to use ADC or Service Account JSON

📊 Output Schema

Audio translations output a detailed JSON structure saved to the Apify Dataset:

{
"translationId": "aud-xyz123456",
"sourceLang": "en",
"targetLang": "es",
"detectedLang": "en",
"inputType": "audio_url",
"segments": [
{
"index": 0,
"startMs": 0,
"endMs": 8000,
"originalText": "Good morning and welcome to our annual review meeting.",
"translatedText": "Buenos días y bienvenidos a nuestra reunión de revisión anual.",
"confidence": 0.98,
"voiceStylePreserved": true
}
],
"metadata": {
"durationMs": 8000,
"wordCount": 11,
"processingMs": 1120,
"modelVersion": "gemini-3.5-live-translate"
},
"srtContent": "1\n00:00:00,000 --> 00:00:08,000\nBuenos días y bienvenidos a nuestra reunión de revisión anual.\n",
"vttContent": "WEBVTT\n\n1\n00:00:00.000 --> 00:00:08.000\nBuenos días y bienvenidos a nuestra reunión de revisión anual.\n",
"plaintextContent": "Buenos días y bienvenidos a nuestra reunión de revisión anual."
}

🛠️ Detailed Architecture & How It Works

This Actor bridges local media files and the Gemini Multimodal Live API Bidi (bidirectional) WebSocket stream.

1. Audio Splitting and Preprocessing

When an audio/video file (or URL/uploaded file) is processed, the Actor uses a static binary of FFmpeg to:

  1. Split the file into small, digestible chunks (default: 8 seconds each) to guarantee context window availability and provide precise time stamps.
  2. Downsample and encode each audio chunk to 16kHz, mono, 16-bit signed little-endian PCM (s16le) format, which is the native input format expected by Gemini.

2. WebSocket Connection & Latch Handshake

For each chunk, the Actor opens a secure WebSocket connection to wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent.

  • Latch Mechanism: It sends the initial setup frame configure block and waits for the server's setupComplete confirmation. Audio data is only streamed after the latch completes, preventing the common 1007 protocol errors.
  • Audio Streaming: Audio chunks are read in memory and sent as base64-encoded frames to the socket.

3. Smart Transcription Inactivity Check

Unlike traditional text models, Gemini Live Translate keeps streaming real-time audio (including padding/silence) to keep the line active, which normally causes clients to hang until hitting the 90-second socket timeout.

  • Text-Activity Monitor: Our translator monitors incoming frames and keeps track of the last time actual transcription text was received.
  • Fast Exit: If 4 seconds pass without new transcription text after all audio chunks are sent, the socket is cleanly closed. This reduces processing time from 6 minutes to ~30 seconds for a 30-second audio track.

4. Audio Recovery & MP3 Packaging

The Actor captures the base64-encoded translated PCM audio (audio/pcm;rate=24000) returned by Gemini.

  • It concatenates the raw buffers.
  • Prepends a standard 44-byte WAV header with the exact sample rate (24000 Hz) and PCM properties.
  • Encodes the WAV file into a highly-compressed MP3 file (translated_output.mp3) using FFmpeg.

🚀 Installation & Quick Start Guide

Deploying the Actor to Apify

  1. Login to Apify CLI: If you have the Apify CLI installed globally, run:
    $apify login
  2. Push the Actor: Run the push command from the /actor directory:
    $npx apify-cli push
    Note: The included .actorignore file automatically excludes local audio test files and compilations (dist/, .system_generated/, *.wav, *.mp3) to keep deployment packages small and fast.
  3. Configure Settings: In the Apify Console, set your GOOGLE_CLOUD_API_KEY under the Environment Variables section.

Local Development and Testing

To test the audio translation and MP3 voice generation locally:

  1. Build the TypeScript files:
    $npm run build
  2. Run the test script with your API key:
    $$env:GOOGLE_CLOUD_API_KEY="YOUR_API_KEY"; node dist/test-local-audio.js
    This will translate the sample mp3 file, print subtitles to console, and save translated_output.wav and translated_output.mp3 in the workspace.

Integrating the MCP Server (Claude Desktop)

  1. Build the MCP server:
    cd mcp-server
    npm install
    npm run build
  2. Add the config to your Claude Desktop mcp_config.json:
    {
    "mcpServers": {
    "google-live-translate": {
    "command": "node",
    "args": ["/absolute/path/to/mcp-server/dist/index.js"],
    "env": {
    "GOOGLE_CLOUD_API_KEY": "YOUR_GEMINI_API_KEY"
    }
    }
    }
    }

💼 Business Use Cases & Monetization

SegmentWorkflowCost & Value Model
Multilingual SupportCall centers requiring real-time translation between agents and clients.$0.90 per successful API call
Video SubtitlingContent creators and e-learning platforms publishing across global markets.$0.90 per successful API call
International MeetingsPipelines that translate Google Meet recordings and deliver SRT subtitles.$0.90 per successful API call
NLP Research & DatasetsTranslation datasets with confidence scores, metadata, and voice style details.$0.90 per successful API call

🔌 Automation & Integraciones (Automating)

  • No-Code Platforms: Trigger the Actor via Webhooks from Make, Zapier, n8n, or ActiveCampaign as soon as a new recording is uploaded.
  • Schedules: Set up Apify's internal Cron Schedules to automatically look for and translate new recordings in Google Drive at regular intervals (daily, weekly, etc.).
  • Cloud Databases: Export structured datasets directly to BigQuery, Snowflake, Amazon S3, Postgres, or vector databases for downstream RAG analytics pipelines.

🌟 Frequently Asked Questions (FAQ)

Does the system require a local FFmpeg installation?

No. The project includes @ffmpeg-installer/ffmpeg as a dependency, which installs a platform-specific static binary for FFmpeg (Windows, macOS, or Linux) out-of-the-box. This ensures audio splitting works automatically in local and Docker containers.

How are private Google Meet recordings fetched from Google Drive?

If you configure the MCP Server or the Actor using a Google Service Account JSON or Application Default Credentials (ADC) with Drive read access, the system automatically requests a secure OAuth token and sends it in the download request header (Authorization: Bearer <TOKEN>).

Can it translate video files as well as audio?

Yes. The internal FFmpeg compiler automatically demuxes the audio track from video files (such as .mp4, .mkv, or .webm) and transcodes it into a 16kHz mono PCM WAV stream for Gemini.

How does structured schema data improve AI engine discoverability?

Based on Generative Engine Optimization (GEO) research by Princeton University, serving rich, schema-structured JSON outputs and structured page markup increases the visibility and citation rates of resources by AI search engines (like ChatGPT, Perplexity, and Gemini) by up to 40%, ensuring accuracy and proper attribution of origin data.


[!NOTE]
This service communicates directly with official Google Cloud APIs, ensuring full data privacy compliance without using web scraping techniques.