Pricing

Pay per usage

Speech AI MCP Server

Speech AI MCP server with 9 tools: pronunciation scoring (0-100 at phoneme/word/sentence level), speech-to-text with timestamps, text-to-speech with 12 English voices, and multilingual Whisper transcription (99 languages + speaker diarization). Sub-300ms latency. Pay-per-use: $0.02/call.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Fabio Suizu

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Tools

Tool	Description
assess_pronunciation	Score English pronunciation from audio (0-100 at overall, sentence, word, phoneme levels)
transcribe_audio	Convert spoken English to text with word-level timestamps
synthesize_speech	Generate natural speech from text (12 voices, American & British)
transcribe_audio_pro	Whisper Large V3 Turbo: 99 languages, speaker diarization
list_tts_voices	List available text-to-speech voices
check_pronunciation_service	Health check for pronunciation backend
check_stt_service	Health check for STT backend
check_tts_service	Health check for TTS backend
check_whisper_service	Health check for Whisper backend

Pronunciation Scoring

Returns scores (0-100) at four granularity levels:

Level	Description
Overall	Global pronunciation quality
Sentence	Sentence-level fluency and accuracy
Word	Per-word pronunciation scores
Phoneme	Individual sound accuracy (IPA + ARPAbet)

Performance

Accuracy: Exceeds human inter-annotator agreement (PCC 0.576 vs 0.555)
Validated: 9,259 utterances across 7 L1 backgrounds, zero errors
Latency: Sub-300ms for pronunciation and STT

How to Use

MCP Endpoint

https://Ym2gS88TksnTdTcPq.apify.actor/mcp?token=YOUR_APIFY_TOKEN

Example: Pronunciation Assessment

{
  "audio_base64": "<base64-encoded-audio>",
  "text": "The quick brown fox jumps over the lazy dog"
}

Example: Text-to-Speech

{
  "text": "Hello, how are you today?",
  "voice": "af_heart",
  "speed": 1.0
}

Pricing

$0.02 per tool call (pay-per-event).

Technical Details

Pronunciation Model: Conformer-CTC Small (17MB, INT8 quantized)
TTS Model: Kokoro-82M (12 English voices, 24kHz WAV)
STT Pro: Whisper Large V3 Turbo (99 languages, speaker diarization)
Audio: Supports WAV, MP3, OGG, FLAC, WebM
Backend: Azure Container Apps, auto-scaling

Links

Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment
API Docs: https://apim-ai-apis.azure-api.net/pronunciation/docs
Company: https://brainiall.com

Brainiall Speech MCP Server

vivid_astronaut/speech-ai-mcp

Production speech AI tools for AI agents: Brainiall Pronunciation (0-100 phoneme/word/sentence scoring), Brainiall Speech (transcription with timestamps), Brainiall Speech Pro (99 languages + diarization), Brainiall Voice (12 voices). Sub-second p50.

Fabio Suizu

Hugging Face Audio AI

alizarin_refrigerator-owner/hugging-face-audio-ai

Audio w/Hugging Face models speech recognition, text-to-speech & audio analysis Speech-to-Text: Transcribe audio Text-to-Speech: Generate natural speech Audio Classification: Classify sounds Voice Activity Detection: Detect speech Speaker Diarization: Identify speakers Music Generation: Create music

The Howlers

Text to speech generator

akash9078/advanced-text-to-speech

Professional-grade Text-to-Speech (TTS) actor powered by advanced AI models. Convert any text into natural, human-like speech with 50+ premium voices across 9 languages. Perfect for content creation, accessibility, voiceovers, audiobooks, podcasts, and multilingual applications.

Akash Kumar Naik

AI Voice Generator MCP Server

szoni/apify-tts-mcp

Convert text to natural speech (text-to-speech / TTS) via MCP — multiple AI voices and models. Pay per character, no provider account or API key needed. Ready for Claude, Cursor and other AI agents.

Szoni

Text to Speech Generator

moving_beacon-owner1/my-actor-30

Convert text into natural-sounding speech in multiple languages with ease.

Jamshaid Arif

Google Free Text to Speech

jupri/google-speech

Use free Google Text to Speech to translate text into voice

cat

302

Text To Speech

vivid_astronaut/text-to-speech

Convert text to natural speech using AI voices. Multiple voices and languages available. Generate audio files for podcasts, videos, accessibility, and voice assistants.

Fabio Suizu

Speech-to-Text Transcription

hgservices/speech-to-text

Transcribe audio and video from YouTube, TikTok, podcasts, X, and 1,000+ other sites or any direct media URL into accurate, speaker-labeled text. Uses World's best speech to text AI models with automatic language detection, multilingual support, and smart formatting.

Harish Garg

213

5.0

Text to Speech

hgservices/text-to-speech

Turn any text into natural-sounding speech with AI voices in seconds. Powered by world class AI models, with multilingual voices and MP3, WAV, FLAC, Opus & AAC output. No setup or coding required.

Harish Garg

Speech to Text — YouTube, TikTok, Instagram, 99+ Languages

andronixmd/speech-to-text-transcriber

Multi-engine speech-to-text for YouTube, TikTok, Instagram, podcasts, X, and direct media URLs. Auto-detects 99+ languages, routes across Groq/OpenAI/ElevenLabs/Google with automatic failover, and returns text, SRT/VTT subtitles, and optional speaker diarization. Pay-per-event — no subscription.