Speech AI MCP Server avatar

Speech AI MCP Server

Pricing

Pay per usage

Go to Apify Store
Speech AI MCP Server

Speech AI MCP Server

Speech AI MCP server with 9 tools: pronunciation scoring (0-100 at phoneme/word/sentence level), speech-to-text with timestamps, text-to-speech with 12 English voices, and multilingual Whisper transcription (99 languages + speaker diarization). Sub-300ms latency. Pay-per-use: $0.02/call.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Fabio Suizu

Fabio Suizu

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a few seconds ago

Last modified

Categories

Share

AI-powered speech tools for MCP-enabled AI agents: pronunciation scoring, speech-to-text, text-to-speech, and multilingual transcription.

Tools

ToolDescription
assess_pronunciationScore English pronunciation from audio (0-100 at overall, sentence, word, phoneme levels)
transcribe_audioConvert spoken English to text with word-level timestamps
synthesize_speechGenerate natural speech from text (12 voices, American & British)
transcribe_audio_proWhisper Large V3 Turbo: 99 languages, speaker diarization
list_tts_voicesList available text-to-speech voices
check_pronunciation_serviceHealth check for pronunciation backend
check_stt_serviceHealth check for STT backend
check_tts_serviceHealth check for TTS backend
check_whisper_serviceHealth check for Whisper backend

Pronunciation Scoring

Returns scores (0-100) at four granularity levels:

LevelDescription
OverallGlobal pronunciation quality
SentenceSentence-level fluency and accuracy
WordPer-word pronunciation scores
PhonemeIndividual sound accuracy (IPA + ARPAbet)

Performance

  • Accuracy: Exceeds human inter-annotator agreement (PCC 0.576 vs 0.555)
  • Validated: 9,259 utterances across 7 L1 backgrounds, zero errors
  • Latency: Sub-300ms for pronunciation and STT

How to Use

MCP Endpoint

https://Ym2gS88TksnTdTcPq.apify.actor/mcp?token=YOUR_APIFY_TOKEN

Example: Pronunciation Assessment

{
"audio_base64": "<base64-encoded-audio>",
"text": "The quick brown fox jumps over the lazy dog"
}

Example: Text-to-Speech

{
"text": "Hello, how are you today?",
"voice": "af_heart",
"speed": 1.0
}

Pricing

$0.02 per tool call (pay-per-event).

Technical Details

  • Pronunciation Model: Conformer-CTC Small (17MB, INT8 quantized)
  • TTS Model: Kokoro-82M (12 English voices, 24kHz WAV)
  • STT Pro: Whisper Large V3 Turbo (99 languages, speaker diarization)
  • Audio: Supports WAV, MP3, OGG, FLAC, WebM
  • Backend: Azure Container Apps, auto-scaling