Speech AI MCP Server
Pricing
Pay per usage
Go to Apify Store
Speech AI MCP Server
Speech AI MCP server with 9 tools: pronunciation scoring (0-100 at phoneme/word/sentence level), speech-to-text with timestamps, text-to-speech with 12 English voices, and multilingual Whisper transcription (99 languages + speaker diarization). Sub-300ms latency. Pay-per-use: $0.02/call.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Fabio Suizu
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a few seconds ago
Last modified
Categories
Share
AI-powered speech tools for MCP-enabled AI agents: pronunciation scoring, speech-to-text, text-to-speech, and multilingual transcription.
Tools
| Tool | Description |
|---|---|
| assess_pronunciation | Score English pronunciation from audio (0-100 at overall, sentence, word, phoneme levels) |
| transcribe_audio | Convert spoken English to text with word-level timestamps |
| synthesize_speech | Generate natural speech from text (12 voices, American & British) |
| transcribe_audio_pro | Whisper Large V3 Turbo: 99 languages, speaker diarization |
| list_tts_voices | List available text-to-speech voices |
| check_pronunciation_service | Health check for pronunciation backend |
| check_stt_service | Health check for STT backend |
| check_tts_service | Health check for TTS backend |
| check_whisper_service | Health check for Whisper backend |
Pronunciation Scoring
Returns scores (0-100) at four granularity levels:
| Level | Description |
|---|---|
| Overall | Global pronunciation quality |
| Sentence | Sentence-level fluency and accuracy |
| Word | Per-word pronunciation scores |
| Phoneme | Individual sound accuracy (IPA + ARPAbet) |
Performance
- Accuracy: Exceeds human inter-annotator agreement (PCC 0.576 vs 0.555)
- Validated: 9,259 utterances across 7 L1 backgrounds, zero errors
- Latency: Sub-300ms for pronunciation and STT
How to Use
MCP Endpoint
https://Ym2gS88TksnTdTcPq.apify.actor/mcp?token=YOUR_APIFY_TOKEN
Example: Pronunciation Assessment
{"audio_base64": "<base64-encoded-audio>","text": "The quick brown fox jumps over the lazy dog"}
Example: Text-to-Speech
{"text": "Hello, how are you today?","voice": "af_heart","speed": 1.0}
Pricing
$0.02 per tool call (pay-per-event).
Technical Details
- Pronunciation Model: Conformer-CTC Small (17MB, INT8 quantized)
- TTS Model: Kokoro-82M (12 English voices, 24kHz WAV)
- STT Pro: Whisper Large V3 Turbo (99 languages, speaker diarization)
- Audio: Supports WAV, MP3, OGG, FLAC, WebM
- Backend: Azure Container Apps, auto-scaling