Hugging Face Audio AI
Pricing
from $0.01 / 1,000 results
Hugging Face Audio AI
Audio w/Hugging Face models speech recognition, text-to-speech & audio analysis Speech-to-Text: Transcribe audio Text-to-Speech: Generate natural speech Audio Classification: Classify sounds Voice Activity Detection: Detect speech Speaker Diarization: Identify speakers Music Generation: Create music
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

John Rippy
Actor stats
0
Bookmarked
5
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
Hugging Face Audio - AI Audio Processing with Whisper, TTS & Music Models
Process audio with state-of-the-art AI models. Speech-to-text transcription with Whisper, text-to-speech synthesis, audio classification, speaker diarization, voice activity detection, and AI music generation. No GPU or audio expertise required. BYOK with your Hugging Face API token.
Features
- Speech-to-Text - Transcribe audio with OpenAI Whisper (97%+ accuracy)
- Text-to-Speech - Generate natural-sounding speech
- Audio Classification - Identify speech, music, sounds
- Audio Enhancement - Denoise and improve audio quality
- Voice Activity Detection - Find speech segments in audio
- Speaker Diarization - Identify different speakers
- Audio Embeddings - Generate audio feature vectors
- Music Generation - Create music from text prompts
- Multi-language Support - 99+ languages for transcription
- Demo Mode - Test with sample data before going live
Who Should Use This Actor?
Content Creators
Transcribe podcasts and videos. Generate voiceovers. Create background music.
Marketing Teams
Transcribe webinars and interviews. Generate audio ads. Analyze call recordings.
Media Companies
Automate transcription. Generate subtitles. Classify audio content.
Researchers
Transcribe interviews. Analyze audio datasets. Extract audio features.
Podcasters
Get instant transcripts. Generate show notes. Create audio clips.
Accessibility Teams
Generate captions. Create audio descriptions. Convert text to speech.
Quick Start
Demo Mode (Free Test)
{"task": "speech_to_text","audioUrl": "https://example.com/podcast.mp3","demoMode": true}
Speech-to-Text (Whisper)
{"task": "speech_to_text","apiToken": "hf_your_token_here","model": "openai/whisper-large-v3","audioUrl": "https://example.com/interview.mp3","language": "en","returnTimestamps": true,"demoMode": false}
Text-to-Speech
{"task": "text_to_speech","apiToken": "hf_your_token_here","text": "Welcome to our podcast. Today we're discussing the latest trends in AI.","speakerId": 0,"demoMode": false}
Audio Classification
{"task": "audio_classification","apiToken": "hf_your_token_here","audioUrl": "https://example.com/audio-clip.wav","demoMode": false}
Voice Activity Detection
{"task": "voice_activity_detection","apiToken": "hf_your_token_here","audioUrl": "https://example.com/recording.wav","demoMode": false}
Speaker Diarization
{"task": "speaker_diarization","apiToken": "hf_your_token_here","audioUrl": "https://example.com/meeting.mp3","demoMode": false}
Audio Enhancement (Noise Reduction)
{"task": "audio_to_audio","apiToken": "hf_your_token_here","audioUrl": "https://example.com/noisy-recording.wav","demoMode": false}
Audio Embeddings
{"task": "audio_embeddings","apiToken": "hf_your_token_here","audioUrl": "https://example.com/audio-sample.wav","demoMode": false}
Music Generation
{"task": "music_generation","apiToken": "hf_your_token_here","musicPrompt": "Upbeat electronic music, 120 BPM, energetic, modern","duration": 10,"demoMode": false}
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
task | string | required | Task to perform (see task list) |
apiToken | string | - | Your Hugging Face API token |
model | string | task default | Specific model to use |
audioUrl | string | - | URL to audio file |
text | string | - | Text for speech synthesis |
musicPrompt | string | - | Prompt for music generation |
language | string | - | Language code for transcription |
targetLanguage | string | - | Target language for translation |
voicePreset | string | - | Voice preset for TTS |
speakerId | number | - | Speaker ID for multi-speaker TTS |
duration | number | 10 | Music duration in seconds |
sampleRate | number | 22050 | Audio sample rate |
returnTimestamps | boolean | false | Include word timestamps |
waitForModel | boolean | true | Wait for model to load |
webhookUrl | string | - | Webhook URL for results |
demoMode | boolean | true | Return sample data |
Available Tasks
| Task | Description | Default Model |
|---|---|---|
speech_to_text | Transcribe audio | Whisper-large-v3 |
text_to_speech | Generate speech | SpeechT5 TTS |
audio_classification | Classify audio | AST-AudioSet |
audio_to_audio | Enhance/transform audio | SepFormer |
voice_activity_detection | Detect speech segments | Pyannote VAD |
speaker_diarization | Identify speakers | Pyannote Diarization |
audio_embeddings | Extract audio features | Wav2Vec2 |
music_generation | Generate music | MusicGen-small |
Output Format
Speech-to-Text
{"success": true,"model": "openai/whisper-large-v3","transcription": "Hello and welcome to our presentation today. We will be discussing the latest developments in artificial intelligence.","language": "en","confidence": 0.95,"segments": [{"start": 0.0, "end": 2.5, "text": "Hello and welcome to our presentation today.", "confidence": 0.96},{"start": 2.5, "end": 5.0, "text": "We will be discussing the latest developments", "confidence": 0.94},{"start": 5.0, "end": 7.5, "text": "in artificial intelligence.", "confidence": 0.95}],"audioUrl": "https://example.com/audio.mp3"}
Text-to-Speech
{"success": true,"model": "microsoft/speecht5_tts","audioBase64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...","mimeType": "audio/wav","text": "Welcome to our podcast...","durationSeconds": 3.5}
Audio Classification
{"success": true,"model": "MIT/ast-finetuned-audioset-10-10-0.4593","classifications": [{"label": "Speech", "score": 0.89},{"label": "Music", "score": 0.07},{"label": "Silence", "score": 0.03},{"label": "Noise", "score": 0.01}],"audioUrl": "https://example.com/audio.wav"}
Voice Activity Detection
{"success": true,"model": "pyannote/voice-activity-detection","speechSegments": [{"start": 0.5, "end": 3.2, "confidence": 0.98},{"start": 4.1, "end": 7.8, "confidence": 0.95},{"start": 9.0, "end": 12.5, "confidence": 0.97}],"totalSpeechDuration": 9.9,"totalAudioDuration": 15.0,"speechRatio": 0.66}
Speaker Diarization
{"success": true,"model": "pyannote/speaker-diarization","speakers": [{"speaker": "SPEAKER_00","segments": [{"start": 0.5, "end": 5.2},{"start": 10.1, "end": 15.3}]},{"speaker": "SPEAKER_01","segments": [{"start": 5.5, "end": 9.8},{"start": 15.5, "end": 20.0}]}],"speakerCount": 2}
Music Generation
{"success": true,"model": "facebook/musicgen-small","audioBase64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...","mimeType": "audio/wav","prompt": "Upbeat electronic music, 120 BPM","durationSeconds": 10,"sampleRate": 32000}
Pricing (Pay-Per-Event)
| Event | Description | Price |
|---|---|---|
audio_processed | Per audio task completed | $0.01 |
Example costs:
- Transcribe 50 audio files: 50 × $0.01 = $0.50
- Generate 100 TTS clips: 100 × $0.01 = $1.00
- Classify 200 audio samples: 200 × $0.01 = $2.00
- Demo mode: $0.00
Cost Comparison
| Tool | Per Minute | This Actor |
|---|---|---|
| Rev.ai | $0.025 | ~$0.01 |
| AssemblyAI | $0.015 | ~$0.01 |
| AWS Transcribe | $0.024 | ~$0.01 |
Common Scenarios
Scenario 1: Podcast Transcription
{"task": "speech_to_text","apiToken": "hf_your_token","audioUrl": "https://example.com/podcast-episode-42.mp3","model": "openai/whisper-large-v3","returnTimestamps": true,"webhookUrl": "https://hooks.zapier.com/...","demoMode": false}
Scenario 2: Meeting Notes with Speaker ID
{"task": "speaker_diarization","apiToken": "hf_your_token","audioUrl": "https://example.com/team-meeting.mp3","demoMode": false}
Scenario 3: Audio Ad Voiceover
{"task": "text_to_speech","apiToken": "hf_your_token","text": "Discover the future of marketing automation. Visit ActorArsenal.com today.","demoMode": false}
Scenario 4: Background Music for Content
{"task": "music_generation","apiToken": "hf_your_token","musicPrompt": "Calm ambient music, soft piano, peaceful, suitable for meditation video","duration": 30,"demoMode": false}
Webhook & Automation Integration
Zapier / Make.com / n8n
- Create a webhook trigger
- Copy the URL to
webhookUrl - Process audio results in your workflow
Popular automations:
- Transcripts -> Notion/Google Docs
- TTS audio -> Cloud storage
- Diarization -> Meeting summaries
- Music -> Video editing software
Supported Languages (Whisper)
Whisper supports 99+ languages including:
- English, Spanish, French, German, Italian, Portuguese
- Chinese (Mandarin, Cantonese), Japanese, Korean
- Arabic, Hindi, Russian, Turkish
- And many more...
Hugging Face AI Suite
| Actor | Best For |
|---|---|
| Hugging Face Master | All-in-one (text + image + audio) |
| Hugging Face Text | Text processing |
| Hugging Face Image | Image processing |
| Hugging Face Audio | Audio processing (lightweight) |
| Hugging Face Hub | Model discovery |
FAQ
Q: What audio formats are supported?
A: MP3, WAV, FLAC, OGG, M4A. WAV recommended for quality.
Q: What's the max audio length?
A: Whisper processes up to 30 seconds per chunk. Longer files are automatically chunked.
Q: How accurate is Whisper?
A: Whisper-large-v3 achieves 97%+ accuracy on standard benchmarks, better than most commercial solutions.
Q: Can I transcribe non-English audio?
A: Yes! Whisper supports 99+ languages. Specify language parameter or let it auto-detect.
Q: How long does music generation take?
A: ~10-30 seconds per 10 seconds of music, depending on model load.
Common Problems & Solutions
"Model is loading"
- Large models need warm-up time
- Set
waitForModel: true(default) - Audio processing can take 2-3 minutes for large files
"Audio format not supported"
- Convert to WAV or MP3
- Ensure proper encoding
"Transcription quality is poor"
- Use Whisper-large-v3 for best quality
- Ensure audio is clear with minimal background noise
- Specify correct language code
"Demo data showing"
- Set
demoMode: false - Provide your Hugging Face API token
Built by John Rippy | Actor Arsenal