Hugging Face Audio AI avatar
Hugging Face Audio AI

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Hugging Face Audio AI

Hugging Face Audio AI

Audio w/Hugging Face models speech recognition, text-to-speech & audio analysis Speech-to-Text: Transcribe audio Text-to-Speech: Generate natural speech Audio Classification: Classify sounds Voice Activity Detection: Detect speech Speaker Diarization: Identify speakers Music Generation: Create music

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

John Rippy

John Rippy

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

1

Monthly active users

7 days ago

Last modified

Share

Hugging Face Audio - AI Audio Processing with Whisper, TTS & Music Models

Process audio with state-of-the-art AI models. Speech-to-text transcription with Whisper, text-to-speech synthesis, audio classification, speaker diarization, voice activity detection, and AI music generation. No GPU or audio expertise required. BYOK with your Hugging Face API token.

Features

  • Speech-to-Text - Transcribe audio with OpenAI Whisper (97%+ accuracy)
  • Text-to-Speech - Generate natural-sounding speech
  • Audio Classification - Identify speech, music, sounds
  • Audio Enhancement - Denoise and improve audio quality
  • Voice Activity Detection - Find speech segments in audio
  • Speaker Diarization - Identify different speakers
  • Audio Embeddings - Generate audio feature vectors
  • Music Generation - Create music from text prompts
  • Multi-language Support - 99+ languages for transcription
  • Demo Mode - Test with sample data before going live

Who Should Use This Actor?

Content Creators

Transcribe podcasts and videos. Generate voiceovers. Create background music.

Marketing Teams

Transcribe webinars and interviews. Generate audio ads. Analyze call recordings.

Media Companies

Automate transcription. Generate subtitles. Classify audio content.

Researchers

Transcribe interviews. Analyze audio datasets. Extract audio features.

Podcasters

Get instant transcripts. Generate show notes. Create audio clips.

Accessibility Teams

Generate captions. Create audio descriptions. Convert text to speech.

Quick Start

Demo Mode (Free Test)

{
"task": "speech_to_text",
"audioUrl": "https://example.com/podcast.mp3",
"demoMode": true
}

Speech-to-Text (Whisper)

{
"task": "speech_to_text",
"apiToken": "hf_your_token_here",
"model": "openai/whisper-large-v3",
"audioUrl": "https://example.com/interview.mp3",
"language": "en",
"returnTimestamps": true,
"demoMode": false
}

Text-to-Speech

{
"task": "text_to_speech",
"apiToken": "hf_your_token_here",
"text": "Welcome to our podcast. Today we're discussing the latest trends in AI.",
"speakerId": 0,
"demoMode": false
}

Audio Classification

{
"task": "audio_classification",
"apiToken": "hf_your_token_here",
"audioUrl": "https://example.com/audio-clip.wav",
"demoMode": false
}

Voice Activity Detection

{
"task": "voice_activity_detection",
"apiToken": "hf_your_token_here",
"audioUrl": "https://example.com/recording.wav",
"demoMode": false
}

Speaker Diarization

{
"task": "speaker_diarization",
"apiToken": "hf_your_token_here",
"audioUrl": "https://example.com/meeting.mp3",
"demoMode": false
}

Audio Enhancement (Noise Reduction)

{
"task": "audio_to_audio",
"apiToken": "hf_your_token_here",
"audioUrl": "https://example.com/noisy-recording.wav",
"demoMode": false
}

Audio Embeddings

{
"task": "audio_embeddings",
"apiToken": "hf_your_token_here",
"audioUrl": "https://example.com/audio-sample.wav",
"demoMode": false
}

Music Generation

{
"task": "music_generation",
"apiToken": "hf_your_token_here",
"musicPrompt": "Upbeat electronic music, 120 BPM, energetic, modern",
"duration": 10,
"demoMode": false
}

Input Parameters

ParameterTypeDefaultDescription
taskstringrequiredTask to perform (see task list)
apiTokenstring-Your Hugging Face API token
modelstringtask defaultSpecific model to use
audioUrlstring-URL to audio file
textstring-Text for speech synthesis
musicPromptstring-Prompt for music generation
languagestring-Language code for transcription
targetLanguagestring-Target language for translation
voicePresetstring-Voice preset for TTS
speakerIdnumber-Speaker ID for multi-speaker TTS
durationnumber10Music duration in seconds
sampleRatenumber22050Audio sample rate
returnTimestampsbooleanfalseInclude word timestamps
waitForModelbooleantrueWait for model to load
webhookUrlstring-Webhook URL for results
demoModebooleantrueReturn sample data

Available Tasks

TaskDescriptionDefault Model
speech_to_textTranscribe audioWhisper-large-v3
text_to_speechGenerate speechSpeechT5 TTS
audio_classificationClassify audioAST-AudioSet
audio_to_audioEnhance/transform audioSepFormer
voice_activity_detectionDetect speech segmentsPyannote VAD
speaker_diarizationIdentify speakersPyannote Diarization
audio_embeddingsExtract audio featuresWav2Vec2
music_generationGenerate musicMusicGen-small

Output Format

Speech-to-Text

{
"success": true,
"model": "openai/whisper-large-v3",
"transcription": "Hello and welcome to our presentation today. We will be discussing the latest developments in artificial intelligence.",
"language": "en",
"confidence": 0.95,
"segments": [
{"start": 0.0, "end": 2.5, "text": "Hello and welcome to our presentation today.", "confidence": 0.96},
{"start": 2.5, "end": 5.0, "text": "We will be discussing the latest developments", "confidence": 0.94},
{"start": 5.0, "end": 7.5, "text": "in artificial intelligence.", "confidence": 0.95}
],
"audioUrl": "https://example.com/audio.mp3"
}

Text-to-Speech

{
"success": true,
"model": "microsoft/speecht5_tts",
"audioBase64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
"mimeType": "audio/wav",
"text": "Welcome to our podcast...",
"durationSeconds": 3.5
}

Audio Classification

{
"success": true,
"model": "MIT/ast-finetuned-audioset-10-10-0.4593",
"classifications": [
{"label": "Speech", "score": 0.89},
{"label": "Music", "score": 0.07},
{"label": "Silence", "score": 0.03},
{"label": "Noise", "score": 0.01}
],
"audioUrl": "https://example.com/audio.wav"
}

Voice Activity Detection

{
"success": true,
"model": "pyannote/voice-activity-detection",
"speechSegments": [
{"start": 0.5, "end": 3.2, "confidence": 0.98},
{"start": 4.1, "end": 7.8, "confidence": 0.95},
{"start": 9.0, "end": 12.5, "confidence": 0.97}
],
"totalSpeechDuration": 9.9,
"totalAudioDuration": 15.0,
"speechRatio": 0.66
}

Speaker Diarization

{
"success": true,
"model": "pyannote/speaker-diarization",
"speakers": [
{
"speaker": "SPEAKER_00",
"segments": [
{"start": 0.5, "end": 5.2},
{"start": 10.1, "end": 15.3}
]
},
{
"speaker": "SPEAKER_01",
"segments": [
{"start": 5.5, "end": 9.8},
{"start": 15.5, "end": 20.0}
]
}
],
"speakerCount": 2
}

Music Generation

{
"success": true,
"model": "facebook/musicgen-small",
"audioBase64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
"mimeType": "audio/wav",
"prompt": "Upbeat electronic music, 120 BPM",
"durationSeconds": 10,
"sampleRate": 32000
}

Pricing (Pay-Per-Event)

EventDescriptionPrice
audio_processedPer audio task completed$0.01

Example costs:

  • Transcribe 50 audio files: 50 × $0.01 = $0.50
  • Generate 100 TTS clips: 100 × $0.01 = $1.00
  • Classify 200 audio samples: 200 × $0.01 = $2.00
  • Demo mode: $0.00

Cost Comparison

ToolPer MinuteThis Actor
Rev.ai$0.025~$0.01
AssemblyAI$0.015~$0.01
AWS Transcribe$0.024~$0.01

Common Scenarios

Scenario 1: Podcast Transcription

{
"task": "speech_to_text",
"apiToken": "hf_your_token",
"audioUrl": "https://example.com/podcast-episode-42.mp3",
"model": "openai/whisper-large-v3",
"returnTimestamps": true,
"webhookUrl": "https://hooks.zapier.com/...",
"demoMode": false
}

Scenario 2: Meeting Notes with Speaker ID

{
"task": "speaker_diarization",
"apiToken": "hf_your_token",
"audioUrl": "https://example.com/team-meeting.mp3",
"demoMode": false
}

Scenario 3: Audio Ad Voiceover

{
"task": "text_to_speech",
"apiToken": "hf_your_token",
"text": "Discover the future of marketing automation. Visit ActorArsenal.com today.",
"demoMode": false
}

Scenario 4: Background Music for Content

{
"task": "music_generation",
"apiToken": "hf_your_token",
"musicPrompt": "Calm ambient music, soft piano, peaceful, suitable for meditation video",
"duration": 30,
"demoMode": false
}

Webhook & Automation Integration

Zapier / Make.com / n8n

  1. Create a webhook trigger
  2. Copy the URL to webhookUrl
  3. Process audio results in your workflow

Popular automations:

  • Transcripts -> Notion/Google Docs
  • TTS audio -> Cloud storage
  • Diarization -> Meeting summaries
  • Music -> Video editing software

Supported Languages (Whisper)

Whisper supports 99+ languages including:

  • English, Spanish, French, German, Italian, Portuguese
  • Chinese (Mandarin, Cantonese), Japanese, Korean
  • Arabic, Hindi, Russian, Turkish
  • And many more...

Hugging Face AI Suite

ActorBest For
Hugging Face MasterAll-in-one (text + image + audio)
Hugging Face TextText processing
Hugging Face ImageImage processing
Hugging Face AudioAudio processing (lightweight)
Hugging Face HubModel discovery

FAQ

Q: What audio formats are supported?

A: MP3, WAV, FLAC, OGG, M4A. WAV recommended for quality.

Q: What's the max audio length?

A: Whisper processes up to 30 seconds per chunk. Longer files are automatically chunked.

Q: How accurate is Whisper?

A: Whisper-large-v3 achieves 97%+ accuracy on standard benchmarks, better than most commercial solutions.

Q: Can I transcribe non-English audio?

A: Yes! Whisper supports 99+ languages. Specify language parameter or let it auto-detect.

Q: How long does music generation take?

A: ~10-30 seconds per 10 seconds of music, depending on model load.

Common Problems & Solutions

"Model is loading"

  • Large models need warm-up time
  • Set waitForModel: true (default)
  • Audio processing can take 2-3 minutes for large files

"Audio format not supported"

  • Convert to WAV or MP3
  • Ensure proper encoding

"Transcription quality is poor"

  • Use Whisper-large-v3 for best quality
  • Ensure audio is clear with minimal background noise
  • Specify correct language code

"Demo data showing"

  • Set demoMode: false
  • Provide your Hugging Face API token

Built by John Rippy | Actor Arsenal