Pricing

from $0.01 / 1,000 results

Hugging Face Audio AI

Audio w/Hugging Face models speech recognition, text-to-speech & audio analysis Speech-to-Text: Transcribe audio Text-to-Speech: Generate natural speech Audio Classification: Classify sounds Voice Activity Detection: Detect speech Speaker Diarization: Identify speakers Music Generation: Create music

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

The Howlers

Actor stats

Bookmarked

Total users

Monthly active users

15 days ago

Last modified

Hugging Face Audio - AI Audio Processing with Whisper, TTS & Music Models

Process audio with state-of-the-art AI models. Speech-to-text transcription with Whisper, text-to-speech synthesis, audio classification, speaker diarization, voice activity detection, and AI music generation. No GPU or audio expertise required. BYOK with your Hugging Face API token.

Features

Speech-to-Text - Transcribe audio with OpenAI Whisper (97%+ accuracy)
Text-to-Speech - Generate natural-sounding speech
Audio Classification - Identify speech, music, sounds
Audio Enhancement - Denoise and improve audio quality
Voice Activity Detection - Find speech segments in audio
Speaker Diarization - Identify different speakers
Audio Embeddings - Generate audio feature vectors
Music Generation - Create music from text prompts
Multi-language Support - 99+ languages for transcription
Demo Mode - Test with sample data before going live

Who Should Use This Actor?

Content Creators

Transcribe podcasts and videos. Generate voiceovers. Create background music.

Marketing Teams

Transcribe webinars and interviews. Generate audio ads. Analyze call recordings.

Media Companies

Automate transcription. Generate subtitles. Classify audio content.

Researchers

Transcribe interviews. Analyze audio datasets. Extract audio features.

Podcasters

Get instant transcripts. Generate show notes. Create audio clips.

Accessibility Teams

Generate captions. Create audio descriptions. Convert text to speech.

Quick Start

Demo Mode (Free Test)

{
  "task": "speech_to_text",
  "audioUrl": "https://example.com/podcast.mp3",
  "demoMode": true
}

Speech-to-Text (Whisper)

{
  "task": "speech_to_text",
  "apiToken": "hf_your_token_here",
  "model": "openai/whisper-large-v3",
  "audioUrl": "https://example.com/interview.mp3",
  "language": "en",
  "returnTimestamps": true,
  "demoMode": false
}

Text-to-Speech

{
  "task": "text_to_speech",
  "apiToken": "hf_your_token_here",
  "text": "Welcome to our podcast. Today we're discussing the latest trends in AI.",
  "speakerId": 0,
  "demoMode": false
}

Audio Classification

{
  "task": "audio_classification",
  "apiToken": "hf_your_token_here",
  "audioUrl": "https://example.com/audio-clip.wav",
  "demoMode": false
}

Voice Activity Detection

{
  "task": "voice_activity_detection",
  "apiToken": "hf_your_token_here",
  "audioUrl": "https://example.com/recording.wav",
  "demoMode": false
}

Speaker Diarization

{
  "task": "speaker_diarization",
  "apiToken": "hf_your_token_here",
  "audioUrl": "https://example.com/meeting.mp3",
  "demoMode": false
}

Audio Enhancement (Noise Reduction)

{
  "task": "audio_to_audio",
  "apiToken": "hf_your_token_here",
  "audioUrl": "https://example.com/noisy-recording.wav",
  "demoMode": false
}

Audio Embeddings

{
  "task": "audio_embeddings",
  "apiToken": "hf_your_token_here",
  "audioUrl": "https://example.com/audio-sample.wav",
  "demoMode": false
}

Music Generation

{
  "task": "music_generation",
  "apiToken": "hf_your_token_here",
  "musicPrompt": "Upbeat electronic music, 120 BPM, energetic, modern",
  "duration": 10,
  "demoMode": false
}

Input Parameters

Parameter	Type	Default	Description
`task`	string	required	Task to perform (see task list)
`apiToken`	string	-	Your Hugging Face API token
`model`	string	task default	Specific model to use
`audioUrl`	string	-	URL to audio file
`text`	string	-	Text for speech synthesis
`musicPrompt`	string	-	Prompt for music generation
`language`	string	-	Language code for transcription
`targetLanguage`	string	-	Target language for translation
`voicePreset`	string	-	Voice preset for TTS
`speakerId`	number	-	Speaker ID for multi-speaker TTS
`duration`	number	`10`	Music duration in seconds
`sampleRate`	number	`22050`	Audio sample rate
`returnTimestamps`	boolean	`false`	Include word timestamps
`waitForModel`	boolean	`true`	Wait for model to load
`webhookUrl`	string	-	Webhook URL for results
`demoMode`	boolean	`true`	Return sample data

Available Tasks

Task	Description	Default Model
`speech_to_text`	Transcribe audio	Whisper-large-v3
`text_to_speech`	Generate speech	SpeechT5 TTS
`audio_classification`	Classify audio	AST-AudioSet
`audio_to_audio`	Enhance/transform audio	SepFormer
`voice_activity_detection`	Detect speech segments	Pyannote VAD
`speaker_diarization`	Identify speakers	Pyannote Diarization
`audio_embeddings`	Extract audio features	Wav2Vec2
`music_generation`	Generate music	MusicGen-small

Output Format

Speech-to-Text

{
  "success": true,
  "model": "openai/whisper-large-v3",
  "transcription": "Hello and welcome to our presentation today. We will be discussing the latest developments in artificial intelligence.",
  "language": "en",
  "confidence": 0.95,
  "segments": [
    {"start": 0.0, "end": 2.5, "text": "Hello and welcome to our presentation today.", "confidence": 0.96},
    {"start": 2.5, "end": 5.0, "text": "We will be discussing the latest developments", "confidence": 0.94},
    {"start": 5.0, "end": 7.5, "text": "in artificial intelligence.", "confidence": 0.95}
  ],
  "audioUrl": "https://example.com/audio.mp3"
}

Text-to-Speech

{
  "success": true,
  "model": "microsoft/speecht5_tts",
  "audioBase64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
  "mimeType": "audio/wav",
  "text": "Welcome to our podcast...",
  "durationSeconds": 3.5
}

Audio Classification

{
  "success": true,
  "model": "MIT/ast-finetuned-audioset-10-10-0.4593",
  "classifications": [
    {"label": "Speech", "score": 0.89},
    {"label": "Music", "score": 0.07},
    {"label": "Silence", "score": 0.03},
    {"label": "Noise", "score": 0.01}
  ],
  "audioUrl": "https://example.com/audio.wav"
}

Voice Activity Detection

{
  "success": true,
  "model": "pyannote/voice-activity-detection",
  "speechSegments": [
    {"start": 0.5, "end": 3.2, "confidence": 0.98},
    {"start": 4.1, "end": 7.8, "confidence": 0.95},
    {"start": 9.0, "end": 12.5, "confidence": 0.97}
  ],
  "totalSpeechDuration": 9.9,
  "totalAudioDuration": 15.0,
  "speechRatio": 0.66
}

Speaker Diarization

{
  "success": true,
  "model": "pyannote/speaker-diarization",
  "speakers": [
    {
      "speaker": "SPEAKER_00",
      "segments": [
        {"start": 0.5, "end": 5.2},
        {"start": 10.1, "end": 15.3}
      ]
    },
    {
      "speaker": "SPEAKER_01",
      "segments": [
        {"start": 5.5, "end": 9.8},
        {"start": 15.5, "end": 20.0}
      ]
    }
  ],
  "speakerCount": 2
}

Music Generation

{
  "success": true,
  "model": "facebook/musicgen-small",
  "audioBase64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA...",
  "mimeType": "audio/wav",
  "prompt": "Upbeat electronic music, 120 BPM",
  "durationSeconds": 10,
  "sampleRate": 32000
}

Pricing (Pay-Per-Event)

Event	Description	Price
`audio_processed`	Per audio task completed	$0.01

Example costs:

Transcribe 50 audio files: 50 × $0.01 = $0.50
Generate 100 TTS clips: 100 × $0.01 = $1.00
Classify 200 audio samples: 200 × $0.01 = $2.00
Demo mode: $0.00

Cost Comparison

Tool	Per Minute	This Actor
Rev.ai	$0.025	~$0.01
AssemblyAI	$0.015	~$0.01
AWS Transcribe	$0.024	~$0.01

Common Scenarios

Scenario 1: Podcast Transcription

{
  "task": "speech_to_text",
  "apiToken": "hf_your_token",
  "audioUrl": "https://example.com/podcast-episode-42.mp3",
  "model": "openai/whisper-large-v3",
  "returnTimestamps": true,
  "webhookUrl": "https://hooks.zapier.com/...",
  "demoMode": false
}

Scenario 2: Meeting Notes with Speaker ID

{
  "task": "speaker_diarization",
  "apiToken": "hf_your_token",
  "audioUrl": "https://example.com/team-meeting.mp3",
  "demoMode": false
}

Scenario 3: Audio Ad Voiceover

{
  "task": "text_to_speech",
  "apiToken": "hf_your_token",
  "text": "Discover the future of marketing automation. Visit ActorArsenal.com today.",
  "demoMode": false
}

Scenario 4: Background Music for Content

{
  "task": "music_generation",
  "apiToken": "hf_your_token",
  "musicPrompt": "Calm ambient music, soft piano, peaceful, suitable for meditation video",
  "duration": 30,
  "demoMode": false
}

Webhook & Automation Integration

Zapier / Make.com / n8n

Create a webhook trigger
Copy the URL to webhookUrl
Process audio results in your workflow

Popular automations:

Transcripts -> Notion/Google Docs
TTS audio -> Cloud storage
Diarization -> Meeting summaries
Music -> Video editing software

Supported Languages (Whisper)

Whisper supports 99+ languages including:

English, Spanish, French, German, Italian, Portuguese
Chinese (Mandarin, Cantonese), Japanese, Korean
Arabic, Hindi, Russian, Turkish
And many more...

Hugging Face AI Suite

Actor	Best For
Hugging Face Master	All-in-one (text + image + audio)
Hugging Face Text	Text processing
Hugging Face Image	Image processing
Hugging Face Audio	Audio processing (lightweight)
Hugging Face Hub	Model discovery

FAQ

Q: What audio formats are supported?

A: MP3, WAV, FLAC, OGG, M4A. WAV recommended for quality.

Q: What's the max audio length?

A: Whisper processes up to 30 seconds per chunk. Longer files are automatically chunked.

Q: How accurate is Whisper?

A: Whisper-large-v3 achieves 97%+ accuracy on standard benchmarks, better than most commercial solutions.

Q: Can I transcribe non-English audio?

A: Yes! Whisper supports 99+ languages. Specify language parameter or let it auto-detect.

Q: How long does music generation take?

A: ~10-30 seconds per 10 seconds of music, depending on model load.

Common Problems & Solutions

"Model is loading"

Large models need warm-up time
Set waitForModel: true (default)
Audio processing can take 2-3 minutes for large files

"Audio format not supported"

Convert to WAV or MP3
Ensure proper encoding

"Transcription quality is poor"

Use Whisper-large-v3 for best quality
Ensure audio is clear with minimal background noise
Specify correct language code

"Demo data showing"

Set demoMode: false
Provide your Hugging Face API token

📞 Support

Actor Arsenal: Full Actor Catalog
Developer: John Rippy

Built by John Rippy | Actor Arsenal

Google Free Text to Speech

jupri/google-speech

Use free Google Text to Speech to translate text into voice

cat

202

Text to Speech Generator

moving_beacon-owner1/my-actor-30

Convert text into natural-sounding speech in multiple languages with ease.

Jamshaid Arif

Text To Speech

vivid_astronaut/text-to-speech

Convert text to natural speech using AI voices. Multiple voices and languages available. Generate audio files for podcasts, videos, accessibility, and voice assistants.

Fabio Suizu

Speech To Text

vivid_astronaut/speech-to-text

Convert speech to text with high accuracy using Azure AI. Supports 100+ languages, speaker detection, and timestamps. Perfect for transcription, subtitles, and voice-to-text applications.

Fabio Suizu

Speech AI MCP Server

vivid_astronaut/pronunciation-assessment-mcp

Speech AI MCP server with 9 tools: pronunciation scoring (0-100 at phoneme/word/sentence level), speech-to-text with timestamps, text-to-speech with 12 English voices, and multilingual Whisper transcription (99 languages + speaker diarization). Sub-300ms latency. Pay-per-use: $0.02/call.

Fabio Suizu

Text to speech generator

akash9078/advanced-text-to-speech

Professional-grade Text-to-Speech (TTS) actor powered by advanced AI models. Convert any text into natural, human-like speech with 50+ premium voices across 9 languages. Perfect for content creation, accessibility, voiceovers, audiobooks, podcasts, and multilingual applications.

Akash Kumar Naik

Text To Speech

calm_necessity/text-to-speech

AI Text-to-Speech API that converts written text into high-quality natural voice audio. Supports multiple voices, languages, adjustable speed and pitch, ideal for audiobooks, podcasts, accessibility, automation, and voice-enabled applications.

Taher Ali Badnawarwala

Speech to Text Converter (Transcript / Captcha)

saswave/speech-to-text-converter

Transform audio records to text. Get transcription from sales or customer success teams audio files. Get Captcha text from captcha audio challenge. Speech to text converter helps you analyse, build KPI with audio records and bypass captcha.

SASWAVE

Microsoft Text to Speech

jupri/microsoft-tts

💫 Use Microsoft Edge TTS service to convert texts into speech

cat

Hugging Face Master

alizarin_refrigerator-owner/hugging-face-master

Unified Apify actor for Hugging Face Inference API access 200K+ AI models for text image audio processing Text Generation LLMs Llama Summarization Condense documents Translation 100+ languages Sentiment Analysis Image Generation Stable Diffusion Speech transcription Semantic search QA classification

The Howlers

Hugging Face Audio AI

Hugging Face Audio - AI Audio Processing with Whisper, TTS & Music Models

Features

Who Should Use This Actor?

Content Creators

Marketing Teams

Media Companies

Researchers

Podcasters

Accessibility Teams

Quick Start

Demo Mode (Free Test)

Speech-to-Text (Whisper)

Text-to-Speech

Audio Classification

Voice Activity Detection

Speaker Diarization

Audio Enhancement (Noise Reduction)

Audio Embeddings

Music Generation

Input Parameters

Available Tasks

Output Format

Speech-to-Text

Text-to-Speech

Audio Classification

Voice Activity Detection

Speaker Diarization

Music Generation

Pricing (Pay-Per-Event)

Cost Comparison

Common Scenarios

Scenario 1: Podcast Transcription

Scenario 2: Meeting Notes with Speaker ID

Scenario 3: Audio Ad Voiceover

Scenario 4: Background Music for Content

Webhook & Automation Integration

Zapier / Make.com / n8n

Supported Languages (Whisper)

Hugging Face AI Suite

FAQ

Q: What audio formats are supported?

Q: What's the max audio length?

Q: How accurate is Whisper?

Q: Can I transcribe non-English audio?

Q: How long does music generation take?

Common Problems & Solutions

"Model is loading"

"Audio format not supported"

"Transcription quality is poor"

"Demo data showing"

📞 Support

You might also like

Google Free Text to Speech

Text to Speech Generator

Text To Speech

Speech To Text

Speech AI MCP Server

Text to speech generator

Text To Speech

Speech to Text Converter (Transcript / Captcha)

Microsoft Text to Speech

Hugging Face Master