Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

YouTube Speech Dataset Builder

Deprecated

See alternative Actors

Generate multilingual speech datasets from YouTube using WhisperX, transcription, language detection, and code-switch analysis for ASR training, benchmarking, and speech AI research.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Jona

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Multilingual Code-Switching Audio Scraper

Scrapes YouTube for code-switched speech (e.g. Malayalam+English / Manglish), transcribes with WhisperX, detects language switch points, and outputs annotated clips ready for ASR benchmark dataset creation.

Built as part of the Manglish ASR Benchmark project — a rigorous evaluation dataset and leaderboard for code-switched Indian English speech recognition.

What It Does

Searches YouTube for content matching your queries (or processes URLs you provide)
Downloads audio with yt-dlp
Transcribes with WhisperX — word-level timestamps
Detects language spans — which words are Malayalam vs English, using Unicode script ranges
Finds switch points — timestamps where the speaker switches language mid-sentence
Filters clips by code-switching quality (configurable ratio threshold)
Outputs annotated JSON — YouTube IDs + timestamps for research publishing, or full audio clips for local training

Output Format

Each result item:

{
  "clip_id": "ml_en_0042",
  "youtube_id": "dQw4w9WgXcQ",
  "video_title": "My Day in Kerala - Manglish Vlog",
  "start_sec": 42.3,
  "end_sec": 55.1,
  "duration_sec": 12.8,
  "transcript": "Njan yesterday office-il പോയി, but the meeting was boring",
  "language_spans": [
    {"start": 0.0, "end": 0.4, "lang": "ml", "text": "Njan"},
    {"start": 0.4, "end": 1.1, "lang": "en", "text": "yesterday"},
    {"start": 1.1, "end": 2.0, "lang": "ml", "text": "office-il"},
    {"start": 2.0, "end": 4.5, "lang": "en", "text": "but the meeting was boring"}
  ],
  "switch_points": [0.4, 1.1, 2.0],
  "switch_count": 3,
  "primary_lang_ratio": 0.42,
  "en_ratio": 0.58,
  "confidence": 0.87
}

Running Locally (M4 Mac / Any Machine)

Prerequisites

# Install ffmpeg (required for audio processing)
brew install ffmpeg   # Mac
# sudo apt install ffmpeg  # Linux

# Install Python dependencies
pip install -r requirements.txt

Run

# Edit input settings
nano storage/key_value_stores/default/INPUT.json

# Run
python -m src

Results → output/results.json

Audio clips (if publishIdsOnly=false) → output/clips/

Running on Apify

Deploy via Apify CLI:

npm install -g apify-cli
apify login
apify push

Or drag the folder into Apify Console → Create Actor → Upload source.

Research Mode vs Full Audio Mode

Setting	`publishIdsOnly: true`	`publishIdsOnly: false`
What's saved	YouTube ID + timestamps	Actual .wav clip files
For	HuggingFace dataset publishing	Local model training
Audio downloaded	Deleted after processing	Saved to output/clips/
Legal	Follows academic dataset norms	For personal/research use only

For HuggingFace publishing: Use publishIdsOnly: true. Publish your results.json as the dataset — users reconstruct audio from IDs themselves. This is the standard approach (AudioSet, VGGSound, etc.)

Supported Languages (Phase 1)

Code	Language	Script Detection
`ml`	Malayalam	Unicode 0D00–0D7F
`hi`	Hindi	Unicode 0900–097F
`ta`	Tamil	Unicode 0B80–0BFF
`te`	Telugu	Unicode 0C00–0C7F
`kn`	Kannada	Unicode 0C80–0CFF
`bn`	Bengali	Unicode 0980–09FF

The Switch-Point WER Metric

This scraper feeds the Switch-Point WER benchmark — a novel evaluation metric that measures ASR accuracy specifically in a ±2 word window around each language switch.

Standard WER misses that models fail specifically at the moment of switching. Switch-Point WER isolates this.

→ Manglish ASR Benchmark on HuggingFace (link after publish)

Project Context

This is part of a larger research effort:

Phase 1: Malayalam+English (Manglish) — this actor
Phase 2: Tamil+English, Hindi+English
End goal: Published HuggingFace dataset + leaderboard + fine-tuned Whisper checkpoint

Built by Jona Joy

Hugging Face Audio AI

alizarin_refrigerator-owner/hugging-face-audio-ai

Audio w/Hugging Face models speech recognition, text-to-speech & audio analysis Speech-to-Text: Transcribe audio Text-to-Speech: Generate natural speech Audio Classification: Classify sounds Voice Activity Detection: Detect speech Speaker Diarization: Identify speakers Music Generation: Create music

The Howlers

Speech AI MCP Server

vivid_astronaut/pronunciation-assessment-mcp

Speech AI MCP server with 9 tools: pronunciation scoring (0-100 at phoneme/word/sentence level), speech-to-text with timestamps, text-to-speech with 12 English voices, and multilingual Whisper transcription (99 languages + speaker diarization). Sub-300ms latency. Pay-per-use: $0.02/call.

Fabio Suizu

Speech Lang Pathologist Email Scraper

contacts-api/speech-lang-pathologist-email-scraper

Speech-language pathologist email scraper to extract verified speech therapist emails from clinics, hospitals, rehabilitation centers, schools, and healthcare directories 📧🗣️ Perfect for healthcare outreach, recruitment, and speech therapy lead generation.

Lead Heaven

Speech-to-Text Transcription

hgservices/speech-to-text

Transcribe audio and video from YouTube, TikTok, podcasts, X, and 1,000+ other sites or any direct media URL into accurate, speaker-labeled text. Uses World's best speech to text AI models with automatic language detection, multilingual support, and smart formatting.

Harish Garg

181

5.0

Speech to Text — Audio Transcription API, 100+ Languages

vivid_astronaut/speech-to-text

Transcribe audio to text with high accuracy in 100+ languages, with speaker detection and word timestamps. Input an audio file, get structured transcript JSON — ready for subtitles, meeting notes, and voice apps.

Fabio Suizu

Text to Speech Generator

moving_beacon-owner1/my-actor-30

Convert text into natural-sounding speech in multiple languages with ease.

Jamshaid Arif

Text to speech generator

akash9078/advanced-text-to-speech

Professional-grade Text-to-Speech (TTS) actor powered by advanced AI models. Convert any text into natural, human-like speech with 50+ premium voices across 9 languages. Perfect for content creation, accessibility, voiceovers, audiobooks, podcasts, and multilingual applications.

Akash Kumar Naik

Google Free Text to Speech

jupri/google-speech

Use free Google Text to Speech to translate text into voice

cat

301

Text To Speech

vivid_astronaut/text-to-speech

Convert text to natural speech using AI voices. Multiple voices and languages available. Generate audio files for podcasts, videos, accessibility, and voice assistants.

Fabio Suizu