YouTube Speech Dataset Builder avatar

YouTube Speech Dataset Builder

Under maintenance

Pricing

Pay per usage

Go to Apify Store
YouTube Speech Dataset Builder

YouTube Speech Dataset Builder

Under maintenance

Generate multilingual speech datasets from YouTube using WhisperX, transcription, language detection, and code-switch analysis for ASR training, benchmarking, and speech AI research.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Jona

Jona

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 days ago

Last modified

Share

Multilingual Code-Switching Audio Scraper

Scrapes YouTube for code-switched speech (e.g. Malayalam+English / Manglish), transcribes with WhisperX, detects language switch points, and outputs annotated clips ready for ASR benchmark dataset creation.

Built as part of the Manglish ASR Benchmark project — a rigorous evaluation dataset and leaderboard for code-switched Indian English speech recognition.


What It Does

  1. Searches YouTube for content matching your queries (or processes URLs you provide)
  2. Downloads audio with yt-dlp
  3. Transcribes with WhisperX — word-level timestamps
  4. Detects language spans — which words are Malayalam vs English, using Unicode script ranges
  5. Finds switch points — timestamps where the speaker switches language mid-sentence
  6. Filters clips by code-switching quality (configurable ratio threshold)
  7. Outputs annotated JSON — YouTube IDs + timestamps for research publishing, or full audio clips for local training

Output Format

Each result item:

{
"clip_id": "ml_en_0042",
"youtube_id": "dQw4w9WgXcQ",
"video_title": "My Day in Kerala - Manglish Vlog",
"start_sec": 42.3,
"end_sec": 55.1,
"duration_sec": 12.8,
"transcript": "Njan yesterday office-il പോയി, but the meeting was boring",
"language_spans": [
{"start": 0.0, "end": 0.4, "lang": "ml", "text": "Njan"},
{"start": 0.4, "end": 1.1, "lang": "en", "text": "yesterday"},
{"start": 1.1, "end": 2.0, "lang": "ml", "text": "office-il"},
{"start": 2.0, "end": 4.5, "lang": "en", "text": "but the meeting was boring"}
],
"switch_points": [0.4, 1.1, 2.0],
"switch_count": 3,
"primary_lang_ratio": 0.42,
"en_ratio": 0.58,
"confidence": 0.87
}

Running Locally (M4 Mac / Any Machine)

Prerequisites

# Install ffmpeg (required for audio processing)
brew install ffmpeg # Mac
# sudo apt install ffmpeg # Linux
# Install Python dependencies
pip install -r requirements.txt

Run

# Edit input settings
nano storage/key_value_stores/default/INPUT.json
# Run
python -m src

Results → output/results.json

Audio clips (if publishIdsOnly=false) → output/clips/


Running on Apify

Deploy via Apify CLI:

npm install -g apify-cli
apify login
apify push

Or drag the folder into Apify Console → Create Actor → Upload source.


Research Mode vs Full Audio Mode

SettingpublishIdsOnly: truepublishIdsOnly: false
What's savedYouTube ID + timestampsActual .wav clip files
ForHuggingFace dataset publishingLocal model training
Audio downloadedDeleted after processingSaved to output/clips/
LegalFollows academic dataset normsFor personal/research use only

For HuggingFace publishing: Use publishIdsOnly: true. Publish your results.json as the dataset — users reconstruct audio from IDs themselves. This is the standard approach (AudioSet, VGGSound, etc.)


Supported Languages (Phase 1)

CodeLanguageScript Detection
mlMalayalamUnicode 0D00–0D7F
hiHindiUnicode 0900–097F
taTamilUnicode 0B80–0BFF
teTeluguUnicode 0C00–0C7F
knKannadaUnicode 0C80–0CFF
bnBengaliUnicode 0980–09FF

The Switch-Point WER Metric

This scraper feeds the Switch-Point WER benchmark — a novel evaluation metric that measures ASR accuracy specifically in a ±2 word window around each language switch.

Standard WER misses that models fail specifically at the moment of switching. Switch-Point WER isolates this.

Manglish ASR Benchmark on HuggingFace (link after publish)


Project Context

This is part of a larger research effort:

  • Phase 1: Malayalam+English (Manglish) — this actor
  • Phase 2: Tamil+English, Hindi+English
  • End goal: Published HuggingFace dataset + leaderboard + fine-tuned Whisper checkpoint

Built by Jona Joy