# YouTube Speech Dataset Builder (`eternallabs/multilingual-codeswitching-scraper`) Actor

Generate multilingual speech datasets from YouTube using WhisperX, transcription, language detection, and code-switch analysis for ASR training, benchmarking, and speech AI research.

- **URL**: https://apify.com/eternallabs/multilingual-codeswitching-scraper.md
- **Developed by:** [Jona](https://apify.com/eternallabs) (community)
- **Categories:** AI, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Multilingual Code-Switching Audio Scraper

Scrapes YouTube for **code-switched speech** (e.g. Malayalam+English / Manglish), transcribes with WhisperX, detects language switch points, and outputs annotated clips ready for ASR benchmark dataset creation.

Built as part of the **Manglish ASR Benchmark** project — a rigorous evaluation dataset and leaderboard for code-switched Indian English speech recognition.

---

### What It Does

1. **Searches YouTube** for content matching your queries (or processes URLs you provide)
2. **Downloads audio** with yt-dlp
3. **Transcribes** with WhisperX — word-level timestamps
4. **Detects language spans** — which words are Malayalam vs English, using Unicode script ranges
5. **Finds switch points** — timestamps where the speaker switches language mid-sentence
6. **Filters clips** by code-switching quality (configurable ratio threshold)
7. **Outputs** annotated JSON — YouTube IDs + timestamps for research publishing, or full audio clips for local training

---

### Output Format

Each result item:

```json
{
  "clip_id": "ml_en_0042",
  "youtube_id": "dQw4w9WgXcQ",
  "video_title": "My Day in Kerala - Manglish Vlog",
  "start_sec": 42.3,
  "end_sec": 55.1,
  "duration_sec": 12.8,
  "transcript": "Njan yesterday office-il പോയി, but the meeting was boring",
  "language_spans": [
    {"start": 0.0, "end": 0.4, "lang": "ml", "text": "Njan"},
    {"start": 0.4, "end": 1.1, "lang": "en", "text": "yesterday"},
    {"start": 1.1, "end": 2.0, "lang": "ml", "text": "office-il"},
    {"start": 2.0, "end": 4.5, "lang": "en", "text": "but the meeting was boring"}
  ],
  "switch_points": [0.4, 1.1, 2.0],
  "switch_count": 3,
  "primary_lang_ratio": 0.42,
  "en_ratio": 0.58,
  "confidence": 0.87
}
````

***

### Running Locally (M4 Mac / Any Machine)

#### Prerequisites

```bash
## Install ffmpeg (required for audio processing)
brew install ffmpeg   # Mac
## sudo apt install ffmpeg  # Linux

## Install Python dependencies
pip install -r requirements.txt
```

#### Run

```bash
## Edit input settings
nano storage/key_value_stores/default/INPUT.json

## Run
python -m src
```

Results → `output/results.json`

Audio clips (if publishIdsOnly=false) → `output/clips/`

***

### Running on Apify

Deploy via Apify CLI:

```bash
npm install -g apify-cli
apify login
apify push
```

Or drag the folder into Apify Console → Create Actor → Upload source.

***

### Research Mode vs Full Audio Mode

| Setting | `publishIdsOnly: true` | `publishIdsOnly: false` |
|---------|----------------------|------------------------|
| What's saved | YouTube ID + timestamps | Actual .wav clip files |
| For | HuggingFace dataset publishing | Local model training |
| Audio downloaded | Deleted after processing | Saved to output/clips/ |
| Legal | Follows academic dataset norms | For personal/research use only |

**For HuggingFace publishing:** Use `publishIdsOnly: true`. Publish your results.json as the dataset — users reconstruct audio from IDs themselves. This is the standard approach (AudioSet, VGGSound, etc.)

***

### Supported Languages (Phase 1)

| Code | Language | Script Detection |
|------|----------|-----------------|
| `ml` | Malayalam | Unicode 0D00–0D7F |
| `hi` | Hindi | Unicode 0900–097F |
| `ta` | Tamil | Unicode 0B80–0BFF |
| `te` | Telugu | Unicode 0C00–0C7F |
| `kn` | Kannada | Unicode 0C80–0CFF |
| `bn` | Bengali | Unicode 0980–09FF |

***

### The Switch-Point WER Metric

This scraper feeds the **Switch-Point WER** benchmark — a novel evaluation metric that measures ASR accuracy specifically in a ±2 word window around each language switch.

Standard WER misses that models fail *specifically at the moment of switching*. Switch-Point WER isolates this.

→ [Manglish ASR Benchmark on HuggingFace](#) *(link after publish)*

***

### Project Context

This is part of a larger research effort:

- **Phase 1:** Malayalam+English (Manglish) — this actor
- **Phase 2:** Tamil+English, Hindi+English
- **End goal:** Published HuggingFace dataset + leaderboard + fine-tuned Whisper checkpoint

***

*Built by Jona Joy*

# Actor input Schema

## `searchQueries` (type: `array`):

List of search terms to find relevant YouTube videos. E.g. 'Malayalam vlog', 'Manglish interview', 'Kerala tech talk'

## `youtubeUrls` (type: `array`):

Directly provide YouTube video URLs to process instead of or in addition to search queries.

## `primaryLanguage` (type: `string`):

The main non-English language you expect in the audio.

## `whisperModel` (type: `string`):

Larger = more accurate but slower. 'small' recommended for most use cases.

## `maxVideos` (type: `integer`):

Maximum number of YouTube videos to download and process per run.

## `minClipDuration` (type: `integer`):

Ignore segments shorter than this.

## `maxClipDuration` (type: `integer`):

Split segments longer than this.

## `minCodeSwitchRatio` (type: `number`):

Reject clips where one language dominates too much. 0.15 means at least 15% must be the minority language.

## `publishIdsOnly` (type: `boolean`):

If true, output only YouTube video IDs + timestamps (for HuggingFace dataset publishing). If false, save actual audio clips locally.

## Actor input object example

```json
{
  "searchQueries": [
    "Malayalam English vlog",
    "Manglish interview",
    "Kerala tech talk English"
  ],
  "youtubeUrls": [],
  "primaryLanguage": "ml",
  "whisperModel": "small",
  "maxVideos": 10,
  "minClipDuration": 4,
  "maxClipDuration": 15,
  "minCodeSwitchRatio": 0.15,
  "publishIdsOnly": true
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("eternallabs/multilingual-codeswitching-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("eternallabs/multilingual-codeswitching-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call eternallabs/multilingual-codeswitching-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=eternallabs/multilingual-codeswitching-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "YouTube Speech Dataset Builder",
        "description": "Generate multilingual speech datasets from YouTube using WhisperX, transcription, language detection, and code-switch analysis for ASR training, benchmarking, and speech AI research.",
        "version": "1.0",
        "x-build-id": "WtkDdqtEeqby9s0vX"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/eternallabs~multilingual-codeswitching-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-eternallabs-multilingual-codeswitching-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/eternallabs~multilingual-codeswitching-scraper/runs": {
            "post": {
                "operationId": "runs-sync-eternallabs-multilingual-codeswitching-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/eternallabs~multilingual-codeswitching-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-eternallabs-multilingual-codeswitching-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "searchQueries",
                    "primaryLanguage"
                ],
                "properties": {
                    "searchQueries": {
                        "title": "YouTube Search Queries",
                        "type": "array",
                        "description": "List of search terms to find relevant YouTube videos. E.g. 'Malayalam vlog', 'Manglish interview', 'Kerala tech talk'",
                        "default": [
                            "Malayalam English vlog",
                            "Manglish interview",
                            "Kerala tech talk English"
                        ],
                        "items": {
                            "type": "string"
                        }
                    },
                    "youtubeUrls": {
                        "title": "Specific YouTube URLs (optional)",
                        "type": "array",
                        "description": "Directly provide YouTube video URLs to process instead of or in addition to search queries.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "primaryLanguage": {
                        "title": "Primary Language",
                        "enum": [
                            "ml",
                            "hi",
                            "ta",
                            "te",
                            "kn",
                            "bn"
                        ],
                        "type": "string",
                        "description": "The main non-English language you expect in the audio.",
                        "default": "ml"
                    },
                    "whisperModel": {
                        "title": "Whisper Model Size",
                        "enum": [
                            "tiny",
                            "base",
                            "small",
                            "medium"
                        ],
                        "type": "string",
                        "description": "Larger = more accurate but slower. 'small' recommended for most use cases.",
                        "default": "small"
                    },
                    "maxVideos": {
                        "title": "Max Videos to Process",
                        "minimum": 1,
                        "maximum": 50,
                        "type": "integer",
                        "description": "Maximum number of YouTube videos to download and process per run.",
                        "default": 10
                    },
                    "minClipDuration": {
                        "title": "Minimum Clip Duration (seconds)",
                        "minimum": 2,
                        "type": "integer",
                        "description": "Ignore segments shorter than this.",
                        "default": 4
                    },
                    "maxClipDuration": {
                        "title": "Maximum Clip Duration (seconds)",
                        "maximum": 30,
                        "type": "integer",
                        "description": "Split segments longer than this.",
                        "default": 15
                    },
                    "minCodeSwitchRatio": {
                        "title": "Minimum Code-Switch Ratio",
                        "minimum": 0.05,
                        "maximum": 0.5,
                        "type": "number",
                        "description": "Reject clips where one language dominates too much. 0.15 means at least 15% must be the minority language.",
                        "default": 0.15
                    },
                    "publishIdsOnly": {
                        "title": "Research Mode (IDs Only)",
                        "type": "boolean",
                        "description": "If true, output only YouTube video IDs + timestamps (for HuggingFace dataset publishing). If false, save actual audio clips locally.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
