# Youtube Transcript Scraper (`scrapesmith/youtube-transcript-scraper`) Actor

Extract timestamped transcripts and captions from any YouTube video in 20+ languages. Bulk scrape thousands of videos with full metadata — title, views, duration, channel, and publish date included.

- **URL**: https://apify.com/scrapesmith/youtube-transcript-scraper.md
- **Developed by:** [Scrape Smith](https://apify.com/scrapesmith) (community)
- **Categories:** Social media, AI, Automation
- **Stats:** 5 total users, 2 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $5.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## YouTube Transcript Scraper

Extract full transcripts, captions, and subtitles from any YouTube video — with metadata, multi-language support, and bulk processing built in.

---

### What This Scraper Does

YouTube Transcript Scraper lets you extract the complete spoken text from any YouTube video as structured, timestamped data. Whether you need transcripts for AI training datasets, content analysis, SEO research, subtitle generation, accessibility tools, or media monitoring — this scraper delivers fast, reliable results at scale.

Each result includes the full timestamped transcript broken into segments, a joined plain-text version of the entire transcript, and rich video metadata including title, channel name, view count, duration, publish date, category, description, and keywords — all in a single run with no extra API calls.

---

### Key Features

- **Bulk processing** — feed hundreds or thousands of video URLs or IDs in one run
- **Multi-language support** — request transcripts in 20+ languages with automatic fallback
- **Smart language fallback** — if your requested language isn't available, falls back to English, then to whatever is available, so you always get something
- **Rich metadata** — title, author, channel ID, views, duration, publish date, category, description, and keywords extracted alongside every transcript
- **State migration** — if Apify migrates the run to a new server, the scraper resumes exactly where it left off with zero duplicate data
- **Full audit trail** — every video ID is recorded with a status field so you know exactly what happened to each one

---

### Use Cases

#### AI & Machine Learning
- Build speech-to-text training datasets from YouTube transcripts
- Create question-answering datasets from educational video content
- Extract text corpora for NLP model training
- Build knowledge bases from tutorial and lecture videos
- Fine-tune LLMs on domain-specific YouTube content

#### Content & SEO
- Analyze competitor video scripts and talking points
- Extract keywords and topics from top-ranking YouTube videos
- Repurpose video content into blog posts, articles, and newsletters
- Identify content gaps by analyzing what topics creators cover
- Index video content for internal search engines

#### Research & Journalism
- Monitor what politicians, executives, and public figures say in videos
- Track brand mentions and sentiment across YouTube content
- Analyze trends in educational content across channels
- Archive spoken content from videos for future reference

#### Accessibility & Localization
- Generate subtitle files for videos that lack them
- Build translation pipelines by extracting source transcripts
- Create searchable archives of video content
- Make video content accessible for hearing-impaired audiences

#### Business Intelligence
- Monitor earnings calls, conference talks, and product announcements
- Track competitor messaging and product positioning
- Extract insights from industry conference presentations
- Analyze customer testimonial and review videos

---

### Input

| Field | Type | Required | Description |
|---|---|---|---|
| videoIds | array | ✅ | YouTube video URLs or IDs |
| lang | string | ❌ | Preferred caption language code. Default: `en` |
| minDelay | integer | ❌ | Minimum delay between requests in ms. Default: `300` |
| maxDelay | integer | ❌ | Maximum delay between requests in ms. Default: `800` |

#### Accepted URL Formats

https://www.youtube.com/watch?v=jNQXAC9IVRw
https://youtu.be/jNQXAC9IVRw
https://www.youtube.com/shorts/jNQXAC9IVRw
jNQXAC9IVRw

#### Supported Languages

| Code | Language |
|---|---|
| en | English |
| en-US | English (United States) |
| en-GB | English (United Kingdom) |
| zh | Chinese (Simplified) |
| zh-Hant | Chinese (Traditional) |
| pt | Portuguese |
| pt-BR | Portuguese (Brazil) |
| es | Spanish |
| es-ES | Spanish (Spain) |
| de | German |
| fr | French |
| it | Italian |
| ru | Russian |
| tr | Turkish |
| ja | Japanese |
| ko | Korean |
| hi | Hindi |
| id | Indonesian |
| fil | Filipino |
| vi | Vietnamese |

---

### Output

Each item in the dataset represents one video:
```json
{
  "video_id": "jNQXAC9IVRw",
  "video_url": "https://www.youtube.com/watch?v=jNQXAC9IVRw",
  "title": "Me at the zoo",
  "author": "jawed",
  "channel_id": "UC4QobU6STFB0P71PMvOGN5A",
  "channel_url": "https://www.youtube.com/channel/UC4QobU6STFB0P71PMvOGN5A",
  "view_count": 386567321,
  "duration_seconds": 19,
  "publish_date": "2005-04-23T20:31:52-07:00",
  "category": "Film & Animation",
  "description": "The first video on YouTube.",
  "keywords": ["me at the zoo", "jawed karim", "first youtube video"],
  "is_family_safe": true,
  "is_private": false,
  "transcript_text": "All right so here we are in front of the elephants...",
  "segment_count": 12,
  "segments": [
    { "start": 1.2, "dur": 2.16, "text": "All right so here we are" },
    { "start": 3.36, "dur": 1.8, "text": "in front of the elephants" }
  ],
  "lang": "en",
  "lang_requested": "en",
  "lang_fallback": false,
  "status": "ok"
}
````

#### Status Values

| Status | Meaning |
|---|---|
| ok | Transcript extracted successfully in requested language |
| low\_quality | Transcript found but very few segments (under 5) |
| lang\_not\_found | Requested language unavailable, fell back to another |
| no\_captions | Video has no caption tracks available |
| unplayable | Video is private, deleted, or region restricted |
| error | Unexpected failure — will retry on next run |

***

### Language Fallback Logic

1. Try requested language exactly
2. If not found and requested language wasn't English, try English
3. If English also not found, use first available language track
4. Always records `lang_fallback: true` and `lang` used so you know exactly what was returned

***

### Performance

- Typical throughput: **50 videos per minute**
- State saved every 60 seconds — migrations lose minimal progress

***

### Frequently Asked Questions

**Can I scrape transcripts from any YouTube video?**
Only videos that have captions enabled. Most videos uploaded after 2020 have auto-generated captions in English. Older videos, music videos, and some regional content may have no captions at all — these are returned with `status: no_captions`.

**Does this work with auto-generated captions?**
Yes. YouTube auto-generated captions are returned the same way as manually uploaded ones. The transcript text may be less accurate for auto-generated captions but is still very usable for most NLP and analysis tasks.

**What happens if my requested language isn't available?**
The scraper falls back automatically — first to English, then to whatever language track is available. The `lang_fallback` field will be `true` and `lang` will tell you exactly which language was used. You never get an empty result just because one language wasn't available.

**Can I scrape transcripts in bulk?**
Yes. There is no hard limit on input size. Feed thousands of video IDs in one run. The scraper processes them sequentially with rate limiting to avoid detection and handles session rotation automatically for long runs.

**How do I get video IDs in bulk?**
Use the YouTube Search Scraper or YouTube Channel Scraper actors to collect video IDs first, then feed them into this actor. You can also paste YouTube URLs directly — the scraper extracts the ID automatically from any YouTube URL format.

**What is the `segments` field?**
An array of timestamped caption segments, each with `start` (seconds from beginning), `dur` (duration in seconds), and `text`. Use this for subtitle file generation, time-aligned text analysis, or synchronizing text with video.

**What is `transcript_text`?**
All segments joined into a single plain text string with spaces. Use this for NLP processing, keyword extraction, content indexing, or anywhere you need the full transcript as a readable block of text.

**Will this work on private or age-restricted videos?**
Private videos always return `status: unplayable`. Age-restricted videos also return `unplayable`.

**Does the scraper resume after Apify migration?**
Yes. Completed video IDs are saved to Apify KV store every 60 seconds and on every migration event. When the run resumes on a new container it skips all already-completed videos and continues from exactly where it left off.

**Can I use this for AI training data?**
Yes. The structured output with `transcript_text` and `segments` is designed for downstream NLP use. Combine it with the metadata fields (title, category, keywords, description) to build richly labelled training datasets.

**How many requests does this make per video?**
Two requests per video — one to the YouTube player API to get the caption track URL and video metadata, and one to fetch the actual transcript XML. Both are lightweight and use your authenticated session.

**Is a proxy required?**
No. This scraper uses authenticated YouTube session cookies which bypass IP-based bot detection. No proxy is needed for normal usage. If you are running extremely high volumes you may want to add additional YouTube accounts to the session pool instead of proxies.

**What languages are supported?**
20 language codes confirmed from real YouTube caption data: en, en-US, en-GB, zh, zh-Hant, pt, pt-BR, es, es-ES, de, fr, it, ru, tr, ja, ko, hi, id, fil, vi. If the video has captions in a language not in this list, use the JSON input mode and type the language code manually — the scraper will attempt to find it.

**Why are some transcripts marked `low_quality`?**
Videos with fewer than 5 caption segments — typically very short videos, music videos with minimal speech, or partially captioned content. The transcript data is still pushed and usable, just flagged so you can filter if needed.

**Can I extract subtitles for YouTube Shorts?**
Yes. YouTube Shorts URLs are supported as input and processed identically to regular videos.

# Actor input Schema

## `videoIds` (type: `array`):

YouTube video URLs or IDs. Accepts watch URLs, youtu.be links, shorts URLs, or plain 11-character video IDs.

## `lang` (type: `string`):

Preferred caption language. Falls back to English then first available if not found on the video.

## `proxyConfiguration` (type: `object`):

Optional proxy. Enable residential proxy if getting consecutive unplayable errors.

## Actor input object example

```json
{
  "videoIds": [
    "https://www.youtube.com/watch?v=jNQXAC9IVRw"
  ],
  "lang": "en",
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "videoIds": [
        "https://www.youtube.com/watch?v=jNQXAC9IVRw"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("scrapesmith/youtube-transcript-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "videoIds": ["https://www.youtube.com/watch?v=jNQXAC9IVRw"] }

# Run the Actor and wait for it to finish
run = client.actor("scrapesmith/youtube-transcript-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "videoIds": [
    "https://www.youtube.com/watch?v=jNQXAC9IVRw"
  ]
}' |
apify call scrapesmith/youtube-transcript-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=scrapesmith/youtube-transcript-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Youtube Transcript Scraper",
        "description": "Extract timestamped transcripts and captions from any YouTube video in 20+ languages. Bulk scrape thousands of videos with full metadata — title, views, duration, channel, and publish date included.",
        "version": "0.0",
        "x-build-id": "wrb0492xVLlG4TuPk"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/scrapesmith~youtube-transcript-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-scrapesmith-youtube-transcript-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/scrapesmith~youtube-transcript-scraper/runs": {
            "post": {
                "operationId": "runs-sync-scrapesmith-youtube-transcript-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/scrapesmith~youtube-transcript-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-scrapesmith-youtube-transcript-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "videoIds"
                ],
                "properties": {
                    "videoIds": {
                        "title": "Video URLs or IDs",
                        "type": "array",
                        "description": "YouTube video URLs or IDs. Accepts watch URLs, youtu.be links, shorts URLs, or plain 11-character video IDs.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "lang": {
                        "title": "Target Language",
                        "enum": [
                            "en",
                            "en-US",
                            "en-GB",
                            "zh",
                            "zh-Hant",
                            "pt",
                            "pt-BR",
                            "es",
                            "es-ES",
                            "de",
                            "fr",
                            "it",
                            "ru",
                            "tr",
                            "ja",
                            "ko",
                            "hi",
                            "id",
                            "fil",
                            "vi"
                        ],
                        "type": "string",
                        "description": "Preferred caption language. Falls back to English then first available if not found on the video.",
                        "default": "en"
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Optional proxy. Enable residential proxy if getting consecutive unplayable errors.",
                        "default": {
                            "useApifyProxy": false
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
