YouTube Transcript Scraper avatar
YouTube Transcript Scraper

Pricing

$20.00/month + usage

Go to Apify Store
YouTube Transcript Scraper

YouTube Transcript Scraper

Get transcripts from YouTube videos and Shorts as plain text or structured timestamped segments. Results come with title, description, likes, channel details, and other metadata.

Pricing

$20.00/month + usage

Rating

0.0

(0)

Developer

Embion

Embion

Maintained by Community

Actor stats

2

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

▶️ Scrape transcripts and metadata from YouTube videos and Shorts

This Actor gets transcripts supplied by the video creator or generated by YouTube. It works in two modes: full text or structured segments with exact timestamps. Built for automation pipelines: stable output, reliable retries, structured error codes, and proxy support.

Features:

  • Extracts transcripts as plain text or timestamped segments
  • Includes title, description, keywords, category, duration, publish date, channel name, subscriber count, and more
  • Supports HTML and plain text transcript formats
  • Writes consistent structure even when transcripts are missing
  • Retries on failures, tracks error reasons, produces structured error items
  • Residential proxies for highest reliability (recommended setting)

If you want to see the exact output format, check the section "Successful item example".

📦 Output dataset

Each processed URL produces one dataset item with the following structure:

✅ Successful item example

We truncated some long texts with "...etc" symbols to make examples easy to read. The real dataset output will contain full results without truncation.

Also note, that some videos do not have any transcripts for the language that you want, so the actor will still write the result to the dataset with caption_text and captions_structured fields both set to null.

Example of output when with_timestamps is true (enabled). captions_structured field is filled while caption_text field is null:

{
"url": "https://youtu.be/dqwpQarrDwk", // URL provided as input in 'start_urls'
"id": "dqwpQarrDwk", // video ID
"title": "1,000km Cable to the Stars - The Skyhook", // video title
"channel_name": "Kurzgesagt – In a Nutshell", // name of the channel which posted the video
"channel_id": "UCsXVk37bltHxD1rDPwtNM8Q", // ID of the channel which posted the video
"channel_url": "http://www.youtube.com/@kurzgesagt", // URL of the channel which posted the video
"channel_subscribers_text": "24.8M subscribers", // subscriber count as shown on video page
"channel_subscribers": 24800000, // subscriber count parsed as number
"category": "Education", // category of the video
"duration": 420, // total duration of the video in seconds, 0 if livestream
"view_count": 12705758, // total number of views from microdata
"like_count": 409346, // total number of likes from microdata
"published_date": "2019-11-17T13:30:03.000Z", // when the video was published from microdata
"published_date_text": "Nov 17, 2019", // when the video was published as it appears on the website
"keywords": [ // keywords of the video
"Skyhook",
"Spacetether",
"Tether",
"Space",
"Spavetravel",
...etc
],
"caption_lang": "en", // language code of captions
"caption_generated": false, // true if captions are auto-generated
"caption_text": null, // null when "with_timestamps" input is true
"captions_structured": [ // filled when "with_timestamps" is true and captions (transcript) exist
{
"start_ms": "1200", // milliseconds from start of the video when the text appears, guaranteed to be non-null if object exists
"end_ms": "2920", // milliseconds from start of the video when the text hides, guaranteed to be non-null if object exists
"snippet": "Getting to space is hard." // actual text in subtitles, may be null
},
{
"start_ms": "3080",
"end_ms": "6580",
"snippet": "Right now, it’s like going up on a mountain on a unicycle-"
},
...etc
],
"available_captions": [ // list of all captions YouTube declares for the video
{
"language_code": "sq", // guaranteed to be non-null if object exists
"name": "Albanian", // may be null depending on what YouTube returns
"generated": false // guaranteed to be non-null if object exists
},
{
"language_code": "ar",
"name": "Arabic",
"generated": false
},
...etc
],
"unlisted": false, // video requires direct URL and is hidden from search
"live": false, // video is a livestream
"error_code": null, // null on success
"description": "Sources: https://sites.google.com/view/sources-skyhooks/\n\nGet your 12,020 SPACE Calendar here: https://shop.kurzgesagt.org/\nWORLDWIDE SHIPPING IS AVAILABLE!\n\nGetting to space is incredibly hard, expensive and needs a lot of resources. \nA more efficient way to get there is a Skyhook (or Spacetether), an ever rotating cable with a counter weight, that catapults spaceships from earth orbit into the depths of space. \n\n\nOUR CHANNELS\n▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀\nGerman Channel: https://kgs.link/youtubeDE \nSpanish Channel: https://kgs.link/youtubeES \n\n\nHOW CAN YOU SUPPORT US?\n▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀\nThis is how we make our living and it would be a pleasure if you support us!\n\nGet Merch designed with ❤ from https://kgs.link/shop" ...etc // raw description as shown below video
}

Example of output when with_timestamps setting is false (disabled). caption_text field is filled while captions_structured field is null:

{
"url": "https://youtu.be/dqwpQarrDwk",
"id": "dqwpQarrDwk",
"title": "1,000km Cable to the Stars - The Skyhook",
"channel_name": "Kurzgesagt – In a Nutshell",
"channel_id": "UCsXVk37bltHxD1rDPwtNM8Q",
"channel_url": "http://www.youtube.com/@kurzgesagt",
"channel_subscribers_text": "24.8M subscribers",
"channel_subscribers": 24800000,
"category": "Education",
"duration": 420,
"view_count": 12705758,
"like_count": 409346,
"published_date": "2019-11-17T13:30:03.000Z",
"published_date_text": "Nov 17, 2019",
"keywords": [
"Skyhook",
"Spacetether",
"Tether",
"Space",
"Spavetravel",
...etc
],
"caption_lang": "en",
"caption_generated": false,
"caption_text": "Getting to space is hard. Right now, it’s like going up on a mountain on a unicycle- with a backpack full of explosives. Incredibly slow, you can’t transport a lot of stuff, and you might die. A rocket needs to reach a velocity about 40,000km an hour to escape from Earth. To get to that speed, rockets are mostly containers for fuel with a tiny tip of payload. This is bad if you want to go to other planets, because you need a lot of heavy stuff if you want to survive, and maybe even come back. So, is there a way to get to space with less fuel and more payload? A nice thing that solved most of our transport problems on Earth is what you call infrastructure. Whether it’s roads for cars, ports for ships, or rails for trains, we’ve made it easier to get to places. We can apply the same solution to space travel. Space infrastructure will make getting into orbit and out to the Moon, Mars, and beyond easier and cheaper." ...etc,
"captions_structured": null,
"available_captions": [
{
"language_code": "sq",
"name": "Albanian",
"generated": false
},
{
"language_code": "ar",
"name": "Arabic",
"generated": false
},
...etc
],
"unlisted": false,
"live": false,
"error_code": null,
"description": "Sources: https://sites.google.com/view/sources-skyhooks/\n\nGet your 12,020 SPACE Calendar here: https://shop.kurzgesagt.org/\nWORLDWIDE SHIPPING IS AVAILABLE!\n\nGetting to space is incredibly hard, expensive and needs a lot of resources. \nA more efficient way to get there is a Skyhook (or Spacetether), an ever rotating cable with a counter weight, that catapults spaceships from earth orbit into the depths of space. \n\n\nOUR CHANNELS\n▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀\nGerman Channel: https://kgs.link/youtubeDE \nSpanish Channel: https://kgs.link/youtubeES \n\n\nHOW CAN YOU SUPPORT US?\n▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀\nThis is how we make our living and it would be a pleasure if you support us!\n\nGet Merch designed with ❤ from https://kgs.link/shop" ...etc
}

❌ Error item example

If actor fails to get video information for the given URL, it will write an error record to the dataset. This will allow your downstream automation to verify that the actor actually tried working on the given URL.

Actor has access only to the publicly available videos, including unlisted ones. Errors may happen due to any of the following reasons:

  • Anti-bot protection or suspicious traffic block
  • The URL is not a valid YouTube video page (homepage, channel, search, Shorts feed, etc.)
  • Video was unpublished, deleted, set to private, or expired
  • Video requires login due to age restriction or membership-only access
  • Video is blocked in the region of your proxy
  • Redirect or network issue prevented resolving the video URL
  • Transcripts exist but the selected track could not be fetched or parsed
  • Critical metadata (e.g., video ID, title, or caption manifest) was missing
  • Global or regional YouTube outages

Here's how the dataset record looks like when the actor fails to fetch information about the specific video:

{
"url": "https://example.com/",
"error_code": "not_youtube"
}
{
"url": "https://youtube.com/",
"error_code": "invalid_page_type"
}
{
"url": "https://www.youtube.com/watch?v=aAkMkVFwAoX",
"error_code": "video_unavailable"
}

List of possible values written to error_code field of the dataset:

CodeMeaning
not_youtubeInput link is not recognised as a valid YouTube video URL.
resolve_errorThe link could not be resolved to a playable video (redirect or network issue).
invalid_page_typeThe page type is unsupported (for example, experimental formats).
transcript_fetch_errorTranscript metadata could not be retrieved.
transcript_selection_errorTranscript metadata exists but the selected track failed to load.
missing_critical_dataEssential metadata was missing, preventing a complete record.
video_info_fetch_errorVideo metadata retrieval returned an unexpected response.
video_unavailableThe video is blocked, private, removed, or otherwise unavailable.
failedAll retries were exhausted due to an unexpected error.
nullNo error encountered.

⚙️ Inputs

FieldTypeDescription
start_urlsarray of request objectsEach entry must include a url that points to a YouTube video or Shorts page. Optional HTTP method and headers are supported.
caption_languagestringLanguage code prioritised for transcripts (for example en, es, de).
with_descriptionbooleanInclude the video description text in the output when true.
with_timestampsbooleanEmit timestamped transcript segments when true in captions_structured; otherwise a single transcript string in caption_text.
allow_generated_captionsbooleanFall back to auto-generated transcripts if creator-supplied transcripts are unavailable.
caption_formatstringAccepts plain_text or html, applied to transcript fields.
concurrencyintegerMaximum number of videos processed simultaneously. Tune to match your proxy capacity.
max_retriesintegerNumber of retry attempts per video before an error item is written.
proxyobjectStandard Apify proxy configuration payload. We recommend enabling residential proxies for reliable results

Example input

{
"allow_generated_captions": true,
"caption_format": "plain_text",
"caption_language": "en",
"concurrency": 10,
"proxy": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
},
"start_urls": [
{
"url": "https://youtu.be/dqwpQarrDwk"
}
],
"with_description": true,
"with_timestamps": true,
"max_retries": 5
}

💬 Language codes for transcripts

There isn't any complete list of possible language codes for caption_language input field, because some may be internal and undocumented. The most reliable way to discover the language codes is to scrape a few videos of your interest and check available_captions.language_code field in the resulting dataset.

However, the following list is closely following the set of language codes we had seen on YouTube:

CodeLanguage
abAbkhazian
aaAfar
afAfrikaans
sqAlbanian
aseAmerican Sign Language
amAmharic
arArabic
arcAramaic
hyArmenian
asAssamese
ayAymara
azAzerbaijani
bnBangla
baBashkir
euBasque
beBelarusian
bhBihari
biBislama
bsBosnian
brBreton
bgBulgarian
yueCantonese
yue-HKCantonese (Hong Kong)
caCatalan
chrCherokee
zhChinese
zh-CNChinese (China)
zh-HKChinese (Hong Kong)
zh-HansChinese (Simplified)
zh-SGChinese (Singapore)
zh-TWChinese (Taiwan)
zh-HantChinese (Traditional)
choChoctaw
coCorsican
hrCroatian
csCzech
daDanish
nlDutch
nl-BEDutch (Belgium)
nl-NLDutch (Netherlands)
dzDzongkha
enEnglish
en-CAEnglish (Canada)
en-IEEnglish (Ireland)
en-GBEnglish (United Kingdom)
en-USEnglish (United States)
eoEsperanto
etEstonian
foFaroese
fjFijian
filFilipino
fiFinnish
frFrench
fr-BEFrench (Belgium)
fr-CAFrench (Canada)
fr-FRFrench (France)
fr-CHFrench (Switzerland)
glGalician
kaGeorgian
deGerman
de-ATGerman (Austria)
de-DEGerman (Germany)
de-CHGerman (Switzerland)
elGreek
klGreenlandic (Kalaallisut)
gnGuarani
guGujarati
hakHakka Chinese
hak-TWHakka Chinese (Taiwan)
haHausa
iwHebrew
hiHindi
hi-LatnHindi (Phonetic)
huHungarian
isIcelandic
igIgbo
idIndonesian
iaInterlingua
ieInterlingue
iuInuktitut
ikInupiaq
gaIrish
itItalian
jaJapanese
jvJavanese
knKannada
ksKashmiri
kkKazakh
kmKhmer
rwKinyarwanda
tlhKlingon
koKorean
kuKurdish
kyKyrgyz
loLao
laLatin
lvLatvian
lnLingala
ltLithuanian
lbLuxembourgish
mkMacedonian
mgMalagasy
msMalay
mlMalayalam
mtMaltese
miMaori
mrMarathi
masMasai
nanMin Nan Chinese
nan-TWMin Nan Chinese (Taiwan)
moMoldavian
mnMongolian
myMyanmar (Burmese)
naNauru
nvNavajo
neNepali
noNorwegian
ocOccitan
orOdia
omOromo
psPashto
faPersian
fa-AFPersian (Afghanistan)
fa-IRPersian (Iran)
plPolish
ptPortuguese
pt-BRPortuguese (Brazil)
pt-PTPortuguese (Portugal)
paPunjabi
quQuechua
roRomanian
rmRomansh
rnRundi
ruRussian
ru-LatnRussian (Phonetic)
smSamoan
sgSango
saSanskrit
gdScottish Gaelic
srSerbian
sr-CyrlSerbian (Cyrillic)
sr-LatnSerbian (Latin)
shSerbo-Croatian
sdpSherdukpen
snShona
sdSindhi
siSinhala
skSlovak
slSlovenian
soSomali
stSouthern Sotho
esSpanish
es-419Spanish (Latin America)
es-MXSpanish (Mexico)
es-ESSpanish (Spain)
suSundanese
swSwahili
ssSwati
svSwedish
tlTagalog
tgTajik
taTamil
ttTatar
teTelugu
thThai
boTibetan
tiTigrinya
toTongan
tsTsonga
tnTswana
trTurkish
tkTurkmen
twTwi
ukUkrainian
urUrdu
uzUzbek
viVietnamese
voVolapük
cyWelsh
fyWestern Frisian
woWolof
xhXhosa
yiYiddish
yoYoruba
zuZulu

Source: https://gist.github.com/stpe/f0ef216bda12ffed8b939a455f0d4b65

🚀 Running the Actor

  1. Register or log into Apify.
  2. Open the actor in Apify Console and configure your preferred input (inline or JSON).
  3. Start the run and observe progress in the live log stream.
  4. Download the dataset once the run finishes.