YouTube Transcript Corpus Audit & RAG Readiness avatar

YouTube Transcript Corpus Audit & RAG Readiness

Pricing

from $6.00 / 1,000 transcript rag chunks

Go to Apify Store
YouTube Transcript Corpus Audit & RAG Readiness

YouTube Transcript Corpus Audit & RAG Readiness

Extract public YouTube captions, audit transcript coverage, score RAG readiness, and create timestamped supporting chunks without double charging report mode.

Pricing

from $6.00 / 1,000 transcript rag chunks

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

YouTube Corpus Audit & RAG Readiness Report

Turn public YouTube captions from videos, playlists, or channels into a decision-ready RAG corpus audit. The report focuses on caption coverage, missing-caption risk, chunking quality, retrieval QA actions, and the next run needed to move from raw transcript extraction to usable AI retrieval.

Use corpus_snapshot for a compact coverage checklist, or rag_readiness when you need retrieval QA actions and prioritized fixes. The legacy chunks mode is still available for timestamped transcript rows.

Store Quickstart

Recommended first run:

{
"videoUrls": [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ"
],
"language": "en",
"reportTier": "corpus_snapshot",
"maxChargeUsd": 9,
"maxVideos": 1,
"delivery": "dataset",
"dryRun": false
}

Input Examples

Corpus Snapshot Report

{
"videoUrls": [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ"
],
"reportTier": "corpus_snapshot",
"maxChargeUsd": 9,
"maxVideos": 1,
"delivery": "dataset",
"dryRun": false
}

RAG Readiness Report

{
"channelUrls": [
"https://www.youtube.com/@OpenAI"
],
"reportTier": "rag_readiness",
"maxChargeUsd": 29,
"maxVideos": 5,
"chunkSize": 1200,
"chunkOverlap": 150,
"delivery": "dataset",
"dryRun": false
}

Legacy Transcript Chunks

{
"videoIds": [
"dQw4w9WgXcQ"
],
"reportTier": "chunks",
"language": "en",
"delivery": "webhook",
"webhookUrl": "https://example.com/webhook",
"dryRun": false
}

Sample Output

{
"meta": {
"actorName": "youtube-channel-transcript-rag-intelligence",
"actorTitle": "YouTube Channel Transcript RAG Intelligence",
"fetchedAt": "2026-05-09T00:00:00.000Z",
"totalRows": 2
},
"rows": [
{
"rowType": "corpus_audit_report",
"reportTier": "rag_readiness",
"status": "success",
"chargedEvent": "rag_readiness_report",
"chargedUsd": 29,
"decisionSummary": "RAG readiness report: 4/5 videos have usable public captions, coverage score 80, RAG readiness 76/100, missing-caption risk medium.",
"coverageScore": 80,
"ragReadinessScore": 76,
"missingCaptionRisk": "medium",
"chunkingRisks": [
{
"severity": "low",
"code": "chunking_ok",
"action": "Use current chunking as a baseline for retrieval QA."
}
],
"retrievalQaChecklist": [
{
"check": "caption_coverage",
"status": "pass",
"action": "Confirm target videos have public captions before indexing."
},
{
"check": "grounding",
"status": "required",
"action": "Run golden Q&A prompts and require timestamped source citations."
}
],
"actionList": [
"Replace captionless or unavailable videos and keep their warning rows as no-charge source evidence.",
"Build 5-10 golden retrieval questions and verify each answer cites a timestamped source chunk.",
"Tag missing-caption videos as ingestion blockers before scheduling recurring corpus updates."
],
"previewReport": {
"nextRunInput": {
"channelUrls": ["https://www.youtube.com/@OpenAI"],
"reportTier": "rag_readiness",
"maxChargeUsd": 29,
"dryRun": false
}
},
"sourceUrls": ["https://www.youtube.com/@OpenAI"]
}
],
"warnings": []
}

Output Fields

Report rows include:

  • decisionSummary
  • coverageScore
  • ragReadinessScore
  • missingCaptionRisk
  • chunkingRisks
  • retrievalQaChecklist
  • actionList and prioritizedActions
  • previewReport.nextRunInput
  • status, chargedEvent, chargedUsd, reason
  • sourceUrls, warnings, errors

Supporting transcript chunks are included as no-charge evidence rows in report mode. Legacy chunk mode keeps timestamped rag_chunk rows with video metadata, chunk text, timestamps, and source URLs.

Pricing and No-Charge Rules

  • corpus_snapshot emits corpus_snapshot_report.
  • rag_readiness emits rag_readiness_report.
  • Report mode charges at most one paid event per run. Supporting chunks are no-charge evidence rows.
  • dryRun, demoMode, caption failures, source failures, and maxChargeUsd limit rows are no-charge.
  • The recurring watch summary is planned and proof-gated. It is not selectable in the public input schema and is not promoted until paid proof exists.

Compliance Guardrails

  • Uses public YouTube pages and public caption tracks only.
  • No account session, private video, member-only, paywalled, or login-only access is used.
  • No CAPTCHA or rate-limit bypass is attempted.
  • Do not position output as a replacement for rights-managed transcript licensing.
  • Do not claim ranking, sales, or revenue improvements from the report.
  • Do not use provider emblems or wording that implies upstream approval.

See Also