Pricing

from $0.001 / result

Go to Apify Store

Japanese Website Content Crawler for RAG

Try for free

日本語のドキュメント、ヘルプセンター、ブログ、製品サイトをクロールし、RAG、ベクトルDB、LLMアプリ、社内検索に使いやすいMarkdown、テキスト、HTMLとして抽出します。

Pricing

from $0.001 / result

Rating

0.0

(0)

Developer

nezha

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

出力プレビュー

Datasetには、各ページの本文、構造情報、クロールメタデータが保存されます。

URL	Title	Format	Words	Language	Depth
`/resources/introduction-to-web-accessibility-guidebook`	ウェブアクセシビリティ導入ガイドブック	markdown	2430	ja	0
`/resources/standard-guidelines`	標準ガイドライン	markdown	1180	ja	1
`/policies`	政策	markdown	940	ja	1

主なフィールド:

url, title, description, canonicalUrl
content, markdown, text, html, cleanHtml
headings, wordCount, language, depth, httpStatusCode, crawledAt
OUTPUT_SUMMARY, FAILED_PAGES, SKIPPED_PAGES, CLEAN_HTML_INDEX

このActorでできること

日本語サイトをMarkdown、プレーンテキスト、クリーンHTMLに変換
sitemapまたは開始URLからページを探索
開始URLと同じドメイン、同じパス配下だけにクロール範囲を制限
PDF、画像、動画、Officeファイル、アーカイブなど非HTMLファイルを自動除外
失敗ページ、スキップページ、実行サマリーをKey-value storeに保存
RAG、チャンク分割、埋め込み、AIナレッジベース、社内検索に使いやすい形で出力

すぐに試す

日本語サイト、ドキュメント、ヘルプセンターのURL に対象URLを入れます。
初回は maxPages: 3、crawlMode: auto、outputFormat: markdown のまま実行します。
Datasetと OUTPUT_SUMMARY を確認します。
出力が期待通りなら maxPages を増やして本番クロールに進みます。

autoモードは最初にsitemapを試し、対象ページが見つからない場合は開始URLからリンクをたどります。

コストと実行サイズ

maxPages が実行時間とコストを調整する主な項目です。デフォルトの3ページは高速プレビュー用です。大規模なドキュメントサイトやヘルプセンターでは、sitemap、sameDomainOnly、URLフィルタを組み合わせて、検索ページ、ログインページ、タグページ、ダウンロードページを避けてください。

主なユースケース

日本語ドキュメントをRAGへ
製品ドキュメント、APIドキュメント、技術ガイドをMarkdownまたはHTMLで抽出し、チャンク分割、埋め込み、検索に渡せます。

ヘルプセンターをAIサポートに取り込む
FAQやサポート記事をテキスト化し、社内検索、サポートAI、問い合わせ支援に利用できます。

ブログや製品ページをナレッジベース化
記事、ガイド、製品ページをタイトル、見出し、canonical URL、本文と一緒に保存できます。

日本語サイトをMarkdownに変換
手作業のコピー&ペーストではなく、再実行できるワークフローとしてMarkdownを生成できます。

完全なJSON例

入力例

例1: 高速Markdownプレビュー

{
  "startUrls": [
    {
      "url": "https://www.digital.go.jp/resources/introduction-to-web-accessibility-guidebook"
    }
  ],
  "maxPages": 3,
  "crawlMode": "auto",
  "sitemapUrls": [
    "https://www.digital.go.jp/sitemap.xml"
  ],
  "outputFormat": "markdown",
  "maxDepth": 1,
  "sameDomainOnly": true,
  "saveCleanHtml": false
}

例2: 開始URLからリンクをたどる

{
  "startUrls": [
    {
      "url": "https://www.digital.go.jp/resources/introduction-to-web-accessibility-guidebook"
    }
  ],
  "maxPages": 10,
  "crawlMode": "website",
  "outputFormat": "text",
  "maxDepth": 1,
  "sameDomainOnly": true
}

例3: sitemapから広めに抽出する

{
  "startUrls": [
    {
      "url": "https://www.digital.go.jp/"
    }
  ],
  "maxPages": 20,
  "crawlMode": "sitemap",
  "sitemapUrls": [
    "https://www.digital.go.jp/sitemap.xml"
  ],
  "outputFormat": "markdown",
  "sameDomainOnly": true,
  "saveCleanHtml": false
}

APIからの実行

このActorはApify API、Apify Python client、Apify JavaScript clientから実行できます。

API reference: Apify API
Client docs: Apify clients

Website Crawler API — Markdown for RAG

tugelbay/website-content-crawler

Website crawler API for public pages and clean Markdown, text, or HTML output for RAG pipelines, AI agents, documentation indexing, and monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Website Content Crawler

rupom888/website-content-crawler

Syed Rupom

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

RAG Web Browser

travelmonitorlab/rag-web-browser

Search the web and extract content for AI/RAG pipelines. Returns clean text ready for LLM ingestion.

Travel Monitor Lab

RAG Web Browser

scrapier/rag-web-browser

🌐 RAG Web Browser (rag-web-browser) is an intelligent tool for retrieving and generating answers from web sources with RAG. ⚡ Speed up research, get accurate citations, and streamline workflows for developers & analysts.

Scrapier

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.