Pricing

from $0.001 / result

Go to Apify Store

Japanese Website Content Crawler for RAG

Try for free

日本語のドキュメント、ヘルプセンター、ブログ、製品サイトをクロールし、RAG、ベクトルDB、LLMアプリ、社内検索に使いやすいMarkdown、テキスト、HTMLとして抽出します。

Pricing

from $0.001 / result

Rating

0.0

(0)

Developer

nezha

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Categories

Developer tools

SEO tools

日本語サイト、ドキュメント、ヘルプセンターのURL

startUrls

Required

Markdown、テキスト、HTMLに変換したい日本語ページまたはセクションを貼り付けます。ドキュメントのトップ、ヘルプセンターのカテゴリ、ブログ一覧、製品サイトの一部などを指定できます。Target scope onlyがオンの場合、URLのパスもクロール範囲に使われます。

Type:array

抽出する最大ページ数

maxPages

Optional

Datasetに保存する最大ページ数です。最初は高速プレビュー用に3のまま実行し、出力を確認してから増やしてください。実行時間とコストを調整する一番重要な項目です。

Type:integer

Minimum:1

Default:3

ページ探索方法

crawlMode

Optional

Autoはまずsitemap URLを使って高速にページを探し、対象ページが見つからない場合は開始URLからリンクをたどります。セクション単位でクロールしたい場合はWebsite links、sitemapだけに限定したい場合はSitemap onlyを選んでください。

Type:string

Default:auto

Options:

autowebsitesitemap

Sitemap URL

sitemapUrls

Optional

任意のsitemap.xml URLです。空のままにすると、各開始URLのドメインで/sitemap.xmlを試します。信頼できるsitemapを持つドキュメント、ヘルプセンター、ブログに便利です。

Type:array

リンクの深さ

maxDepth

Optional

Website linksモードで何階層までリンクをたどるかを指定します。0は貼り付けたURLだけ、1は直接リンクされたページまで、2以上はより広いセクションのクロールに使います。Sitemap onlyモードでは無視されます。

Type:integer

Minimum:0

Default:1

対象範囲だけをクロール

sameDomainOnly

Optional

開始URLと同じドメインおよびパス配下だけをクロールします。例: /docsから開始すると/docs配下だけを対象にします。別ドメインやサイト全体まで広げたい場合だけオフにしてください。

Type:boolean

Default:true

メインコンテンツ形式

outputFormat

Optional

Datasetのcontentフィールドに入る形式を指定します。多くのRAGやベクトルDBにはMarkdown、軽量な検索やQAにはプレーンテキスト、独自解析には構造が残るHTMLが向いています。

Type:string

Default:markdown

Options:

markdowntexthtml

クリーンHTMLを別レコードに保存

saveCleanHtml

Optional

各ページの整形済みHTMLをKey-value storeに保存し、CLEAN_HTML_INDEXに一覧を出力します。高速プレビューではオフのままにし、下流処理でHTMLファイルが必要な場合にオンにしてください。

Type:boolean

Default:false

本文エリアのCSSセレクタ

contentSelector

Optional

本文エリアを指定する任意のCSSセレクタです。例: main、article、.docs-content、#content。空のままにするとmain、article、[role=main]、bodyの順に自動検出します。

Type:string

削除するCSSセレクタ

removeSelectors

Optional

抽出前に削除する任意のCSSセレクタです。例: .sidebar、.cookie-banner、.newsletter、.toc、.ads。ナビゲーションや共通パーツが出力に混ざる場合に使います。

Type:array

最小テキスト長

minTextLength

Optional

抽出テキストがこの文字数未満のページをスキップします。最初の実行では0のままにしてください。空ページ、リダイレクト、一覧ページを除外したい場合に後から増やします。

Type:integer

Minimum:0

Default:0

含めるURLパターン

includeUrlGlobs

Optional

残したいURLを指定する任意のCrawlee globパターンです。例: /docs/、/help/。クロール範囲をさらに絞りたい場合だけ使います。

Type:array

除外するURLパターン

excludeUrlGlobs

Optional

スキップしたいURLを指定する任意のCrawlee globパターンです。例: /search/、/login/、?utm_、/*.pdf。

Type:array

追加でスキップするファイル拡張子

excludeFileExtensions

Optional

このActorはpdf、画像、動画、Officeファイル、アーカイブなど一般的な非HTMLファイルを自動でスキップします。対象サイトに独自のダウンロード形式がある場合だけ追加してください。

Type:array

待機するCSSセレクタ

waitForSelector

Optional

抽出前に表示を待つ任意のセレクタです。例: main、.article-body。初期読み込み後に本文が表示されるJavaScriptサイトで使います。

Type:string

ページ読み込みタイムアウト

navigationTimeoutSecs

Optional

ページ遷移と任意セレクタの待機に使う秒数です。JavaScriptが多く遅いサイトでは増やしてください。

Type:integer

Minimum:15

Default:25

プロキシ設定

proxyConfiguration

Optional

任意のApifyプロキシ設定です。公開ドキュメント、ヘルプセンター、ブログでは通常デフォルトの直接接続で十分です。

Type:object

Website Crawler API — Markdown for RAG

tugelbay/website-content-crawler

Website crawler API for public pages and clean Markdown, text, or HTML output for RAG pipelines, AI agents, documentation indexing, and monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Website Content Crawler

rupom888/website-content-crawler

Syed Rupom

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

RAG Web Browser

travelmonitorlab/rag-web-browser

Search the web and extract content for AI/RAG pipelines. Returns clean text ready for LLM ingestion.

Travel Monitor Lab

RAG Web Browser

scrapier/rag-web-browser

🌐 RAG Web Browser (rag-web-browser) is an intelligent tool for retrieving and generating answers from web sources with RAG. ⚡ Speed up research, get accurate citations, and streamline workflows for developers & analysts.

Scrapier

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.