Website Content Crawler for RAG avatar

Website Content Crawler for RAG

Pricing

from $0.01 / result

Go to Apify Store
Website Content Crawler for RAG

Website Content Crawler for RAG

Crawl documentation sites, help centers, blogs, and websites, then extract clean markdown, text, or HTML for RAG pipelines, vector databases, and LLM applications.

Pricing

from $0.01 / result

Rating

0.0

(0)

Developer

yun qing

yun qing

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

2

Monthly active users

10 days ago

Last modified

Share

Crawl docs sites, help centers, blogs, and websites, then extract clean content as markdown, text, or HTML for RAG, vector databases, and LLM pipelines.

Built for:

  • AI engineers
  • RAG developers
  • Knowledge base teams
  • Developer tooling teams

Why use this Actor?

  • Crawl from start URLs or sitemap URLs
  • Keep the crawl inside your target scope
  • Filter out PDFs and non-HTML files
  • Store clean HTML separately for downstream processing
  • Export markdown, text, or HTML depending on your ingestion workflow

Typical use cases

  • Crawl product documentation into a vector database
  • Ingest help center content into an internal knowledge base
  • Extract clean website content for LLM applications
  • Capture docs and blog content for search or analysis

What makes it useful for content ingestion

  • sitemap mode for docs and help center sites
  • scope control to avoid crawling unrelated pages
  • PDF and file filtering to keep the output focused
  • clean HTML storage for downstream parsing and chunking
  • markdown, text, and HTML outputs for different pipelines

If this is your first run, start with:

  • 1 start URL or 1 sitemap URL
  • contentFormat: markdown
  • a conservative maxDepth
  • file filtering enabled

Good first-run targets:

  • a product docs site
  • a help center
  • a blog section

Example workflows

1. Docs site to RAG

Use the Actor to crawl a documentation site, then send the markdown or clean HTML output into your chunking and embedding pipeline.

Best for:

  • internal developer docs
  • product documentation
  • public API docs

2. Help center to knowledge base

Crawl support articles from a help center and export them as clean text or markdown for:

  • internal search
  • support copilots
  • FAQ assistants

3. Website content extraction for LLM apps

Collect structured content from blogs, docs, and product pages to build:

  • retrieval systems
  • internal knowledge tools
  • content analysis workflows

Typical input

{
"startUrls": [{ "url": "https://docs.apify.com/" }],
"crawlMode": "website",
"contentFormat": "markdown",
"maxDepth": 2,
"excludeFileExtensions": [".pdf", ".zip", ".doc", ".docx", ".ppt", ".pptx"]
}

Local development

pnpm actor:dev websiteContentCrawler --example 0 --force-input
pnpm actor:dev websiteContentCrawler --example 2 --force-input

Notes:

  • input-examples.json is used by local actor:dev
  • Apify platform automated testing uses the prefill values from .actor/input_schema.json
  • The schema uses a public default URL so automated testing can pass without relying on localhost

Build

$pnpm actor:build websiteContentCrawler

Publish

pnpm actor:push websiteContentCrawler
pnpm actor:push websiteContentCrawler --dry-run
pnpm actor:push websiteContentCrawler --sync-meta --prefer-local-meta

Dataset Output

Each dataset item includes:

  • url
  • title
  • description
  • content
  • contentFormat
  • cleanHtml
  • markdown
  • text
  • html
  • wordCount
  • language
  • canonicalUrl
  • depth
  • httpStatusCode
  • crawledAt

Crawl Modes

  • website: start from startUrls, then follow links recursively
  • sitemap: load URLs from sitemapUrls or fallback origin + /sitemap.xml

Separate Clean HTML Storage

  • CLEAN_HTML_INDEX stores the mapping between page URL and KVS record key
  • Individual cleaned HTML records are stored as CLEAN_HTML_000001, CLEAN_HTML_000002, and so on