Website Content Crawler avatar

Website Content Crawler

Pricing

from $0.01 / result

Go to Apify Store
Website Content Crawler

Website Content Crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, and the wider LLM ecosystem.

Pricing

from $0.01 / result

Rating

0.0

(0)

Developer

yun qing

yun qing

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 hours ago

Last modified

Share

A focused first-version actor inspired by Apify's Website Content Crawler.

It crawls one or more websites, stays inside the start URL scope, extracts cleaned page content, and stores the result as markdown, text, or html.

Features

  • Crawl from one or more start URLs
  • Crawl from sitemap URLs in sitemap mode
  • Follow links recursively with maxDepth
  • Keep the crawl inside the same start URL scope
  • Filter out PDFs and other non-HTML files by extension
  • Extract cleaned page content from main, article, or a custom selector
  • Output markdown, text, or html
  • Store OUTPUT_SUMMARY, FAILED_PAGES, SKIPPED_PAGES, and CLEAN_HTML_INDEX
  • Store cleaned HTML separately in key-value store records like CLEAN_HTML_000001

Local Development

pnpm actor:dev websiteContentCrawler --example 0 --force-input
pnpm actor:dev websiteContentCrawler --example 2 --force-input

Notes:

  • input-examples.json is used by local actor:dev
  • Apify platform automated testing uses the prefill values from .actor/input_schema.json
  • The schema now uses a public default URL so automated testing can pass without relying on localhost

Build

$pnpm actor:build websiteContentCrawler

Publish

pnpm actor:push websiteContentCrawler
pnpm actor:push websiteContentCrawler --dry-run

Dataset Output

Each dataset item includes:

  • url
  • title
  • description
  • content
  • contentFormat
  • cleanHtml
  • markdown
  • text
  • html
  • wordCount
  • language
  • canonicalUrl
  • depth
  • httpStatusCode
  • crawledAt

Crawl Modes

  • website: start from startUrls, then follow links recursively
  • sitemap: load URLs from sitemapUrls or fallback origin + /sitemap.xml

Separate Clean HTML Storage

  • CLEAN_HTML_INDEX stores the mapping between page URL and KVS record key
  • Individual cleaned HTML records are stored as CLEAN_HTML_000001, CLEAN_HTML_000002, and so on