Website Content Crawler
Pricing
from $0.01 / result
Go to Apify Store
Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, and the wider LLM ecosystem.
Pricing
from $0.01 / result
Rating
0.0
(0)
Developer
yun qing
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 hours ago
Last modified
Categories
Share
A focused first-version actor inspired by Apify's Website Content Crawler.
It crawls one or more websites, stays inside the start URL scope, extracts cleaned page content, and stores the result as markdown, text, or html.
Features
- Crawl from one or more start URLs
- Crawl from sitemap URLs in
sitemapmode - Follow links recursively with
maxDepth - Keep the crawl inside the same start URL scope
- Filter out PDFs and other non-HTML files by extension
- Extract cleaned page content from
main,article, or a custom selector - Output
markdown,text, orhtml - Store
OUTPUT_SUMMARY,FAILED_PAGES,SKIPPED_PAGES, andCLEAN_HTML_INDEX - Store cleaned HTML separately in key-value store records like
CLEAN_HTML_000001
Local Development
pnpm actor:dev websiteContentCrawler --example 0 --force-inputpnpm actor:dev websiteContentCrawler --example 2 --force-input
Notes:
input-examples.jsonis used by localactor:dev- Apify platform automated testing uses the
prefillvalues from.actor/input_schema.json - The schema now uses a public default URL so automated testing can pass without relying on localhost
Build
$pnpm actor:build websiteContentCrawler
Publish
pnpm actor:push websiteContentCrawlerpnpm actor:push websiteContentCrawler --dry-run
Dataset Output
Each dataset item includes:
urltitledescriptioncontentcontentFormatcleanHtmlmarkdowntexthtmlwordCountlanguagecanonicalUrldepthhttpStatusCodecrawledAt
Crawl Modes
website: start fromstartUrls, then follow links recursivelysitemap: load URLs fromsitemapUrlsor fallbackorigin + /sitemap.xml
Separate Clean HTML Storage
CLEAN_HTML_INDEXstores the mapping between page URL and KVS record key- Individual cleaned HTML records are stored as
CLEAN_HTML_000001,CLEAN_HTML_000002, and so on