
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
3.7 (41)
Pricing
Pay per usage
1514
Total users
58K
Monthly users
7.9K
Runs succeeded
>99%
Issues response
7.6 days
Last modified
19 hours ago
Crawling sitemaps
Closed
The crawler doesn't detect any URLs when given a sitemap, such as any default sitemap autogenerated by WordPress: https://guykawasaki.com/sitemap_index.xml
Hello @sentiyen and thank you for your interest in this Actor!
We've just released a new Website Content Crawler update (0.3.29
), where we've added a new input option Consider URLs from Sitemaps
(or useSitemaps
via API). Setting this to true
tells the Actor to search for possible sitemap files among the start URLs - and also to actively discover sitemaps (e.g. by analyzing robots.txt
) on crawled domains. This all should help our users to get more consistent results for their crawls and also enable some alternative use cases, such as yours.
In your next run, please try enabling this input option (and let us know how it went!) Thanks again!