Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.7 (41)

Pricing

Pay per usage

1514

Total users

58K

Monthly users

7.9K

Runs succeeded

>99%

Issues response

7.6 days

Last modified

19 hours ago

SE

Crawling sitemaps

Closed

sentiyen opened this issue
a year ago

The crawler doesn't detect any URLs when given a sitemap, such as any default sitemap autogenerated by WordPress: https://guykawasaki.com/sitemap_index.xml

jindrich.bar avatar

Hello @sentiyen and thank you for your interest in this Actor!

We've just released a new Website Content Crawler update (0.3.29), where we've added a new input option Consider URLs from Sitemaps (or useSitemaps via API). Setting this to true tells the Actor to search for possible sitemap files among the start URLs - and also to actively discover sitemaps (e.g. by analyzing robots.txt) on crawled domains. This all should help our users to get more consistent results for their crawls and also enable some alternative use cases, such as yours.

In your next run, please try enabling this input option (and let us know how it went!) Thanks again!