Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
jiri.spilka avatar

Sitemap discovery takes long time (15 minutes)

Closed

Jiří Spilka (jiri.spilka) opened this issue
a month ago

I'm raising this issue on behalf of @axstv.

Fetching sitemaps is taking around 15 minutes in total and this is occurring across different runs. While disabling sitemap use is a workaround, we need to investigate the cause.

jiri.spilka avatar

We have a PR ready to help reduce latency.

If you only need to scrape results from the startUrls, set max depth to 0 and turn off Consider URLs from sitemap. This way, scraping will be faster, and you’ll get data only from the pages specified in startUrls.

jiri.spilka avatar

We have made changes to sitemap fetching and tested them. It is way faster now. This update will be shipped in the next release. I’ll keep this issue open until the release is out.

jindrich.bar avatar

Today, we released a new version of WCC (0.3.54), which has more aggressive timeouts for sitemap loading and internal crawler operations. This ensures faster crawling progression, even in the case of a slow or misconfigured server.

We checked that the patches in 0.3.54 improve the performance of the Actor for your runs, too. Start using the new version in the Run options section of the Actor input, either either by pinning the version number or by switching to the latest build tag.

I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 636 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago