Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

SS

Crawling does not work for custom sitemaps

Closed

sai_sampath opened this issue
3 months ago

I want the Apify to take in and use custom sitemaps. We have a use case for this. For example, Take this sitemap: https://customgpt-streamlit.s3.amazonaws.com/customgpt-streamlit/f1357fcd-a8cb-4498-a641-ec541a999fbc.xml

It's failing to crawl on it, I've tried multiple actors and different settings like sitemaps on/not. It's still not working.

These are the runs:

RUN 1: https://console.apify.com/actors/runs/BVzk3hXVg4bC8HNh2

RUN 2: https://console.apify.com/actors/runs/KFJmq9MqQKPRMHn2c

RUN 3: https://console.apify.com/actors/runs/EBjTGlHXS2CDIzzW1

RUN 4: https://console.apify.com/actors/runs/MLrPsygiOhTnyaKnZ

Can you please look into this?

UA

unified_antechamber

3 months ago

+1

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

We're sorry about the inconvenience caused by this issue. Currently, WCC only supports automatic scanning for sitemaps (by reading the robots.txt on a given domain and downloading the sitemap automatically). Passing the actual sitemap file as a start URL causes issues, as the start URLs are only expected to be valid web pages.

Our team will look into this use case and we'll try to come up with a way of supporting this.

Thank you for your patience, I'll get back to you once we figure out the best way about this. Cheers!

SS

sai_sampath

3 months ago

Thank you for the response, Please try to get this done as this is one of our major use cases. Hope to hear fruitful news from you!

vladfrangu_apify avatar

Hey! Following up from above, we've detected that this should have worked (passing in a sitemap xml with the Use sitemaps option turned on), and the fact it doesn't is a bug. We will follow-up once we release the bug fix for this! :D

Thanks a bunch for reporting this!

SS

sai_sampath

3 months ago

Thank you for updating this, Will be looking forward for good news soon!

jindrich.bar avatar

We've finally released a patch that fixes this issue. When processing a URL leading to a sitemap, the crawler now processes the sitemap as a list of links to enqueue.

I'll close this issue now, but make sure to let us know if anything breaks for you.

Thank you for your patience!

Developer
Maintained by Apify
Actor metrics
  • 2.8k monthly users
  • 317 stars
  • 100.0% runs succeeded
  • 4 days response time
  • Created in Mar 2023
  • Modified 1 day ago