No credit card required
Website Content Crawler
No credit card required
Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.
I want the Apify to take in and use custom sitemaps. We have a use case for this.
For example, Take this sitemap:
https://customgpt-streamlit.s3.amazonaws.com/customgpt-streamlit/f1357fcd-a8cb-4498-a641-ec541a999fbc.xml
It's failing to crawl on it, I've tried multiple actors and different settings like sitemaps on/not. It's still not working.
These are the runs:
RUN 1: https://console.apify.com/actors/runs/BVzk3hXVg4bC8HNh2
RUN 2: https://console.apify.com/actors/runs/KFJmq9MqQKPRMHn2c
RUN 3: https://console.apify.com/actors/runs/EBjTGlHXS2CDIzzW1
RUN 4: https://console.apify.com/actors/runs/MLrPsygiOhTnyaKnZ
Can you please look into this?
+1
Hello and thank you for your interest in this Actor!
We're sorry about the inconvenience caused by this issue. Currently, WCC only supports automatic scanning for sitemaps (by reading the robots.txt
on a given domain and downloading the sitemap automatically).
Passing the actual sitemap file as a start URL causes issues, as the start URLs are only expected to be valid web pages.
Our team will look into this use case and we'll try to come up with a way of supporting this.
Thank you for your patience, I'll get back to you once we figure out the best way about this. Cheers!
Thank you for the response, Please try to get this done as this is one of our major use cases. Hope to hear fruitful news from you!
Hey! Following up from above, we've detected that this should have worked (passing in a sitemap xml with the Use sitemaps
option turned on), and the fact it doesn't is a bug. We will follow-up once we release the bug fix for this! :D
Thanks a bunch for reporting this!
Thank you for updating this, Will be looking forward for good news soon!
- 2k monthly users
- 99.9% runs succeeded
- 2.9 days response time
- Created in Mar 2023
- Modified 3 days ago