Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

User avatar

Crawling does not work for custom sitemaps

Open

sai_sampath opened this issue
a month ago

I want the Apify to take in and use custom sitemaps. We have a use case for this. For example, Take this sitemap: https://customgpt-streamlit.s3.amazonaws.com/customgpt-streamlit/f1357fcd-a8cb-4498-a641-ec541a999fbc.xml

It's failing to crawl on it, I've tried multiple actors and different settings like sitemaps on/not. It's still not working.

These are the runs:

RUN 1: https://console.apify.com/actors/runs/BVzk3hXVg4bC8HNh2

RUN 2: https://console.apify.com/actors/runs/KFJmq9MqQKPRMHn2c

RUN 3: https://console.apify.com/actors/runs/EBjTGlHXS2CDIzzW1

RUN 4: https://console.apify.com/actors/runs/MLrPsygiOhTnyaKnZ

Can you please look into this?

User avatar

unified_antechamber

a month ago

+1

User avatar

Hello and thank you for your interest in this Actor!

We're sorry about the inconvenience caused by this issue. Currently, WCC only supports automatic scanning for sitemaps (by reading the robots.txt on a given domain and downloading the sitemap automatically). Passing the actual sitemap file as a start URL causes issues, as the start URLs are only expected to be valid web pages.

Our team will look into this use case and we'll try to come up with a way of supporting this.

Thank you for your patience, I'll get back to you once we figure out the best way about this. Cheers!

User avatar

sai_sampath

a month ago

Thank you for the response, Please try to get this done as this is one of our major use cases. Hope to hear fruitful news from you!

User avatar

Hey! Following up from above, we've detected that this should have worked (passing in a sitemap xml with the Use sitemaps option turned on), and the fact it doesn't is a bug. We will follow-up once we release the bug fix for this! :D

Thanks a bunch for reporting this!

User avatar

sai_sampath

a month ago

Thank you for updating this, Will be looking forward for good news soon!

Developer
Maintained by Apify
Actor metrics
  • 2.1k monthly users
  • 99.8% runs succeeded
  • 3 days response time
  • Created in Mar 2023
  • Modified 1 day ago