Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.9 (41)

Pricing

Pay per usage

1552

Total users

60K

Monthly users

7.7K

Runs succeeded

>99%

Issues response

7.8 days

Last modified

5 days ago

SS

Crawling does not work for custom sitemaps

Closed

sai_sampath opened this issue
a year ago

I want the Apify to take in and use custom sitemaps. We have a use case for this. For example, Take this sitemap: https://customgpt-streamlit.s3.amazonaws.com/customgpt-streamlit/f1357fcd-a8cb-4498-a641-ec541a999fbc.xml

It's failing to crawl on it, I've tried multiple actors and different settings like sitemaps on/not. It's still not working.

These are the runs:

RUN 1: https://console.apify.com/actors/runs/BVzk3hXVg4bC8HNh2

RUN 2: https://console.apify.com/actors/runs/KFJmq9MqQKPRMHn2c

RUN 3: https://console.apify.com/actors/runs/EBjTGlHXS2CDIzzW1

RUN 4: https://console.apify.com/actors/runs/MLrPsygiOhTnyaKnZ

Can you please look into this?

UA

unified_antechamber

a year ago

+1

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

We're sorry about the inconvenience caused by this issue. Currently, WCC only supports automatic scanning for sitemaps (by reading the robots.txt on a given domain and downloading the sitemap automatically). Passing the actual sitemap file as a start URL causes issues, as the start URLs are only expected to be valid web pages.

Our team will look into this use case and we'll try to come up with a way of supporting this.

Thank you for your patience, I'll get back to you once we figure out the best way about this. Cheers!

SS

sai_sampath

a year ago

Thank you for the response, Please try to get this done as this is one of our major use cases. Hope to hear fruitful news from you!

vladfrangu_apify avatar

Hey! Following up from above, we've detected that this should have worked (passing in a sitemap xml with the Use sitemaps option turned on), and the fact it doesn't is a bug. We will follow-up once we release the bug fix for this! :D

Thanks a bunch for reporting this!

SS

sai_sampath

a year ago

Thank you for updating this, Will be looking forward for good news soon!

jindrich.bar avatar

We've finally released a patch that fixes this issue. When processing a URL leading to a sitemap, the crawler now processes the sitemap as a list of links to enqueue.

I'll close this issue now, but make sure to let us know if anything breaks for you.

Thank you for your patience!