Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoI'm raising this issue on behalf of @axstv.
Fetching sitemaps is taking around 15 minutes in total and this is occurring across different runs. While disabling sitemap use is a workaround, we need to investigate the cause.
We have a PR ready to help reduce latency.
If you only need to scrape results from the startUrls
, set max depth to 0 and turn off Consider URLs from sitemap
.
This way, scraping will be faster, and you’ll get data only from the pages specified in startUrls.
We have made changes to sitemap fetching and tested them. It is way faster now. This update will be shipped in the next release. I’ll keep this issue open until the release is out.
Today, we released a new version of WCC (0.3.54
), which has more aggressive timeouts for sitemap loading and internal crawler operations. This ensures faster crawling progression, even in the case of a slow or misconfigured server.
We checked that the patches in 0.3.54
improve the performance of the Actor for your runs, too. Start using the new version in the Run options
section of the Actor input, either either by pinning the version number or by switching to the latest
build tag.
I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!
- 3.8k monthly users
- 636 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago