
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.5 (39)
Pricing
Pay per usage
1357
Total users
51.5k
Monthly users
7.4k
Runs succeeded
>99%
Issues response
6.4 days
Last modified
4 days ago
Invalid Input URL error causing runs to fail
Closed
Entire runs will fail and not get started due to invalid url error. (Using multiple urls as inputs)
Issue 1: could not identify invalid URLs after checking structure of urls (what is the reason for an invalid URLs in the 1st place vs failed to scrape?) Issue 2: even if some urls fail for some reason, the run should continue and scrape urls that are valid instead of failing completely
Thank you!
Hello and thank you for your interest in this Actor!
Looking at your list of URLs, it seems that you indeed have some invalid URLs in there (e.g. https://v
or http://How bad is the global chip shortage problem
). But as you correctly pointed out, the crawler should probably just skip those, not fail immediately.
I patched the Actor and released a new beta
version, where this is fixed (so you can re-run the actor with your input). To switch to the beta version, simply scroll down in the Input tab and select Run options > Build > beta
.
I'll keep this issue open until we release a new latest
version, once we test everything properly - until then, you can let us know how your crawl went and whether you have encountered any other problems with this Actor.
Thanks again!
Hello again - just letting you know, I've just released a new latest
version 0.3.31
where this problem is fixed.
Closing this issue for now, but as always, let us know in case of any problems. Thank you!