Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.5 (39)

Pricing

Pay per usage

1357

Total users

51.5k

Monthly users

7.4k

Runs succeeded

>99%

Issues response

6.4 days

Last modified

4 days ago

ML

Invalid Input URL error causing runs to fail

Closed

motivated_leaflet opened this issue
a year ago

Entire runs will fail and not get started due to invalid url error. (Using multiple urls as inputs)

Issue 1: could not identify invalid URLs after checking structure of urls (what is the reason for an invalid URLs in the 1st place vs failed to scrape?) Issue 2: even if some urls fail for some reason, the run should continue and scrape urls that are valid instead of failing completely

Thank you!

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

Looking at your list of URLs, it seems that you indeed have some invalid URLs in there (e.g. https://v or http://How bad is the global chip shortage problem). But as you correctly pointed out, the crawler should probably just skip those, not fail immediately.

I patched the Actor and released a new beta version, where this is fixed (so you can re-run the actor with your input). To switch to the beta version, simply scroll down in the Input tab and select Run options > Build > beta. I'll keep this issue open until we release a new latest version, once we test everything properly - until then, you can let us know how your crawl went and whether you have encountered any other problems with this Actor.

Thanks again!

jindrich.bar avatar

Hello again - just letting you know, I've just released a new latest version 0.3.31 where this problem is fixed.

Closing this issue for now, but as always, let us know in case of any problems. Thank you!