Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoeven though the site is live and we use residential proxies, the scarper failed
Hello, I see the issue. In the HTML processing, you should change the "HTML transformer" to the default, which is "Readable text." I will close the ticket. If you need anything else, let us know.
Im looking for this configuration, but i wonder what is different with the site? I've ran thousands of scraping jobs with the actor with the same configuration
Ok the configuration we chose is correct, we use "HTML transformer": null to documentation says:
None - Only removes the HTML elements specified via 'Remove HTML elements' option.
which is what we want, we want to specify elements to omit - and again, it worked for thousands of sites we already ran with the actor.
Can you please check why this specific configuration is failing for this site?
After further investigation, we discovered that the issue was not in the HTML transformer but in the timeout. Timeouts can be for multiple reasons, such as slow content rendering, server overload, residential proxies being slower, etc. If you want to keep the same input setup, our recommendation is to increase the 'requestTimeoutSecs'.
Worked. thanks, closing the issue.
Actor Metrics
4k monthly users
-
839 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 18 hours ago