Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
BL

Hi my run didnt work

Closed

ballerine opened this issue
4 months ago

even though the site is live and we use residential proxies, the scarper failed

Oscardz avatar

Hello, I see the issue. In the HTML processing, you should change the "HTML transformer" to the default, which is "Readable text." I will close the ticket. If you need anything else, let us know.

BL

ballerine

4 months ago

Im looking for this configuration, but i wonder what is different with the site? I've ran thousands of scraping jobs with the actor with the same configuration

BL

ballerine

4 months ago

Ok the configuration we chose is correct, we use "HTML transformer": null to documentation says:

None - Only removes the HTML elements specified via 'Remove HTML elements' option.

which is what we want, we want to specify elements to omit - and again, it worked for thousands of sites we already ran with the actor.

Can you please check why this specific configuration is failing for this site?

Oscardz avatar

After further investigation, we discovered that the issue was not in the HTML transformer but in the timeout. Timeouts can be for multiple reasons, such as slow content rendering, server overload, residential proxies being slower, etc. If you want to keep the same input setup, our recommendation is to increase the 'requestTimeoutSecs'.

BL

ballerine

4 months ago

Worked. thanks, closing the issue.

Developer
Maintained by Apify

Actor Metrics

  • 4k monthly users

  • 839 stars

  • >99% runs succeeded

  • 1 days response time

  • Created in Mar 2023

  • Modified 18 hours ago