Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1224

Monthly users

6.4k

Runs succeeded

>99%

Response time

4.6 days

Last modified

2 days ago

EA

Seeing many `NS_ERROR_X` errors for a crawl

Closed

embrace_ai opened this issue
a year ago

I've tried a couple of configurations for crawling a particular website and it is failing to process the first page due to a variety of errors such as NS_ERROR_NET_INTERRUPT NS_ERROR_PROXY_CONNECTION_REFUSED and NS_ERROR_NET_TIMEOUT

I've tried using residential proxies, the playwright:firefox crawler, and increased the memory and timeout parameters for my task but have had no success.

The information provided by enabling debugLog and debugMode haven't helped me figure out the cause.

Thank you!

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

The page you are trying to scrape seems to be misconfigured - while it doesn't respond on https://bmc.com (the connection never goes through and proxy / browser eventually times out), it does respond on https://www.bmc.com.

A quick look into the DNS records shows that the apex domain (without the www) points to a different address than the www domain. The apex domain server doesn't seem to listen on the 443 port - this is why the https://bmc.com request never resolves. Interesting point - the apex domain server listens on port 80 and correctly redirects you to the www URL - i.e. going to http://bmc.com (http, not https) does work too. This definitely feels broken and probably hurts their SEO performance - if you somehow know the admin of that website, it's something they should look into.

You can easily fix your scraper run by replacing the https://bmc.com start URL with https://www.bmc.com. You can reset the rest of the settings to default (mostly proxies, RESIDENTIALS are more expensive and you can easily scrape this website with DATACENTER). Check out my example run here - you can see that WCC indeed scrapes the contents of the website.

Thank you again for bringing this up!

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.