
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.6 (38)
Pricing
Pay per usage
1224
Monthly users
6.4k
Runs succeeded
>99%
Response time
4.6 days
Last modified
2 days ago
Seeing many `NS_ERROR_X` errors for a crawl
Closed
I've tried a couple of configurations for crawling a particular website and it is failing to process the first page due to a variety of errors such as NS_ERROR_NET_INTERRUPT
NS_ERROR_PROXY_CONNECTION_REFUSED
and NS_ERROR_NET_TIMEOUT
I've tried using residential proxies, the playwright:firefox
crawler, and increased the memory and timeout parameters for my task but have had no success.
The information provided by enabling debugLog
and debugMode
haven't helped me figure out the cause.
Thank you!
Hello and thank you for your interest in this Actor!
The page you are trying to scrape seems to be misconfigured - while it doesn't respond on https://bmc.com
(the connection never goes through and proxy / browser eventually times out), it does respond on https://www.bmc.com
.
A quick look into the DNS records shows that the apex domain (without the www
) points to a different address than the www
domain. The apex domain server doesn't seem to listen on the 443
port - this is why the https://bmc.com
request never resolves. Interesting point - the apex domain server listens on port 80
and correctly redirects you to the www
URL - i.e. going to http://bmc.com
(http
, not https
) does work too. This definitely feels broken and probably hurts their SEO performance - if you somehow know the admin of that website, it's something they should look into.
You can easily fix your scraper run by replacing the https://bmc.com
start URL with https://www.bmc.com
. You can reset the rest of the settings to default (mostly proxies, RESIDENTIALS are more expensive and you can easily scrape this website with DATACENTER).
Check out my example run here - you can see that WCC indeed scrapes the contents of the website.
Thank you again for bringing this up!
Pricing
Pricing model
Pay per usageThis Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.