Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoI am calling the actor via the low code platform called n8n. I am setting the memory usage to 4096 meg, both via n8n and in the apify console itself. But the actor still uses more than 8 gig of ram which kills the run and I am prompted to upgrade my account. When I run the actor direct from Apify with memory set to 4096 meg then it works fine using only 4096 of memory. How can I get the actor to limit the memory used to be to be 4096 when calling it via API and n8n and so not hit these limits?
This is the message I get from Apify in n8n.
"Payment required - perhaps check your payment details? [item 0] By launching this job you will exceed the memory limit of 8192MB for all your Actor runs and builds (currently used: 8192MB, requested: 4096MB). Please consider an upgrade to a paid plan at https://console.apify.com/billing/subscription to increase your Actor memory limit."
Hi, thank you for using Website Content Crawler!
First, you should be able to run Website Content Crawler with 8GB, even as a free user.
Could it be that you are running multiple Actors in parallel, causing the memory quota to be exceeded?
If you want to limit the memory via an API call, you can specify it in the request body using the memory
parameter.
Here’s an example:
1curl -X POST https://api.apify.com/v2/acts/apify~website-content-crawler/run-sync-get-dataset-items?token=APIFY_API_TOKEN \ 2-d '{"memory":4096, "startUrls":[{"url":"https://example.com"}]}' \ 3-H "Content-Type: application/json"
Hope this helps. Jiri
Thanks for the reply and suggestion. I did try the suggested endpint and still I get the same result
My guess is Apify is trying to launch multiple actors, maybe 1 for each of the websites being sent in via api. Any suggestions on how I can rather get Apify and the actor to process crawls sequentially?
Hi,
If you are making multiple, each call will start a new Actor run.
If you need to call the Actor sequentially, you’ll need to orchestrate the calls one by one.
Alternatively, you can provide a list of startURLs like this:
1"startUrls": [ 2 {"url": "https://example.com"}, 3 {"url": "https://apify.com"} 4]
However, in this case, the output might will contain results from different domains together.
It also depends on whether you want to crawl multiple pages or just scrape a single URL.
If you can share an example call, I’ll be able to help you more specifically.
I am using the low code tool called n8n (similar to make.com) to make these calls, so I don't think I can share the call with you. If you are familiar with n8n I can JSON with you for the work flow?
n8n uses nodes and has a node pre coded up to work with Apify. I am feeding 10 items (website URLS) into the apify node. Based on what you have shared with me, I think 10 actors are being spun up at the same time. When I take this into porduction, I may have hundreds websites being fed in.
I will see if I can figure out a way to send only one request at a time.
"It also depends on whether you want to crawl multiple pages or just scrape a single URL." I am wanting to scrape each one of the websites, but in a way that keeps the required memory below 8 gigs.
Hi, thank you for sharing the details!
I understand the issue now. My apologies—earlier I overlooked the fact that you are running the code from n8n.
It would be great if you could share the workflow with us. My colleague, @dusan.vystrcil, uses n8n daily with Apify and might be able to assist you.
"It also depends on whether you want to crawl multiple pages or just scrape a single URL."
I want to scrape each one of the websites but in a way that keeps the required memory below 8 gigs.
If you only need to scrape the URLs without crawling further, you can add all the URLs into startURLs
and set the maximum crawling depth to 0. This way, you’ll get the content of each URL without consuming additional memory.
Attached is the JSON copy of the n8n workflow. I am using a community node to access Apify, so you will need to install that 1st. I was using HTTP nodes to access Apify but this community node did a neater job. The community node is called n8n-nodes-apify.
Hi,
here's the workflow which works for me.
I transformed data into the list and also add maxCrawlDepth = 0.
Feel free to test it out and let us know if somethings missing.
Thanks - on testing I am getting this error from apify node
{ "error": { "type": "invalid-input", "message": "Input is not valid: Field input.startUrls must be array" } }
PS Thanks so much for taking the time to help me out.
I fixed the JSON
{
"startUrls": [
{{ $json.data.map(item => {"url":"${item.url}"}
).join(',') }}
],
"maxCrawlDepth": 0
}
So now data passes to Apify, but I am getting error
{ "errorMessage": "The connection was aborted, perhaps the server is offline", "errorDetails": { "rawErrorMessage": [ "timeout of 300000ms exceeded", "timeout of 300000ms exceeded" ], "httpCode": "ECONNABORTED" },
The run continues in Apify, it's just the connection with n8n that times out. In the apify node in n8n I have set the timeout to be 0 which means it should be infinte?
In Apify, when the run completes I cannot see that data set - Cannot load data from dataset. Reason: Network Error
It seems that there's some issue with this specific page: https://chiropracticcollective.com/
It keeps failing to connect and then the whole run is long over 6 minutes. If your setup has lower timeout then it takes the run to finish, n8n server lost connection to Apify and can't retrieve data.
Try to leave that specific page or setup longer timeout.
I'm not sure about that "Network Error" - I just retrieved your dataset successfully, so maybe just some temporary error.
Thanks, I have removed all time out limits both on the Apify Console and in the apify node, but it still dropping the connection at exactly 5 minutes every time. There must be some hard coded time limit somewhere that I can't access.
Seems like the limit of n8n platform. See https://community.n8n.io/t/what-is-the-maximum-timeout-for-cloud-3-minutes-or-unlimited/25115
Solution could be to use Apify's Integration - you could setup webhook which will be triggered after successful run with payload containing ID of created dataset. In n8n you will create trigger node with this webhook and continue in your flow.
Hey @Robmobius, in case you've got further questions, let us know. For now I'm closing this issue.
Actor Metrics
4k monthly users
-
840 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 21 hours ago