
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.6 (38)
Pricing
Pay per usage
1262
Total users
47.5k
Monthly users
6.5k
Runs succeeded
>99%
Response time
4.6 days
Last modified
5 days ago
Timeout setting does not work
Closed
[Need Help]
I set timeout but it does not work. Even timeout setting is on, some times the run keeps and use a lot of balance...
What could I do?
It happens frequently.
Hello and thank you for your interest in this Actor!
Would you mind sharing a specific run where this happens (sharing only the Run ID is fine).
The run linked to this issue has the limits (maxResults
and maxCrawlPages
) set to default and behaves as expected.
Note that the requestTimeoutSecs
option sets the timeout per request (crawling one page), not the entire Actor Run. requestTimeoutSecs
mostly protects the Actor from malformed (or too large) pages that take too long to parse and process.
You can also set a run-wide timeout limit in the bottom-most section of the input schema (Run Options > Timeout
).
Note that this means that the Apify Platform kills the Actor once it exceeds the given time - this is great for keeping tabs on the Platform usage ($$$), but also might result in incomplete results in your dataset (the Actor might not be able to finish its job).
But again, if you have encountered something that doesn't feel right, please share the run id here. Thanks!
manuel3
Thank you for replying.
As you say, I want to set it to kill the Acotor once exceeds the time, And I set at 180 seconds, but for example, this run got 4 hours and I cancelled it.
yxKmXu6URp8v7g7Nt
manuel3
manuel3
What could I do?
Thank you for the additional information!
Looking at the run you linked, I can see that the (hard) timeout for this run were the default 360 000
seconds. I also can see that you didn't start this Run from the web console, but via API (from a Python script perhaps?)
Note that the timeout set in the Input schema (in the web console) is only applicable for the current run that you run from the web. If you want to start a run with a hard timeout from your Python script, you need to pass the timeout option from there (e.g. see the documentation for ActorClient.start()
method in our Python client - you can pass the named argument timeout_secs
there.
If you are making the API calls yourself in your script, you can pass the query parameter timeout
(see documentation). However, we strongly recommend you use the Apify Client for Python - it provides much nicer DX.
TLDR: pass the timeout
option with every (Actor start) API call you make. Let me know how it went!
manuel3
Thank you for your reply. From your advice, I made a code as below with "timeout_secs" but the duration is still uncontrollable...
from apify_client import ApifyClient
apify_client = ApifyClient('MY_API_KEY')
actor_call = apify_client.actor('apify/website-content-crawler').call(run_input={ 'startUrls': [{ 'url': 'https://www.sakataseed.co.jp/special/korotan/howto/' }], 'maxRequestsPerCrawl': 1, 'maxCrawlingDepth': 1, 'timeout_secs': 30 })
dataset_items = apify_client.dataset(actor_call['defaultDatasetId']).list_items().items
for item in dataset_items: print(item['url']) print(item['text']) print('---')
manuel3
manuel3
I will attach the code file too here.
Hello again!
Note that timeout_secs
is not a part of the Actor input (run_input
), it's a separate keyword argument to the call
method (see docs). The following code should work as expected:
1from apify_client import ApifyClient 2 3apify_client = ApifyClient('MY-API-KEY') 4 5actor_call = apify_client.actor('apify/website-content-crawler').call( 6 run_input={ 7 'startUrls': [{ 8 'url': 'https://www.sakataseed.co.jp/special/korotan/howto/' 9 }], 10 'maxRequestsPerCrawl': 1, 11 'maxCrawlingDepth': 1, 12 }, 13 timeout_secs=30 # timeout_secs is a separate keyword argument 14)
Once again, thank you for your patience! This kind of feedback is very important to us - we'll look at what we can do regarding the documentation - we'd love to make it more approachable. Thanks again!