-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash detection #20
Comments
The easy solution is to always set a |
Thank, I appreciate your openness to these features. I do not think a timeout in each mirai() would be enough for my case. I am seldom sure how long each task will need a priori, and if I overestimate the timeout, that could be a significant a delay in finding out about the crash. For the efficiency of the {targets} pipelines I have in mind, I think I would need to find out as soon as possible if a particular job will not finish due to a broke connection. |
Implemented in 5661217, an active queue is now self-repairing if a node fails. This will be extended to the more general version of the active queue once that is implemented. |
Amazing! I tested this locally, and it appears to work (see below). I only worry about one possible edge case:
How would you suggest I avoid this improbable but vicious loop? Current test:
|
This would be rare as all evaluation is wrapped in a You would need to call How quickly it attempts retries etc. are options that can be set through the nanonext/NNG |
I see. Is there an NNG option for the maximum number of retries? |
That is not currently an option. This is such an edge case... I am not sure it is something that is worth handling, even on your side. |
I agree that it is extremely rare, but I do care about it. The loop in #20 (comment) could be such a nightmare for an unlucky user with servers running as AWS Batch jobs backed by expensive EC2 instances. Is there anything else we can do through If not, I think I would need to let go of my plans for fault tolerance in |
For something like this, if it is going to be useful for 99% of cases then why not go ahead with the feature, but just have the option to turn it off for the 1% of times when this might happen. If you really care about the 1% then have the default switched to off - but we tend to overestimate the probability of rare events in any case. |
Yeah, I think I could let users opt in or out of fault tolerance via |
If a server crashes while running a task, is there a way to promptly know if the task is never going to complete? I tried the following steps on my SGE cluster. On a server node:
On the client with a different node and different IP than the server:
Then before 10 seconds completed, I terminated the server process. On the client, the
mirai
object looks the same as when the job was running.The text was updated successfully, but these errors were encountered: