-
Notifications
You must be signed in to change notification settings - Fork 314
Long/inconsistent requeue times with 2.9.1 + slurm #2117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I believe the two components that can mark a node as unhealthy are clustermgtd and the slurm daemons. Why doesn't clustermgtd mark the node as unhealthy?Before I detail what I think the issue is, I want to address why clustermgtd doesn't mark this node as unhealthy. clustermgtd periodically checks for the following on each compute node in the cluster:
Shutting down the instance doesn't affect scheduled events. When I shut down an instance as you did in your example, I see the following statuses:
Why don't the slurm daemons mark the node as down?I believe the reason it takes so long for slurm to replace the node because of the value configured for
This value is set to 3600 seconds (one hour) by ParallelCluster. I believe that value is much larger than the default of 60 seconds to handle the case where an AMI does not have the software ParallelCluster requires. Why doesn't the SlurmdTimeout parameter apply?I don't have an answer for this. From the same slurm.conf documentation:
To me, it would seem that if a node's slurmd wasn't responding, then this timeout would take precedent over the ResumeTimeout. Perhaps slurmctld waits for the duration of the ResumeTimeout even after a node has responded once in order to acommodate nodes that are intermittently responsive when resuming for whatever reason. I'll get clarification on this. |
Just to clarify, in the example I'm manually shutting down the instance for demonstration - what I'm trying to figure out in practice is spot nodes being terminated by AWS and certain jobs causing the entire node to become unresponsive I guess my follow-up questions would be:
|
Hi @keien, checking in on this issue, Regarding your use-case question, the use case regarding spot node being terminated should be covered by current logic. However, if your spot interrupt behavior is stopping or hibernating the instance, On the other hand, we are still looking into the issue and we are not sure why slurm would take so long to mark nodes as down when the instance is stopped. We will try to update this issue with more details once we have more information. Thanks! |
Okay, I was testing the behavior via running |
Unfortunately I see the same thing when terminating the node instead of shutting it down - node stays in One possible theory is that if the logic is monitoring the instance health checks, the health checks only show issues very briefly after termination, before they become completely unavailable, so maybe the timing window is too short to detect that it went down? |
Hi @keien,
Unfortunately I cannot reproduce the issue with termination. On termination We are still looking into the root cause for your original issue. Hopefully we can have update soon. |
Let me try to reproduce it again to get the relevant section of the logs To address the OS thing (if it happens to be the issue), no we cannot switch away from Centos 7 at this time. |
Vaguely related symptom that I just saw: a spot instance was allocated and almost immediately terminated by AWS (not uncommon with GPU instances), but slurm never updated its power state from powering up to powering down, so now I've got a node stuck in |
Hi @keien, After communication with SchedMD, we confirm that there is a problematic behavior from slurm of not marking non responsive node as down, if node is within Details for this issue have been documented in our wiki here. If you have comments/feedback regarding this slurm issue, please leave them in this issue thread #2146 so we can track inputs from our users and make a case to continually push SchedMD As discussed in wiki, you should not see similar issue in the case of instance termination. Please let us know if you have any other open question regarding your use case. Thank you! |
Appreciate the follow-up - will set Feel free to close this |
Required Info:
Bug description and how to reproduce:
When a node with a job is terminated, the time it takes for slurm to update the job/node state is quite variable - sometimes a few minutes, but I've seen it take nearly an hour. Here is an example run:
first job running:
terminate node:
right after termination:
attempting to salloc while in this state:
the salloc job gets stuck in CG after the above:
job restarted:
logs:
As you can see, the above node took just under an hour to bring down.
I'd really like to see slurm update the node state a lot faster so we can avoid this long restart time as well as getting a bunch of jobs stuck trying to run on a dead node.
The text was updated successfully, but these errors were encountered: