-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-restart hanging job if faulty node detected #176
Comments
On Very common:
Sometimes:
(UVM related functions on the stack. Note that the proc of the hang here is not necessarily your own proc, but it hints at a problem with UVM.) Rare:
Rare:
|
This seems like a highly site-specific problem, so my sugggestion would be to implement a generic "detect_hanging" function (working title) that does nothing by default and that can be overridden by users in their settigns.py for specific locations. One problem that I see is that the hang detection would need run in the manager and not the worker.
this assumes passwordless ssh access to all nodes has been setup.
adds a dependency to the python env running the manager. I am not against it. All of my counter points are easily mitigated by making the actual check a user implementation and then the user is responsible to setup the requirements. |
Is it? It might be specific to NVIDIA GPUs, and maybe also Linux, but this is rather generic then?
Actually both are possible. In the worker, you could also start another subproc which does those checks. And maybe this will touch a file when the checks are successful, so then the Sisyphus manager only needs to check if this file has recently been touched. But otherwise, why is it a problem if the hang detection runs in the manager?
Yes, but this is usually the case. |
maybe someone who buys server-grade hardware does not have these problems. Maybe other sites have additional problems. Maybe users of other distros have different error messages.
maybe they can share the output of this mechanism and then at i6 one would use a script that queries the admin database of faulty nodes rather than trying to login to each node. |
If the problem don't occur, then the check would also not hurt. But actually, at least for DGX with V100 SXM3 (at ITC), and also V100-SXM2 (at ITC), and also A10 (i6), I see very similar problems. If there are additional often occuring problems, we can always extend it. The dmesg messages are only specific to Linux, or maybe NVIDIA driver. I get your point that this should be easily user configurable and extensible. But I also think a few of those checks are very generic and probably useful for everybody, and it might make sense to directly provide them as part of Sisyphus so that users can easily use them. Maybe a config option for each particular check (e.g. So, I'm thinking about having a few predefined checks which the user can enable, and then additionally maybe such a function as you said, like And additionally maybe an option like
The faulty nodes are set into drain mode. So we could just check whether a mode is in drain mode. But we would need to extend the engine API. Slurm and the others all need their own specific mechanism to check that. |
I just noticed, when I see the warning "Job marked as running but logging file has not been updated", that is when (I started with some initial implementation, also inside the worker, as another thread, which performs the same check, via I just had a hanging case, but not so much related to the other GPU related hangs here (I think), where the worker proc looks like:
And
Via pystack:
|
As you see, this hangs in This But that means, we basically already have a reliable way to see if the proc hangs. If we see this warning, it is really hanging. Or not? I actually sometimes see the warning also for some jobs at the beginning when they are slow at loading (e.g. Edit See also the added pystack output. As you see from there, it hangs when reading for |
Hm, very strange, I then attached via
|
Again a hanging job. So, now debugging why I actually get the warning "Job marked as running but logging file has not been updated". Current procs:
So, checking the Sisyphus worker:
This seems ok. So, current files:
So, the
This is exactly my current time. So, I don't understand, why do I get the warning "Job marked as running but logging file has not been updated"? The So, I restarted the Sis manager, to check if/when I get the same warning again:
So, it takes 6 minutes, and then prints the warning again. In maximal_file_age = gs.WAIT_PERIOD_JOB_FS_SYNC + gs.PLOGGING_UPDATE_FILE_PERIOD + gs.WAIT_PERIOD_JOB_CLEANUP I have: WAIT_PERIOD_JOB_FS_SYNC = 1
PLOGGING_UPDATE_FILE_PERIOD = 60
WAIT_PERIOD_JOB_CLEANUP = 10 I wonder, maybe the |
When you check that the |
I checked from the node of the manager. My current explanation is: I think I'm not really sure how to verify this. I maybe need an exact timeline from an independent node to log when the file is updated. At the same time, I need frequent sampling of the worker with The actual question is, how to reliably implement the job-hanging check. If it hangs only temporarily, it should not restart. So this means maybe just that I need to increase the timeout. Then, additionally, I maybe should extend |
Currently the admins already have some mechanism to detect faulty nodes and then let it drain and reboot. However, if you are unlucky and it is your job currently hanging there, nothing is done about that. It needs manual inspection by the user and manual cancelation of the job (or waiting until the timelimit is hit). This is of course very suboptimal, and we are thinking about solutions. It was suggested that such logic could also be implemented as part of the Sisyphus manager.
Sisyphus already detects potential hanging jobs by checking whether the log has not been updated recently, and then prints:
So, the question is, how to detect whether it is really hanging and should be canceled and restarted.
And then the job should not go into error state, also not retry_error, but just restart it? Or maybe, if it can not be resumed, it should also automatically be cleared first?
And maybe Sisyphus should also keep a temporary local list of excluded nodes and temporarily add the node there (e.g. for 2h or so)?
Here some possible hang detections. All of them would involve logging on to the node via SSH.
Check if
nvidia-smi
returns an error.(Maybe some unrelated GPU is faulty, but anyway, if your proc hangs + this is the case, I think it's ok to cancel the job.)
(This covers already a lot of cases I had, but not all.)
Check whether
python -c "import torch; torch.cuda.init()"
hangs.Check
py-spy dump -p <pid>
, whether that hangs (given some timeout, maybe 10sec).(One problem: Which PID actually? All the (deep) childs of slurm_script? Only the direct childs would actually not cover my RETURNN training setups with the option
use_train_proc_manager
, where only the sub sub proc hangs.)(Alternative to py-spy: Maybe just strace, or gdb, or sth else? strace output looks like
strace: Process 81401 attached
and then nothing more comes. py-spy is also only for Python, so not generic.)Check
dmesg
for some common errors? But what exactly?The text was updated successfully, but these errors were encountered: