-
Notifications
You must be signed in to change notification settings - Fork 129
SLURM quality of life improvements #405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
adding this support for slurm wouldn't be too bad:
something like job_dir we could relatively easily extend to local_cwd, local_docker -- more complex for k8s/batch/ray |
For the heterogenous jobs displaying differently, that's tricky in the current model. The macros like https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L138 I did look but doesn't appear that sacct/squeue has a way to hide child jobs. You can use |
I think it's hard to see us migrating to use |
#412 makes it so that when running with |
…the worker's stdout and stderr streams (#412) Summary: Pull Request resolved: #412 Addresses the QOL issue around SLURM logs mentioned in #405 TL;DR - since torchx launches nodes (not tasks) in SLURM, the stdout and stderr logs are combined for all 8 workers on the node (versus having separate ones for each worker when launched as task). This makes `dist.ddp` set `--tee=3` flag to torchelastic which prefixes each line of stderr and stdout of the workers with the local_rank of that worker so that the user can easily grep out the logs for a particular worker. Reviewed By: d4l3k Differential Revision: D34726681 fbshipit-source-id: 937f472d702e195f7344a042db69ab5ac1c1e900
…d of having everything redirected to stdout (#414) Summary: Pull Request resolved: #414 Addresses the log related QOL mentioned #405. Split the stdout and stderr into two separate log files and make adjustments to the slurm_scheduler.log_iter implementation to support this. Reviewed By: d4l3k Differential Revision: D34729144 fbshipit-source-id: 1505d265f8a779b1ad3499cd42ff01858a47fc4e
We need to work with the lightning team to make sure that the ranks displayed here match the ones used in lightning, which isn't guaranteed to be the case right now as @kiukchung and I discovered the other day. |
Summary: This automatically infers `nomem` by using `sinfo` to get the `Memory` setting for the specified partition. If we can't determine the memory setting or its <= 1000 MB we disable memory allocation checks. #405 Pull Request resolved: #461 Test Plan: CI Slurm integration tests w/ `nomem` removed Reviewed By: kiukchung Differential Revision: D35678067 Pulled By: d4l3k fbshipit-source-id: 91fbd15339f4e0ddf495da8232ccc451f01488ae
Description
Making a couple of requests to improve QoL on SLURM
Detailed Proposal
It would be helpful to have -
squeue
command including the job name. If our jobs are all run via torchx, every job will be namedtrain_app-{i}
which makes it hard to identify which experiment / project the job is from.time
argument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that.squeue
logs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :)slurm-{job-id}-train_app-{node-id}.out
files (per node) and a singleslurm-{job-id}.out
. Normally, our jobs instead have logs of the form{job-id}-{node-id}.out
and{job-id}-{node-id}.err
(per node) - the separation betweenstderr
andstdout
helps find which machine actually crashed more easily. And I'm not sure whatslurm-{job-id}.out
corresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping forTraceback
will return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for?global_rank
is not equal tolocal_rank + node_id * gpus_per_node
, i.e. the global rank 0 can be on node 3.The text was updated successfully, but these errors were encountered: