Open
Description
Description
Making a couple of requests to improve QoL on SLURM
Detailed Proposal
It would be helpful to have -
- The ability to specify the output path. Currently, you need to cd to the right path for this, which generally needs a helper function to set up the directory, cd to it, and then launch via torchx. torchx can ideally handle it for us. slurm_scheduler, dir_workspace: add isolated workspaces for Slurm #416
- Code isolation and reproducibility. While doing research, we make a change, launch an experiment, and repeat. To make sure each experiment uses the same consistent code, we copy the code to the experiment directory (which also helps with reproducibility). slurm_scheduler, dir_workspace: add isolated workspaces for Slurm #416
- Verification of the passed launch script. If I launch from a wrong directory for instance, I would still queue up the job, wait for a few minutes / hours only to crash because of a wrong path (i.e. the launch script does not exist).
- Being able to specify a job name - SLURM shows job details when running the
squeue
command including the job name. If our jobs are all run via torchx, every job will be namedtrain_app-{i}
which makes it hard to identify which experiment / project the job is from. - The
time
argument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that. - torchx submits jobs in heterogeneous mode. This is something FAIR users don't have familiarity with - I'm guessing in terms of execution and command support there should be feature and scheduling speed parity (not sure about the latter)? The
squeue
logs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :) - The job logs are created in
slurm-{job-id}-train_app-{node-id}.out
files (per node) and a singleslurm-{job-id}.out
. Normally, our jobs instead have logs of the form{job-id}-{node-id}.out
and{job-id}-{node-id}.err
(per node) - the separation betweenstderr
andstdout
helps find which machine actually crashed more easily. And I'm not sure whatslurm-{job-id}.out
corresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping forTraceback
will return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for? - The
global_rank
is not equal tolocal_rank + node_id * gpus_per_node
, i.e. the global rank 0 can be on node 3. - automatically set nomem on pcluster