Skip to content

SLURM quality of life improvements #405

Open
@mannatsingh

Description

@mannatsingh

Description

Making a couple of requests to improve QoL on SLURM

Detailed Proposal

It would be helpful to have -

  • The ability to specify the output path. Currently, you need to cd to the right path for this, which generally needs a helper function to set up the directory, cd to it, and then launch via torchx. torchx can ideally handle it for us. slurm_scheduler, dir_workspace: add isolated workspaces for Slurm #416
  • Code isolation and reproducibility. While doing research, we make a change, launch an experiment, and repeat. To make sure each experiment uses the same consistent code, we copy the code to the experiment directory (which also helps with reproducibility). slurm_scheduler, dir_workspace: add isolated workspaces for Slurm #416
  • Verification of the passed launch script. If I launch from a wrong directory for instance, I would still queue up the job, wait for a few minutes / hours only to crash because of a wrong path (i.e. the launch script does not exist).
  • Being able to specify a job name - SLURM shows job details when running the squeue command including the job name. If our jobs are all run via torchx, every job will be named train_app-{i} which makes it hard to identify which experiment / project the job is from.
  • The time argument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that.
  • torchx submits jobs in heterogeneous mode. This is something FAIR users don't have familiarity with - I'm guessing in terms of execution and command support there should be feature and scheduling speed parity (not sure about the latter)? The squeue logs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :)
  • The job logs are created in slurm-{job-id}-train_app-{node-id}.out files (per node) and a single slurm-{job-id}.out. Normally, our jobs instead have logs of the form {job-id}-{node-id}.out and {job-id}-{node-id}.err (per node) - the separation between stderr and stdout helps find which machine actually crashed more easily. And I'm not sure what slurm-{job-id}.out corresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping for Traceback will return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for?
  • The global_rank is not equal to local_rank + node_id * gpus_per_node, i.e. the global rank 0 can be on node 3.
  • automatically set nomem on pcluster

Metadata

Metadata

Assignees

Labels

slurmslurm scheduler

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions