SLURM quality of life improvements

## Description
Making a couple of requests to improve QoL on SLURM 

## Detailed Proposal
It would be helpful to have -
- [x] The ability to specify the output path. Currently, you need to cd to the right path for this, which generally needs a helper function to set up the directory, cd to it, and then launch via torchx. torchx can ideally handle it for us. #416
- [x] Code isolation and reproducibility. While doing research, we make a change, launch an experiment, and repeat. To make sure each experiment uses the same consistent code, we copy the code to the experiment directory (which also helps with reproducibility). #416
- [ ] Verification of the passed launch script. If I launch from a wrong directory for instance, I would still queue up the job, wait for a few minutes / hours only to crash because of a wrong path (i.e. the launch script does not exist).
- [x] Being able to specify a job name - SLURM shows job details when running the `squeue` command including the job name. If our jobs are all run via torchx, every job will be named `train_app-{i}` which makes it hard to identify which experiment / project the job is from.
- [x] The `time` argument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that.
- [ ] torchx submits jobs in [heterogeneous mode](https://slurm.schedmd.com/heterogeneous_jobs.html). This is something FAIR users don't have familiarity with - I'm guessing in terms of execution and command support there should be feature and scheduling speed parity (not sure about the latter)? The `squeue` logs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :)
- [x] The job logs are created in `slurm-{job-id}-train_app-{node-id}.out` files (per node) and a single `slurm-{job-id}.out`. Normally, our jobs instead have logs of the form `{job-id}-{node-id}.out` and `{job-id}-{node-id}.err` (per node) - the separation between `stderr` and `stdout` helps find which machine actually crashed more easily. And I'm not sure what `slurm-{job-id}.out` corresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping for `Traceback` will return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for?
- [ ] The `global_rank` is not equal to `local_rank + node_id * gpus_per_node`, i.e. the global rank 0 can be on node 3.
- [ ] automatically set nomem on pcluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SLURM quality of life improvements #405

Description

Detailed Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SLURM quality of life improvements #405

Description

Description

Detailed Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions