Skip to content

SLURM quality of life improvements #405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 of 9 tasks
mannatsingh opened this issue Mar 4, 2022 · 5 comments
Open
5 of 9 tasks

SLURM quality of life improvements #405

mannatsingh opened this issue Mar 4, 2022 · 5 comments
Assignees
Labels
slurm slurm scheduler

Comments

@mannatsingh
Copy link

mannatsingh commented Mar 4, 2022

Description

Making a couple of requests to improve QoL on SLURM

Detailed Proposal

It would be helpful to have -

  • The ability to specify the output path. Currently, you need to cd to the right path for this, which generally needs a helper function to set up the directory, cd to it, and then launch via torchx. torchx can ideally handle it for us. slurm_scheduler, dir_workspace: add isolated workspaces for Slurm #416
  • Code isolation and reproducibility. While doing research, we make a change, launch an experiment, and repeat. To make sure each experiment uses the same consistent code, we copy the code to the experiment directory (which also helps with reproducibility). slurm_scheduler, dir_workspace: add isolated workspaces for Slurm #416
  • Verification of the passed launch script. If I launch from a wrong directory for instance, I would still queue up the job, wait for a few minutes / hours only to crash because of a wrong path (i.e. the launch script does not exist).
  • Being able to specify a job name - SLURM shows job details when running the squeue command including the job name. If our jobs are all run via torchx, every job will be named train_app-{i} which makes it hard to identify which experiment / project the job is from.
  • The time argument doesn't say what the unit is - maybe we just follow the SLURM API, but it would be nice if we clarified that.
  • torchx submits jobs in heterogeneous mode. This is something FAIR users don't have familiarity with - I'm guessing in terms of execution and command support there should be feature and scheduling speed parity (not sure about the latter)? The squeue logs show every node as a separate line - so a 32 node job would take 32 lines instead of 1. This just makes it harder to monitor jobs - not a technical issue, just a QoL one :)
  • The job logs are created in slurm-{job-id}-train_app-{node-id}.out files (per node) and a single slurm-{job-id}.out. Normally, our jobs instead have logs of the form {job-id}-{node-id}.out and {job-id}-{node-id}.err (per node) - the separation between stderr and stdout helps find which machine actually crashed more easily. And I'm not sure what slurm-{job-id}.out corresponds to - maybe it's a consequence of the heterogeneous jobs? With torchelastic, it becomes harder to debug which node crashed since every node logs a crash (so grepping for Traceback will return each log file instead of just the node which originally crashed) - maybe there is a way to figure this out and I just don't know what to look for?
  • The global_rank is not equal to local_rank + node_id * gpus_per_node, i.e. the global rank 0 can be on node 3.
  • automatically set nomem on pcluster
@d4l3k
Copy link
Member

d4l3k commented Mar 4, 2022

adding this support for slurm wouldn't be too bad:

  1. generalize the workspace file logic from docker_workspace (.torchxignore)
  2. add a job_dir argument to allow specifying an isolation env
  3. change launch code to cp + chdir
  4. add some statefile (.torchxjobdirs) so torchx log knows where to find logs for slurm

something like job_dir we could relatively easily extend to local_cwd, local_docker -- more complex for k8s/batch/ray

@d4l3k d4l3k self-assigned this Mar 4, 2022
@d4l3k
Copy link
Member

d4l3k commented Mar 7, 2022

For the heterogenous jobs displaying differently, that's tricky in the current model. The macros like replica_id generally need be applied on a per worker basis. If we wrap the app in a runtime it does allow us to materialize those later though it does add an extra dependency. Slurm virtualenv/conda will have TorchX installed anyways in most cases so that's not necessarily a blocker but changes the model from what we've had so far

https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L138

I did look but doesn't appear that sacct/squeue has a way to hide child jobs. You can use torchx status so we could add a torchx queue method to render this better for all schedulers

@mannatsingh
Copy link
Author

You can use torchx status so we could add a torchx queue method to render this better for all schedulers

I think it's hard to see us migrating to use torchx status - squeue gives the status of all jobs which is what I normally check. torchx wouldn't even be aware of all the jobs being run (since they might have been queued outside of torchx). Even if it did, that's introducing a new workflow which I'd imagine most people would want to avoid (unless it gave them some benefit).

@kiukchung
Copy link
Contributor

re: The job logs are created in per node files

#412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line. So you'd see something akin to this:

image

facebook-github-bot pushed a commit that referenced this issue Mar 9, 2022
…the worker's stdout and stderr streams (#412)

Summary:
Pull Request resolved: #412

Addresses the QOL issue around SLURM logs mentioned in #405

TL;DR - since torchx launches nodes (not tasks) in SLURM, the stdout and stderr logs are combined for all 8 workers on the node (versus having separate ones for each worker when launched as task). This makes `dist.ddp` set `--tee=3` flag to torchelastic which prefixes each line of stderr and stdout of the workers with the local_rank of that worker so that the user can easily grep out the logs for a particular worker.

Reviewed By: d4l3k

Differential Revision: D34726681

fbshipit-source-id: 937f472d702e195f7344a042db69ab5ac1c1e900
facebook-github-bot pushed a commit that referenced this issue Mar 9, 2022
…d of having everything redirected to stdout (#414)

Summary:
Pull Request resolved: #414

Addresses the log related QOL mentioned #405.

Split the stdout and stderr into two separate log files and make adjustments to the slurm_scheduler.log_iter implementation to support this.

Reviewed By: d4l3k

Differential Revision: D34729144

fbshipit-source-id: 1505d265f8a779b1ad3499cd42ff01858a47fc4e
@mannatsingh
Copy link
Author

#412 makes it so that when running with dist.ddp the node stdout and stderr log lines are prefixed with the local_rank of the worker that produced that line...

We need to work with the lightning team to make sure that the ranks displayed here match the ones used in lightning, which isn't guaranteed to be the case right now as @kiukchung and I discovered the other day.

@d4l3k d4l3k added the slurm slurm scheduler label Apr 8, 2022
facebook-github-bot pushed a commit that referenced this issue Apr 15, 2022
Summary:
This automatically infers `nomem` by using `sinfo` to get the `Memory` setting for the specified partition. If we can't determine the memory setting or its <= 1000 MB we disable memory allocation checks.

 #405

Pull Request resolved: #461

Test Plan:
CI
Slurm integration tests w/ `nomem` removed

Reviewed By: kiukchung

Differential Revision: D35678067

Pulled By: d4l3k

fbshipit-source-id: 91fbd15339f4e0ddf495da8232ccc451f01488ae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
slurm slurm scheduler
Projects
None yet
Development

No branches or pull requests

3 participants