Skip to content

[torchx/schedulers] Add more runopts for SLURM #389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kiukchung opened this issue Feb 12, 2022 · 2 comments
Closed

[torchx/schedulers] Add more runopts for SLURM #389

kiukchung opened this issue Feb 12, 2022 · 2 comments
Assignees
Milestone

Comments

@kiukchung
Copy link
Contributor

kiukchung commented Feb 12, 2022

Description

Currently we have a handful of SLURM options exposed as runopt. Asking for more.

Motivation/Background

FAIR users typically set these configs: https://github.com/facebookresearch/pycls/blob/8c79a8e2adfffa7cae3a88aace28ef45e52aa7e5/pycls/core/distributed.py#L120-L130

Some of them can be set via the AppDef (especially those that have to do with resources: mem, gpu, cpu, etc). While others like "email" need to just be straight up offered as runopt or need to figure out a more dynamic way to pass them (see detailed proposal)

Detailed Proposal

Either:

  1. keep adding user requested sbatch options on a "need-to" basis
  2. support a dynamic kv pair ( "--cfg sbatch_options=k:v,k:v,k:v")
  3. support slurm specific options via appdef.metadata (we do this for our internal schedulers - to allow users to set thrift fields - as json - directly from the metadata).

Alternatives

(discussed in the proposal above)

Additional context/links

N/A

@kiukchung kiukchung added this to the 0.1.2 release milestone Feb 12, 2022
@mannatsingh
Copy link

It seems that submitit also goes the 2. route for additional params - https://github.com/facebookincubator/submitit/blob/e37899bce0c7c58e3cc46ecb5b7fa8ce941fc3d7/submitit/slurm/slurm.py#L438.

For FAIR's use cases, I think partition, timeout, comments and constraints (this decides which machines are returned) are required for usage.
We also use email as part of our workflows, which isn't strictly necessary but it would be really helpful to have that!

@d4l3k
Copy link
Member

d4l3k commented Feb 15, 2022

We already have partition and timeout.

#391 adds comments, constraints and email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants