-
Notifications
You must be signed in to change notification settings - Fork 146
Closed
Labels
bugSomething isn't workingSomething isn't workingmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesissues related to the torchx.runner and torchx.scheduler modulesslurmslurm schedulerslurm scheduler
Description
🐛 Bug
According to aws/aws-parallelcluster#2198 PCluster has problems running jobs that have explicit memory requirements.
We need to modify our slurm scheduler to address this.
Module (check all that applies):
-
torchx.spec -
torchx.component -
torchx.apps -
torchx.runtime -
torchx.cli - [ x]
torchx.schedulers -
torchx.pipelines -
torchx.aws -
torchx.examples -
other
To Reproduce
Steps to reproduce the behavior:
- ssh to slurm cluster
- create main.py that prints hello world
- torchx run -s slurm --scheduler_args partition=compute,time=10 dist.ddp --script main.py
Expected behavior
Job successfully executed
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesissues related to the torchx.runner and torchx.scheduler modulesslurmslurm schedulerslurm scheduler