Is it possible for SLURM auto submit to work on DP? #1456
-
In my experience, it never works. I looked to the trainer code and saw that the code managing this works only on DDP. def configure_slurm_ddp(self, num_gpu_nodes):
self.is_slurm_managing_tasks = False
### !!HERE!!
if self.use_ddp:
self.num_requested_gpus = self.num_gpus * num_gpu_nodes
self.num_slurm_tasks = 0
try:
self.num_slurm_tasks = int(os.environ['SLURM_NTASKS'])
self.is_slurm_managing_tasks = self.num_slurm_tasks == self.num_requested_gpus
# in interactive mode we don't manage tasks
job_name = os.environ['SLURM_JOB_NAME']
if job_name == 'bash':
self.is_slurm_managing_tasks = False
except Exception:
# likely not on slurm, so set the slurm managed flag to false
self.is_slurm_managing_tasks = False However, sometimes we are not using the distributed computing on slurm (only DP). It would be nice to have the auto resubmit feature still working in this situation.
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
good point. i think auto resubmit should work no matter how training is happening so long as it detects slurm. mind submitting a PR? |
Beta Was this translation helpful? Give feedback.
-
I'll try to work on it! |
Beta Was this translation helpful? Give feedback.
-
actually, lightning supports slurm no matter what backend you use...
|
Beta Was this translation helpful? Give feedback.
actually, lightning supports slurm no matter what backend you use...