Is it possible for SLURM auto submit to work on DP? #1456

philip30 · 2020-04-11T13:58:03Z

philip30
Apr 11, 2020

In my experience, it never works. I looked to the trainer code and saw that the code managing this works only on DDP.

def configure_slurm_ddp(self, num_gpu_nodes):
    self.is_slurm_managing_tasks = False

    ### !!HERE!!
    if self.use_ddp:
        self.num_requested_gpus = self.num_gpus * num_gpu_nodes
        self.num_slurm_tasks = 0
        try:
            self.num_slurm_tasks = int(os.environ['SLURM_NTASKS'])
            self.is_slurm_managing_tasks = self.num_slurm_tasks == self.num_requested_gpus

            # in interactive mode we don't manage tasks
            job_name = os.environ['SLURM_JOB_NAME']
            if job_name == 'bash':
                self.is_slurm_managing_tasks = False

        except Exception:
            # likely not on slurm, so set the slurm managed flag to false
            self.is_slurm_managing_tasks = False

However, sometimes we are not using the distributed computing on slurm (only DP). It would be nice to have the auto resubmit feature still working in this situation.

OS: Linux
Packaging conda
Version 16

Answered by williamFalcon

Apr 13, 2020

actually, lightning supports slurm no matter what backend you use...

    def register_slurm_signal_handlers(self):
        # see if we're using slurm (not interactive)
        on_slurm = False
        try:
            job_name = os.environ['SLURM_JOB_NAME']
            if job_name != 'bash':
                on_slurm = True
        except Exception as e:
            pass

        if on_slurm:
            log.info('Set SLURM handle signals.')
            signal.signal(signal.SIGUSR1, self.sig_handler)
            signal.signal(signal.SIGTERM, self.term_handler)

View full answer

williamFalcon · 2020-04-11T14:21:46Z

williamFalcon
Apr 11, 2020
Maintainer

good point. i think auto resubmit should work no matter how training is happening so long as it detects slurm.

mind submitting a PR?

0 replies

philip30 · 2020-04-11T16:16:36Z

philip30
Apr 11, 2020
Author

I'll try to work on it!

0 replies

williamFalcon · 2020-04-13T22:07:22Z

williamFalcon
Apr 13, 2020
Maintainer

actually, lightning supports slurm no matter what backend you use...

    def register_slurm_signal_handlers(self):
        # see if we're using slurm (not interactive)
        on_slurm = False
        try:
            job_name = os.environ['SLURM_JOB_NAME']
            if job_name != 'bash':
                on_slurm = True
        except Exception as e:
            pass

        if on_slurm:
            log.info('Set SLURM handle signals.')
            signal.signal(signal.SIGUSR1, self.sig_handler)
            signal.signal(signal.SIGTERM, self.term_handler)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible for SLURM auto submit to work on DP? #1456

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Is it possible for SLURM auto submit to work on DP? #1456

philip30 Apr 11, 2020

Replies: 3 comments

williamFalcon Apr 11, 2020 Maintainer

philip30 Apr 11, 2020 Author

williamFalcon Apr 13, 2020 Maintainer

philip30
Apr 11, 2020

williamFalcon
Apr 11, 2020
Maintainer

philip30
Apr 11, 2020
Author

williamFalcon
Apr 13, 2020
Maintainer