Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin) #1731

Open
subho406 opened this issue Mar 9, 2023 · 2 comments

Comments

@subho406
Copy link

subho406 commented Mar 9, 2023

Hi,

I am using the Hydra submitit plugin to schedule Sweeps jobs in the Compute Canada cluster. I use the following config to schedule the sweeps:

defaults:
  - _self_
  - override hydra/launcher: submitit_slurm


tags: null
project_name: "test"
seed: 1
steps: 5000000
log_interval: 10000
trainer:
  rollout_len: 256
  num_envs: 8
eval_interval: null
task:
  num_distractors: 6
use_wandb: True

hydra:
  mode: MULTIRUN
  launcher:
      setup:
        - export WANDB_MODE=offline
      account: test
      cpus_per_task: 8
      mem_gb: 5
      timeout_min: 300

  sweeper:
    params:
      trainer/seq_model: lstm, gru
      trainer.optimizer.learning_rate: 0.05, 0.01
      seed: 2,3,4,5

My jobs are executed successfully and they finish before the specified timeout (5 hour). However, it seems like the SLURM job keeps running even though the process has exited. I checked the trainer.log and it seems like submitit is ignoring the SIGTERM signal.

[2023-03-09 10:55:49,583][submitit][INFO] - Job completed successfully
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,586][submitit][WARNING] - Bypassing signal SIGTERM

I'm not sure if this a bug. I was wondering if there a way for the SLURM jobs to be killed before the timeout, after successful job completion? This would help save a lot of resources for other jobs in queue.

System Information:

Linux cedar1.cedar.computecanada.ca 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022 x86_64 GNU/Linux
@terrykong
Copy link

FWIW, I am seeing this as well using the local launcher: hydra/launcher=submitit_local

@nikhilxb
Copy link

nikhilxb commented Nov 3, 2023

Resolved in #1677.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants