SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin) #1731

subho406 · 2023-03-09T22:08:18Z

Hi,

I am using the Hydra submitit plugin to schedule Sweeps jobs in the Compute Canada cluster. I use the following config to schedule the sweeps:

defaults:
  - _self_
  - override hydra/launcher: submitit_slurm


tags: null
project_name: "test"
seed: 1
steps: 5000000
log_interval: 10000
trainer:
  rollout_len: 256
  num_envs: 8
eval_interval: null
task:
  num_distractors: 6
use_wandb: True

hydra:
  mode: MULTIRUN
  launcher:
      setup:
        - export WANDB_MODE=offline
      account: test
      cpus_per_task: 8
      mem_gb: 5
      timeout_min: 300

  sweeper:
    params:
      trainer/seq_model: lstm, gru
      trainer.optimizer.learning_rate: 0.05, 0.01
      seed: 2,3,4,5

My jobs are executed successfully and they finish before the specified timeout (5 hour). However, it seems like the SLURM job keeps running even though the process has exited. I checked the trainer.log and it seems like submitit is ignoring the SIGTERM signal.

[2023-03-09 10:55:49,583][submitit][INFO] - Job completed successfully
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,585][submitit][WARNING] - Bypassing signal SIGTERM
[2023-03-09 10:55:49,586][submitit][WARNING] - Bypassing signal SIGTERM

I'm not sure if this a bug. I was wondering if there a way for the SLURM jobs to be killed before the timeout, after successful job completion? This would help save a lot of resources for other jobs in queue.

System Information:

Linux cedar1.cedar.computecanada.ca 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

terrykong · 2023-03-14T22:17:25Z

FWIW, I am seeing this as well using the local launcher: hydra/launcher=submitit_local

nikhilxb · 2023-11-03T17:40:05Z

Resolved in #1677.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin) #1731

SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin) #1731

subho406 commented Mar 9, 2023 •

edited

Loading

terrykong commented Mar 14, 2023

nikhilxb commented Nov 3, 2023

SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin) #1731

SLURM Job keeps running after Successful Job Completon (Hydra Submitit Plugin) #1731

Comments

subho406 commented Mar 9, 2023 • edited Loading

terrykong commented Mar 14, 2023

nikhilxb commented Nov 3, 2023

subho406 commented Mar 9, 2023 •

edited

Loading