Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn off Signal Handling #1760

Open
lukasbm opened this issue Jan 15, 2024 · 2 comments
Open

Turn off Signal Handling #1760

lukasbm opened this issue Jan 15, 2024 · 2 comments

Comments

@lukasbm
Copy link

lukasbm commented Jan 15, 2024

Hi everyone,
I might have run into a bit of an Edge Case.
I use the hydra-submitit plugin to launch multiple tasks on a SLURM cluster.
The configuration works, and the generated slurm submission scripts are also exactly what I want.

However it seems like the submitit entry point (submitit/core/_submit.py) immediately starts waiting on some signals from the launched SLURM job.
On my HPC setup the jobs are lauched on a physically different machine so any signal communication will not work.
The SLURM job does actually launch, but it is killed after around 2 seconds.
Is there a way to turn off signal handling and basically exploit submitit to just be a submission script generator + launcher?

The signal errors usually look like this:

  File "/home/hpc/.../submitit/core/core.py", line 289, in results
    outcome, result = self._get_outcome_and_result()
  File "/home/hpc/.../submitit/core/core.py", line 384, in _get_outcome_and_result
    raise utils.UncompletedJobError("\n".join(message))
submitit.core.utils.UncompletedJobError: Job 742754_0 (task: 0) with path /home/hpc/.../.submitit/742754_0/742754_0_0_result.pkl
has not produced any output (state: FAILED)
Error stream produced:
----------------------------------------
srun: error: tg064: task 0: User defined signal 2
srun: launch/slurm: _step_signal: Terminating StepId=742755.0

launch script:

#!/bin/bash

# note: i removed error and output for brevity

# Parameters
#SBATCH --array=0-1%2
#SBATCH --gres=gpu:1
#SBATCH --job-name=_implementations
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --open-mode=append
#SBATCH --signal=USR2@120
#SBATCH --time=2
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered python3 -u -m submitit.core._submit /home/hpc/.../.submitit/%j

TIA

@lukasbm lukasbm changed the title Turn of Signal Handling Turn off Signal Handling Jan 15, 2024
@lukasbm
Copy link
Author

lukasbm commented Jan 16, 2024

Ok, so I got the actual submitted jobs to work by disabling most signals.
Did this by adding this code to the job's code:

def setup_signal_handlers() -> None:
    import signal

    signal.signal(signal.SIGUSR1, signal.SIG_IGN)  # ignore SIGUSR1
    signal.signal(signal.SIGUSR2, signal.SIG_IGN)  # ignore SIGUSR2
    signal.signal(signal.SIGCONT, signal.SIG_IGN)  # ignore SIGCONT
    signal.signal(signal.SIGTERM, signal.SIG_IGN)  # ignore SIGTERM
    signal.signal(signal.SIGHUP, signal.SIG_IGN)  # ignore SIGTSTP

However, the main process (in my case the hydra-submitit-plugin) which is responsible for launching the jobs never exits,
even if the submitted jobs are long completed.

Right now, I am always just killing the main process. Haven't encountered any problem with this approach yet, though I would be interested to know if there's a better solution for this

@gil2rok
Copy link

gil2rok commented Mar 8, 2024

Facing something similar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants