You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone,
I might have run into a bit of an Edge Case.
I use the hydra-submitit plugin to launch multiple tasks on a SLURM cluster.
The configuration works, and the generated slurm submission scripts are also exactly what I want.
However it seems like the submitit entry point (submitit/core/_submit.py) immediately starts waiting on some signals from the launched SLURM job.
On my HPC setup the jobs are lauched on a physically different machine so any signal communication will not work.
The SLURM job does actually launch, but it is killed after around 2 seconds.
Is there a way to turn off signal handling and basically exploit submitit to just be a submission script generator + launcher?
The signal errors usually look like this:
File "/home/hpc/.../submitit/core/core.py", line 289, in results
outcome, result = self._get_outcome_and_result()
File "/home/hpc/.../submitit/core/core.py", line 384, in _get_outcome_and_result
raise utils.UncompletedJobError("\n".join(message))
submitit.core.utils.UncompletedJobError: Job 742754_0 (task: 0) with path /home/hpc/.../.submitit/742754_0/742754_0_0_result.pkl
has not produced any output (state: FAILED)
Error stream produced:
----------------------------------------
srun: error: tg064: task 0: User defined signal 2
srun: launch/slurm: _step_signal: Terminating StepId=742755.0
launch script:
#!/bin/bash# note: i removed error and output for brevity# Parameters#SBATCH --array=0-1%2#SBATCH --gres=gpu:1#SBATCH --job-name=_implementations#SBATCH --nodes=1#SBATCH --ntasks-per-node=1#SBATCH --open-mode=append#SBATCH --signal=USR2@120#SBATCH --time=2#SBATCH --wckey=submitit# commandexport SUBMITIT_EXECUTOR=slurm
srun --unbuffered python3 -u -m submitit.core._submit /home/hpc/.../.submitit/%j
TIA
The text was updated successfully, but these errors were encountered:
lukasbm
changed the title
Turn of Signal Handling
Turn off Signal Handling
Jan 15, 2024
However, the main process (in my case the hydra-submitit-plugin) which is responsible for launching the jobs never exits,
even if the submitted jobs are long completed.
Right now, I am always just killing the main process. Haven't encountered any problem with this approach yet, though I would be interested to know if there's a better solution for this
Hi everyone,
I might have run into a bit of an Edge Case.
I use the hydra-submitit plugin to launch multiple tasks on a SLURM cluster.
The configuration works, and the generated slurm submission scripts are also exactly what I want.
However it seems like the submitit entry point (
submitit/core/_submit.py
) immediately starts waiting on some signals from the launched SLURM job.On my HPC setup the jobs are lauched on a physically different machine so any signal communication will not work.
The SLURM job does actually launch, but it is killed after around 2 seconds.
Is there a way to turn off signal handling and basically exploit submitit to just be a submission script generator + launcher?
The signal errors usually look like this:
launch script:
TIA
The text was updated successfully, but these errors were encountered: