You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been dealing with a particularly strange submitit error that I am having trouble understanding. Specifically, all jobs I launch through submitit die after 7-10 hours without error. However, this only happens on our cluster with slurm 19.05 and does not occur on a different cluster with slurm 20.11 (there the jobs run fine for the entire allotted time). Are there specific settings in slurm that are needed for submitit to work? Is submitit incompatible with slurm 19.05? Also note this is an error specific to launching jobs on slurm with submitit, I can manually launch sbatch jobs just fine and srun also works on my cluster.
Here is a minimum reproducible example:
launch_script:
import submitit
slurm_additional_parameters = {
"partition": "russ_reserved",
"time": "3-00:00:00",
"gpus": 1,
"cpus_per_gpu": 20,
"mem": 62,
}
def test():
while True:
pass
# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="test_cluster_log")
# set timeout in min, and partition for running the job
slurm_additional_parameters["job_name"] = "test_cluster"
executor.update_parameters(slurm_additional_parameters=slurm_additional_parameters)
job = executor.submit(test) # will
print(job.job_id) # ID of your job
output:
slurmstepd: error: *** STEP 250338.0 ON matrix-2-1 CANCELLED AT 2024-02-10T03:41:37 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** JOB 250338 ON matrix-2-1 CANCELLED AT 2024-02-10T03:41:37 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
submitit WARNING (2024-02-10 03:41:37,635) - Bypassing signal SIGCONT
submitit WARNING (2024-02-10 03:41:37,636) - Bypassing signal SIGTERM
submitit version: 1.5.1
The text was updated successfully, but these errors were encountered:
I have been dealing with a particularly strange submitit error that I am having trouble understanding. Specifically, all jobs I launch through submitit die after 7-10 hours without error. However, this only happens on our cluster with slurm 19.05 and does not occur on a different cluster with slurm 20.11 (there the jobs run fine for the entire allotted time). Are there specific settings in slurm that are needed for submitit to work? Is submitit incompatible with slurm 19.05? Also note this is an error specific to launching jobs on slurm with submitit, I can manually launch sbatch jobs just fine and srun also works on my cluster.
Here is a minimum reproducible example:
launch_script:
output:
submitit version: 1.5.1
The text was updated successfully, but these errors were encountered: