Retry SLURM job submission #58

PerilousApricot · 2017-07-25T22:50:31Z

Under load, the SLURM scheduler is prone to barf on any client commands.
Retry the job submission if it fails.

This should be exposed as configurable values, instead of being hardcoded.

Under load, the SLURM scheduler is prone to barf on any client commands. Retry the job submission if it fails.

brianhlin · 2017-07-26T20:55:44Z

src/scripts/slurm_submit.sh

-retcode=$?
+retry=0
+MAX_RETRY=3
+until [ $retry -eq $MAX_RETRY ] ; do


This looks like the number of attempts to submit is 3 but the number of retries is actually 2, right?

The first attempt is retry=0, once retry=3, the loop condition fails and breaks. That should be three tries, right?

Yea, that's 3 tries total but only 2 retries so we should call the variable MAX_TRIES or bump the initial value of retry

brianhlin · 2017-07-26T20:58:51Z

src/scripts/slurm_submit.sh

-jobID=`${slurm_binpath}/sbatch $bls_tmp_file` # actual submission
-retcode=$?
+retry=0
+MAX_RETRY=3


Could we add this as a config variable slurm_max_submit_retries here, defaulting to 0, and reference it via ${slurm_max_submit_retries}?

brianhlin · 2017-07-26T20:59:52Z

src/scripts/slurm_submit.sh

+        break
+    fi
+    retry=$[$retry+1]
+    sleep 10


Could we make the sleep backoff exponentially?

brianhlin · 2019-03-01T20:16:15Z

Hi @PerilousApricot, we've had some discussions locally that we actually want to disable job retries on the CE completely (https://opensciencegrid.atlassian.net/browse/SOFTWARE-3407) since pilots are just resource requests. Would this have a negative impact for a site, say getting dinged by the WLCG for failing pilots?

PerilousApricot · 2019-03-01T22:24:21Z

TBH, I don't know the WLCG effects, and Im a little unsure of the context as well from the JIRA ticket as well

brianhlin · 2019-03-04T16:41:33Z

The idea is that pilots are cheap resource requests that are fairly uniform so if a CE fails to submit a pilot job to the batch system, it should just give up and wait for more pilots. I was wondering if pilot success/failure at sites is tracked closely.

Retry SLURM job submission

53e3291

Under load, the SLURM scheduler is prone to barf on any client commands. Retry the job submission if it fails.

brianhlin suggested changes Jul 26, 2017

View reviewed changes

brianhlin requested a review from djw8605 July 26, 2017 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry SLURM job submission #58

Retry SLURM job submission #58

PerilousApricot commented Jul 25, 2017

brianhlin Jul 26, 2017

PerilousApricot Jul 26, 2017

brianhlin Aug 24, 2017

brianhlin Jul 26, 2017

PerilousApricot Jul 26, 2017

brianhlin Jul 26, 2017

PerilousApricot Jul 26, 2017

brianhlin commented Mar 1, 2019

PerilousApricot commented Mar 1, 2019

brianhlin commented Mar 4, 2019

Retry SLURM job submission #58

Are you sure you want to change the base?

Retry SLURM job submission #58

Conversation

PerilousApricot commented Jul 25, 2017

brianhlin Jul 26, 2017

Choose a reason for hiding this comment

PerilousApricot Jul 26, 2017

Choose a reason for hiding this comment

brianhlin Aug 24, 2017

Choose a reason for hiding this comment

brianhlin Jul 26, 2017

Choose a reason for hiding this comment

PerilousApricot Jul 26, 2017

Choose a reason for hiding this comment

brianhlin Jul 26, 2017

Choose a reason for hiding this comment

PerilousApricot Jul 26, 2017

Choose a reason for hiding this comment

brianhlin commented Mar 1, 2019

PerilousApricot commented Mar 1, 2019

brianhlin commented Mar 4, 2019