-
Notifications
You must be signed in to change notification settings - Fork 12
Retry SLURM job submission #58
base: v1_18_bosco
Are you sure you want to change the base?
Retry SLURM job submission #58
Conversation
Under load, the SLURM scheduler is prone to barf on any client commands. Retry the job submission if it fails.
retcode=$? | ||
retry=0 | ||
MAX_RETRY=3 | ||
until [ $retry -eq $MAX_RETRY ] ; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the number of attempts to submit is 3 but the number of retries is actually 2, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first attempt is retry=0
, once retry=3
, the loop condition fails and breaks. That should be three tries, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's 3 tries total but only 2 retries so we should call the variable MAX_TRIES
or bump the initial value of retry
jobID=`${slurm_binpath}/sbatch $bls_tmp_file` # actual submission | ||
retcode=$? | ||
retry=0 | ||
MAX_RETRY=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add this as a config variable slurm_max_submit_retries
here, defaulting to 0, and reference it via ${slurm_max_submit_retries}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
break | ||
fi | ||
retry=$[$retry+1] | ||
sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the sleep backoff exponentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do.
Hi @PerilousApricot, we've had some discussions locally that we actually want to disable job retries on the CE completely (https://opensciencegrid.atlassian.net/browse/SOFTWARE-3407) since pilots are just resource requests. Would this have a negative impact for a site, say getting dinged by the WLCG for failing pilots? |
TBH, I don't know the WLCG effects, and Im a little unsure of the context as well from the JIRA ticket as well |
The idea is that pilots are cheap resource requests that are fairly uniform so if a CE fails to submit a pilot job to the batch system, it should just give up and wait for more pilots. I was wondering if pilot success/failure at sites is tracked closely. |
Under load, the SLURM scheduler is prone to barf on any client commands.
Retry the job submission if it fails.
This should be exposed as configurable values, instead of being hardcoded.