Skip to content
This repository has been archived by the owner on Oct 10, 2019. It is now read-only.

Retry SLURM job submission #58

Open
wants to merge 1 commit into
base: v1_18_bosco
Choose a base branch
from

Conversation

PerilousApricot
Copy link

Under load, the SLURM scheduler is prone to barf on any client commands.
Retry the job submission if it fails.

This should be exposed as configurable values, instead of being hardcoded.

Under load, the SLURM scheduler is prone to barf on any client commands.
Retry the job submission if it fails.
retcode=$?
retry=0
MAX_RETRY=3
until [ $retry -eq $MAX_RETRY ] ; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the number of attempts to submit is 3 but the number of retries is actually 2, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first attempt is retry=0, once retry=3, the loop condition fails and breaks. That should be three tries, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's 3 tries total but only 2 retries so we should call the variable MAX_TRIES or bump the initial value of retry

jobID=`${slurm_binpath}/sbatch $bls_tmp_file` # actual submission
retcode=$?
retry=0
MAX_RETRY=3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add this as a config variable slurm_max_submit_retries here, defaulting to 0, and reference it via ${slurm_max_submit_retries}?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

break
fi
retry=$[$retry+1]
sleep 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the sleep backoff exponentially?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do.

@brianhlin brianhlin requested a review from djw8605 July 26, 2017 21:00
@brianhlin
Copy link
Member

Hi @PerilousApricot, we've had some discussions locally that we actually want to disable job retries on the CE completely (https://opensciencegrid.atlassian.net/browse/SOFTWARE-3407) since pilots are just resource requests. Would this have a negative impact for a site, say getting dinged by the WLCG for failing pilots?

@PerilousApricot
Copy link
Author

TBH, I don't know the WLCG effects, and Im a little unsure of the context as well from the JIRA ticket as well

@brianhlin
Copy link
Member

The idea is that pilots are cheap resource requests that are fairly uniform so if a CE fails to submit a pilot job to the batch system, it should just give up and wait for more pilots. I was wondering if pilot success/failure at sites is tracked closely.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants