[JENKINS-49707] Use `retry` to recover from `node` outages #405

jglick · 2022-06-07T17:59:50Z

If and when jenkinsci/workflow-basic-steps-plugin#203 is released and ci.jenkins.io is updated to this.

Could also benefit from jenkinsci/kubernetes-plugin#1190, similar to jenkinsci/pipeline-model-definition-plugin#533 (comment) but for Scripted syntax. The tricky bit is that kubernetesAgent() only retries errors from K8s agents (hence the name), applying some specialized logic so it will not retry e.g. out of memory errors; but in buildPlugin() we allow any of a number of labels to be used and it is not obvious which would be K8s agents and which VMs, so we are using plain agent() for now. Currently there is not a way to express the desire “please retry if this was a K8s agent and had a qualifying reason as restricted to pods, or if this was a non-K8s agents and had a generic agent outage condition”.

jglick · 2022-06-07T18:05:08Z

vars/buildPlugin.groovy

@@ -48,224 +48,227 @@ def call(Map params = [:]) {
    }

    tasks[stageIdentifier] = {
-      node(label) {


You need to ignore WS diffs in this PR. (I tried to just use one-space indentation but Spotless shot me down.)

dduportal · 2022-06-07T18:11:49Z

vars/buildPlugin.groovy

-                  }
-                }
+      // TODO use kubernetesAgent() if we know we are using a pod
+      retry(count: 3, conditions: [agent(), nonresumable()]) {


Suggested change

retry(count: 3, conditions: [agent(), nonresumable()]) {

retry(count: 3, conditions: [agent(), nonresumable()]) {

Sorry for the naive question but I'm not sure to understand the "meaning" of the 2 functions.

I understand the instructions to be "retry the instructions in the provided block, maximum 3 times, if the failure is because of the agent (?) or if the failure is that the pipeline cannot be resumed successfully after a controller restart).

Is that correct?

Yes. nonresumable (Non-resumable steps in Snippet Generator) is for builds that were running inside steps like checkout, junit, etc. which cannot pause and resume execution after a controller restart—unlike sh, input, etc.

It occurs to me that I am missing help.html files for the conditions. I can add some.

jglick · 2022-06-07T18:11:51Z

The idea would be to recover from stuff like https://ci.jenkins.io/job/Plugins/job/pipeline-groovy-lib-plugin/job/master/6/execution/node/65/log/

 Cannot contact jnlp-maven-11-3n2pk: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
 Could not connect to jnlp-maven-11-3n2pk to send interrupt signal to process

though I do not think I have managed to reproduce that exact error in a functional test.

jglick · 2022-06-10T21:26:10Z

@dduportal @lemeurherve does this look like it would be helpful to you? If I proceed to release the basic versions of these changes, will you pick them up on ci.jenkins.io and try this library change? What else could I do to clarify the functionality and improve documentation?

dduportal · 2022-06-13T11:15:21Z

@dduportal @lemeurherve does this look like it would be helpful to you?

It absolutely makes sense for us! We are tracking it in jenkins-infra/helpdesk#2984 so we won't make you wait for us more than you already did 😅 Thanks for your patience and work to communicate it to us!

If I proceed to release the basic versions of these changes, will you pick them up on ci.jenkins.io and try this library change?

Absolutely yes, that would help a lot not only the infra but all the contributors whom are spending quite some time waiting when agents are going offline. We do not mind testing this as soon as posible.

What would be the most useful way for you to ship?

Trying on infra.ci.jenkins.io (private instance on the weekly core: less risks to break things but less usage so eventually less feedbacks), and then to other controllers?
Or maybe on ci.jenkins.io (more builds, more cases, more feedbacks but risk of impacting more people) as soon as possible?

=> Asking the question openly based on your level of confidence on the change + need for "real life feedbacks"

What else could I do to clarify the functionality and improve documentation?

@lemeurherve had some questions early today, i'll let him ask there.
I might have some, we'll share it with you but it looks really good for a first version.

lemeurherve · 2022-06-13T12:08:00Z

Here are some questions I have, but I can say already it looks it could be really helpful indeed, thanks!

it is not obvious which would be K8s agents and which VMs

Do you know if there is a way to determine at runtime if the current job is executed on a k8s agent?

What else could I do to clarify the functionality and improve documentation?

What are the different conditions accepted by this function? Are there other than agents(), non-resumables() and kubernetesAgent()? (additional question: why this last one isn't a plural? Unrelated? Misunderstood from my part?)

What does the StepContext going in the context argument looks like? Are there some examples? (Already existing before your PR, asking here as I don't know the subject)
From https://github.com/jglick/workflow-basic-steps-plugin/blob/retry-JENKINS-49707-base/src/main/resources/org/jenkinsci/plugins/workflow/steps/RetryStep/help-conditions.html,

If there are no specified conditions, the block will always be retried except in case of user aborts.

Is there something preventing us to not specify any condition so we wouldn't have to care about running in a VM or in a k8s agent?

Is there a way to see how many retries have been attempted on past builds? If so I'm wondering if it should be monitored to see jobs frequently retried

lemeurherve · 2022-06-13T12:09:43Z

(Let me know if you prefer these questions to be asked there instead of here)

jglick · 2022-06-13T18:13:21Z

What would be the most useful way for you to ship?

My plan was to just release the first tranche of updates (jenkinsci/kubernetes-plugin#1190 and deps) and wait for you to install them from the public UC on ci.jenkins.io, merge this, and wait for feedback. I presumed you would not want to be running prerelease plugin versions. If you are OK doing so, then I can prepare incremental builds (basically just making sure all PRs are up to date with their base branch) and give you download URLs, which would be attractive in that we could smoke out any critical flaws before releasing updates to the Jenkins community as a whole.

Do you know if there is a way to determine at runtime if the current job is executed on a k8s agent?

Good question. I do not think so, at least not as easily. The POD_LABEL environment variable is set inside the podTemplate step, but you are not using that—you are using pod templates defined globally (in KubernetesCloud) and then selected implicitly by a matching label in node. buildPlugin.groovy could of course hard-code the list of labels known to request K8s pods but this would have to be synchronized with infrastructure changes.

Are there other than agents(), non-resumables() and kubernetesAgent()?

Just those currently. Actually the first two are named agent (not plural) and nonresumable.

What does the StepContext going in the context argument looks like?

https://javadoc.jenkins.io/plugin/workflow-step-api/org/jenkinsci/plugins/workflow/steps/StepContext.html but this sounds like a question better posed as a line comment in one of the plugin PRs (not sure which one you were even looking at).

Is there something preventing us to not specify any condition so we wouldn't have to care about running in a VM or in a k8s agent?

I am not sure I understand the question. Is this about selecting agent vs. kubernetesAgent? The latter is more restrictive—it only matches errors thrown by a K8s agent, and only those which do not have a well-known (K8s) termination reason like OOMKilled. So you can use agent() on a K8s build (as this PR currently does) but it will retry in circumstances where you would probably prefer it did not.

Is there a way to see how many retries have been attempted on past builds?

Well this would be recorded in the log and in Pipeline metadata but there is no sort of dashboard or published metric or whatever. I suppose this could be fed to the metrics plugin somehow as an optional dep.

dduportal · 2022-06-13T19:25:28Z

I presumed you would not want to be running prerelease plugin versions. If you are OK doing so, then I can prepare incremental builds (basically just making sure all PRs are up to date with their base branch) and give you download URLs, which would be attractive in that we could smoke out any critical flaws before releasing updates to the Jenkins community as a whole.

That is a correct assumption as a general rule of thumb, but in this case we don't mind.

We can absolutely install the incrementals plugins in infra.ci as a first step (plugins defined in https://github.com/jenkins-infra/docker-jenkins-weekly/blob/main/plugins.txt can be URLs) as this instance holds the reports generated for RPU that take quite some time to build: great use case to kill some agents wildly and see how it behaves.

If it's not breaking, then we can roll out to ci.jenkins the incrementals and increase the testing area, all of this before releasing plugins: is that good for you?

jglick · 2022-06-13T21:12:40Z

jenkins-infra/docker-jenkins-weeklyci#512 then.

jglick · 2022-06-14T16:02:02Z

Just let me know when jenkins-infra/docker-jenkins-weeklyci#512 is live so this can be taken out of draft.

jglick · 2022-06-14T16:02:22Z

(We can also experiment in special Jenkinsfiles like that of bom.)

lemeurherve · 2022-06-14T16:08:15Z

It's deployed on infra.ci.jenkins.io: jenkins-infra/kubernetes-management#2502

dduportal · 2022-06-14T16:15:09Z

(We can also experiment in special Jenkinsfiles like that of bom.)

Yep, the draft status is still required until it's live on ci.jenkins.io though: the "weekly" docker image is only about infra.ci.jenkins.io.

We are checking how it behaves. So far:

The jobs that were running during the plugin upgrade were stuck waiting for an agent after the restart (no node online... waiting) but nothing happened for ~1hour. Had to stop them and kick new ones
We saw an improvement on the job using https://github.com/jenkins-infra/kubernetes-management/blob/main/Jenkinsfile_k8s: when the controller restarts, it is retried by default (which wasn't the case before). Since it was never before this change, I assume it is the new default behavior. But I might misunderstand.
- Gotta test when changing the core to the new weekly ( feat(deps): bump jenkins weekly to 2.355 docker-jenkins-weeklyci#514). I would like to check the deployment/behavior with you @jglick in live, if you have some availability in the upcoming days (morning for you)?

lemeurherve · 2022-06-14T16:18:37Z

Yep, the draft status is still required until it's live on ci.jenkins.io though: the "weekly" docker image is only about infra.ci.jenkins.io.

We could maybe add an isInfra() condition, WDYT?

dduportal · 2022-06-14T16:24:49Z

Yep, the draft status is still required until it's live on ci.jenkins.io though: the "weekly" docker image is only about infra.ci.jenkins.io.

We could maybe add an isInfra() condition, WDYT?

Note sure how it solves the first "gated rollout" where we need the updated plugins which should not break existing workloads in their current state?

jglick · 2022-06-14T16:26:59Z

The jobs that were running during the plugin upgrade were stuck waiting for an agent after the restart (no node online... waiting) but nothing happened for ~1hour.

Nothing to do with these updates, but anyway—that should not happen, they should have automatically aborted after waiting 5m. So there is a bug somewhere. What sort of agents? (I cannot even see build logs on that system, much less get system info.)

when the controller restarts, it is retried by default

No, that is not something introduced by this change, which only takes effect if you use the retry step (with the new option). You must be seeing something different.

(jenkinsci/workflow-durable-task-step-plugin#180 would have subtle effects on behavior of builds running across controller restarts—though still any sort of automatic retry—but this patch is not included in jenkins-infra/docker-jenkins-weeklyci#512 as I am leaving that to a later stage. Basically that would allow retry to work when the cause of failure was an agent being lost after a restart. Not a normal scenario for a controller restart, where agents running builds are expected to reconnect without error; designed for things like a complete cluster outage.)

I would like to check the deployment/behavior with you @jglick in live, if you have some availability in the upcoming days (morning for you)?

Sure. Just put something on my calendar. Maybe starting 09.15 US-EDT.

add an isInfra() condition

Not sure I follow. This patch is about buildPlugin() only—no effect expected on anything outside https://ci.jenkins.io/job/Plugins/.

dduportal · 2022-07-07T06:28:01Z

Let's go to try this behavior!

timja · 2022-07-07T15:51:05Z

vars/buildPlugin.groovy

-                    bat 'git clean -xffd || echo %GITUNAVAILABLEMESSAGE%'
-                  }
-                }
+      // TODO use kubernetesAgent() if we know we are using a pod


Doesn't the kubernetesAgent return false if it can't find a match? i.e. would it actually cause a problem to always set this? jenkinsci/kubernetes-plugin#1190

(as core could benefit from this as well just trying to figure out the best approach)

I have not settled on the right DX here. Currently kubernetesAgent matches only when used on K8s agents. So if you use conditions: [kubernetesAgent()] then you will get no retries of non-K8s (e.g., Docker VM) node blocks, which seems like the expected behavior. And conditions: [agent(), kubernetesAgent()] is logically the same as conditions: [agent()]: matches whenever the node is lost, disregarding K8s details (like genuine problems with the pod definition). I guess there could be an option like conditions: [kubernetesAgent(handleNonKubernetes: true)] which would fall back to generic agent() behavior when given a non-K8s agent, which is what we would ideally use here where we cannot easily tell what sort of agent is being used.

so if you were to use:

conditions: [agent(), kubernetesAgent()]

on a non kubernetes agent (because you don't know when you'll be on kubernetes and when not) it won't cause a problem?

It would not cause a problem, it would just behave identically to

conditions: [agent()]

Same for

conditions: [kubernetesAgent(), agent()]

In other words, when used with a K8s agent, it would retry node even though the kubernetes plugin knows that in this case the problem was not an infrastructure outage: something like OOMKilled (bad memory limit request) or DeadlineExceeded (e.g., missing container image) that is really the developer’s concern.

⇒ jenkinsci/kubernetes-plugin@c38fe0f

…nfra#405) Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>

[JENKINS-49707] Use retry to recover from node outages

bdb6a2f

jglick commented Jun 7, 2022

View reviewed changes

dduportal reviewed Jun 7, 2022

View reviewed changes

dduportal mentioned this pull request Jun 9, 2022

High memory runners fail to come online jenkins-infra/helpdesk#2972

Closed

lemeurherve mentioned this pull request Jun 13, 2022

[JENKINS-49707] Evaluate retry conditions to improve the stability of the builds jenkins-infra/helpdesk#2984

Closed

jglick mentioned this pull request Jun 13, 2022

[JENKINS-49707] Experimental plugin updates for node-block retry jenkins-infra/docker-jenkins-weeklyci#512

Merged

jglick marked this pull request as ready for review July 6, 2022 18:50

jglick requested a review from a team July 6, 2022 18:50

This was referenced Jul 6, 2022

[JENKINS-49707] Retry node block after infrastructure outages jenkinsci/bom#1249

Merged

[JENKINS-49707] Retry node block after infrastructure outages jenkinsci/acceptance-test-harness#813

Merged

Merge branch 'master' into retry-JENKINS-49707

8b40dba

timja approved these changes Jul 6, 2022

View reviewed changes

dduportal approved these changes Jul 7, 2022

View reviewed changes

dduportal merged commit e468733 into jenkins-infra:master Jul 7, 2022

timja reviewed Jul 7, 2022

View reviewed changes

jglick deleted the retry-JENKINS-49707 branch July 7, 2022 16:15

jglick mentioned this pull request Jul 7, 2022

[JENKINS-49707] Allow node blocks from deleted pods to be retried jenkinsci/kubernetes-plugin#1190

Merged

jglick added a commit to jglick/kubernetes-plugin that referenced this pull request Jul 12, 2022

handleNonKubernetes jenkins-infra/pipeline-library#405 (comment)

c38fe0f

jglick mentioned this pull request Jul 13, 2022

Use kubernetesAgent(handleNonKubernetes: true) #436

Merged

fruch mentioned this pull request Dec 6, 2022

Consider using the built-in retry camunda-community-hub/camunda-jenkins-shared-library#6

Open

jglick mentioned this pull request Dec 15, 2022

add java 11 back for linux jenkinsci/workflow-multibranch-plugin#221

Closed

6 tasks

smerle33 pushed a commit to smerle33/pipeline-library that referenced this pull request Jan 16, 2024

[JENKINS-49707] Use retry to recover from node outages (jenkins-i…

93ad3a1

…nfra#405) Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>

jglick mentioned this pull request Mar 19, 2024

Quietly skipped incrementals deployment after retry jenkins-infra/helpdesk#4002

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-49707] Use `retry` to recover from `node` outages #405

[JENKINS-49707] Use `retry` to recover from `node` outages #405

jglick commented Jun 7, 2022 •

edited

Loading

jglick Jun 7, 2022

dduportal Jun 7, 2022

jglick Jun 7, 2022

jglick Jun 7, 2022

jglick commented Jun 7, 2022

jglick commented Jun 10, 2022

dduportal commented Jun 13, 2022

lemeurherve commented Jun 13, 2022

lemeurherve commented Jun 13, 2022

jglick commented Jun 13, 2022 •

edited

Loading

dduportal commented Jun 13, 2022

jglick commented Jun 13, 2022

jglick commented Jun 14, 2022

jglick commented Jun 14, 2022

lemeurherve commented Jun 14, 2022

dduportal commented Jun 14, 2022

lemeurherve commented Jun 14, 2022 •

edited

Loading

dduportal commented Jun 14, 2022

jglick commented Jun 14, 2022

dduportal commented Jul 7, 2022

timja Jul 7, 2022 •

edited

Loading

jglick Jul 7, 2022 •

edited

Loading

timja Jul 7, 2022

jglick Jul 7, 2022

jglick Jul 7, 2022

jglick Jul 12, 2022

	retry(count: 3, conditions: [agent(), nonresumable()]) {
	retry(count: 3, conditions: [agent(), nonresumable()]) {

[JENKINS-49707] Use retry to recover from node outages #405

[JENKINS-49707] Use retry to recover from node outages #405

Conversation

jglick commented Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglick commented Jun 7, 2022

jglick commented Jun 10, 2022

dduportal commented Jun 13, 2022

lemeurherve commented Jun 13, 2022

lemeurherve commented Jun 13, 2022

jglick commented Jun 13, 2022 • edited Loading

dduportal commented Jun 13, 2022

jglick commented Jun 13, 2022

jglick commented Jun 14, 2022

jglick commented Jun 14, 2022

lemeurherve commented Jun 14, 2022

dduportal commented Jun 14, 2022

lemeurherve commented Jun 14, 2022 • edited Loading

dduportal commented Jun 14, 2022

jglick commented Jun 14, 2022

dduportal commented Jul 7, 2022

timja Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

jglick Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[JENKINS-49707] Use `retry` to recover from `node` outages #405

[JENKINS-49707] Use `retry` to recover from `node` outages #405

jglick commented Jun 7, 2022 •

edited

Loading

jglick commented Jun 13, 2022 •

edited

Loading

lemeurherve commented Jun 14, 2022 •

edited

Loading

timja Jul 7, 2022 •

edited

Loading

jglick Jul 7, 2022 •

edited

Loading