Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49707] Use retry to recover from node outages #405

Merged
merged 2 commits into from
Jul 7, 2022

Conversation

jglick
Copy link
Contributor

@jglick jglick commented Jun 7, 2022

If and when jenkinsci/workflow-basic-steps-plugin#203 is released and ci.jenkins.io is updated to this.

Could also benefit from jenkinsci/kubernetes-plugin#1190, similar to jenkinsci/pipeline-model-definition-plugin#533 (comment) but for Scripted syntax. The tricky bit is that kubernetesAgent() only retries errors from K8s agents (hence the name), applying some specialized logic so it will not retry e.g. out of memory errors; but in buildPlugin() we allow any of a number of labels to be used and it is not obvious which would be K8s agents and which VMs, so we are using plain agent() for now. Currently there is not a way to express the desire “please retry if this was a K8s agent and had a qualifying reason as restricted to pods, or if this was a non-K8s agents and had a generic agent outage condition”.

@@ -48,224 +48,227 @@ def call(Map params = [:]) {
}

tasks[stageIdentifier] = {
node(label) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to ignore WS diffs in this PR. (I tried to just use one-space indentation but Spotless shot me down.)

}
}
// TODO use kubernetesAgent() if we know we are using a pod
retry(count: 3, conditions: [agent(), nonresumable()]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retry(count: 3, conditions: [agent(), nonresumable()]) {
retry(count: 3, conditions: [agent(), nonresumable()]) {

Sorry for the naive question but I'm not sure to understand the "meaning" of the 2 functions.

I understand the instructions to be "retry the instructions in the provided block, maximum 3 times, if the failure is because of the agent (?) or if the failure is that the pipeline cannot be resumed successfully after a controller restart).

Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. nonresumable (Non-resumable steps in Snippet Generator) is for builds that were running inside steps like checkout, junit, etc. which cannot pause and resume execution after a controller restart—unlike sh, input, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that I am missing help.html files for the conditions. I can add some.

@jglick
Copy link
Contributor Author

jglick commented Jun 7, 2022

The idea would be to recover from stuff like https://ci.jenkins.io/job/Plugins/job/pipeline-groovy-lib-plugin/job/master/6/execution/node/65/log/

 Cannot contact jnlp-maven-11-3n2pk: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
 Could not connect to jnlp-maven-11-3n2pk to send interrupt signal to process

though I do not think I have managed to reproduce that exact error in a functional test.

@jglick
Copy link
Contributor Author

jglick commented Jun 10, 2022

@dduportal @lemeurherve does this look like it would be helpful to you? If I proceed to release the basic versions of these changes, will you pick them up on ci.jenkins.io and try this library change? What else could I do to clarify the functionality and improve documentation?

@dduportal
Copy link
Contributor

@dduportal @lemeurherve does this look like it would be helpful to you?

It absolutely makes sense for us! We are tracking it in jenkins-infra/helpdesk#2984 so we won't make you wait for us more than you already did 😅 Thanks for your patience and work to communicate it to us!

If I proceed to release the basic versions of these changes, will you pick them up on ci.jenkins.io and try this library change?

Absolutely yes, that would help a lot not only the infra but all the contributors whom are spending quite some time waiting when agents are going offline. We do not mind testing this as soon as posible.

What would be the most useful way for you to ship?

  • Trying on infra.ci.jenkins.io (private instance on the weekly core: less risks to break things but less usage so eventually less feedbacks), and then to other controllers?
  • Or maybe on ci.jenkins.io (more builds, more cases, more feedbacks but risk of impacting more people) as soon as possible?

=> Asking the question openly based on your level of confidence on the change + need for "real life feedbacks"

What else could I do to clarify the functionality and improve documentation?

@lemeurherve had some questions early today, i'll let him ask there.
I might have some, we'll share it with you but it looks really good for a first version.

@lemeurherve
Copy link
Member

Here are some questions I have, but I can say already it looks it could be really helpful indeed, thanks!

it is not obvious which would be K8s agents and which VMs

Do you know if there is a way to determine at runtime if the current job is executed on a k8s agent?

What else could I do to clarify the functionality and improve documentation?

What are the different conditions accepted by this function? Are there other than agents(), non-resumables() and kubernetesAgent()? (additional question: why this last one isn't a plural? Unrelated? Misunderstood from my part?)

  1. What does the StepContext going in the context argument looks like? Are there some examples? (Already existing before your PR, asking here as I don't know the subject)

  2. From https://github.com/jglick/workflow-basic-steps-plugin/blob/retry-JENKINS-49707-base/src/main/resources/org/jenkinsci/plugins/workflow/steps/RetryStep/help-conditions.html,

If there are no specified conditions, the block will always be retried except in case of user aborts.

Is there something preventing us to not specify any condition so we wouldn't have to care about running in a VM or in a k8s agent?

  1. Is there a way to see how many retries have been attempted on past builds? If so I'm wondering if it should be monitored to see jobs frequently retried

@lemeurherve
Copy link
Member

(Let me know if you prefer these questions to be asked there instead of here)

@jglick
Copy link
Contributor Author

jglick commented Jun 13, 2022

What would be the most useful way for you to ship?

My plan was to just release the first tranche of updates (jenkinsci/kubernetes-plugin#1190 and deps) and wait for you to install them from the public UC on ci.jenkins.io, merge this, and wait for feedback. I presumed you would not want to be running prerelease plugin versions. If you are OK doing so, then I can prepare incremental builds (basically just making sure all PRs are up to date with their base branch) and give you download URLs, which would be attractive in that we could smoke out any critical flaws before releasing updates to the Jenkins community as a whole.

Do you know if there is a way to determine at runtime if the current job is executed on a k8s agent?

Good question. I do not think so, at least not as easily. The POD_LABEL environment variable is set inside the podTemplate step, but you are not using that—you are using pod templates defined globally (in KubernetesCloud) and then selected implicitly by a matching label in node. buildPlugin.groovy could of course hard-code the list of labels known to request K8s pods but this would have to be synchronized with infrastructure changes.

Are there other than agents(), non-resumables() and kubernetesAgent()?

Just those currently. Actually the first two are named agent (not plural) and nonresumable.

What does the StepContext going in the context argument looks like?

https://javadoc.jenkins.io/plugin/workflow-step-api/org/jenkinsci/plugins/workflow/steps/StepContext.html but this sounds like a question better posed as a line comment in one of the plugin PRs (not sure which one you were even looking at).

Is there something preventing us to not specify any condition so we wouldn't have to care about running in a VM or in a k8s agent?

I am not sure I understand the question. Is this about selecting agent vs. kubernetesAgent? The latter is more restrictive—it only matches errors thrown by a K8s agent, and only those which do not have a well-known (K8s) termination reason like OOMKilled. So you can use agent() on a K8s build (as this PR currently does) but it will retry in circumstances where you would probably prefer it did not.

Is there a way to see how many retries have been attempted on past builds?

Well this would be recorded in the log and in Pipeline metadata but there is no sort of dashboard or published metric or whatever. I suppose this could be fed to the metrics plugin somehow as an optional dep.

@dduportal
Copy link
Contributor

I presumed you would not want to be running prerelease plugin versions. If you are OK doing so, then I can prepare incremental builds (basically just making sure all PRs are up to date with their base branch) and give you download URLs, which would be attractive in that we could smoke out any critical flaws before releasing updates to the Jenkins community as a whole.

That is a correct assumption as a general rule of thumb, but in this case we don't mind.

We can absolutely install the incrementals plugins in infra.ci as a first step (plugins defined in https://github.com/jenkins-infra/docker-jenkins-weekly/blob/main/plugins.txt can be URLs) as this instance holds the reports generated for RPU that take quite some time to build: great use case to kill some agents wildly and see how it behaves.

If it's not breaking, then we can roll out to ci.jenkins the incrementals and increase the testing area, all of this before releasing plugins: is that good for you?

@jglick
Copy link
Contributor Author

jglick commented Jun 13, 2022

@jglick
Copy link
Contributor Author

jglick commented Jun 14, 2022

Just let me know when jenkins-infra/docker-jenkins-weeklyci#512 is live so this can be taken out of draft.

@jglick
Copy link
Contributor Author

jglick commented Jun 14, 2022

(We can also experiment in special Jenkinsfiles like that of bom.)

@lemeurherve
Copy link
Member

It's deployed on infra.ci.jenkins.io: jenkins-infra/kubernetes-management#2502

@dduportal
Copy link
Contributor

(We can also experiment in special Jenkinsfiles like that of bom.)

Yep, the draft status is still required until it's live on ci.jenkins.io though: the "weekly" docker image is only about infra.ci.jenkins.io.

We are checking how it behaves. So far:

@lemeurherve
Copy link
Member

lemeurherve commented Jun 14, 2022

Yep, the draft status is still required until it's live on ci.jenkins.io though: the "weekly" docker image is only about infra.ci.jenkins.io.

We could maybe add an isInfra() condition, WDYT?

@dduportal
Copy link
Contributor

Yep, the draft status is still required until it's live on ci.jenkins.io though: the "weekly" docker image is only about infra.ci.jenkins.io.

We could maybe add an isInfra() condition, WDYT?

Note sure how it solves the first "gated rollout" where we need the updated plugins which should not break existing workloads in their current state?

@jglick
Copy link
Contributor Author

jglick commented Jun 14, 2022

The jobs that were running during the plugin upgrade were stuck waiting for an agent after the restart (no node online... waiting) but nothing happened for ~1hour.

Nothing to do with these updates, but anyway—that should not happen, they should have automatically aborted after waiting 5m. So there is a bug somewhere. What sort of agents? (I cannot even see build logs on that system, much less get system info.)

when the controller restarts, it is retried by default

No, that is not something introduced by this change, which only takes effect if you use the retry step (with the new option). You must be seeing something different.

(jenkinsci/workflow-durable-task-step-plugin#180 would have subtle effects on behavior of builds running across controller restarts—though still any sort of automatic retry—but this patch is not included in jenkins-infra/docker-jenkins-weeklyci#512 as I am leaving that to a later stage. Basically that would allow retry to work when the cause of failure was an agent being lost after a restart. Not a normal scenario for a controller restart, where agents running builds are expected to reconnect without error; designed for things like a complete cluster outage.)

I would like to check the deployment/behavior with you @jglick in live, if you have some availability in the upcoming days (morning for you)?

Sure. Just put something on my calendar. Maybe starting 09.15 US-EDT.

add an isInfra() condition

Not sure I follow. This patch is about buildPlugin() only—no effect expected on anything outside https://ci.jenkins.io/job/Plugins/.

@dduportal
Copy link
Contributor

Let's go to try this behavior!

@dduportal dduportal merged commit e468733 into jenkins-infra:master Jul 7, 2022
bat 'git clean -xffd || echo %GITUNAVAILABLEMESSAGE%'
}
}
// TODO use kubernetesAgent() if we know we are using a pod
Copy link
Member

@timja timja Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the kubernetesAgent return false if it can't find a match? i.e. would it actually cause a problem to always set this? jenkinsci/kubernetes-plugin#1190

(as core could benefit from this as well just trying to figure out the best approach)

Copy link
Contributor Author

@jglick jglick Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not settled on the right DX here. Currently kubernetesAgent matches only when used on K8s agents. So if you use conditions: [kubernetesAgent()] then you will get no retries of non-K8s (e.g., Docker VM) node blocks, which seems like the expected behavior. And conditions: [agent(), kubernetesAgent()] is logically the same as conditions: [agent()]: matches whenever the node is lost, disregarding K8s details (like genuine problems with the pod definition). I guess there could be an option like conditions: [kubernetesAgent(handleNonKubernetes: true)] which would fall back to generic agent() behavior when given a non-K8s agent, which is what we would ideally use here where we cannot easily tell what sort of agent is being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if you were to use:

conditions: [agent(), kubernetesAgent()]

on a non kubernetes agent (because you don't know when you'll be on kubernetes and when not) it won't cause a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not cause a problem, it would just behave identically to

conditions: [agent()]

Same for

conditions: [kubernetesAgent(), agent()]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, when used with a K8s agent, it would retry node even though the kubernetes plugin knows that in this case the problem was not an infrastructure outage: something like OOMKilled (bad memory limit request) or DeadlineExceeded (e.g., missing container image) that is really the developer’s concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants