-
-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-49707] Use retry
to recover from node
outages
#405
Conversation
@@ -48,224 +48,227 @@ def call(Map params = [:]) { | |||
} | |||
|
|||
tasks[stageIdentifier] = { | |||
node(label) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to ignore WS diffs in this PR. (I tried to just use one-space indentation but Spotless shot me down.)
} | ||
} | ||
// TODO use kubernetesAgent() if we know we are using a pod | ||
retry(count: 3, conditions: [agent(), nonresumable()]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retry(count: 3, conditions: [agent(), nonresumable()]) { | |
retry(count: 3, conditions: [agent(), nonresumable()]) { |
Sorry for the naive question but I'm not sure to understand the "meaning" of the 2 functions.
I understand the instructions to be "retry the instructions in the provided block, maximum 3 times, if the failure is because of the agent (?) or if the failure is that the pipeline cannot be resumed successfully after a controller restart).
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. nonresumable
(Non-resumable steps in Snippet Generator) is for builds that were running inside steps like checkout
, junit
, etc. which cannot pause and resume execution after a controller restart—unlike sh
, input
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that I am missing help.html
files for the conditions. I can add some.
The idea would be to recover from stuff like https://ci.jenkins.io/job/Plugins/job/pipeline-groovy-lib-plugin/job/master/6/execution/node/65/log/
though I do not think I have managed to reproduce that exact error in a functional test. |
@dduportal @lemeurherve does this look like it would be helpful to you? If I proceed to release the basic versions of these changes, will you pick them up on ci.jenkins.io and try this library change? What else could I do to clarify the functionality and improve documentation? |
It absolutely makes sense for us! We are tracking it in jenkins-infra/helpdesk#2984 so we won't make you wait for us more than you already did 😅 Thanks for your patience and work to communicate it to us!
Absolutely yes, that would help a lot not only the infra but all the contributors whom are spending quite some time waiting when agents are going offline. We do not mind testing this as soon as posible. What would be the most useful way for you to ship?
=> Asking the question openly based on your level of confidence on the change + need for "real life feedbacks"
@lemeurherve had some questions early today, i'll let him ask there. |
Here are some questions I have, but I can say already it looks it could be really helpful indeed, thanks!
Do you know if there is a way to determine at runtime if the current job is executed on a k8s agent?
What are the different
Is there something preventing us to not specify any condition so we wouldn't have to care about running in a VM or in a k8s agent?
|
(Let me know if you prefer these questions to be asked there instead of here) |
My plan was to just release the first tranche of updates (jenkinsci/kubernetes-plugin#1190 and deps) and wait for you to install them from the public UC on ci.jenkins.io, merge this, and wait for feedback. I presumed you would not want to be running prerelease plugin versions. If you are OK doing so, then I can prepare incremental builds (basically just making sure all PRs are up to date with their base branch) and give you download URLs, which would be attractive in that we could smoke out any critical flaws before releasing updates to the Jenkins community as a whole.
Good question. I do not think so, at least not as easily. The
Just those currently. Actually the first two are named
https://javadoc.jenkins.io/plugin/workflow-step-api/org/jenkinsci/plugins/workflow/steps/StepContext.html but this sounds like a question better posed as a line comment in one of the plugin PRs (not sure which one you were even looking at).
I am not sure I understand the question. Is this about selecting
Well this would be recorded in the log and in Pipeline metadata but there is no sort of dashboard or published metric or whatever. I suppose this could be fed to the |
That is a correct assumption as a general rule of thumb, but in this case we don't mind. We can absolutely install the incrementals plugins in infra.ci as a first step (plugins defined in https://github.com/jenkins-infra/docker-jenkins-weekly/blob/main/plugins.txt can be URLs) as this instance holds the reports generated for RPU that take quite some time to build: great use case to kill some agents wildly and see how it behaves. If it's not breaking, then we can roll out to ci.jenkins the incrementals and increase the testing area, all of this before releasing plugins: is that good for you? |
Just let me know when jenkins-infra/docker-jenkins-weeklyci#512 is live so this can be taken out of draft. |
(We can also experiment in special |
It's deployed on infra.ci.jenkins.io: jenkins-infra/kubernetes-management#2502 |
Yep, the draft status is still required until it's live on ci.jenkins.io though: the "weekly" docker image is only about infra.ci.jenkins.io. We are checking how it behaves. So far:
|
We could maybe add an |
Note sure how it solves the first "gated rollout" where we need the updated plugins which should not break existing workloads in their current state? |
Nothing to do with these updates, but anyway—that should not happen, they should have automatically aborted after waiting 5m. So there is a bug somewhere. What sort of agents? (I cannot even see build logs on that system, much less get system info.)
No, that is not something introduced by this change, which only takes effect if you use the (jenkinsci/workflow-durable-task-step-plugin#180 would have subtle effects on behavior of builds running across controller restarts—though still any sort of automatic retry—but this patch is not included in jenkins-infra/docker-jenkins-weeklyci#512 as I am leaving that to a later stage. Basically that would allow
Sure. Just put something on my calendar. Maybe starting 09.15 US-EDT.
Not sure I follow. This patch is about |
Let's go to try this behavior! |
bat 'git clean -xffd || echo %GITUNAVAILABLEMESSAGE%' | ||
} | ||
} | ||
// TODO use kubernetesAgent() if we know we are using a pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the kubernetesAgent
return false if it can't find a match? i.e. would it actually cause a problem to always set this? jenkinsci/kubernetes-plugin#1190
(as core could benefit from this as well just trying to figure out the best approach)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not settled on the right DX here. Currently kubernetesAgent
matches only when used on K8s agents. So if you use conditions: [kubernetesAgent()]
then you will get no retries of non-K8s (e.g., Docker VM) node
blocks, which seems like the expected behavior. And conditions: [agent(), kubernetesAgent()]
is logically the same as conditions: [agent()]
: matches whenever the node is lost, disregarding K8s details (like genuine problems with the pod definition). I guess there could be an option like conditions: [kubernetesAgent(handleNonKubernetes: true)]
which would fall back to generic agent()
behavior when given a non-K8s agent, which is what we would ideally use here where we cannot easily tell what sort of agent is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if you were to use:
conditions: [agent(), kubernetesAgent()]
on a non kubernetes agent (because you don't know when you'll be on kubernetes and when not) it won't cause a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would not cause a problem, it would just behave identically to
conditions: [agent()]
Same for
conditions: [kubernetesAgent(), agent()]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, when used with a K8s agent, it would retry node
even though the kubernetes
plugin knows that in this case the problem was not an infrastructure outage: something like OOMKilled
(bad memory limit request) or DeadlineExceeded
(e.g., missing container image) that is really the developer’s concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…nfra#405) Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>
If and when jenkinsci/workflow-basic-steps-plugin#203 is released and ci.jenkins.io is updated to this.
Could also benefit from jenkinsci/kubernetes-plugin#1190, similar to jenkinsci/pipeline-model-definition-plugin#533 (comment) but for Scripted syntax. The tricky bit is that
kubernetesAgent()
only retries errors from K8s agents (hence the name), applying some specialized logic so it will not retry e.g. out of memory errors; but inbuildPlugin()
we allow any of a number of labels to be used and it is not obvious which would be K8s agents and which VMs, so we are using plainagent()
for now. Currently there is not a way to express the desire “please retry if this was a K8s agent and had a qualifying reason as restricted to pods, or if this was a non-K8s agents and had a generic agent outage condition”.