-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-67164] Call StepExecution.onResume directly from WorkflowRun.onLoad rather than via FlowExecutionList.ItemListenerImpl to ensure step resumption #178
Conversation
…han via FlowExecutionList to ensure resumption
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Outdated
Show resolved
Hide resolved
src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
Show resolved
Hide resolved
// TODO: Unclear how this might happen in practice. | ||
FlowExecutionList.get().unregister(b.asFlowExecutionOwner()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I am not sure how this might happen outside of something like a backup where files are being copied indiscriminately. The code around build completion in WorkflowRun
looks reasonable to me from a quick glance, and the call to unregister
is the final thing that happens, so the only thing that comes to mind is that there is some issue causing the build to not be saved as expected, for example perhaps some issue with a BulkChange
being aborted or something like that.
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking right.
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Show resolved
Hide resolved
public void onSuccess(List<StepExecution> result) { | ||
FlowExecutionList list = FlowExecutionList.get(); | ||
FlowExecutionOwner owner = e.getOwner(); | ||
if (!list.runningTasks.contains(owner)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check isComplete
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking about that as well. CpsFlowExecution.isComplete
is checked in CpsFlowExecution.onLoad
before loadProgramAsync
gets called, but WorkflowRun.onLoad
only checks CpsFlowExecution.isComplete()
if WorkflowRun.completed
is null
, which I think should only be true for old Pipelines from before WorkflowRun.completed
was added (which seems wrong, probably better to always check in case there is a mismatch between the serialized state of the execution and the run, so I will look at that in a bit), so there is no guarantee about the value of CpsFlowExecution.isComplete
here as far as I can tell, so it might be worth checking.
I also wasn't sure if it would be possible for there to be a timing issue, for example if the pipeline was suspended somewhere where it doesn't need any additional input to proceed and so the CPS VM thread finishes the build while we are on another thread still in WorkflowRun.onLoad
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it would be possible for there to be a timing issue
Almost certainly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost certainly.
If this is the case though, then we cannot fix the issue here, because there is no synchronization that can guarantee that the execution does not complete right after we check isComplete
but before we do anything else. Maybe the CPS VM thread should not attempt to execute anything until WorkflowRun.onLoad
finishes or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes you think the observed problem is the result of a race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following your question. I just mean that it is not clear to me whether checking isComplete
here actually gives us any useful guarantees, partly because WorkflowRun.onLoad
is not guaranteed to check it and partly because the execution actually completing will happen on the CPS VM thread, which is unrelated to the thread calling WorkflowRun.onLoad
. I don't necessarily think that the underlying reported problem is a race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I said
then we cannot fix the issue here,
I just meant that we cannot fix any timing issues related to checking isComplete
or not. Maybe that was not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, so reading through WorkflowRun.onLoad
again, it is guaranteed to call isComplete
inside of WorkflowRun.getExecution
here, and if the execution is complete, it marks the WorkflowRun
as complete (this was the fix for the symptoms of JENKINS-50199), so this condition would be false and so FlowExecutionListener.fireResumed
would be skipped along with ResumeStepExecutionListener.onResumed
if the execution was marked as complete in its serialized form.
That leaves us with a potential case where WorkflowRun.getExecution
loads and resumes the execution, which has not yet completed, but then the execution completes before ResumeStepExecutionListener.onResumed
has been called by WorkflowRun.onLoad
after WorkflowRun.getExecution
and so we resume the steps unnecessarily. We could check isComplete
here in case this happens, but it seems somewhat unlikely, and the issue is even if we do check it, the execution could complete right after the check before ResumeStepExecutionListener.onResumed
attempts to resume the step executions, so we wouldn't really be guaranteed to prevent the problem. We could try to modify execution resumption to somehow only start after WorkflowRun.onLoad
returns, but that seems like a somewhat risky change unless we have some evidence that it is a problem in practice.
I think I'll go ahead and add the isComplete
check since it won't hurt and could potentially help in some case, but I'll also add a comment mentioning that execution is asynchronous and so there could still be problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check added with a comment in e9f2421.
Once the build completes I will do some downstream testing, probably just with a draft PR against https://github.com/jenkinsci/bom. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be safe.
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Show resolved
Hide resolved
src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
Show resolved
Hide resolved
src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
Show resolved
Hide resolved
src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
Show resolved
Hide resolved
} | ||
} | ||
|
||
}, MoreExecutors.directExecutor()); // TODO: Unclear if we need to run this asynchronously or if StepExecution.onResume has any particular thread requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like running this synchronously caused JENKINS-67351. I will look into switching to using a different thread.
…kflowRun.onLoad rather than via FlowExecutionList.ItemListenerImpl to ensure step resumption (jenkinsci#178)" This reverts commit 37c8c49.
See JENKINS-67164.
We have seen various reports of cases where a Pipeline resumes after a Jenkins restart and is occupying an executor slot, but the Pipeline in question appears to have already completed based on log output and it hangs forever after resuming. For example, see JENKINS-50199 (fixed via hardening a different resumption code path) and JENKINS-43587 (which maybe should not have been marked as resolved).
I was able to reproduce this issue when restoring Jenkins from a backup taken while Pipeline builds were in progress. The Pipelines in question were in the process of running a
SynchronousNonBlockingStepExecution
when the backup was taken, and that step should have failed when the Pipeline resumed andFlowExecutionListener$ItemListenerImpl
calledStepExecution.onResume
, but this call never happened because the builds in question were not present inorg.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
on disk for some reason, so the execution just hung forever waiting for that step to complete. Additionally, the build logs indicated that the builds had already proceeded past this step, so it is not clear to me whyprogram.dat
was outdated. The Pipelines were using the standardPERFORMANCE_OPTIMIZED
durability mode, and theirbuild.xml
files showed them as still in progress.This PR attempts to harden the
StepExecution
resumption code by replacingFlowExecutionListener.ItemListenerImpl
with aFlowExecutionListener
that implementsonResumed
.WorkflowRun.onLoad
directly callsFlowExecutionListener.fireOnResumed
when it resumes in-progress Pipelines (happens automatically whenWorkflowRun.onLoad
is called), so it is guaranteed to be called for every Pipeline that resumes regardless of the stated ofFlowExecutionList
.I will keep investigating to try to understand how this might happen in cases besides backups, since I think that this issue has been seen even when just restarting a controller normally.