Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-67164] Call StepExecution.onResume directly from WorkflowRun.onLoad rather than via FlowExecutionList.ItemListenerImpl to ensure step resumption #178

Merged
merged 7 commits into from
Nov 19, 2021

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Nov 4, 2021

See JENKINS-67164.

We have seen various reports of cases where a Pipeline resumes after a Jenkins restart and is occupying an executor slot, but the Pipeline in question appears to have already completed based on log output and it hangs forever after resuming. For example, see JENKINS-50199 (fixed via hardening a different resumption code path) and JENKINS-43587 (which maybe should not have been marked as resolved).

I was able to reproduce this issue when restoring Jenkins from a backup taken while Pipeline builds were in progress. The Pipelines in question were in the process of running a SynchronousNonBlockingStepExecution when the backup was taken, and that step should have failed when the Pipeline resumed and FlowExecutionListener$ItemListenerImpl called StepExecution.onResume, but this call never happened because the builds in question were not present in org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml on disk for some reason, so the execution just hung forever waiting for that step to complete. Additionally, the build logs indicated that the builds had already proceeded past this step, so it is not clear to me why program.dat was outdated. The Pipelines were using the standard PERFORMANCE_OPTIMIZED durability mode, and their build.xml files showed them as still in progress.

This PR attempts to harden the StepExecution resumption code by replacing FlowExecutionListener.ItemListenerImpl with a FlowExecutionListener that implements onResumed. WorkflowRun.onLoad directly calls FlowExecutionListener.fireOnResumed when it resumes in-progress Pipelines (happens automatically when WorkflowRun.onLoad is called), so it is guaranteed to be called for every Pipeline that resumes regardless of the stated of FlowExecutionList.

I will keep investigating to try to understand how this might happen in cases besides backups, since I think that this issue has been seen even when just restarting a controller normally.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

…han via FlowExecutionList to ensure resumption
Comment on lines +101 to +102
// TODO: Unclear how this might happen in practice.
FlowExecutionList.get().unregister(b.asFlowExecutionOwner());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I am not sure how this might happen outside of something like a backup where files are being copied indiscriminately. The code around build completion in WorkflowRun looks reasonable to me from a quick glance, and the call to unregister is the final thing that happens, so the only thing that comes to mind is that there is some issue causing the build to not be saved as expected, for example perhaps some issue with a BulkChange being aborted or something like that.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking right.

public void onSuccess(List<StepExecution> result) {
FlowExecutionList list = FlowExecutionList.get();
FlowExecutionOwner owner = e.getOwner();
if (!list.runningTasks.contains(owner)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check isComplete first?

Copy link
Member Author

@dwnusbaum dwnusbaum Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking about that as well. CpsFlowExecution.isComplete is checked in CpsFlowExecution.onLoad before loadProgramAsync gets called, but WorkflowRun.onLoad only checks CpsFlowExecution.isComplete() if WorkflowRun.completed is null, which I think should only be true for old Pipelines from before WorkflowRun.completed was added (which seems wrong, probably better to always check in case there is a mismatch between the serialized state of the execution and the run, so I will look at that in a bit), so there is no guarantee about the value of CpsFlowExecution.isComplete here as far as I can tell, so it might be worth checking.

I also wasn't sure if it would be possible for there to be a timing issue, for example if the pipeline was suspended somewhere where it doesn't need any additional input to proceed and so the CPS VM thread finishes the build while we are on another thread still in WorkflowRun.onLoad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it would be possible for there to be a timing issue

Almost certainly.

Copy link
Member Author

@dwnusbaum dwnusbaum Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost certainly.

If this is the case though, then we cannot fix the issue here, because there is no synchronization that can guarantee that the execution does not complete right after we check isComplete but before we do anything else. Maybe the CPS VM thread should not attempt to execute anything until WorkflowRun.onLoad finishes or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes you think the observed problem is the result of a race condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following your question. I just mean that it is not clear to me whether checking isComplete here actually gives us any useful guarantees, partly because WorkflowRun.onLoad is not guaranteed to check it and partly because the execution actually completing will happen on the CPS VM thread, which is unrelated to the thread calling WorkflowRun.onLoad. I don't necessarily think that the underlying reported problem is a race condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I said

then we cannot fix the issue here,

I just meant that we cannot fix any timing issues related to checking isComplete or not. Maybe that was not clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so reading through WorkflowRun.onLoad again, it is guaranteed to call isComplete inside of WorkflowRun.getExecution here, and if the execution is complete, it marks the WorkflowRun as complete (this was the fix for the symptoms of JENKINS-50199), so this condition would be false and so FlowExecutionListener.fireResumed would be skipped along with ResumeStepExecutionListener.onResumed if the execution was marked as complete in its serialized form.

That leaves us with a potential case where WorkflowRun.getExecution loads and resumes the execution, which has not yet completed, but then the execution completes before ResumeStepExecutionListener.onResumed has been called by WorkflowRun.onLoad after WorkflowRun.getExecution and so we resume the steps unnecessarily. We could check isComplete here in case this happens, but it seems somewhat unlikely, and the issue is even if we do check it, the execution could complete right after the check before ResumeStepExecutionListener.onResumed attempts to resume the step executions, so we wouldn't really be guaranteed to prevent the problem. We could try to modify execution resumption to somehow only start after WorkflowRun.onLoad returns, but that seems like a somewhat risky change unless we have some evidence that it is a problem in practice.

I think I'll go ahead and add the isComplete check since it won't hurt and could potentially help in some case, but I'll also add a comment mentioning that execution is asynchronous and so there could still be problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check added with a comment in e9f2421.

@dwnusbaum
Copy link
Member Author

Once the build completes I will do some downstream testing, probably just with a draft PR against https://github.com/jenkinsci/bom.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be safe.

@dwnusbaum dwnusbaum added the bug label Nov 17, 2021
@dwnusbaum dwnusbaum changed the title Call StepExecution.onResume directly from WorkflowRun.onLoad rather than via FlowExecutionList.ItemListenerImpl to ensure resumption [JENKINS-67164] Call StepExecution.onResume directly from WorkflowRun.onLoad rather than via FlowExecutionList.ItemListenerImpl to ensure step resumption Nov 17, 2021
@dwnusbaum dwnusbaum marked this pull request as ready for review November 17, 2021 21:29
@dwnusbaum dwnusbaum requested a review from car-roll November 17, 2021 21:29
}
}

}, MoreExecutors.directExecutor()); // TODO: Unclear if we need to run this asynchronously or if StepExecution.onResume has any particular thread requirements.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like running this synchronously caused JENKINS-67351. I will look into switching to using a different thread.

dwnusbaum added a commit to dwnusbaum/workflow-api-plugin that referenced this pull request Jan 14, 2022
…kflowRun.onLoad rather than via FlowExecutionList.ItemListenerImpl to ensure step resumption (jenkinsci#178)"

This reverts commit 37c8c49.
@dwnusbaum dwnusbaum deleted the step-resumption-reliability branch May 10, 2022 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants