-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-67351] Avoid deadlock when resuming Pipelines in some cases #188
Conversation
…ion against other types of deadlock as in JENKINS-25890
… to help consumers avoid resuming Pipelines inadvertently
/** | ||
* Returns true if all executions that were present in this {@link FlowExecutionList} have been loaded and resumed. | ||
* | ||
* This takes place slightly after {@link InitMilestone#COMPLETED} is reached during Jenkins startup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/jenkinsci/jenkins/blob/960925fc5055208d2b64a0388161ddcbcd5916aa/core/src/main/java/jenkins/model/Jenkins.java#L987-L1033. Perhaps we could instead add an API to core to indicate that Jenkins.<init>
is truly complete, although that would not help us right at this moment, or maybe we could check for STARTUP_MARKER_FILE
instead or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea from @jglick is to refactor FlowExecutionList$ItemListenerImpl.onLoaded
to instead be an @Initializer
method so that we can then just use InitMilestone.COMPLETED
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #188 (comment) and later comments. For now we plan to leave this method in place but mark it as @Restricted(Beta.class)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using STARTUP_MARKER_FILE since it's already there, but this way seems more explicit without having to dig into core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBD if we actually need the new API. Otherwise looks good.
} | ||
} | ||
|
||
// Using an unbounded thread pool to avoid potential deadlock as in JENKINS-25890. TODO: Unclear if this is actually necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timer
is probably good enough? Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe I'll switch back to Timer
for now and then if there is related deadlock we can introduce an ExecutorService
. In this case we are at least guaranteed to release the RunMap
lock as long as the callback does not use the current thread, so even if Timer
is saturated and the task to resume the step executions is not scheduled quickly, I don't think anything else would be blocked waiting for it.
Yeah I am currently testing a switch from |
So far it causes |
That test is edited in jenkinsci/workflow-durable-task-step-plugin#180 FYI. |
The issue with that test seems to be that having Pipelines resume before this code runs changes the behavior slightly. Compare the log output of the tests, here is the current behavior:
Here is the behavior when
Easy enough to fix that test, but I will check |
Not sure I follow. Do you mean that |
Oh actually it looks like the problem is that when using Tests in |
Hmm, true, that sounds dangerous. I tried to think of a cleaner system than the API currently proposed here, but have not come up with anything. Maybe at least mark it |
Sure, that sounds good to me. |
… precaution against other types of deadlock as in JENKINS-25890" This reverts commit c178493.
Note: not deployed due to apparent flake https://github.com/jenkinsci/workflow-api-plugin/runs/4555248728 |
@@ -163,6 +167,18 @@ public static FlowExecutionList get() { | |||
return l; | |||
} | |||
|
|||
/** | |||
* Returns true if all executions that were present in this {@link FlowExecutionList} have been loaded and resumed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that last part actually true? Now that onResume
can be called asynchronously, this does not seem to be tracking that it completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WorkflowRun
and CpsFlowExecution
have been loaded, CpsFlowExecution.loadProgramAsync
and FlowExecutionListener.onResumed
have been called, but StepExecution.onResume
may not have been called yet or may still be in progress.
@dwnusbaum I think this needs to be reverted. It causes frequent test failures in diff --git src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java
index d36e18d..41110d7 100644
--- src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java
+++ src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java
@@ -79,6 +79,7 @@ public class SSHAgentStepExecution extends AbstractStepExecutionImpl implements
@Override
public void onResume() {
super.onResume();
+ listener.getLogger().println("=== onResume starts");
try {
purgeSockets();
initRemoteAgent();
@@ -86,6 +87,7 @@ public class SSHAgentStepExecution extends AbstractStepExecutionImpl implements
listener.getLogger().println(Messages.SSHAgentBuildWrapper_CouldNotStartAgent());
x.printStackTrace(listener.getLogger());
}
+ listener.getLogger().println("=== onResume ends");
}
static FilePath tempDir(FilePath ws) { it is possible to see that with this patch,
is typically printed somewhere in the middle of |
@jglick Fine (I have no time to investigate), but this cannot be reverted on its own, #178 as well as jenkinsci/workflow-durable-task-step-plugin#188 and jenkinsci/workflow-durable-task-step-plugin#185 would need to be reverted all at the same time. |
Right, that is what I meant. |
Perhaps adding some kind of API that |
Either of those proposals sound reasonable. |
Well, I got sidetracked and looked into this a bit.
This does not seem to work on its own, because at least in tests, From a bit more testing, passing an
|
…e cases (jenkinsci#188)" This reverts commit 6d6de20.
Perhaps. The following seems to restore reasonable behavior to diff --git src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
index b04be34..10eff35 100644
--- src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
+++ src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
@@ -173,10 +173,18 @@ public class FlowExecutionList implements Iterable<FlowExecution> {
*/
@Extension
public static class ItemListenerImpl extends ItemListener {
+ void pause(FlowExecution e, boolean b) {
+ try {
+ e.getClass().getMethod("pause", boolean.class).invoke(e, b);
+ } catch (Exception x) {
+ x.printStackTrace();
+ }
+ }
@Override
public void onLoaded() {
FlowExecutionList list = FlowExecutionList.get();
for (final FlowExecution e : list) {
+ pause(e, true);
// The call to FlowExecutionOwner.get in the implementation of iterator() is sufficent to load the Pipeline.
LOGGER.log(Level.FINE, "Eagerly loaded {0}", e);
Futures.addCallback(e.getCurrentExecutions(false), new FutureCallback<List<StepExecution>>() {
@@ -190,6 +198,7 @@ public class FlowExecutionList implements Iterable<FlowExecution> {
se.getContext().onFailure(x);
}
}
+ pause(e, false);
}
@Override
@@ -200,7 +209,7 @@ public class FlowExecutionList implements Iterable<FlowExecution> {
LOGGER.log(Level.WARNING, "Failed to load " + e, t);
}
}
- }, MoreExecutors.directExecutor());
+ }, Timer.get());
}
}
} If I could get something like that to work, was there any other obstacle to restoring your fixes (basically reverting #198 & jenkinsci/workflow-durable-task-step-plugin#197)? |
Not that I am aware of. |
See JENKINS-67351, #178, and jenkinsci/workflow-durable-task-step-plugin#185.
The
RunMap
lock is always held whenWorkflowRun.onLoad
callsFlowExecutionListener.fireResumed
, which means that any steps that need to acquire other locks inStepExecution.onResume
(e.g.ExecutorStepExecution.onResume
needs to acquire the queue lock to schedule a build) may trigger a deadlock with other code that also attempts to acquire theRunMap
lock.To prevent these issues, we use a new
ExecutorService
so that theRunMap
lock is not held when resuming step executions.This PR also introduces a new API that will be used to update jenkinsci/workflow-durable-task-step-plugin#185 (will mark as ready for review once I have a PR up for that) to prevent Pipelines from being resumed before
FlowExecutionListener$ItemListenerImpl.onLoaded
has even had a chance to execute, which can lead to similar issues.