-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release node body reference when finishing a PlaceholderTask #117
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine, though it would be easier to make body
be a WeakReference<BodyExecution>
I suppose.
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
Sure, I reimplemented the change with |
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and simple. You checked that the WeakReference
rewrite solves the test failure?
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
…cutorStepExecution.java Co-Authored-By: Jesse Glick <jglick@cloudbees.com>
Yes, of course. But don't take my word for it, see the results for yourself in jenkinsci/workflow-cps-global-lib-plugin#82. |
…ut `BodyExecution`. When we do not use pickles, there is no need for `transient`. Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak. This is only possible now that `Callback` holds the `ExecutorStepExecution`. And we can now properly handle `AsynchronousExecution.interrupt`.
…tion of `node` step, not kill whole build (#180) * Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution` * Amending test from #34 * Fixing `Callback` * Exploring where to detect a failure eligible for retry * Making `normalNodeDisappearance` pass under somewhat altered circumstances * First successful `node` block retry * `FilePathDynamicContext` must take precedence over `ExecutorStepDynamicContext.FilePathTranslator` * `ExecutorStepTest.unloadableExecutorPickle` as currently conceived no longer makes sense * Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableException` by copying some diagnostic code from `FilePathDynamicContext` * Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnection` * Marking `ExecutorStepRetryEligibility` beta * Javadoc imports * SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.node` to be null * Reasonable behavior of `ExecutorPickleTest.canceledQueueItem` * User-visible logging about 5m timeout now that `ExecutorPickle.printWaitingMessage` is unused * `ExecutorPickleTest` → `ExecutorStepDynamicContextTest` * Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `ExecutorStepExecution` * Some user-visible logging when `ExecutorStepRetryEligibility` is _not_ used * Do not block on `FutureImpl`. After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer` which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message: ``` FINE o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart #1] java.util.concurrent.TimeoutException at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115) Caused: java.io.IOException at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553) at … FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back ``` * A `sh` step could fail to find a workspace at the moment it starts. java.lang.NullPointerException at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:329) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at … * Also handling `ClosedChannelException` as commonly thrown by `SimpleBuildStep`s and the like: jenkinsci/kubernetes-plugin#1083 (comment) * Also retry after `MissingContextVariableException` on `FilePath` * Remove `super.onResume` left over in #46; see jenkinsci/workflow-step-api-plugin#66 * Documenting need for JENKINS-30383 * jenkinsci/workflow-step-api-plugin#77 allows `retryNodeBlockSynchAcrossRestarts` to be fixed * jenkinsci/workflow-step-api-plugin#77 released * Expanding `MissingContextVariableException` type list after noticing that `isUnix` requests `Launcher` not `FilePath` * Never log a message with only `PlaceholderTask.cookie`; also need `.runId` for context * Suppressing uninteresting log message * Maybe need to call `StepContext.saveState`? * Deleting comment; for now it seems clearest to keep `ExecutorStepDynamicContext` as a distinct type * Resolve longstanding tech debt from #104 & #117 about `BodyExecution`. When we do not use pickles, there is no need for `transient`. Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak. This is only possible now that `Callback` holds the `ExecutorStepExecution`. And we can now properly handle `AsynchronousExecution.interrupt`. * Sketch of updating `WorkspaceStepExecution`: #180 (comment) * Allowing `FilePathDynamicContext` to block waiting for an agent to reconnect. Not great since the CPS VM thread will only grant 5s. Better would be to avoid unnecessary use of `StepContext.get(FilePath.class)` (which implicitly requires a live `Channel`); perhaps `FilePathRepresentation` should become an API. Amends b65ff1c. #180 (comment) * Adding a comment about 8b08398 * 8b08398 seems to allow a968adf to be finished. * Refining 8b08398 to avoid tail call * `org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated * SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an SVUID; taking one from trunk * Adding `NodeTranslator` * Beginning rewrite to `AgentErrorCondition` * Extracted `RetryExecutorStepTest` from `ExecutorStepTest` * Rely on blocking `ExecutorStepExecution.onResume` * Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `unrestorableAgent`. Sometimes a failing `waitForMessage` seems to trigger https://maven.apache.org/surefire/maven-surefire-plugin/faq.html#corruptedstream though I cannot track down why. * `PlaceholderTask` may not hold a reference to `ExecutorStepExecution` * Equivalent of #184 for `ExecutorStepDynamicContext` * Assert jenkinsci/workflow-api-plugin@b1778a9 * jenkinsci/workflow-cps-plugin#534 & jenkinsci/workflow-job-plugin#260 released * Windows tests jenkinsci/workflow-basic-steps-plugin#203 (comment) * Avoiding `powershell` jenkinsci/workflow-basic-steps-plugin#203 (comment) * Better usage of Hamcrest; got a flake https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=6779178701 with `QueueTaskCancelled` instead of `RemovedNodeCause` * In fact the flake was in a different test (`canceledQueueItem`) * Forgot to make `canceledQueueItem` not use `sh`, though it should not really matter since this step is expected to fail anyway * Test flake: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=7240329028 * Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well * Can now run all of `AgentErrorConditionTest` * Responding to `CancellationException` in `ExecutorStepDynamicContext` with `QueueTaskCancelled`, analogous to handling in `ExecutorPickle` * Permit `canceledQueueItem` to pass on `RemovedNodeCause` * Normalizing indentation * jenkinsci/workflow-step-api-plugin#124 deprecates an overload which 2e95e3a only handles at one call site; TBD if others should be switched to `actualInterruption` * Refined `ExecutorStepDynamicContext` behavior in case of `quietingDown` #180 (comment) or `terminating` #180 (comment) * Add an explicit `serialVersionUID` to `ExecutorStepExecution.PlaceholderTask.Callback` #180 (comment) * Optimizing `withExecution` to only consider steps in the current build #180 (comment) * Pick up jenkinsci/workflow-api-plugin#256 #180 (comment) * jenkinsci/workflow-api-plugin#256 released
Problem
After upgrading
workflow-durable-task-step
from 2.31 to 2.32 inworkflow-cps-global-lib-plugin
,LibraryMemoryTest#loaderReleased
starts failing with the following stack trace:Evaluation
Bisection revealed that the problem does not occur with
workflow-durable-task-step
2.31 (see jenkinsci/workflow-cps-global-lib-plugin#80) but does occur withworkflow-durable-task-step
2.32 (see jenkinsci/workflow-cps-global-lib-plugin#81). Looking at the changes in the compare view, I noticed two new fields:DurableTaskStep#removedNodeDiscovered
andPlaceholderTask#body
, both added in #104.DurableTaskStep#removedNodeDiscovered
is along
and is unlikely to be the leak, so I hypothesized thatPlaceholderTask#body
was the leak. Sure enough, when I changedbody = context.newBodyInvoker().[...]
back tocontext.newBodyInvoker().[...]
inPlaceholderExecutable#run
(the only place wherePlaceholderExecutable#body
was ever set) and re-ranLibraryMemoryTest#loaderReleased
, it passed. This confirmed that the proximate source of the problem wasPlaceholderTask#body
.While I don't know enough about this plugin to be completely sure, based on the above analysis I strongly suspect that the cause of the problem is that
PlaceholderTask#body
is being leaked, exposing theworkflow-cps-global-lib-plugin
test failure inLibraryMemoryTest#loaderReleased
.Solution
Assuming that the above evaluation is correct, the solution would be to release
PlaceholderExecutable#body
when it is no longer needed. Without knowing very much about this plugin,PlaceholderTask#finish
seemed like as good a place as any to do this. Much to my surprise,workflow-cps-global-lib-plugin
'sLibraryMemoryTest#loaderReleased
test started passing again once I made this change.If I am misunderstanding the cause of the problem or if there is a better solution, please feel free to let me know.