-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349
Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349
Conversation
…kupView and LinearBlockHoppingScanner and improve resumption error handling
@@ -362,20 +362,31 @@ private static final class ParallelResumer { | |||
nodes.put(n, se); | |||
} else { | |||
LOGGER.warning(() -> "Could not find FlowNode for " + se + " so it will not be resumed"); | |||
// TODO: Should we call se.getContext().onFailure()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to create tests for all of these cases to see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's kind of interesting. Specifically in the case of a missing or corrupt FlowNode
, there are two cases that I see:
- If
FlowExecution.heads
contains the bad node, then everything gets handled inCpsFlowExecution.onLoad
, we never readprogram.dat
, we create placeholder nodes, the Pipeline fails, great. - If
FlowExecution.heads
does not contain the bad node, then the Pipeline loads, we attempt to resume it, various warnings get logged, and it hangs forever. The problem is that this call in CpsStepContext throws right away, so theCpsThread
never resumes with the outcome, and the Pipeline hangs. It seems like this log message should at least be bumped toWARNING
because any errors there are likely to be the proximate cause of a build hanging forever, but I wonder if we should also attempt to callCpsFlowExecution.croak
orCpsVmExecutorService.reportProblem
in that catch block.
So long story short, it seems like there is no point trying to do anything special in these cases in this plugin for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jenkinsci/workflow-cps-plugin#916 updates the relevant log message to WARNING
so that we have an idea if the problematic situation is ever occurring in practice.
} | ||
} | ||
} catch (Exception e) { | ||
// TODO: Should we instead just try to resume steps with no respect to topological order? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this would be better or worse than killing all of the steps. If you end up here, something is pretty wrong with your flow graph. The main causes I can think of are if a BlockEndNode.startId
points to a node that either doesn't exist, is not a BlockStartNode
, or (the new case I just saw) points to a BlockStartNode
with a newer ID than the end node, creating a cycle in the graph.
The question though is whether flow graph corruption necessarily means that trying to resume the steps is pointless, or if there is a decent chance that if we fall back to synchronous resumption of each step execution in order, the build might be able to complete successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I was just looking for #221 the other day as a reference, because its original motivating use case is obsolete as of jenkinsci/ssh-agent-plugin#152. Of course there might still be other steps which legitimately need context from an enclosing step in their onResume
, but I am not aware of any offhand. So I think it would be reasonable to fall back to resuming all steps in random order. I would not expect the build to complete successfully, but it might be able to e.g. release external resources more gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c6b4fa9 adjusts things so we resume all step executions as long as they have a FlowNode
. #349 (comment) discusses the cases where the FlowNode
is missing and/or corrupted.
Related to #347. This PR tries to improve these main issues:
StandardGraphLookupView.bruteForceScanForEnclosingBlock
andLinearBlockHoppingScanner
are now detected and throw an exception. This does not fix all possible infinite loops with flow graph iteration APIs, but these are the only two cases I saw where the infinite loop happens internally rather than the caller actually iterating over nodes infinitely. Loops inStandardGraphLookupView.bruteForceScanForEnclosingBlock
viaPlaceholderTask.getAffinityKey
brick the build queue, and loops inLinearBlockHoppingScanner
brick theCpsVmExecutorService
for the build in question.Testing done
Submitter checklist