-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support bundle data for running Pipelines, update support bundle component names and file paths, and update log warning for a particular type of error during step completion #916
Conversation
plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsVmExecutorService.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsVmExecutorService.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsVmExecutorService.java
Outdated
Show resolved
Hide resolved
…f in progress operations
…r to reflect its severity
e0ec99a
to
a3db23e
Compare
metrics
// TODO: If the problem is with the FlowNode and not the CpsFlowExecution, should we try to call | ||
// CpsVmExecutorService.reportProblem or CpsFlowExecution.croak to kill the build right away? | ||
LOGGER.log(Level.WARNING, "Unable to load FlowNode or CpsFlowExecution when completing " + this + ", which is likely to cause its execution to hang indefinitely", x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have any tests in this plugin which enter this catch
block, and the only case where I ever saw it was while manually creating broken Pipelines while investigating jenkinsci/workflow-api-plugin#349. I think for now it is good enough to just update the log message, and if we see it come up when investigating stuck Pipelines in the future, then it might be worth investigating further.
/** instances of {@link Timing} which have not yet completed for reporting counts and durations in support bundles. Never persisted. */ | ||
transient @NonNull Set<Timing> liveIncompleteTimings = ConcurrentHashMap.newKeySet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered a few different approaches, like using metrics
or refactoring liveTimings
, but in the end this seemed like the best way to report useful information about ongoing operations while keeping things simple and similar to what we already do. I don't see any benefit to persisting this data since it only applies to actively running tasks - once the build completes it should be empty, barring any cleanup-related bugs.
@@ -108,13 +118,47 @@ public CpsThreadDump getThreadDump() { | |||
|
|||
@Override public void addContents(Container container) { | |||
container.add(new Content("nodes/master/pipeline-thread-dump.txt") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth updating the display name and file name to be more generic, and/or moving this to a distinct class, or should we just keep it? Users may be used to the current display name, and there may be some CloudBees-internal automation which expects the current file name, but I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear to me how this related to the existing pipeline-timings.txt
. Should we consolidate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is for completed builds, one is for running builds. If I was going to rename both of them, I would do this:
- "Timing data about recently completed Pipeline builds"/
pipeline-timings.txt
would become "Recently completed Pipeline builds"/pipeline-recent-builds.txt
- "Thread dumps of running Pipeline builds"/
pipeline-thread-dump.txt
would become "Running Pipeline builds"/pipeline-running-builds.txt
We could combine them into a single "Pipeline builds"/pipelines.txt
component, but I think keeping running and completed builds separate seems clearest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this
Seems reasonable. Anyway, this is fine as is, just trying to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some GitHub searches for usage of these exact strings and checked with CloudBees support about support bundle automation and didn't find anything problematic, so I will go ahead and update the names here.
EDIT: Also, the IDs that keep track of which Component
s are active are based on the simple class name of the Component
by default, so I will just leave the class names as-is and keep them where they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See e8af1ac.
@@ -108,13 +118,47 @@ public CpsThreadDump getThreadDump() { | |||
|
|||
@Override public void addContents(Container container) { | |||
container.add(new Content("nodes/master/pipeline-thread-dump.txt") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear to me how this related to the existing pipeline-timings.txt
. Should we consolidate?
var started = Instant.ofEpochMilli(run.getStartTimeInMillis()); | ||
pw.println("Started: " + started); | ||
var duration = Duration.between(started, Instant.now()); | ||
pw.print("Duration: " + duration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very legible; consider Util.getTimeSpanString
(IIRC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is subjective (personally I like ISO 8601 in contexts like this). Mainly I did not use getTimeSpanString
to avoid localization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearer, thanks.
Relates to #892, jenkinsci/workflow-api-plugin#347, and jenkinsci/workflow-api-plugin#349.
Recently I have seen two cases where the CPS VM thread task queue is either very backed up or completely stuck for some builds, leading to various problems. It is currently difficult to detect this state without either running Groovy scripts that use reflection to access internal state of
SingleLaneExecutorService
or analyzing a heap dump. I think it is worth tracking both the amount of time that a build spends waiting for queued tasks to start as well as the current number of queued tasks.Here is an example of
pipeline-thread-dump.txt
with the data from this PR:Testing done
Submitter checklist