Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support bundle data for running Pipelines, update support bundle component names and file paths, and update log warning for a particular type of error during step completion #916

Merged
merged 8 commits into from
Aug 15, 2024

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Aug 8, 2024

Relates to #892, jenkinsci/workflow-api-plugin#347, and jenkinsci/workflow-api-plugin#349.

Recently I have seen two cases where the CPS VM thread task queue is either very backed up or completely stuck for some builds, leading to various problems. It is currently difficult to detect this state without either running Groovy scripts that use reflection to access internal state of SingleLaneExecutorService or analyzing a heap dump. I think it is worth tracking both the amount of time that a build spends waiting for queued tasks to start as well as the current number of queued tasks.

Here is an example of pipeline-thread-dump.txt with the data from this PR:

Build: p #39
Started: 2024-08-12T20:08:44.701Z
Duration: PT7.651134S
Timings:
  classLoad	33ms
  flowNode	69ms
  parse	69ms
  run	177ms
  runQueue	417ms
  saveProgram	50ms
Active operations:
  run	1	7326ms
  runQueue	7	51535ms
Approximate graph size: 11
Thread #5
	at DSL.sleep(sleeping for another 52 sec)
	at WorkflowScript.run(WorkflowScript:3)
	at DSL.parallel(Native Method)
	at WorkflowScript.run(WorkflowScript:1)
Thread #6
	at WorkflowScript.run(WorkflowScript:5)
Thread #7
	at WorkflowScript.run(WorkflowScript:7)
Thread #8
	at WorkflowScript.run(WorkflowScript:9)


Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@dwnusbaum dwnusbaum force-pushed the cps-vm-thread-metrics branch from e0ec99a to a3db23e Compare August 12, 2024 19:15
@dwnusbaum dwnusbaum changed the title Track CPS VM thread queue size and task execution times via metrics Improve support bundle data for running Pipelines and update log warning for a particular type of error during step completion Aug 12, 2024
Comment on lines +438 to +440
// TODO: If the problem is with the FlowNode and not the CpsFlowExecution, should we try to call
// CpsVmExecutorService.reportProblem or CpsFlowExecution.croak to kill the build right away?
LOGGER.log(Level.WARNING, "Unable to load FlowNode or CpsFlowExecution when completing " + this + ", which is likely to cause its execution to hang indefinitely", x);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have any tests in this plugin which enter this catch block, and the only case where I ever saw it was while manually creating broken Pipelines while investigating jenkinsci/workflow-api-plugin#349. I think for now it is good enough to just update the log message, and if we see it come up when investigating stuck Pipelines in the future, then it might be worth investigating further.

Comment on lines +407 to +408
/** instances of {@link Timing} which have not yet completed for reporting counts and durations in support bundles. Never persisted. */
transient @NonNull Set<Timing> liveIncompleteTimings = ConcurrentHashMap.newKeySet();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered a few different approaches, like using metrics or refactoring liveTimings, but in the end this seemed like the best way to report useful information about ongoing operations while keeping things simple and similar to what we already do. I don't see any benefit to persisting this data since it only applies to actively running tasks - once the build completes it should be empty, barring any cleanup-related bugs.

@@ -108,13 +118,47 @@ public CpsThreadDump getThreadDump() {

@Override public void addContents(Container container) {
container.add(new Content("nodes/master/pipeline-thread-dump.txt") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth updating the display name and file name to be more generic, and/or moving this to a distinct class, or should we just keep it? Users may be used to the current display name, and there may be some CloudBees-internal automation which expects the current file name, but I am not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me how this related to the existing pipeline-timings.txt. Should we consolidate?

Copy link
Member Author

@dwnusbaum dwnusbaum Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is for completed builds, one is for running builds. If I was going to rename both of them, I would do this:

  • "Timing data about recently completed Pipeline builds"/pipeline-timings.txt would become "Recently completed Pipeline builds"/pipeline-recent-builds.txt
  • "Thread dumps of running Pipeline builds"/pipeline-thread-dump.txt would become "Running Pipeline builds"/pipeline-running-builds.txt

We could combine them into a single "Pipeline builds"/pipelines.txt component, but I think keeping running and completed builds separate seems clearest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this

Seems reasonable. Anyway, this is fine as is, just trying to clarify.

Copy link
Member Author

@dwnusbaum dwnusbaum Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some GitHub searches for usage of these exact strings and checked with CloudBees support about support bundle automation and didn't find anything problematic, so I will go ahead and update the names here.

EDIT: Also, the IDs that keep track of which Components are active are based on the simple class name of the Component by default, so I will just leave the class names as-is and keep them where they are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e8af1ac.

@@ -108,13 +118,47 @@ public CpsThreadDump getThreadDump() {

@Override public void addContents(Container container) {
container.add(new Content("nodes/master/pipeline-thread-dump.txt") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me how this related to the existing pipeline-timings.txt. Should we consolidate?

var started = Instant.ofEpochMilli(run.getStartTimeInMillis());
pw.println("Started: " + started);
var duration = Duration.between(started, Instant.now());
pw.print("Duration: " + duration);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very legible; consider Util.getTimeSpanString (IIRC).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is subjective (personally I like ISO 8601 in contexts like this). Mainly I did not use getTimeSpanString to avoid localization.

@dwnusbaum dwnusbaum requested review from jglick and a team August 13, 2024 19:24
@dwnusbaum dwnusbaum changed the title Improve support bundle data for running Pipelines and update log warning for a particular type of error during step completion Improve support bundle data for running Pipelines, update support bundle component names and file paths, and update log warning for a particular type of error during step completion Aug 13, 2024
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearer, thanks.

@dwnusbaum dwnusbaum merged commit 3519a32 into jenkinsci:master Aug 15, 2024
16 checks passed
@dwnusbaum dwnusbaum deleted the cps-vm-thread-metrics branch August 15, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants