Log a warning in `CpsVmExecutorService` when tasks start long after they were submitted #892

dwnusbaum · 2024-05-23T21:33:41Z

@Dohbedoh and I recently looked at a Pipeline build that was seemingly hanging forever with no obvious errors reported. It seems that the cause was that the build was reading a 20ish MB JSON file into memory with readFile and parsing it using @Grab with json-simple (exact details unimportant, readJSON from pipeline-utility-steps has the same behavior). The parsed value was accidentally stored as a global variable on the Pipeline script itself, causing it to be serialized in program.dat every time a step was written, so program.dat was also 20+ MB. The script then started 200ish parallel branches, each with a node step, and nothing else really happened for 30+ minutes.

Thread dumps always showed the VM thread for the build writing program.dat deep in JBoss Marshalling internals as a result of calls to CpsStepContext.saveState, which I think were a result of the node step starting, see here (not 100% sure because the stack trace loses the async context). The stack trace for that thread was not identical across thread dumps, so it seemed that some progress was being made. Pipeline thread dumps showed only 200ish node steps with nothing running inside of them. Using a Groovy script along with a reproducer we saw that the size of SingleLaneExecutorService.tasks.size() for CpsVmExecutorService.delegate() for the build was around 200 entries (i.e. roughly task one per branch), so my hypothesis is that CpsVmExecutorService was so backed up having to serialize this huge JSON object once for every single parallel branch that the actual Pipeline script was unable to progress because the tasks submitted by CpsThreadGroup.scheduleRun were at the end of the queue.

I think we should at least try to detect this type of situation and log a warning. We try to time out script execution that occupies the VM thread for more than 5 minutes with the following code, but in this situation the issue was with other tasks that have no such timeout.

workflow-cps-plugin/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsThread.java

Line 176 in 7fc878b

try (Timeout timeout = Timeout.limit(5, TimeUnit.MINUTES)) {

Just a draft PR for now because I am not sure about the approach and have not had time to test it much. A more direct improvement for this particular case could be to instead coalesce calls to CpsStepContext.saveState. For example, in this scenario where 200 parallel branches each start a node step which all call saveState, we only need to make sure that saveState runs once after all of the node steps have started. If the first call to saveState starts running when one node step has started, and the other 199 have not, and the save takes 5 minutes, there is no reason to then save the CpsThreadGroup 199 more times. We just need to save it one more time since each call to CpsStepContext.saveState submits a task to save the CpsThreadGroup that subsumes all previous queued tasks as long as no other tasks were interleaved between them.

Testing done

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

…y were submitted

dwnusbaum · 2024-05-23T21:56:58Z

plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsVmExecutorService.java

@@ -77,9 +79,18 @@ public Void call() {

    @Override
    protected Runnable wrap(final Runnable r) {
+        Instant submitted = Instant.now();


Would need to be copied to the wrap(Callable) overload as well.

dwnusbaum · 2024-06-25T19:36:06Z

Still in the back of my mind, but I have not heard anything back from the user who originally had this issue and I have no time to investigate this further for now. I think it would be nice to have a way to check queue depth for SingleLaneExecutorService, the distribution of wait times for queued callables, and the distribution of execution times for each callable. Maybe this stuff could be made available to metrics.

Log a warning in CpsVmExecutorService when tasks start long after the…

b91e779

…y were submitted

dwnusbaum commented May 23, 2024

View reviewed changes

dwnusbaum closed this Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log a warning in `CpsVmExecutorService` when tasks start long after they were submitted #892

Log a warning in `CpsVmExecutorService` when tasks start long after they were submitted #892

dwnusbaum commented May 23, 2024 •

edited

Loading

dwnusbaum May 23, 2024

dwnusbaum commented Jun 25, 2024

Log a warning in CpsVmExecutorService when tasks start long after they were submitted #892

Log a warning in CpsVmExecutorService when tasks start long after they were submitted #892

Conversation

dwnusbaum commented May 23, 2024 • edited Loading

Testing done

Submitter checklist

dwnusbaum May 23, 2024

Choose a reason for hiding this comment

dwnusbaum commented Jun 25, 2024

Log a warning in `CpsVmExecutorService` when tasks start long after they were submitted #892

Log a warning in `CpsVmExecutorService` when tasks start long after they were submitted #892

dwnusbaum commented May 23, 2024 •

edited

Loading