Log a warning in CpsVmExecutorService
when tasks start long after they were submitted
#892
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@Dohbedoh and I recently looked at a Pipeline build that was seemingly hanging forever with no obvious errors reported. It seems that the cause was that the build was reading a 20ish MB JSON file into memory with
readFile
and parsing it using@Grab
withjson-simple
(exact details unimportant,readJSON
frompipeline-utility-steps
has the same behavior). The parsed value was accidentally stored as a global variable on the Pipeline script itself, causing it to be serialized inprogram.dat
every time a step was written, soprogram.dat
was also 20+ MB. The script then started 200ish parallel branches, each with anode
step, and nothing else really happened for 30+ minutes.Thread dumps always showed the VM thread for the build writing
program.dat
deep in JBoss Marshalling internals as a result of calls toCpsStepContext.saveState
, which I think were a result of thenode
step starting, see here (not 100% sure because the stack trace loses the async context). The stack trace for that thread was not identical across thread dumps, so it seemed that some progress was being made. Pipeline thread dumps showed only 200ishnode
steps with nothing running inside of them. Using a Groovy script along with a reproducer we saw that the size ofSingleLaneExecutorService.tasks.size()
forCpsVmExecutorService.delegate()
for the build was around 200 entries (i.e. roughly task one per branch), so my hypothesis is thatCpsVmExecutorService
was so backed up having to serialize this huge JSON object once for every single parallel branch that the actual Pipeline script was unable to progress because the tasks submitted byCpsThreadGroup.scheduleRun
were at the end of the queue.I think we should at least try to detect this type of situation and log a warning. We try to time out script execution that occupies the VM thread for more than 5 minutes with the following code, but in this situation the issue was with other tasks that have no such timeout.
workflow-cps-plugin/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsThread.java
Line 176 in 7fc878b
Just a draft PR for now because I am not sure about the approach and have not had time to test it much. A more direct improvement for this particular case could be to instead coalesce calls to
CpsStepContext.saveState
. For example, in this scenario where 200 parallel branches each start anode
step which all callsaveState
, we only need to make sure thatsaveState
runs once after all of thenode
steps have started. If the first call tosaveState
starts running when onenode
step has started, and the other 199 have not, and the save takes 5 minutes, there is no reason to then save theCpsThreadGroup
199 more times. We just need to save it one more time since each call toCpsStepContext.saveState
submits a task to save theCpsThreadGroup
that subsumes all previous queued tasks as long as no other tasks were interleaved between them.Testing done
Submitter checklist