Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log a warning in CpsVmExecutorService when tasks start long after they were submitted #892

Closed

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented May 23, 2024

@Dohbedoh and I recently looked at a Pipeline build that was seemingly hanging forever with no obvious errors reported. It seems that the cause was that the build was reading a 20ish MB JSON file into memory with readFile and parsing it using @Grab with json-simple (exact details unimportant, readJSON from pipeline-utility-steps has the same behavior). The parsed value was accidentally stored as a global variable on the Pipeline script itself, causing it to be serialized in program.dat every time a step was written, so program.dat was also 20+ MB. The script then started 200ish parallel branches, each with a node step, and nothing else really happened for 30+ minutes.

Thread dumps always showed the VM thread for the build writing program.dat deep in JBoss Marshalling internals as a result of calls to CpsStepContext.saveState, which I think were a result of the node step starting, see here (not 100% sure because the stack trace loses the async context). The stack trace for that thread was not identical across thread dumps, so it seemed that some progress was being made. Pipeline thread dumps showed only 200ish node steps with nothing running inside of them. Using a Groovy script along with a reproducer we saw that the size of SingleLaneExecutorService.tasks.size() for CpsVmExecutorService.delegate() for the build was around 200 entries (i.e. roughly task one per branch), so my hypothesis is that CpsVmExecutorService was so backed up having to serialize this huge JSON object once for every single parallel branch that the actual Pipeline script was unable to progress because the tasks submitted by CpsThreadGroup.scheduleRun were at the end of the queue.

I think we should at least try to detect this type of situation and log a warning. We try to time out script execution that occupies the VM thread for more than 5 minutes with the following code, but in this situation the issue was with other tasks that have no such timeout.

try (Timeout timeout = Timeout.limit(5, TimeUnit.MINUTES)) {

Just a draft PR for now because I am not sure about the approach and have not had time to test it much. A more direct improvement for this particular case could be to instead coalesce calls to CpsStepContext.saveState. For example, in this scenario where 200 parallel branches each start a node step which all call saveState, we only need to make sure that saveState runs once after all of the node steps have started. If the first call to saveState starts running when one node step has started, and the other 199 have not, and the save takes 5 minutes, there is no reason to then save the CpsThreadGroup 199 more times. We just need to save it one more time since each call to CpsStepContext.saveState submits a task to save the CpsThreadGroup that subsumes all previous queued tasks as long as no other tasks were interleaved between them.

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@@ -77,9 +79,18 @@ public Void call() {

@Override
protected Runnable wrap(final Runnable r) {
Instant submitted = Instant.now();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to be copied to the wrap(Callable) overload as well.

@dwnusbaum
Copy link
Member Author

Still in the back of my mind, but I have not heard anything back from the user who originally had this issue and I have no time to investigate this further for now. I think it would be nice to have a way to check queue depth for SingleLaneExecutorService, the distribution of wait times for queued callables, and the distribution of execution times for each callable. Maybe this stuff could be made available to metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant