-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(threads): back out #1388 Vertexecutor #1449
Conversation
Test image available:
|
8c89be0
to
92a83cd
Compare
Test image available:
|
92a83cd
to
e8408a5
Compare
Test image available:
|
@maxcao13 @tthvo if you have some spare time, could you help this this? I think this is a pretty low-risk change since it mostly sets some things back to how they were ~1 month ago. I am waiting for some further details from the bug reporter but I think this bug can be triggered by setting up a scenario where there are 20+ target applications available for Cryostat to discover on startup. The k8s/OpenShift part of the report is probably not important - I expect this would also reproduce in the smoketest podman setup, so simply copy-pasting enough configurations of one of the sample apps should be able to trigger this, but I haven't tried that yet. This is a small patch and is a bugfix so no rush to get this reviewed before tomorrow's development codefreeze. |
I ran smoketest with 100 Here's my |
Hmm, okay. Those have the |
Original reporter confirmed that the bug reproduced in their environment with 20 targets, but with 17 it was OK. That confirms my suspicion of the root cause, so I'm pretty confident this is a good fix. It would still be best to actually test that. |
Running I think there is still some problems with blocking the event loop however. For example, using this pr image as the CORE_IMG, I had 11 vertx-fib-demos running on a namespace and when I created an Automated Rule with match expression: true, I got this:
I'm unable to get more than 20 targets up on my oc cluster, because it apparently takes too much resources, the first log came from a local crc instance and the second from cluster bot. Another thing I notice is that the cryostat logger generates tons of messages, super fast, this might be because the recursive update discovery tree function, which I think calls a logger message everytime it observes a target.
tons of these messages get sent even though, the targets likely can't all be new. |
I think that last case of the |
It looks like in those cases it is just one event loop thread that is blocked too long but ends up becoming unblocked after a few seconds. That's still not proper threading behaviour, but it isn't as critically broken as it is in the #1448 report. When you run this scenario, does the Cryostat container end up getting killed by OpenShift or does it stay up and running? |
I think the container got killed, it was just hard to tell since it gets rolling updates, but the log said so. |
Ah okay, I wasn't sure if that was because you had actually shut it down or because it got killed. Strange. https://vertx.io/docs/apidocs/io/vertx/core/VertxOptions.html#setWorkerPoolSize-int- The size of the worker pool can be reduced to make this condition easier to test. |
e8408a5
to
adf8102
Compare
c599a0a
to
09c6c19
Compare
cryostat-sample-5cf7b86589-72hc6-cryostat-sample.log Yup, seems like that was the issue. I get vertx thread blocking just like the original issue
testing with your pr has no such thing with the 9 pods and the worker size of 5. |
Okay great, thank you for running those tests for me. I have finished setting back every instance of where I think the Vert.x worker threads were being misused, but I have had very split attention today so I need to go over this more careful and exercise things to make sure nothing new is broken. |
I also tried with 4 vertx pods and even that was an issue since cryostat discovers itself. When I set the pod count to 3, it was fine :-) |
You mean this test case was still broken (as expected according to the report) pre-patch, and post-patch it now works fine, right? |
2b86d5b
to
f29b865
Compare
Test image available:
|
I mean, prepatch if the number of targets was >= the worker pool, then there would be problems, and if i set the numTargets to workerPool - 1, it would be okay confirming the issue. But yes the patch fixes it regardless. |
f29b865
to
b8a16ca
Compare
hmm.. but this is strange, I now get blocked threads in the normal smoketest.sh deployment (probably because default deployment still uses vertexecutor):
This is with
and no changes to anything including smoketest.sh itself. Seems like something to do with jvmId handling again... I'll look and try to see what the problem is. |
Test image available:
|
This reverts commit b2fdcfd.
Signed-off-by: Andrew Azores <aazores@redhat.com>
7f60a77
to
78f6f7a
Compare
This PR/issue depends on:
|
Welcome to Cryostat! 👋
Before contributing, make sure you have:
main
branch[chore, ci, docs, feat, fix, test]
git commit --amend --signoff
Related to #1448
Depends on #1466
Description of the change:
This backs out a threadpool related change from #1388, setting many injection sites of the new implementation back to what they previously were.
Motivation for the change:
The Vert.x worker pool was being used in a lot of places where it was not strictly necessary. That thread pool has a fixed size. If there are too many forking tasks to be done then it is possible that all the threads in the pool become deadlocked waiting for tasks that will never complete, because those tasks are stuck in the queue to be serviced by the same pool. This leaves only the Vert.x event loop alive.