-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
App with Kafka Producer fails to initialize the producer #43158
Comments
One of the symptoms was that the CPU was saturated (100%+) |
More context: The producer is being managed by quarkus-messaging-kafka |
cc @mariofusco |
This issue seems to be similar to the other linked problem, but for that one the bug was in the concurrent way how a jar resource was opened, while looking at the pasted stack here the problem seems to happen when the resource is already opened and ready to be used, which is admittedly even weirder. @pcasaes I will try to investigate this further, but I'm afraid that I cannot do much without a reproducer, and at the moment I don't know how to implement one. Can you please provide some hints or even better a simplified (and anonymized if necessary) project doing something similar to the one where you experienced this problem? I understand that very likely this simplified project won't throw that exception, but at least I will have some clue on where to look at.
Is this a symptom or a cause? E.g. is it possible that the problem happens only when the CPU is already saturated? |
The service uses quarkus-messaging-kafka to consume from one topic and then publishes to a new topic. When the instance went up there were records ready for it to consume. Not sure if the kafka producer is lazy in it's io thread initialization. regardless the class it tried to load was almost a second after startup. 2024-09-09T10:38:09.924887101Z JVM (powered by Quarkus 3.14.2) started in 7.308s. CPU shot up at about 10:38:10. Can't give you a hard figure so unsure if it's cause or consequence. Looking at the code I would guess that close is being called twice for the same resource. It looks like this could cause the issue, but this is a from a naive reading of the code. Edit: reproducing is hard since this happened once after several restarts. |
I'm thinking the issue can be caused by eviction of the resource being called concurrently more than once https://github.com/quarkusio/quarkus/blob/3.14.2/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L193 The story: A. Some calls to loadClass evicts the JarResource X form the cache and starts call to release (Thread 1) I was able to "reproduce" this by playing around with the |
Repro with video instruction https://github.com/pcasaes/quarkus/tree/pc/classload-race-condition-repro 2024-09-10.15-35-14.mp4Thread 1: Evicts our target jar order of execution Thread 1 (first eviction) Thread 2 (happy path) Thread 1 (first eviction) Thread 3 (second eviction) Thread 2 (happy path) |
This fixes the scenario above. Not sure if it's the right approach though. The idea is make sure that a close is only performed ONCE per jar file resource instance. |
Thanks a lot for your proposed fix @pcasaes. I must admit that I'm just seeing what you wrote here and commented your pull request before reading your explanation. There I was wondering how the Out of curiosity did you try the fix that you proposed with your use case? If so does it solve the problem? I'm asking because I'm still unable to reproduce it on my side. |
Yes, I repeated the steps and the illegal state is not reached. The last acquire releases just fine. I do question if a reference that has been marked as closed should ever be added again to the cache though. As for the reproducer, you can only do it by controlling the execution order (using break points). Classic race condition, sometimes it happens, sometimes not. Close can be called more than once because a reference can be removed and added many times to the cache, all while a happy path acquire and then release is taking place. |
@pcasaes could you get a few thread dumps when your CPU is at 100% ( |
I can't actually reproduce what happened in production. I was only able to prove that close can be called more than once. |
For the repro video I forgot to mention that all the breakpoints in JarFileResource had this condition Also, the repro branch is different than the potential fix: https://github.com/pcasaes/quarkus/tree/pc/classload-race-condition-repro |
Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.
What do you think of improving the test like this? This PR adds our case (where a reference could be re added and re evicted from the cache), but more importantly it ensures that no exceptions were thrown in the worker threads. |
Thanks @pcasaes it's a great idea! Please, feel free to send such a PR, it is indeed a useful and due contribution 🙏 |
Here: #43286 |
I think we can close this one as it has been fixed in 3.14.4. Thanks for the great report and the amazing collaboration! |
FTR, just earlier today I was hit by Great to see this fixed in 3.14.4! |
Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.
Without this change, the original test would pass if an exception bubbled up into a worker thread. Relates to: quarkusio/quarkus#43158
Ensure that any exceptions thrown in worker threads are properly checked to avoid misleading test results. This change prevents the test from passing if an exception bubbles up into a worker thread. Relates to: quarkusio/quarkus#43158
Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.
Describe the bug
Very rarely we are seeing a kafka producer fail to initialize in an unrecoverable state (all health checks pass).
The root exception is: "The reference counter cannot be negative, found: -1"
Seems to be related to #42067
Expected behavior
Class loader should not fail when starting a kafka producer
Actual behavior
In rare cases fails to start class loader
How to Reproduce?
This happens very rarely which points to non deterministic behavior. Not sure how to reproduce this.
Output of
uname -a
orver
Linux ... #1 SMP Wed Aug 7 16:53:27 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Output of
java -version
openjdk version "21.0.4" 2024-07-16 LTS OpenJDK Runtime Environment (Red_Hat-21.0.4.0.7-1) (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM (Red_Hat-21.0.4.0.7-1) (build 21.0.4+7-LTS, mixed mode, sharing)
Quarkus version or git rev
3.14.2
JVM (powered by Quarkus 3.14.2) started in 7.308s
Build tool (ie. output of
mvnw --version
orgradlew --version
)Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)
Additional information
No response
The text was updated successfully, but these errors were encountered: