Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App with Kafka Producer fails to initialize the producer #43158

Closed
pcasaes opened this issue Sep 9, 2024 · 18 comments
Closed

App with Kafka Producer fails to initialize the producer #43158

pcasaes opened this issue Sep 9, 2024 · 18 comments
Labels
area/kafka kind/bug Something isn't working
Milestone

Comments

@pcasaes
Copy link
Contributor

pcasaes commented Sep 9, 2024

Describe the bug

Very rarely we are seeing a kafka producer fail to initialize in an unrecoverable state (all health checks pass).

The root exception is: "The reference counter cannot be negative, found: -1"

Seems to be related to #42067

{
  "timestamp": "2024-09-09T10:38:10.767329108Z",
  "sequence": 114,
  "loggerClassName": "org.apache.kafka.common.utils.LogContext$LocationAwareKafkaLogger",
  "loggerName": "org.apache.kafka.clients.producer.internals.Sender",
  "level": "ERROR",
  "message": "[Producer clientId=kafka-producer-...bddf94778-znbcx] Uncaught error in kafka producer I/O thread: ",
  "threadName": "kafka-producer-network-thread | kafka-producer-...bddf94778-znbcx",
  "threadId": 42,
  "ndc": "",
  "hostName": "...bddf94778-znbcx",
  "processName": "/usr/lib/jvm/java-21-openjdk-21.0.4.0.7-1.el9.aarch64/bin/java",
  "processId": 1,
  "exception": {
    "refId": 1,
    "exceptionType": "java.lang.IllegalStateException",
    "message": "The reference counter cannot be negative, found: -1",
    "frames": [
      {
        "class": "io.quarkus.bootstrap.runner.JarFileReference",
        "method": "release",
        "line": 58
      },
      {
        "class": "io.quarkus.bootstrap.runner.JarFileReference",
        "method": "consumeSharedJarFile",
        "line": 128
      },
      {
        "class": "io.quarkus.bootstrap.runner.JarFileReference",
        "method": "withJarFile",
        "line": 96
      },
      {
        "class": "io.quarkus.bootstrap.runner.JarResource",
        "method": "getResourceData",
        "line": 56
      },
      {
        "class": "io.quarkus.bootstrap.runner.RunnerClassLoader",
        "method": "loadClass",
        "line": 106
      },
      {
        "class": "io.quarkus.bootstrap.runner.RunnerClassLoader",
        "method": "loadClass",
        "line": 72
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "doSend",
        "line": 549
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "doSend",
        "line": 522
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "handleInitiateApiVersionRequests",
        "line": 1035
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "poll",
        "line": 597
      },
      {
        "class": "org.apache.kafka.clients.NetworkClientUtils",
        "method": "awaitReady",
        "line": 74
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "awaitNodeReady",
        "line": 571
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "maybeSendAndPollTransactionalRequest",
        "line": 492
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "runOnce",
        "line": 339
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "run",
        "line": 253
      },
      {
        "class": "java.lang.Thread",
        "method": "run",
        "line": 1583
      }
    ]
  }

Expected behavior

Class loader should not fail when starting a kafka producer

Actual behavior

In rare cases fails to start class loader

How to Reproduce?

This happens very rarely which points to non deterministic behavior. Not sure how to reproduce this.

Output of uname -a or ver

Linux ... #1 SMP Wed Aug 7 16:53:27 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Output of java -version

openjdk version "21.0.4" 2024-07-16 LTS OpenJDK Runtime Environment (Red_Hat-21.0.4.0.7-1) (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM (Red_Hat-21.0.4.0.7-1) (build 21.0.4+7-LTS, mixed mode, sharing)

Quarkus version or git rev

3.14.2
JVM (powered by Quarkus 3.14.2) started in 7.308s

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)

Additional information

No response

@pcasaes pcasaes added the kind/bug Something isn't working label Sep 9, 2024
@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 9, 2024

One of the symptoms was that the CPU was saturated (100%+)

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 9, 2024

More context: The producer is being managed by quarkus-messaging-kafka

@geoand
Copy link
Contributor

geoand commented Sep 10, 2024

cc @mariofusco

@mariofusco
Copy link
Contributor

This issue seems to be similar to the other linked problem, but for that one the bug was in the concurrent way how a jar resource was opened, while looking at the pasted stack here the problem seems to happen when the resource is already opened and ready to be used, which is admittedly even weirder.

@pcasaes I will try to investigate this further, but I'm afraid that I cannot do much without a reproducer, and at the moment I don't know how to implement one. Can you please provide some hints or even better a simplified (and anonymized if necessary) project doing something similar to the one where you experienced this problem? I understand that very likely this simplified project won't throw that exception, but at least I will have some clue on where to look at.

One of the symptoms was that the CPU was saturated (100%+)

Is this a symptom or a cause? E.g. is it possible that the problem happens only when the CPU is already saturated?

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 10, 2024

The service uses quarkus-messaging-kafka to consume from one topic and then publishes to a new topic. When the instance went up there were records ready for it to consume. Not sure if the kafka producer is lazy in it's io thread initialization. regardless the class it tried to load was almost a second after startup.

2024-09-09T10:38:09.924887101Z JVM (powered by Quarkus 3.14.2) started in 7.308s.
2024-09-09T10:38:10.767329108Z Uncaught error in kafka producer I/O thread

CPU shot up at about 10:38:10. Can't give you a hard figure so unsure if it's cause or consequence.

Looking at the code I would guess that close is being called twice for the same resource. It looks like this could cause the issue, but this is a from a naive reading of the code.

Edit: reproducing is hard since this happened once after several restarts.
Edit 2: The consumer is configured with a dlq

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 10, 2024

I'm thinking the issue can be caused by eviction of the resource being called concurrently more than once https://github.com/quarkusio/quarkus/blob/3.14.2/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L193

The story:

A. Some calls to loadClass evicts the JarResource X form the cache and starts call to release (Thread 1)
B. loadClass is called for JarResource X which adds it to the cache and call the happy path acquire (Thread 2), counter is incremented to 2.
C. Some calls to loadClass evicts the JarResource X (the one added from call B) from the cache and starts call to release (Thread 3)
D. release in Thread 1 from A is completed, counter is decremented to 1.
E. Release in Thread 3 from C is completed, counter is decremented to 0.
F. happy path call release (Thread 2), fails with error (counter is 0).

I was able to "reproduce" this by playing around with the RunnerClassLoaderTest#testConcurrentJarCloseAndReload and setting appropriate breakpoints.

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 10, 2024

Repro with video instruction

https://github.com/pcasaes/quarkus/tree/pc/classload-race-condition-repro

2024-09-10.15-35-14.mp4

Thread 1: Evicts our target jar
Thread 2: Load our target jar into the cache again
Thread 3: Evicts our target jar again

order of execution

Thread 1 (first eviction)
evicts the target jar
calls release

Thread 2 (happy path)
adds target jar to cache
class acquire
increments count to 2

Thread 1 (first eviction)
decrements count to 1

Thread 3 (second eviction)
evicts the target jar
calls release
decrements count to 0

Thread 2 (happy path)
tries to release but counter at 0 (fails).

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 11, 2024

This fixes the scenario above. Not sure if it's the right approach though.

#43221

The idea is make sure that a close is only performed ONCE per jar file resource instance.

@mariofusco
Copy link
Contributor

Thanks a lot for your proposed fix @pcasaes. I must admit that I'm just seeing what you wrote here and commented your pull request before reading your explanation. There I was wondering how the close could be called more than once, but it seems you provided an explanation here. Let me read this with more attention and try to understand what's going on in your use case.

Out of curiosity did you try the fix that you proposed with your use case? If so does it solve the problem? I'm asking because I'm still unable to reproduce it on my side.

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 12, 2024

Yes, I repeated the steps and the illegal state is not reached.

The last acquire releases just fine.

I do question if a reference that has been marked as closed should ever be added again to the cache though.

As for the reproducer, you can only do it by controlling the execution order (using break points). Classic race condition, sometimes it happens, sometimes not.

Close can be called more than once because a reference can be removed and added many times to the cache, all while a happy path acquire and then release is taking place.

@gsmet
Copy link
Member

gsmet commented Sep 12, 2024

@pcasaes could you get a few thread dumps when your CPU is at 100% (kill -3 <pid>)? I'm curious as to what it's doing.

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 12, 2024

I can't actually reproduce what happened in production.

I was only able to prove that close can be called more than once.

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 12, 2024

@mariofusco

For the repro video I forgot to mention that all the breakpoints in JarFileResource had this condition this.jarFile.name.contains("simple"). Otherwise it's too many release and acquires to keep track.

Also, the repro branch is different than the potential fix: https://github.com/pcasaes/quarkus/tree/pc/classload-race-condition-repro

pcasaes added a commit to pcasaes/quarkus that referenced this issue Sep 13, 2024
Attempts to test against
quarkusio#43158

The original test would pass if an exception bubbled up into a worker thread.
@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 13, 2024

@mariofusco @franz1981

What do you think of improving the test like this?
#43286

This PR adds our case (where a reference could be re added and re evicted from the cache), but more importantly it ensures that no exceptions were thrown in the worker threads.

@franz1981
Copy link
Contributor

Thanks @pcasaes it's a great idea! Please, feel free to send such a PR, it is indeed a useful and due contribution 🙏

@pcasaes
Copy link
Contributor Author

pcasaes commented Sep 13, 2024

Here: #43286

@gsmet
Copy link
Member

gsmet commented Sep 14, 2024

I think we can close this one as it has been fixed in 3.14.4. Thanks for the great report and the amazing collaboration!

@gsmet gsmet closed this as completed Sep 14, 2024
@gsmet gsmet added this to the 3.14.4 milestone Sep 14, 2024
@famod
Copy link
Member

famod commented Sep 16, 2024

FTR, just earlier today I was hit by The reference counter cannot be negative, found: -1 in JarFileReference#release on Quarkus 3.14.2. It happened in a readiness check shortly after the service was up. No Kafka involved.

Great to see this fixed in 3.14.4!

melloware pushed a commit to melloware/quarkus that referenced this issue Sep 17, 2024
Attempts to test against
quarkusio#43158

The original test would pass if an exception bubbled up into a worker thread.
geoand added a commit to gsmet/devoxx24-playground that referenced this issue Sep 20, 2024
Without this change, the original test would pass if an exception bubbled up into a worker thread.

Relates to:
quarkusio/quarkus#43158
geoand added a commit to gsmet/devoxx24-playground that referenced this issue Sep 20, 2024
Ensure that any exceptions thrown in worker threads are properly checked
to avoid misleading test results. This change prevents the test from passing
if an exception bubbles up into a worker thread.

Relates to: quarkusio/quarkus#43158
danielsoro pushed a commit to danielsoro/quarkus that referenced this issue Sep 20, 2024
Attempts to test against
quarkusio#43158

The original test would pass if an exception bubbled up into a worker thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka kind/bug Something isn't working
Projects
Development

No branches or pull requests

6 participants