App with Kafka Producer fails to initialize the producer #43158

pcasaes · 2024-09-09T18:11:25Z

Describe the bug

Very rarely we are seeing a kafka producer fail to initialize in an unrecoverable state (all health checks pass).

The root exception is: "The reference counter cannot be negative, found: -1"

Seems to be related to #42067

{
  "timestamp": "2024-09-09T10:38:10.767329108Z",
  "sequence": 114,
  "loggerClassName": "org.apache.kafka.common.utils.LogContext$LocationAwareKafkaLogger",
  "loggerName": "org.apache.kafka.clients.producer.internals.Sender",
  "level": "ERROR",
  "message": "[Producer clientId=kafka-producer-...bddf94778-znbcx] Uncaught error in kafka producer I/O thread: ",
  "threadName": "kafka-producer-network-thread | kafka-producer-...bddf94778-znbcx",
  "threadId": 42,
  "ndc": "",
  "hostName": "...bddf94778-znbcx",
  "processName": "/usr/lib/jvm/java-21-openjdk-21.0.4.0.7-1.el9.aarch64/bin/java",
  "processId": 1,
  "exception": {
    "refId": 1,
    "exceptionType": "java.lang.IllegalStateException",
    "message": "The reference counter cannot be negative, found: -1",
    "frames": [
      {
        "class": "io.quarkus.bootstrap.runner.JarFileReference",
        "method": "release",
        "line": 58
      },
      {
        "class": "io.quarkus.bootstrap.runner.JarFileReference",
        "method": "consumeSharedJarFile",
        "line": 128
      },
      {
        "class": "io.quarkus.bootstrap.runner.JarFileReference",
        "method": "withJarFile",
        "line": 96
      },
      {
        "class": "io.quarkus.bootstrap.runner.JarResource",
        "method": "getResourceData",
        "line": 56
      },
      {
        "class": "io.quarkus.bootstrap.runner.RunnerClassLoader",
        "method": "loadClass",
        "line": 106
      },
      {
        "class": "io.quarkus.bootstrap.runner.RunnerClassLoader",
        "method": "loadClass",
        "line": 72
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "doSend",
        "line": 549
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "doSend",
        "line": 522
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "handleInitiateApiVersionRequests",
        "line": 1035
      },
      {
        "class": "org.apache.kafka.clients.NetworkClient",
        "method": "poll",
        "line": 597
      },
      {
        "class": "org.apache.kafka.clients.NetworkClientUtils",
        "method": "awaitReady",
        "line": 74
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "awaitNodeReady",
        "line": 571
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "maybeSendAndPollTransactionalRequest",
        "line": 492
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "runOnce",
        "line": 339
      },
      {
        "class": "org.apache.kafka.clients.producer.internals.Sender",
        "method": "run",
        "line": 253
      },
      {
        "class": "java.lang.Thread",
        "method": "run",
        "line": 1583
      }
    ]
  }

Expected behavior

Class loader should not fail when starting a kafka producer

Actual behavior

In rare cases fails to start class loader

How to Reproduce?

This happens very rarely which points to non deterministic behavior. Not sure how to reproduce this.

Output of `uname -a` or `ver`

Linux ... #1 SMP Wed Aug 7 16:53:27 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Output of `java -version`

openjdk version "21.0.4" 2024-07-16 LTS OpenJDK Runtime Environment (Red_Hat-21.0.4.0.7-1) (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM (Red_Hat-21.0.4.0.7-1) (build 21.0.4+7-LTS, mixed mode, sharing)

Quarkus version or git rev

3.14.2
JVM (powered by Quarkus 3.14.2) started in 7.308s

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)

Additional information

No response

The text was updated successfully, but these errors were encountered:

pcasaes · 2024-09-09T18:40:06Z

One of the symptoms was that the CPU was saturated (100%+)

pcasaes · 2024-09-09T21:01:22Z

More context: The producer is being managed by quarkus-messaging-kafka

geoand · 2024-09-10T09:18:04Z

cc @mariofusco

mariofusco · 2024-09-10T13:23:11Z

This issue seems to be similar to the other linked problem, but for that one the bug was in the concurrent way how a jar resource was opened, while looking at the pasted stack here the problem seems to happen when the resource is already opened and ready to be used, which is admittedly even weirder.

@pcasaes I will try to investigate this further, but I'm afraid that I cannot do much without a reproducer, and at the moment I don't know how to implement one. Can you please provide some hints or even better a simplified (and anonymized if necessary) project doing something similar to the one where you experienced this problem? I understand that very likely this simplified project won't throw that exception, but at least I will have some clue on where to look at.

One of the symptoms was that the CPU was saturated (100%+)

Is this a symptom or a cause? E.g. is it possible that the problem happens only when the CPU is already saturated?

pcasaes · 2024-09-10T17:25:23Z

The service uses quarkus-messaging-kafka to consume from one topic and then publishes to a new topic. When the instance went up there were records ready for it to consume. Not sure if the kafka producer is lazy in it's io thread initialization. regardless the class it tried to load was almost a second after startup.

2024-09-09T10:38:09.924887101Z JVM (powered by Quarkus 3.14.2) started in 7.308s.
2024-09-09T10:38:10.767329108Z Uncaught error in kafka producer I/O thread

CPU shot up at about 10:38:10. Can't give you a hard figure so unsure if it's cause or consequence.

Looking at the code I would guess that close is being called twice for the same resource. It looks like this could cause the issue, but this is a from a naive reading of the code.

Edit: reproducing is hard since this happened once after several restarts.
Edit 2: The consumer is configured with a dlq

pcasaes · 2024-09-10T21:57:16Z

I'm thinking the issue can be caused by eviction of the resource being called concurrently more than once https://github.com/quarkusio/quarkus/blob/3.14.2/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L193

The story:

A. Some calls to loadClass evicts the JarResource X form the cache and starts call to release (Thread 1)
B. loadClass is called for JarResource X which adds it to the cache and call the happy path acquire (Thread 2), counter is incremented to 2.
C. Some calls to loadClass evicts the JarResource X (the one added from call B) from the cache and starts call to release (Thread 3)
D. release in Thread 1 from A is completed, counter is decremented to 1.
E. Release in Thread 3 from C is completed, counter is decremented to 0.
F. happy path call release (Thread 2), fails with error (counter is 0).

I was able to "reproduce" this by playing around with the RunnerClassLoaderTest#testConcurrentJarCloseAndReload and setting appropriate breakpoints.

pcasaes · 2024-09-10T22:39:02Z

Repro with video instruction

https://github.com/pcasaes/quarkus/tree/pc/classload-race-condition-repro

2024-09-10.15-35-14.mp4

Thread 1: Evicts our target jar
Thread 2: Load our target jar into the cache again
Thread 3: Evicts our target jar again

order of execution

Thread 1 (first eviction)
evicts the target jar
calls release

Thread 2 (happy path)
adds target jar to cache
class acquire
increments count to 2

Thread 1 (first eviction)
decrements count to 1

Thread 3 (second eviction)
evicts the target jar
calls release
decrements count to 0

Thread 2 (happy path)
tries to release but counter at 0 (fails).

pcasaes · 2024-09-11T18:28:09Z

This fixes the scenario above. Not sure if it's the right approach though.

#43221

The idea is make sure that a close is only performed ONCE per jar file resource instance.

mariofusco · 2024-09-12T08:09:41Z

Thanks a lot for your proposed fix @pcasaes. I must admit that I'm just seeing what you wrote here and commented your pull request before reading your explanation. There I was wondering how the close could be called more than once, but it seems you provided an explanation here. Let me read this with more attention and try to understand what's going on in your use case.

Out of curiosity did you try the fix that you proposed with your use case? If so does it solve the problem? I'm asking because I'm still unable to reproduce it on my side.

pcasaes · 2024-09-12T17:07:15Z

Yes, I repeated the steps and the illegal state is not reached.

The last acquire releases just fine.

I do question if a reference that has been marked as closed should ever be added again to the cache though.

As for the reproducer, you can only do it by controlling the execution order (using break points). Classic race condition, sometimes it happens, sometimes not.

Close can be called more than once because a reference can be removed and added many times to the cache, all while a happy path acquire and then release is taking place.

gsmet · 2024-09-12T18:07:35Z

@pcasaes could you get a few thread dumps when your CPU is at 100% (kill -3 <pid>)? I'm curious as to what it's doing.

pcasaes · 2024-09-12T18:16:22Z

I can't actually reproduce what happened in production.

I was only able to prove that close can be called more than once.

pcasaes · 2024-09-12T18:38:45Z

@mariofusco

For the repro video I forgot to mention that all the breakpoints in JarFileResource had this condition this.jarFile.name.contains("simple"). Otherwise it's too many release and acquires to keep track.

Also, the repro branch is different than the potential fix: https://github.com/pcasaes/quarkus/tree/pc/classload-race-condition-repro

Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.

pcasaes · 2024-09-13T16:59:07Z

@mariofusco @franz1981

What do you think of improving the test like this?
#43286

This PR adds our case (where a reference could be re added and re evicted from the cache), but more importantly it ensures that no exceptions were thrown in the worker threads.

franz1981 · 2024-09-13T17:10:58Z

Thanks @pcasaes it's a great idea! Please, feel free to send such a PR, it is indeed a useful and due contribution 🙏

pcasaes · 2024-09-13T18:16:44Z

Here: #43286

gsmet · 2024-09-14T16:54:39Z

I think we can close this one as it has been fixed in 3.14.4. Thanks for the great report and the amazing collaboration!

famod · 2024-09-16T14:02:07Z

FTR, just earlier today I was hit by The reference counter cannot be negative, found: -1 in JarFileReference#release on Quarkus 3.14.2. It happened in a readiness check shortly after the service was up. No Kafka involved.

Great to see this fixed in 3.14.4!

Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.

Without this change, the original test would pass if an exception bubbled up into a worker thread. Relates to: quarkusio/quarkus#43158

Ensure that any exceptions thrown in worker threads are properly checked to avoid misleading test results. This change prevents the test from passing if an exception bubbles up into a worker thread. Relates to: quarkusio/quarkus#43158

Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.

pcasaes added the kind/bug Something isn't working label Sep 9, 2024

geoand added the area/kafka label Sep 10, 2024

pcasaes mentioned this issue Sep 11, 2024

Fix jar file reference release race condition #43221

Closed

franz1981 mentioned this issue Sep 12, 2024

Fix jar file reference close race condition #43257

Merged

pcasaes added a commit to pcasaes/quarkus that referenced this issue Sep 13, 2024

Check for exceptions in threads on class loader test

66608a7

Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.

pcasaes mentioned this issue Sep 13, 2024

Check for exceptions in threads on class loader test #43286

Merged

gsmet closed this as completed Sep 14, 2024

gsmet added this to the 3.14.4 milestone Sep 14, 2024

melloware pushed a commit to melloware/quarkus that referenced this issue Sep 17, 2024

Check for exceptions in threads on class loader test

911ac9c

Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.

geoand added a commit to gsmet/devoxx24-playground that referenced this issue Sep 20, 2024

Check for exceptions in threads on classloader test

d44b3dd

Without this change, the original test would pass if an exception bubbled up into a worker thread. Relates to: quarkusio/quarkus#43158

geoand mentioned this issue Sep 20, 2024

Check for exceptions in threads on classloader test gsmet/devoxx24-playground#33

Closed

geoand mentioned this issue Sep 20, 2024

Add exception handling for worker threads gsmet/devoxx24-playground#34

Closed

danielsoro pushed a commit to danielsoro/quarkus that referenced this issue Sep 20, 2024

Check for exceptions in threads on class loader test

fefe87d

Attempts to test against quarkusio#43158 The original test would pass if an exception bubbled up into a worker thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

App with Kafka Producer fails to initialize the producer #43158

App with Kafka Producer fails to initialize the producer #43158

pcasaes commented Sep 9, 2024 •

edited by geoand

Loading

pcasaes commented Sep 9, 2024 •

edited

Loading

pcasaes commented Sep 9, 2024

geoand commented Sep 10, 2024

mariofusco commented Sep 10, 2024

pcasaes commented Sep 10, 2024 •

edited

Loading

pcasaes commented Sep 10, 2024 •

edited

Loading

pcasaes commented Sep 10, 2024 •

edited

Loading

pcasaes commented Sep 11, 2024 •

edited

Loading

mariofusco commented Sep 12, 2024

pcasaes commented Sep 12, 2024

gsmet commented Sep 12, 2024

pcasaes commented Sep 12, 2024

pcasaes commented Sep 12, 2024

pcasaes commented Sep 13, 2024

franz1981 commented Sep 13, 2024

pcasaes commented Sep 13, 2024

gsmet commented Sep 14, 2024

famod commented Sep 16, 2024

App with Kafka Producer fails to initialize the producer #43158

App with Kafka Producer fails to initialize the producer #43158

Comments

pcasaes commented Sep 9, 2024 • edited by geoand Loading

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

pcasaes commented Sep 9, 2024 • edited Loading

pcasaes commented Sep 9, 2024

geoand commented Sep 10, 2024

mariofusco commented Sep 10, 2024

pcasaes commented Sep 10, 2024 • edited Loading

pcasaes commented Sep 10, 2024 • edited Loading

pcasaes commented Sep 10, 2024 • edited Loading

pcasaes commented Sep 11, 2024 • edited Loading

mariofusco commented Sep 12, 2024

pcasaes commented Sep 12, 2024

gsmet commented Sep 12, 2024

pcasaes commented Sep 12, 2024

pcasaes commented Sep 12, 2024

pcasaes commented Sep 13, 2024

franz1981 commented Sep 13, 2024

pcasaes commented Sep 13, 2024

gsmet commented Sep 14, 2024

famod commented Sep 16, 2024

pcasaes commented Sep 9, 2024 •

edited by geoand

Loading

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

pcasaes commented Sep 9, 2024 •

edited

Loading

pcasaes commented Sep 10, 2024 •

edited

Loading

pcasaes commented Sep 10, 2024 •

edited

Loading

pcasaes commented Sep 10, 2024 •

edited

Loading

pcasaes commented Sep 11, 2024 •

edited

Loading