Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App with Infinispan fails after changes in JarFileReference that rarely results in illegal state after changes on ReadWriteLock #42067

Closed
michalvavrik opened this issue Jul 23, 2024 · 16 comments · Fixed by #42139
Labels
area/infinispan Infinispan area/kubernetes kind/bug Something isn't working
Milestone

Comments

@michalvavrik
Copy link
Member

michalvavrik commented Jul 23, 2024

Describe the bug

I have application with Infinispan extension that is deployed to Openshift using the OpenShift extension. Lately I have experienced flakiness when Infinipan library performs service loading on application startup. My colleague @jedla97 discovered that this started happen after #40942.

Expected behavior

No race.

Actual behavior

Application startup fails with exception:

03:16:59,364 INFO  [two] 03:16:57,505 Failed to initialize a channel. Closing: [id: 0xd1e053bd]: java.lang.IllegalStateException: The reference counter cannot be negative, found: -1
03:16:59,364 INFO  [two] 	at io.quarkus.bootstrap.runner.JarFileReference.release(JarFileReference.java:58)
03:16:59,364 INFO  [two] 	at io.quarkus.bootstrap.runner.JarFileReference.consumeSharedJarFile(JarFileReference.java:129)
03:16:59,364 INFO  [two] 	at io.quarkus.bootstrap.runner.JarFileReference.withJarFile(JarFileReference.java:107)
03:16:59,364 INFO  [two] 	at io.quarkus.bootstrap.runner.JarResource.getResourceURL(JarResource.java:89)
03:16:59,364 INFO  [two] 	at io.quarkus.bootstrap.runner.RunnerClassLoader.findResources(RunnerClassLoader.java:265)
03:16:59,364 INFO  [two] 	at java.base/java.lang.ClassLoader.getResources(ClassLoader.java:1473)
03:16:59,364 INFO  [two] 	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(ServiceLoader.java:1203)
03:16:59,364 INFO  [two] 	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1228)
03:16:59,364 INFO  [two] 	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1273)
03:16:59,364 INFO  [two] 	at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1309)
03:16:59,364 INFO  [two] 	at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1393)
03:16:59,364 INFO  [two] 	at org.infinispan.commons.util.SaslUtils.getFactories(SaslUtils.java:58)
03:16:59,365 INFO  [two] 	at org.infinispan.commons.util.SaslUtils.getSaslClientFactories(SaslUtils.java:49)
03:16:59,365 INFO  [two] 	at org.infinispan.client.hotrod.impl.transport.netty.ChannelInitializer.getSaslClientFactory(ChannelInitializer.java:227)
03:16:59,365 INFO  [two] 	at org.infinispan.client.hotrod.impl.transport.netty.ChannelInitializer.initAuthentication(ChannelInitializer.java:204)
03:16:59,365 INFO  [two] 	at org.infinispan.client.hotrod.impl.transport.netty.ChannelInitializer.initChannel(ChannelInitializer.java:110)
03:16:59,365 INFO  [two] 	at io.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:129)
03:16:59,365 INFO  [two] 	at io.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:112)
03:16:59,365 INFO  [two] 	at io.netty.channel.AbstractChannelHandlerContext.callHandlerAdded(AbstractChannelHandlerContext.java:1130)
03:16:59,365 INFO  [two] 	at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:608)
03:16:59,365 INFO  [two] 	at io.netty.channel.DefaultChannelPipeline.access$100(DefaultChannelPipeline.java:45)
03:16:59,365 INFO  [two] 	at io.netty.channel.DefaultChannelPipeline$PendingHandlerAddedTask.execute(DefaultChannelPipeline.java:1460)
03:16:59,365 INFO  [two] 	at io.netty.channel.DefaultChannelPipeline.callHandlerAddedForAllHandlers(DefaultChannelPipeline.java:1114)
03:16:59,365 INFO  [two] 	at io.netty.channel.DefaultChannelPipeline.invokeHandlerAddedIfNeeded(DefaultChannelPipeline.java:649)
03:16:59,365 INFO  [two] 	at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:513)
03:16:59,365 INFO  [two] 	at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:428)
03:16:59,365 INFO  [two] 	at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:485)
03:16:59,365 INFO  [two] 	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
03:16:59,365 INFO  [two] 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
03:16:59,365 INFO  [two] 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
03:16:59,365 INFO  [two] 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
03:16:59,366 INFO  [two] 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
03:16:59,366 INFO  [two] 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
03:16:59,366 INFO  [two] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
03:16:59,366 INFO  [two] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
03:16:59,366 INFO  [two] 	at java.base/java.lang.Thread.run(Thread.java:840)
03:16:59,366 INFO  [two] 03:16:57,517 ISPN004007: Exception encountered. Retry 0 out of 0: io.netty.channel.StacklessClosedChannelException
03:16:59,366 INFO  [two] 	at io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(ChannelPromise)(Unknown Source)
03:16:59,366 INFO  [two] 03:16:57,517 ISPN004007: Exception encountered. Retry 0 out of 0: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ts-qhdkqxqyjo-infinispan-cluster.datagrid-cluster.svc.cluster.local/172.30.48.250:11222
03:16:59,366 INFO  [two] Caused by: java.net.ConnectException: Connection refused
03:16:59,366 INFO  [two] 	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
03:16:59,366 INFO  [two] 	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
03:16:59,366 INFO  [two] 	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946)
03:16:59,366 INFO  [two] 	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336)
03:16:59,366 INFO  [two] 	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
03:16:59,366 INFO  [two] 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
03:16:59,366 INFO  [two] 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
03:16:59,366 INFO  [two] 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
03:16:59,366 INFO  [two] 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
03:16:59,367 INFO  [two] 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
03:16:59,367 INFO  [two] 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
03:16:59,367 INFO  [two] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
03:16:59,367 INFO  [two] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
03:16:59,367 INFO  [two] 	at java.base/java.lang.Thread.run(Thread.java:840)

How to Reproduce?

Reproducer:

  1. have OpenShift cluster up and running, be logged in there
  2. git clone git@github.com:quarkus-qe/quarkus-test-suite.git
  3. git checkout e4056763aecf9a048b89a0e4f2b5ae0e9a98da25 (that just because I don't know if I won't disable the test, you likely can skip this step)
  4. quarkus-test-suite/infinispan-client
  5. mvn clean verify -Dopenshift -Dit.test=OperatorOpenShiftInfinispanCountersIT -Doc.reruns=0

But I'll be honest with you, our Jenkins has by far more success in reproducing it than I do on my workstation. I have never actually seen this exception outside of Jenkins logs.

Output of uname -a or ver

Fedora 40

Output of java -version

17.0.10, vendor: Red Hat, Inc., runtime: /qa/tools/opt/x86_64/java-17-openjdk-17.0.10.0.7-1.portable.jdk.el.x86_64

Quarkus version or git rev

999-SNAPSHOT

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)

Additional information

No response

@michalvavrik michalvavrik added the kind/bug Something isn't working label Jul 23, 2024
Copy link

quarkus-bot bot commented Jul 23, 2024

/cc @geoand (openshift), @iocanel (openshift), @karesti (infinispan), @wburns (infinispan)

@michalvavrik michalvavrik changed the title App with Infinispan fails after changes in JarFileReference that results in illegal state after changes on ReadWriteLock App with Infinispan fails after changes in JarFileReference that rarely results in illegal state after changes on ReadWriteLock Jul 23, 2024
@geoand
Copy link
Contributor

geoand commented Jul 23, 2024

cc @mariofusco

Also cc @gsmet for awareness

@michalvavrik
Copy link
Member Author

Yeah, now I have mentioned #42007 that I didn't find even though I was searching through issues for couple of minutes. Let me check if I have some failures with HEAD after that fix.

@michalvavrik
Copy link
Member Author

Yeah, I have one from yesterday, so issue holds.

@michalvavrik
Copy link
Member Author

@mariofusco if you have access to RH VPN I can show you several runs with full logs (ping me on Slack), not sure if it provides you with more information, but it's probably better than nothing in case you, like me, cannot reproduce it locally.

@michalvavrik
Copy link
Member Author

though #42007 was merged but yesterday and I don't have HEAD after it. I'll try to rerun it couple of times to confirm it's still an issue. Anyway, my exception is different.

@famod
Copy link
Member

famod commented Jul 23, 2024

Same here! I get it all the time I boot my app with 3.13.0.CR1 (I wasn't able to check anything more recent yet).
As it stands, 3.13.0 breaks my app entirely.

Edit, more details:
Seems to be happening everywhere (healtch checks, jobs, serving rest requests), e.g.:

java.lang.IllegalStateException: The reference counter cannot be negative, found: -1
	at io.quarkus.bootstrap.runner.JarFileReference.release(JarFileReference.java:58)
	at io.quarkus.bootstrap.runner.JarFileReference.consumeSharedJarFile(JarFileReference.java:129)
	at io.quarkus.bootstrap.runner.JarFileReference.withJarFile(JarFileReference.java:107)
	at io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:56)
	at io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
	at io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Unknown Source)
	at java.base/java.lang.Class.forName(Unknown Source)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(Unknown Source)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(Unknown Source)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(Unknown Source)
	at java.base/java.util.ServiceLoader$2.hasNext(Unknown Source)
	at java.base/java.util.ServiceLoader$3.hasNext(Unknown Source)
	at io.quarkus.rest.client.reactive.runtime.RestClientListeners.get(RestClientListeners.java:18)
	at io.quarkus.rest.client.reactive.runtime.RestClientBuilderImpl.build(RestClientBuilderImpl.java:403)
	at io.quarkus.rest.client.reactive.runtime.QuarkusRestClientBuilderImpl.build(QuarkusRestClientBuilderImpl.java:267)
	at io.quarkus.rest.client.reactive.runtime.RestClientCDIDelegateBuilder.build(RestClientCDIDelegateBuilder.java:67)
	at io.quarkus.rest.client.reactive.runtime.RestClientCDIDelegateBuilder.createDelegate(RestClientCDIDelegateBuilder.java:49)
	at io.quarkus.rest.client.reactive.runtime.RestClientReactiveCDIWrapperBase.delegate(RestClientReactiveCDIWrapperBase.java:76)
	at io.quarkus.rest.client.reactive.runtime.RestClientReactiveCDIWrapperBase.<init>(RestClientReactiveCDIWrapperBase.java:30)
	at de.someproject.SomeHealthApi$$CDIWrapper.<init>(Unknown Source)
    [...]

@jedla97
Copy link
Contributor

jedla97 commented Jul 23, 2024

Afaik I try to run now (build Quarkus yesterday with #42007) and on 1st try I get the error containing the java.lang.IllegalStateException: The reference counter cannot be negative

For me the enviroment is Fedora 40 (6.9.6-200.fc40.x86_64) and java:

openjdk 21 2023-09-19
OpenJDK Runtime Environment (build 21+35-2513)
OpenJDK 64-Bit Server VM (build 21+35-2513, mixed mode, sharing)

Openshift which I use is 4.16

Edit: Try it with JDK 17 and also see the error

openjdk 17.0.7 2023-04-18
OpenJDK Runtime Environment GraalVM CE 22.3.2 (build 17.0.7+7-jvmci-22.3-b18)
OpenJDK 64-Bit Server VM GraalVM CE 22.3.2 (build 17.0.7+7-jvmci-22.3-b18, mixed mode, sharing)

@michalvavrik
Copy link
Member Author

thanks a lot @jedla97

@geoand
Copy link
Contributor

geoand commented Jul 23, 2024

As it stands, 3.13.0 breaks my app entirely

Given this, I think we have no choice but to revert the ClassLoader change @mariofusco

@mariofusco
Copy link
Contributor

As it stands, 3.13.0 breaks my app entirely

Given this, I think we have no choice but to revert the ClassLoader change @mariofusco

@geoand @gsmet Sorry about this inconvenience and yes, I need more time to investigate this, so please feel free to revert that change for now. I will ping @michalvavrik immediately and try to figure out with his help what's going wrong.

@geoand
Copy link
Contributor

geoand commented Jul 23, 2024

Sorry about this inconvenience

There is absolutely no reason to apologize!

so please feel free to revert that change for now

Can you please open a PR with this revert?

@mariofusco
Copy link
Contributor

Same here! I get it all the time I boot my app with 3.13.0.CR1 (I wasn't able to check anything more recent yet). As it stands, 3.13.0 breaks my app entirely.

Edit, more details: Seems to be happening everywhere (healtch checks, jobs, serving rest requests), e.g.:

@famod Do you have a reproducer that you can share?

@mariofusco
Copy link
Contributor

Can you please open a PR with this revert?

#42083

@famod
Copy link
Member

famod commented Jul 25, 2024

@mariofusco

Same here! I get it all the time I boot my app with 3.13.0.CR1 (I wasn't able to check anything more recent yet). As it stands, 3.13.0 breaks my app entirely.
Edit, more details: Seems to be happening everywhere (healtch checks, jobs, serving rest requests), e.g.:

@famod Do you have a reproducer that you can share?

Not ATM, sorry. I'm about to go on PTO and my queue is full.

@mariofusco
Copy link
Contributor

Not ATM, sorry. I'm about to go on PTO and my queue is full.

@famod not necessary anymore, I have reproduced the problem and I think I fixed it #42139 I would appreciate it if you could test your reproducer against that pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infinispan Infinispan area/kubernetes kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants