Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error starting operator with K8s v1.31.0 (429 Message: storage is (re)initializing) #2526

Closed
Donnerbart opened this issue Sep 5, 2024 · 24 comments

Comments

@Donnerbart
Copy link
Contributor

Bug Report

What did you do?

I've updated the Docker image for our operator integration tests to rancher/k3s:v1.31.0-k3s1.

What did you expect to see?

I expected our tests to continue to work.

What did you see instead? Under which circumstances?

In our integration tests the operator doesn't start anymore, throwing the following stacktrace:

Failed to start quarkus
java.lang.RuntimeException: Failed to start quarkus
	at io.quarkus.runner.ApplicationImpl.doStart(Unknown Source)
	at io.quarkus.runtime.Application.start(Application.java:101)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at io.quarkus.runner.bootstrap.StartupActionImpl.run(StartupActionImpl.java:305)
	at io.quarkus.test.junit.QuarkusTestExtension.doJavaStart(QuarkusTestExtension.java:248)
	at io.quarkus.test.junit.QuarkusTestExtension.ensureStarted(QuarkusTestExtension.java:601)
	at io.quarkus.test.junit.QuarkusTestExtension.beforeAll(QuarkusTestExtension.java:651)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
Caused by: io.javaoperatorsdk.operator.OperatorException: Error starting operator
	at io.javaoperatorsdk.operator.Operator.start(Operator.java:166)
	at io.javaoperatorsdk.operator.OperatorProducer_ProducerMethod_operator_4a13E9CyUpNLli_Z-91F5zETApk_ClientProxy.start(Unknown Source)
	at io.quarkiverse.operatorsdk.runtime.AppEventListener.onStartup(AppEventListener.java:28)
	at io.quarkiverse.operatorsdk.runtime.AppEventListener_Observer_onStartup_lQpwZkDdRQXLkHyaStWxDBo99eQ.notify(Unknown Source)
	at io.quarkus.arc.impl.EventImpl$Notifier.notifyObservers(EventImpl.java:351)
	at io.quarkus.arc.impl.EventImpl$Notifier.notify(EventImpl.java:333)
	at io.quarkus.arc.impl.EventImpl.fire(EventImpl.java:80)
	at io.quarkus.arc.runtime.ArcRecorder.fireLifecycleEvent(ArcRecorder.java:156)
	at io.quarkus.arc.runtime.ArcRecorder.handleLifecycleEvents(ArcRecorder.java:107)
	at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy_0(Unknown Source)
	at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy(Unknown Source)
	... 8 more
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:72)
	at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.executeAndWaitForAllToComplete(ExecutorServiceManager.java:67)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.boundedExecuteAndWaitForAllToComplete(ExecutorServiceManager.java:47)
	at io.javaoperatorsdk.operator.ControllerManager.start(ControllerManager.java:44)
	at io.javaoperatorsdk.operator.Operator.start(Operator.java:161)
	... 18 more
Caused by: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:70)
	... 23 more
Caused by: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
	at io.javaoperatorsdk.operator.processing.event.EventSourceManager.startEventSource(EventSourceManager.java:127)
	at io.javaoperatorsdk.operator.processing.event.EventSourceManager.start(EventSourceManager.java:68)
	at io.javaoperatorsdk.operator.processing.Controller.start(Controller.java:344)
	at io.javaoperatorsdk.operator.ControllerManager.lambda$start$0(ControllerManager.java:45)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$0(ExecutorServiceManager.java:61)
	at java.base@21.0.4/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base@21.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.4/java.lang.Thread.run(Thread.java:1583)
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:72)
	at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.executeAndWaitForAllToComplete(ExecutorServiceManager.java:67)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.boundedExecuteAndWaitForAllToComplete(ExecutorServiceManager.java:47)
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager.start(InformerManager.java:59)
	at io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource.start(ManagedInformerEventSource.java:87)
	at io.javaoperatorsdk.operator.processing.event.source.controller.ControllerResourceEventSource.start(ControllerResourceEventSource.java:60)
	at io.javaoperatorsdk.operator.processing.event.NamedEventSource.start(NamedEventSource.java:27)
	at io.javaoperatorsdk.operator.processing.event.EventSourceManager.startEventSource(EventSourceManager.java:122)
	... 8 more
Caused by: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:70)
	... 16 more
Caused by: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:111)
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager.lambda$start$0(InformerManager.java:62)
	... 5 more
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:95)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:88)
	... 6 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
	at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:142)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
	at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:51)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
	at io.fabric8.kubernetes.client.vertx.VertxHttpRequest.lambda$null$1(VertxHttpRequest.java:121)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:279)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:261)
	at io.vertx.core.http.impl.HttpEventHandler.handleEnd(HttpEventHandler.java:76)
	at io.vertx.core.http.impl.HttpClientResponseImpl.handleEnd(HttpClientResponseImpl.java:250)
	at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.lambda$new$0(Http1xClientConnection.java:421)
	at io.vertx.core.streams.impl.InboundBuffer.handleEvent(InboundBuffer.java:279)
	at io.vertx.core.streams.impl.InboundBuffer.write(InboundBuffer.java:157)
	at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleEnd(Http1xClientConnection.java:709)
	at io.vertx.core.impl.ContextImpl.lambda$execute$4(ContextImpl.java:315)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 more

Environment

Kubernetes cluster type:

A custom testcontainer class OperatorK3sContainer extends K3sContainer switched from v1.30.4-k3s1 to rancher/k3s:v1.31.0-k3s1. This container is started in a custom QuarkusTestResourceLifecycleManager:

@QuarkusTest
@QuarkusTestResource(OperatorTestResourceLifecycleManager.class)

$ Mention java-operator-sdk version from pom.xml file

2024-09-05 13:19:33,260 INFO  [io.qua.ope.dep.VersionAlignmentCheckingStep] (build-6) QOSDK: 6.7.3 (commit: 2f026a2) built on Fri Aug 09 22:01:53 CEST 2024
2024-09-05 13:19:33,270 INFO  [io.qua.ope.dep.VersionAlignmentCheckingStep] (build-6) JOSDK: 4.9.2 (commit: 841bf79) built on Fri May 31 21:44:13 CEST 2024

$ java -version

openjdk version "21.0.4" 2024-07-16
OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu224.04)
OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu224.04, mixed mode, sharing)

$ kubectl version

Client Version: v1.31.0
Kustomize Version: v5.4.2

Possible Solution

I think there might be a missing retry in the watcher/informer logic.

As a workaround it helps to disable this feature gate: --kube-apiserver-arg=feature-gates=ResilientWatchCacheInitialization=false (see additional context). But this only reduces the likelihood for the error to occur (1/22 test jobs succeeded -> 20/22 test jobs succeeded).

Additional context

I first stumbled across this "Fix flaking TestStoreListResourceVersion" K8s PR since it has the same error message.

After a bit of investigation I think the root cause is KEP-4568: Resilient watchcache initialization that was implemented in this PR.

ResilientWatchCacheInitialization - handling the changes to returning 429 errors instead of keeping the requests hanging

https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4568-resilient-watchcache-initialization#risks-and-mitigations

I'm not sure about a JOSDK standalone test setup, but I think with Quarkus the operator might be started concurrently or even before the test resources. My assumption is that with older K8s versions the CRD watcher was just hanging until the storage was initialized. And now it is rejected right away with a 429 Message: storage is (re)initializing.

I further assume that this needs a retry logic in the operator SDK, because this is probably not handled by the fabric8 K8s client, only by the GoLang K8s client.

@metacosm
Copy link
Collaborator

metacosm commented Sep 5, 2024

I've updated the QOSDK CI to use the latest KinD which should be using Kubernetes 1.31.0 and things seem to be working correctly…

@manusa
Copy link
Contributor

manusa commented Sep 6, 2024

I'm not sure about a JOSDK standalone test setup, but I think with Quarkus the operator might be started concurrently or even before the test resources. My assumption is that with older K8s versions the CRD watcher was just hanging until the storage was initialized. And now it is rejected right away with a 429 Message: storage is (re)initializing.

This seems to be the most likely cause.

I further assume that this needs a retry logic in the operator SDK, because this is probably not handled by the fabric8io/kubernetes-client#6055, only by the GoLang K8s client.

The client does have built-in retry logic enabled by default (10).

I'm not sure if the operator is disabling the retry or setting other values in the setup

@csviri
Copy link
Collaborator

csviri commented Sep 6, 2024

@Donnerbart could you pls try to set stopOnInformerErrorDuringStartup to false. That might solve the problem.

@Donnerbart
Copy link
Contributor Author

Donnerbart commented Sep 6, 2024

I've updated the QOSDK CI to use the latest KinD which should be using Kubernetes 1.31.0 and things seem to be working correctly…

How does the initialization work in the QOSDK tests? Is the K8s cluster started before the operator? And is the operator running in the test JVM or deployed into the K8s cluster?

We have Helm chart integration tests that work perfectly fine. But here the operator is deployed into the running K8s cluster. This must either give K8s enough time to initialize the storage, or the operator pod is not even scheduled before K8s is done with the initialization. At least we didn't see any failure there so far.

We just have this in the operator tests, where I'm pretty sure K8s and the operator are started in parallel. We can very clearly see the different startup scenarios in the logging as well:

  • Helm charts: INFO io.quarkus - hivemq-platform-operator 1.6.0-SNAPSHOT on JVM (powered by Quarkus 3.13.2) started in 10.940s. Listening on: http://0.0.0.0:8080/
  • Operator: hivemq-platform-operator 1.6.0-SNAPSHOT on JVM (powered by Quarkus 3.11.0) started in 67.273s. Listening on: http://localhost:33489/

In the Helm charts K8s is ready, so it's just the operator in PROD mode that is quickly started. In the operator tests it's running in DEV mode and needs to wait for K8s to initialize, before the informers can be started successfully. The timings also didn't change with K8s 1.31.0.

It all really looks like this was just working before, because K8s didn't immediately close the hanging watcher request. But in the KEP they state the 429 could've happened before, this is why they directly enabled this part of the feature:

for the changes to handing requests when watchcache is uninitialized, we believe the risk is visibly lower (and 429 errors were already returned before, even in small clusters where priority-levels are small enough to often admit only 1-2 inflight requests at once anyway) and we will enable it by-default from the very beginning

https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4568-resilient-watchcache-initialization#risks-and-mitigations

We had a phase in the past where we had random startup errors of the operator. But I cannot recall if this was the root cause or something else. It resolved itself, so I never investigated this further until now.

could you pls try to set stopOnInformerErrorDuringStartup to false. That might solve the problem.

This was a bit tricky, because it's a property in the Buildtime Runtime fixed config. That took me longer to figure out than I want to admit 😓 I put this setting into our test application.properties now. I can see multiple retries now, but it works and all tests are green.

listSyncAndWatch failed for hivemq.com/v1/hivemq-platforms, will retry
java.util.concurrent.CompletionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32775/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).

BTW, disabling the new feature via --kube-apiserver-arg=feature-gates=ResilientWatchCacheInitialization=false doesn't help with the startup time. It's still about a minute with the operator running in DEV mode in the test JVM. This just supports my assumption that the informer request was always pending/hanging, until the K8s storage was initialized. And just now we need the retries because of the closed requests.

So this is a functional workaround for us, thanks for the help! Right now I don't see this as a production issue. Not sure if you want to change anything in the Operator SDK for test scenarios though.

@Donnerbart
Copy link
Contributor Author

Donnerbart commented Sep 6, 2024

@metacosm Are the QOSDK tests using the Quarkus Dev Services like this?

quarkus.kubernetes-client.devservices.enabled=true
quarkus.kubernetes-client.devservices.flavor=kind

With QOSDK 6.7.3 and Quarkus 3.13.2 that supports quarkus.kubernetes-client.devservices.api-version=v1.30.0 as newest version. The Dev Services start a com.dajudge.kindcontainer.KindContainer and the newest release just added support for 1.30.

But it could very well be that you'll never run into the issue with a default Quarkus setup. It looks like the Dev services are started first and then the operator is started. In a small out-of-the-box reproducer project it just takes 12 seconds to bring up the operator, which looks like the K8s server is already initialized and there is no wait time on the informer start.

After some thought I think the root cause might be different wait strategies in the K8s container classes that are used:

I will give it a try to copy over that logic into our container. If that helps it's probably an issue of the K3sContainer from testcontainers-java, that doesn't wait properly for the K8s initialization.

@Donnerbart
Copy link
Contributor Author

Donnerbart commented Sep 13, 2024

@metacosm

I've updated the QOSDK CI to use the latest KinD which should be using Kubernetes 1.31.0 and things seem to be working correctly…

I just contributed the support for K8s v1.31.0 to the kind container that Quarkus is using. This PR will update the dependency in Quarkus itself: quarkusio/quarkus#43255

I assume after the next Quarkus release the QOSDK repo needs to bump the Quarkus version before the tests would run with K8s v1.31.0?

@Donnerbart
Copy link
Contributor Author

Not sure how the integration test setup in the JOSDK project works exactly (my Maven days are a bit in the past). It looks like you're firing up a minikube cluster before starting the tests? At least in the GitHub actions it looks like it's a separate build step.

This will probably also solve the problem, since there is very likely no operator started before the cluster is initialized.

@Donnerbart
Copy link
Contributor Author

Donnerbart commented Sep 13, 2024

We see this issue with our Helm chart tests now as well. These are open source :)

In this setup the issue doesn't result in a failed build, which is why we didn't see it sooner. The operator is deployed via a Helm chart into a K8s cluster. So if the operator has a startup failure its container just restarts. Since this problem resolves once the K8s storage is initialized, it will just extend the test duration.

Our test setup for Helm is a custom JUnit extension that fires up a K8s container. It's based on the testcontainers K3sContainer, like our operator tests. Our JUnit extension caches the K3s container, so it's re-used in the next test running in the same testsuite. This is why we only see the failure once on the first operator start per testsuite.

We use the rancher K3s v1.31.0 image with the Helm binary installed, so we can executed Helm commands directly in the K8s container.

All integration tests extend AbstractHelmChartIT that uses the JUnit extension starting K8s. And by default our operator Helm chart is installed before each test.

So it all seems to come down to the lifecycle of the test setup and probably if an operator restart will result in a test failure or not, if this issue affects you.

@Donnerbart
Copy link
Contributor Author

I'm running out of ideas how to properly fix this on our end.

Even if I change the wait strategy of our custom K8s container to wait for the log message of the cacher (something like [cacher.go:463] cacher (hivemq-platforms.hivemq.com): initialized), this would not help with our operator tests.

Since we don't use the Quarkus dev services, we probably cannot control the lifecycle so that the operator is started after our K8s cluster. Unless we don't let Quarkus start the operator and also do that ourselves.

So if this should be a supported use-case

@QuarkusTest
@WithTestResource(BringYourOwnK8sContainerTestResourceLifecycleManager.class)

this probably needs a fix in the JOSDK.

The 429 on informer requests during the cacher initialization is now the default behavior of K8s v1.31.0 and newer. And they state that this could've happened before. I think this new behavior should be handled gracefully by the operator even without the stopOnInformerErrorDuringStartup() switch, because it's an expected, retryable error.

Since that setting is buildtime fixed we cannot even set it easily in our Helm chart tests. And I'm a bit hesitant to change that for production. And I don't even like to set it for tests, because then we don't test the production behavior.

And also in production, this might lead to operator container restarts, as we see them now in our Helm chart tests. This might trigger monitoring systems, even if this recovers quickly.

@metacosm
Copy link
Collaborator

@metacosm

I've updated the QOSDK CI to use the latest KinD which should be using Kubernetes 1.31.0 and things seem to be working correctly…

I just contributed the support for K8s v1.31.0 to the kind container that Quarkus is using. This PR will update the dependency in Quarkus itself: quarkusio/quarkus#43255

I assume after the next Quarkus release the QOSDK repo needs to bump the Quarkus version before the tests would run with K8s v1.31.0?

Yes, since they use the testcontainer version. However, in CI, we also have tests against an actual cluster that should be running 1.31.0 now and we haven't noticed any issue there after that update.

@metacosm
Copy link
Collaborator

Not sure how the integration test setup in the JOSDK project works exactly (my Maven days are a bit in the past). It looks like you're firing up a minikube cluster before starting the tests? At least in the GitHub actions it looks like it's a separate build step.

Yeah, the integration test setup in JOSDK is not fully automated to be run locally. You'd need to launch your local cluster (or use a remote cluster, as long as you are connected to the cluster you want to test against via kubectl) separately, manually first. You'd then launch the integration tests using: ./mvnw package -P no-unit-tests (yeah, I know, not the most obvious thing in the world 😓)

This will probably also solve the problem, since there is very likely no operator started before the cluster is initialized.

Indeed.

@Donnerbart
Copy link
Contributor Author

Donnerbart commented Sep 13, 2024

We did further testing and I think we know where the crucial race condition part is. The watch cache from K8s (where the new feature was implemented) seems to cache each K8s resource individually. For custom resources the initialization starts when the CRD is applied to the cluster.

With our CRD it takes about 34 seconds to initialize the watch cache. If you start an operator in that time frame, the informer start will get the 429 response.

This is a reproducer to show the initialization phase:

#/!bin/bash

# we need a new K8s cluster with an empty watch cache
minikube delete

# the verbosity of the API server must be 1 to see the watch cache logs (cacher.go)
minikube start --kubernetes-version v1.31.0 --extra-config=apiserver.v=1

# download the HiveMQ Platform Operator CRD
curl -o hivemq-crd.yml -L https://raw.githubusercontent.com/hivemq/helm-charts/master/charts/hivemq-platform-operator/crds/hivemq-platforms.hivemq.com-v1.yml

# apply the CRD
start=$(date +%s)
kubectl apply -f hivemq-crd.yml
end_deployed=$(date +%s)

# wait for the cache initialization of the HiveMQ custom resource
while ! minikube logs | grep cacher | grep hivemq &> /dev/null; do
  sleep 1
done
end_initialized=$(date +%s)

# print stats
diff_deployed=$((end_deployed - start))
diff_initialized=$((end_initialized - end_deployed))
echo "Time to deploy the HiveMQ CRD: $diff_deployed s"
echo "Time to initialize cacher for HiveMQ CRD: $diff_initialized s"

# clean up
rm hivemq-crd.yml

@Donnerbart
Copy link
Contributor Author

Donnerbart commented Sep 13, 2024

We can reproduce this with a simple Helm install of our operator in a fresh minikube cluster running with K8s v1.31.0. It's most reliable when the operator Docker image is pre-loaded into the minikube Docker registry (eliminates the download time from Dockerhub into the minikube registry).

#/!bin/bash

# function to retrieve the pod name
get_pod_name() {
  kubectl get pods -l app.kubernetes.io/name=hivemq-platform-operator -o jsonpath='{.items[0].metadata.name}' 2>/dev/null
}

# function to get the current restart count
get_current_restart_count() {
  kubectl get pod "$POD_NAME" -o jsonpath='{.status.containerStatuses[0].restartCount}'
}

# we need a new K8s cluster with an empty watch cache
minikube delete

# the verbosity of the API server must be 1 to see the watch cache logs (cacher.go)
minikube start --kubernetes-version v1.31.0 --extra-config=apiserver.v=1

# pre-load the operator Docker image into the minikube cluster
docker pull hivemq/hivemq-platform-operator:1.5.2
minikube image load hivemq/hivemq-platform-operator:1.5.2

# add the HiveMQ Helm repository
helm repo add hivemq https://hivemq.github.io/helm-charts
helm repo update

# install the HiveMQ Platform Operator
helm install test-operator hivemq/hivemq-platform-operator --version 0.2.10 --set logLevel=DEBUG

# find the operator pod
while true; do
  POD_NAME=$(get_pod_name)
  if [ -n "$POD_NAME" ]; then
    echo "Found pod: $POD_NAME"
    break
  fi
  echo "No pod found, retrying in 1 second..."
  sleep 1
done

# wait until the restart count increases
echo "Waiting for pod $POD_NAME to restart..."
while true; do
  current_restart_count=$(get_current_restart_count)
  # check if restart count has increased
  if [ "$current_restart_count" -gt "0" ]; then
    echo "Pod $POD_NAME has restarted."
    break
  fi
  sleep 1
done

# show the log of the crashed operator
kubectl logs $POD_NAME --previous

@metacosm
Copy link
Collaborator

Thank you for the detailed information. We will look into it ASAP.

@metacosm
Copy link
Collaborator

It seems as if the informer should block until the cache is populated to avoid this issue. What do you think @shawkins?

@metacosm
Copy link
Collaborator

Seems like testcontainers have been updated in Quarkus main, I now see the error in the standard QOSDK build.

@shawkins
Copy link
Collaborator

Discussed with @metacosm it appears that retries are disabled for the operator sdk http client - possibly due to the retry logic that the operator sdk maintains. However the retry behavior for the client is lower level - for 429, 500+, and IOExceptions - so it seems like it could be enabled for the operator sdk as well such that the operator sdk retry is for higher-level 400 responses.

It is also possible to tweak the informer exception handler to account for the 429 case as well.

@Donnerbart
Copy link
Contributor Author

From this comment I believe the GoLang client is doing automatic retries on the 429 errors (at least on read operations, because they are idempotent). So I assume everyone on the K8s team felt safe to enable this feature right away.

I also think retryable, read-only operations could be handled automatically by the K8s client itself. Just failures on write operations should be thrown into the operator SDK.

@metacosm
Copy link
Collaborator

@Donnerbart yes, this is handled automatically by the fabric8 client, it's just that, for some reason that I will investigate once I'm done with other stuff, the client is currently configured to not retry. So, theoretically, this issue should be addressed by a simple configuration change.

@metacosm
Copy link
Collaborator

Looks like the issue is in how Quarkus handles the configuration…

@metacosm
Copy link
Collaborator

@Donnerbart the workaround at this point is to use quarkus.kubernetes-client.request-retry-backoff-limit=10 (or whichever maximum number of retries you want to allow) in your application.properties until quarkusio/quarkus#43474 is fixed and released. Another option would be to use a KubernetesConfigCustomizer to set that value by default in QOSDK (so as to restore the default from the fabric8 client), however that would then ignore the property if it had been set in the operator's application.properties.

metacosm added a commit to quarkiverse/quarkus-operator-sdk that referenced this issue Sep 24, 2024
This should be removed when the underlying issue is fixed.

See quarkusio/quarkus#43474

Signed-off-by: Chris Laprun <claprun@redhat.com>
metacosm added a commit to quarkiverse/quarkus-operator-sdk that referenced this issue Sep 25, 2024
This should be removed when the underlying issue is fixed.

See quarkusio/quarkus#43474

Signed-off-by: Chris Laprun <claprun@redhat.com>
metacosm added a commit to quarkiverse/quarkus-operator-sdk that referenced this issue Sep 25, 2024
This should be removed when the underlying issue is fixed.

See quarkusio/quarkus#43474

Signed-off-by: Chris Laprun <claprun@redhat.com>
@Donnerbart
Copy link
Contributor Author

The workaround with quarkus.kubernetes-client.request-retry-backoff-limit=10 works fine for us. And thanks a lot for fixing this upstream in the Quarkus project. That was a really long road to find the root cause behind all of this!

@metacosm
Copy link
Collaborator

metacosm commented Oct 2, 2024

The workaround with quarkus.kubernetes-client.request-retry-backoff-limit=10 works fine for us. And thanks a lot for fixing this upstream in the Quarkus project. That was a really long road to find the root cause behind all of this!

Sorry we didn't get there sooner. And thanks a lot for your investigations as well, which proved essential to figuring out that there indeed was an issue! Community FTW!

@metacosm
Copy link
Collaborator

metacosm commented Oct 2, 2024

Closing this as this seems to be addressed.

@metacosm metacosm closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants