Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes client in 0.22 fails to connect to the API server on Kube 1.18 #5044

Closed
peppe77 opened this issue May 27, 2021 · 23 comments
Closed
Labels

Comments

@peppe77
Copy link

peppe77 commented May 27, 2021

K8S: v1.18

Strimzi Kafka Operator was v0.18 - operational
Kafka Cluster v2.5.0 - operational

In order for us to get to Kafka cluster 2.7.0, we first need to upgrade operator to v0.22 though came across problem.

Upgraded operator from 0.18 to 0.22 and got following(full operator logs attached with work around showing that proble below does not occur and without work around where below excerpt problem happens:

        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.ClusterOperationsImpl.fetchVersion(ClusterOperationsImpl.java:54) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
        at io.fabric8.kubernetes.client.DefaultKubernetesClient.getVersion(DefaultKubernetesClient.java:489) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
        at io.strimzi.operator.PlatformFeaturesAvailability.lambda$getVersionInfoFromKubernetes$5(PlatformFeaturesAvailability.java:150) ~[io.strimzi.operator-common-0.22.0.jar:0.22.0]
        at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$2(ContextImpl.java:313) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: okhttp3.internal.http2.ConnectionShutdownException

This could have been caused by different number of reasons (not necessarily client release) though most obvious ones we ruled out for following reasons:

We had successfully upgraded operator from 0.15 to 0.18 and did not experience a problem with K8S client is not able to reach K8S api; We had 0.18-operator and 2.5.0-Kafka function after 0.15 to 0.18 upgrade!

We did some manual tests / checks to ensure there was nothing preventing client from reaching K8S api - all good.
[most critical one]: we set HTTP2_DISASBLE: "true" and then strimzi operator passed through the point it used to fail. POD actually comes up and before it would cyclic crash/loop. (there are other errors in the logs but they are not related to k8s client used by the operator not being able to call K8S Api).

If there was any problem with connectivity/access/permissions then setting HTTP2_DISABLE = "true", should have also failed but it actually did not.

If you need additional logs/information, please let us know. All we need to do to reproduce the problem is remove env variable passed (HTTP2_DISABLE) as mentioned above. We have environment where problem is easily reproducible - just need help to enable/collect more logs to further understand why K8S client is not able to call K8S Api as reported above.

To Reproduce
Steps to reproduce the behavior:

  1. Operator 0.15 / Kafka 2.3.0 - upgraded kafka 2.4.0 - all good
  2. Upgraded operator 0.15 to 0.18 - all good.
  3. Upgraded kafka 2.4.0 to 2.5.0 - all good
  4. upgraded operator 0.15 to 0.18 - problem herein reported.

[UPDATE]: Just installed operator 0.22 on a K8S cluster (sandbox) that does not even have kafka cluster and same problem occurred. If I install 0.18 then it comes up right away, no problem.

2021-05-27 21:34:48 ERROR PlatformFeaturesAvailability:152 - Detection of Kubernetes version failed.
io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.internal.ClusterOperationsImpl.fetchVersion(ClusterOperationsImpl.java:54) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
	at io.fabric8.kubernetes.client.DefaultKubernetesClient.getVersion(DefaultKubernetesClient.java:489) ~[io.fabric8.kubernetes-client-5.0.2.jar:?]
	at io.strimzi.operator.PlatformFeaturesAvailability.lambda$getVersionInfoFromKubernetes$5(PlatformFeaturesAvailability.java:150) ~[io.strimzi.operator-common-0.22.0.jar:0.22.0]
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$2(ContextImpl.java:313) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.net.SocketException: Broken pipe (Write failed)

Expected behavior
Upgrade operator from 0.18 to 0.22

Environment (please complete the following information):

  • Strimzi version: [operator 0.18 - tried upgrade to 0.22; Kafka 2.5.0]
  • Installation method: [terraform / helm - using Strimzi charts]
  • Kubernetes cluster: [Kubernetes 1.18]
  • Infrastructure: [AWS/EC2 - Self Managed K8S cluster]

kafka_operator_022_http2_false.log
kafka_operator_022.log

@peppe77 peppe77 added the bug label May 27, 2021
@scholzj scholzj changed the title [not able to upgrade operator 0.18 to 0.22 on k8s 1.1.8] k8s client fails to call k8s api Kubernetes client in 0.22 fails to connect to the API server on Kube 1.18 May 27, 2021
@scholzj
Copy link
Member

scholzj commented May 27, 2021

This issue does not seem to be wide spread ... so I think it would be interesting to understand what makes your Kubernetes cluster special to suffer from this.

@peppe77
Copy link
Author

peppe77 commented May 27, 2021

@scholzj that would not explain why 0.15 and 0.18 operators work in any of our clusters but 0.20, 0.21 and 0.22 has problems. Also recall someone else reporting problems between operator and K8S 1.18. If we had anything special then 0.18 should have failed, right? Would be good to understand changes on the client after 0.18

@scholzj
Copy link
Member

scholzj commented May 27, 2021

Ever version has different libraries. So it would not be surprising that some versions work and some don't.

@peppe77
Copy link
Author

peppe77 commented May 27, 2021

@scholzj what additional logs can be enabled/collected to better understand problem (and if the case, report to fabric8io? thanks

@scholzj
Copy link
Member

scholzj commented May 27, 2021

The logging configuration is int he strimzi-cluster-operator config map. The OKHttp library is what actually does the communication. So I guess you can try to get some logs for that.

@peppe77
Copy link
Author

peppe77 commented May 27, 2021

im a sandbox cluster (never even used to deploy kafka clusters), deployed the operator on K8S 1.18 and had very same problem for: strimzi/kafka operators version 0.19, 0.20, 0.21 and 0.22. Now installed 0.18 and it is up and running - no problem and no work around. If we pass HTTP2_DISABLE="true" then it gets operators versions above listed to pass that env/K8s version detection phase. @scholzj thanks. will take a look and get some more logs. appreciated

@peppe77
Copy link
Author

peppe77 commented Jun 1, 2021

@scholzj and/or anyone -> any idea on how to enable OKHttp logs?

@scholzj
Copy link
Member

scholzj commented Jun 1, 2021

I don't know ... judging by how the other configuration looks like maybe something like this might work?

logger.okhttp.name = <OkHttp Package Name>
logger.okhttp.level = WARN
logger.okhttp.additivity = false

@peppe77
Copy link
Author

peppe77 commented Jun 3, 2021

@scholzj pls check this out - managed to enable additional log [GODEBUG=http2debug=2] in K8S Api Server and as you can see below, TCP RST the client attempts with the following:

kube-apiserver-ip-10-64-1-194.ec2.internal kube-apiserver I0603 04:10:09.138948 1 log.go:172] http2: server rejecting conn: INADEQUATE_SECURITY, Prohibited TLS 1.2 Cipher Suite: 9d
kube-apiserver-ip-10-64-1-194.ec2.internal kube-apiserver I0603 04:10:09.138978 1 log.go:172] http2: Framer 0xc014c8b880: wrote GOAWAY len=43 LastStreamID=0 ErrCode=INADEQUATE_SECURITY Debug="Prohibited TLS 1.2 Cipher Suite: 9d"

This happens with v5.0.2 (judging by operator 0.22 logs herein provided] though based on below (very similar issues) - this should have been fixed in v5.0.2 already, no?

#2212
kubernetes-client/java#1149

it seems to be a regression. Please take a look at the 2 above issues as well - Please advise.

thanks /Pedro

@peppe77
Copy link
Author

peppe77 commented Jun 3, 2021

@scholzj reflection here is that if 9d is in the TLS1.2 blacklist [https://httpwg.org/specs/rfc7540.html#BadCipherSuites]? we are not picking any specific cipher, therefore 9d is picked as default and used http2.

@scholzj
Copy link
Member

scholzj commented Jun 3, 2021

I have no idea what 9d is. But if your API server supports only TLS 1.3 on HTTP2 then you probably have to disable the HTTP2 in the operator. This sounds exactly like the special thing on your Kubernetes cluster which breaks it. However it shows it has nothing to do with the previous issue which was about something completely different.

TBH, I have no idea how does the OkHttp client decide what TLS to use. Java 11 normally supports TLSv1.3. But no clue why it uses 1.2 here.

@peppe77
Copy link
Author

peppe77 commented Jun 3, 2021

@scholzj I do not think it only supports TLS 1.3 though as previously mentioned, it does not support a long list of ciphers and perhaps the one proposed is not supported. Moreover , we have not set/configure api-server to only do TLS v1.3 and/or not do certain TLS v1.2 ciphers (by default that list won't be supported).

@peppe77
Copy link
Author

peppe77 commented Jun 3, 2021

here is the confirmation that k8s api server we have can do TLSv1.2 - below is configured
- --tls-min-version=VersionTLS12

@scholzj
Copy link
Member

scholzj commented Jun 3, 2021

Well, by default on Minikube for example it works fine. So this problem does not exist there. It could be that it is an unsupported cipher suite. But it is not really clear which one in that case.

@peppe77
Copy link
Author

peppe77 commented Jun 3, 2021

can we specify specific cipher for v0.22 to use (or how does it pick/propose one)?

@scholzj
Copy link
Member

scholzj commented Jun 3, 2021

I don't know. Fabric8 does not seem to support it. I'm not sure if OkHttp has some configuration options for this. Maybe you can do it somewhere in the JDK configuration?

@peppe77
Copy link
Author

peppe77 commented Jun 3, 2021

@scholzj now we know what is going on. There are a few things here:

  1. Based on -> https://www.ibm.com/docs/en/zos/2.3.0?topic=programming-cipher-suite-definitions , 9d => TLS_RSA_WITH_AES_256_GCM_SHA384 which based on -> https://httpwg.org/specs/rfc7540.html#BadCipherSuites is considered weak and should not be proposed.
  2. We have "forced" our K8S 1.18/API-Server to accept this specific cipher and problem disappeared as expected. In our clusters, for security reasons, we do not allow weak ciphers

Conclusions:

  • minikube setup used for testing is most likely accepting such cipher and should be revised/updated in order to no longer allow it as it complies to the above and will then ensure operator works in a more secure way.
  • client (fabric8io) shall not propose, by default, such cipher - the default option should be any cipher that is not weak/blacklisted. Likewise, the Strimzi/operator needs to either handle that somehow or place a requirement on chosen client to fix of properly handle that.
  • Anyone that has such combination operator v0.22/K8S 1.18 and works is because api-server is not properly configured, thus allowing a weak/black listed cipher to be used.

@peppe77
Copy link
Author

peppe77 commented Jun 3, 2021

@scholzj for reference, we had also opened bug report on fabric8 client -> fabric8io/kubernetes-client#3176

@slachiewicz
Copy link

slachiewicz commented Jun 14, 2021

@peppe77 could you try to run with KUBERNETES_TLS_VERSIONS env variable (or -Dkubernetes.tls.version) set to TLSv1.2,TLSv1.3 ? That should overwrite Operator's kubernetes-client default setting to user TLS1.2 and may be open for using TLS 1.3 cipher suites

@peppe77
Copy link
Author

peppe77 commented Jun 14, 2021

@slachiewicz set it though same outcome. pls let us know if you want us to try / check other things. thanks /Pedro

I0614 20:51:55.139928       1 log.go:172] http2: Framer 0xc0166b7dc0: wrote GOAWAY len=43 LastStreamID=0 ErrCode=INADEQUATE_SECURITY Debug="Prohibited TLS 1.2 Cipher Suite: 9d"

@withlin
Copy link

withlin commented Sep 23, 2021

same issue

@vutkin
Copy link

vutkin commented Jan 12, 2022

I have same problem too

@scholzj
Copy link
Member

scholzj commented Jul 21, 2022

Triaged on 21.7.2022: This seems to be Fabric8 (OkHttp) issue. Fabric8 is trying to add pluggable HTTP clients, which might help to solve it in the future. Strimzi uses latest Fabric8 version, so once fixed there, we will adopt the fix. But there does not seem to be anything we can do in Strimzi itself to fix this. This should be closed.

@scholzj scholzj closed this as not planned Won't fix, can't repro, duplicate, stale Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants