[Bug] unable to load certificates when cruise control is turned on #3694

jrivers96 · 2020-09-22T22:44:42Z

Certificate problem when cruise control turned on

I have a 35 kafka broker 5 zookeeper cluster on strimzi 0.19 that has been running a month on AWS EKS.

I did k edit Kafka kafka-cluster and turned on cruise control with default settings {}. The brokers roll over and I see the below error. The cluster is setup with oauth authentication with external and internal certs.

Any ideas? The cluster seems to be fully operational otherwise.

k -n system-strimzi-dev logs kafka-cluster-cruise-control-6795f646b8-75lmb -c cruise-control


Preparing certificates for internal communication
Adding /etc/tls-sidecar/cluster-ca-certs/ca.crt to truststore /tmp/cruise-control/replication.truststore.p12 with alias ca
Certificate was added to keystore
unable to load certificates

  tls:
    authentication:
      type: oauth
      clientId: kafka-broker
      clientSecret:
        key: secret
        secretName: broker-oauth-secret
      disableTlsHostnameVerification: false
      jwksEndpointUri: https://keycloak.dev/auth/realms/pro-realm/protocol/openid-connect/certs
      validIssuerUri: https://keycloak.dev/auth/realms/pro-realm
      userNameClaim: preferred_username
      tlsTrustedCertificates:
      - secretName: ca-truststore
        certificate: ca.crt
      jwksExpirySeconds: 960
      jwksRefreshSeconds: 300
  external:
    type: loadbalancer
    tls: true
    configuration:
      brokerCertChainAndKey:
        certificate: ca.crt
        key: ca.key
        secretName: external-cert-secret
    authentication:
      type: oauth
      clientId: kafka-broker
      clientSecret:
        key: secret
        secretName: broker-oauth-secret
      disableTlsHostnameVerification: false
      jwksEndpointUri: https://keycloak.dev/auth/realms/pro-realm/protocol/openid-connect/certs
      validIssuerUri: https://keycloak.dev/auth/realms/pro-realm
      userNameClaim: preferred_username
      tlsTrustedCertificates:
      - secretName: ca-truststore
        certificate: ca.crt
      jwksExpirySeconds: 960
      jwksRefreshSeconds: 300

k get secrets

NAME broker-oauth-secret ca-truststore default-token-9xhxm external-cert-secret kafka-cluster-clients-ca kafka-cluster-clients-ca-cert kafka-cluster-cluster-ca kafka-cluster-cluster-ca-cert kafka-cluster-cluster-operator-certs kafka-cluster-cruise-control-certs kafka-cluster-cruise-control-token-wlxz5 kafka-cluster-entity-operator-certs kafka-cluster-entity-operator-token-vp2js kafka-cluster-kafka-brokers kafka-cluster-kafka-exporter-certs kafka-cluster-kafka-exporter-token-wvm92 kafka-cluster-kafka-token-vpj7l kafka-cluster-zookeeper-nodes kafka-cluster-zookeeper-token-rf5tv TYPE DATA AGE
Opaque 1 20d
Opaque 1 20d
kubernetes.io/service-account-token 3 20d
Opaque 2 20d
Opaque 1 7d7h
Opaque 3 7d7h
Opaque 1 7d7h
Opaque 3 7d7h
Opaque 4 7d7h
Opaque 4 41m
kubernetes.io/service-account-token 3 41m
Opaque 4 7d7h
kubernetes.io/service-account-token 3 7d7h
Opaque 140 7d7h
Opaque 4 7d7h
kubernetes.io/service-account-token 3 7d7h
kubernetes.io/service-account-token 3 7d7h
Opaque 20 7d7h
kubernetes.io/service-account-token 3 7d7h

The text was updated successfully, but these errors were encountered:

scholzj · 2020-09-23T06:29:17Z

So, just to clarify ... the Cruise Control pod is crash looping after the unable to load certificates error? Is the tls-sidecar container in the same pod seem to run ok or does it have also some error? Can you check that the secret kafka-cluster-cruise-control-certs is not empty and contains the crt and key files?

jrivers96 · 2020-09-23T12:58:34Z

Yes, the cruise control pod is crash looping after the unable to load certificates.

It looks like the sidecar fails as well.

k -n system-strimzi-dev logs kafka-cluster-cruise-control-6795f646b8-zrhx6 -c tls-sidecar
Starting Stunnel with configuration:
pid = /usr/local/var/run/stunnel.pid
foreground = yes
debug = notice
sslVersion = TLSv1.2
[zookeeper-2181]
client = yes
CAfile = /tmp/cluster-ca.crt
cert = /etc/tls-sidecar/cc-certs/cruise-control.crt
key = /etc/tls-sidecar/cc-certs/cruise-control.key
accept = 127.0.0.1:2181
connect = kafka-cluster-zookeeper-client:2181
delay = yes
verify = 2


Clients allowed=504187
stunnel 4.56 on x86_64-redhat-linux-gnu platform
Compiled/running with OpenSSL 1.0.1e-fips 11 Feb 2013
Threading:PTHREAD Sockets:POLL,IPv6 SSL:ENGINE,OCSP,FIPS Auth:LIBWRAP
Reading configuration from file /tmp/stunnel.conf
FIPS mode is enabled
Compression not enabled
PRNG seeded successfully
Initializing service [zookeeper-2181]
Insecure file permissions on /etc/tls-sidecar/cc-certs/cruise-control.key
Certificate: /etc/tls-sidecar/cc-certs/cruise-control.crt
Error reading certificate file: /etc/tls-sidecar/cc-certs/cruise-control.crt
error queue: 140DC009: error:140DC009:SSL routines:SSL_CTX_use_certificate_chain_file:PEM lib
SSL_CTX_use_certificate_chain_file: 906D06C: error:0906D06C:PEM routines:PEM_read_bio:no start line
Service [zookeeper-2181]: Failed to initialize SSL context
str_stats: 12 block(s), 1053 data byte(s), 696 control byte(s)

It appears the crt file is empty.

Name:         kafka-cluster-cruise-control-certs
Namespace:    mdic-strimzi-pro
Labels:       app.kubernetes.io/instance=kafka-cluster
              app.kubernetes.io/managed-by=strimzi-cluster-operator
              app.kubernetes.io/name=cruise-control
              app.kubernetes.io/part-of=strimzi-kafka-cluster
              strimzi.io/cluster=kafka-cluster
              strimzi.io/kind=Kafka
              strimzi.io/name=strimzi
Annotations:  <none>

Type:  Opaque

Data
====
cruise-control.crt:       0 bytes
cruise-control.key:       1704 bytes
cruise-control.p12:       0 bytes
cruise-control.password:  12 bytes

scholzj · 2020-09-23T22:29:32Z

It looks like something went wrong when generating the certificates - the .crt and .p12 files are empty. Could you:

Check the cluster-ca secrets (but the <cluster-name>-cluster-ca and <cluster-name>-cluster-ca-cert) whether they exist and have some actual certificate and key?
If you have a log from the Cluster Operator from when you enabled Cruise Control it might tell us what went wrong and fix it.

If the Cluster CA secrets look ok, I think you should just delete the kafka-cluster-cruise-control-certs secret and wait until it is recreated (and afterwards delete the CC pod). If the CA secrets are also broken, that might possibly be related to the issues from the comment you deleted and the other issues.

jrivers96 · 2020-09-23T22:40:25Z

I deleted my comment because I realized there are were changes that I didn't realize were made to the cluster this afternoon.

A cyber security engineer applied this to all svcs:
service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags:

I looked at the persistent volumes and they were all in a terminating state. I'm not sure why. I didn't get to verify if the cluster ca looked okay. ( but all consumers and producers were operational)

I'll stand the cluster back up and turn on cruise control as I did before and see if it works this time.

scholzj · 2020-09-23T22:43:48Z

Hmm, that sounds weird. Normally if you apply a label or annotation to a service it should not touch the pods at all unless it causes the loadbalancer to recreate (which would roll them but not delete them).

jrivers96 · 2020-09-23T22:46:55Z

I think at 1:30 EST we had a broker down and 2:30 EST cyber security came by and applied that tag while broker was still down. I then looked at cluster and noticed brokers were missing and persistent volumes were in a terminating state.

scholzj · 2020-09-23T22:48:29Z

I think I probably asked about it already ... but in your storage configuration, you do not use the deleteClaim: true, right?

jrivers96 · 2020-09-23T22:49:04Z

storage:
  type: jbod
  volumes:
  - id: 0
    type: persistent-claim
    class: aws-ssd
    size: 250Gi
    deleteClaim: false

jrivers96 · 2020-09-23T22:58:54Z

strange - I wiped out the cluster.

k -n -strimzi-pro delete -f kafka-persistent-pro-deployed.yaml
k -n -strimzi-pro get PersistentVolumeClaim | awk '{ print $1 }' | grep kafka | xargs kubectl -n-strimzi-pro delete PersistentVolumeClaim
k -n -strimzi-pro get PersistentVolume | awk '{ print $1,$6 }' | grep kafka | awk '{ print $1 }'| xargs kubectl -n-strimzi-pro delete PersistentVolume
k -n -strimzi-pro apply -f kafka-persistent-pro-deployed.yaml

I can't even seem to put the cluster back...

k -n mdic-strimzi-pro logs strimzi-cluster-operator-57869945dd-rnv42

2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 9 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-9, retrying after at least 16000ms
2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 8 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-8, retrying after at least 16000ms
2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 7 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-7, retrying after at least 16000ms
2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 6 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-6, retrying after at least 16000ms

Can the CRDs get corrupted somehow or in some sort of state?

scholzj · 2020-09-23T23:35:49Z

Can you share the whole log? Do the pods exist? When the cluster is deleted, it can happen that it is deleted during the reconciliation ... in which case the Cluster Operator might need some time to figure it out and finish the old reconciliation of the deleted cluster and start the new one with the new cluster. This could be the case maybe? I would need the full log to confirm.

Can the CRDs get corrupted somehow or in some sort of state?

I haven't seen anything like that. But TBH we use the CRDs and know how they work from the user perspective ... but I cannot say I understand the details of the CRD implementation in Kubernetes and all the was how they can go wrong. In any case, deleting the CRD triggers deletion of the CRs which trigger deletion of the cluster. So if there would be something wrong, everything is possible.

jrivers96 · 2020-09-24T13:29:14Z

oh boy. The cyber security engineer came by and applied a yaml where where requests/limits on kafka were flipped. This was the cause of the kafka cluster dying and the volumes going away.

I believe the cruise control bug was valid though, but can't collect information on it because of the aforementioned problem.

scholzj · 2020-09-25T13:48:38Z

Yeah, if the secret was empty it was probably some bug. Hard to find the exact cause without the logs. Did deleting the secret helped? Or did it this resolved by the other problem?

jrivers96 · 2020-09-28T12:15:12Z

yeah, we resolved this by causing a bigger problem and having to reset the cluster unfortunately. I'll try to reproduce this at some point before we go into production with it.

scholzj · 2021-02-21T20:01:12Z

@jrivers96 Did you managed to reproduce this again?

scholzj · 2022-07-07T14:10:11Z

Triaged on 7th July 2022: Should be closed. Seems like it never happened again?

jrivers96 added the bug label Sep 22, 2020

jrivers96 changed the title ~~[Bug] unable to load certificates when cruise control is turn on~~ [Bug] unable to load certificates when cruise control is turned on Sep 22, 2020

scholzj closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] unable to load certificates when cruise control is turned on #3694

[Bug] unable to load certificates when cruise control is turned on #3694

jrivers96 commented Sep 22, 2020 •

edited

Loading

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 24, 2020 •

edited

Loading

scholzj commented Sep 25, 2020

jrivers96 commented Sep 28, 2020

scholzj commented Feb 21, 2021

scholzj commented Jul 7, 2022

[Bug] unable to load certificates when cruise control is turned on #3694

[Bug] unable to load certificates when cruise control is turned on #3694

Comments

jrivers96 commented Sep 22, 2020 • edited Loading

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

jrivers96 commented Sep 23, 2020

scholzj commented Sep 23, 2020

jrivers96 commented Sep 24, 2020 • edited Loading

scholzj commented Sep 25, 2020

jrivers96 commented Sep 28, 2020

scholzj commented Feb 21, 2021

scholzj commented Jul 7, 2022

jrivers96 commented Sep 22, 2020 •

edited

Loading

jrivers96 commented Sep 24, 2020 •

edited

Loading