Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] unable to load certificates when cruise control is turned on #3694

Closed
jrivers96 opened this issue Sep 22, 2020 · 15 comments
Closed

[Bug] unable to load certificates when cruise control is turned on #3694

jrivers96 opened this issue Sep 22, 2020 · 15 comments
Labels

Comments

@jrivers96
Copy link

jrivers96 commented Sep 22, 2020

Certificate problem when cruise control turned on

I have a 35 kafka broker 5 zookeeper cluster on strimzi 0.19 that has been running a month on AWS EKS.

I did k edit Kafka kafka-cluster and turned on cruise control with default settings {}. The brokers roll over and I see the below error. The cluster is setup with oauth authentication with external and internal certs.

Any ideas? The cluster seems to be fully operational otherwise.

k -n system-strimzi-dev logs kafka-cluster-cruise-control-6795f646b8-75lmb -c cruise-control


Preparing certificates for internal communication
Adding /etc/tls-sidecar/cluster-ca-certs/ca.crt to truststore /tmp/cruise-control/replication.truststore.p12 with alias ca
Certificate was added to keystore
unable to load certificates

  tls:
    authentication:
      type: oauth
      clientId: kafka-broker
      clientSecret:
        key: secret
        secretName: broker-oauth-secret
      disableTlsHostnameVerification: false
      jwksEndpointUri: https://keycloak.dev/auth/realms/pro-realm/protocol/openid-connect/certs
      validIssuerUri: https://keycloak.dev/auth/realms/pro-realm
      userNameClaim: preferred_username
      tlsTrustedCertificates:
      - secretName: ca-truststore
        certificate: ca.crt
      jwksExpirySeconds: 960
      jwksRefreshSeconds: 300
  external:
    type: loadbalancer
    tls: true
    configuration:
      brokerCertChainAndKey:
        certificate: ca.crt
        key: ca.key
        secretName: external-cert-secret
    authentication:
      type: oauth
      clientId: kafka-broker
      clientSecret:
        key: secret
        secretName: broker-oauth-secret
      disableTlsHostnameVerification: false
      jwksEndpointUri: https://keycloak.dev/auth/realms/pro-realm/protocol/openid-connect/certs
      validIssuerUri: https://keycloak.dev/auth/realms/pro-realm
      userNameClaim: preferred_username
      tlsTrustedCertificates:
      - secretName: ca-truststore
        certificate: ca.crt
      jwksExpirySeconds: 960
      jwksRefreshSeconds: 300

k get secrets

NAME TYPE DATA AGE
broker-oauth-secret Opaque 1 20d
ca-truststore Opaque 1 20d
default-token-9xhxm kubernetes.io/service-account-token 3 20d
external-cert-secret Opaque 2 20d
kafka-cluster-clients-ca Opaque 1 7d7h
kafka-cluster-clients-ca-cert Opaque 3 7d7h
kafka-cluster-cluster-ca Opaque 1 7d7h
kafka-cluster-cluster-ca-cert Opaque 3 7d7h
kafka-cluster-cluster-operator-certs Opaque 4 7d7h
kafka-cluster-cruise-control-certs Opaque 4 41m
kafka-cluster-cruise-control-token-wlxz5 kubernetes.io/service-account-token 3 41m
kafka-cluster-entity-operator-certs Opaque 4 7d7h
kafka-cluster-entity-operator-token-vp2js kubernetes.io/service-account-token 3 7d7h
kafka-cluster-kafka-brokers Opaque 140 7d7h
kafka-cluster-kafka-exporter-certs Opaque 4 7d7h
kafka-cluster-kafka-exporter-token-wvm92 kubernetes.io/service-account-token 3 7d7h
kafka-cluster-kafka-token-vpj7l kubernetes.io/service-account-token 3 7d7h
kafka-cluster-zookeeper-nodes Opaque 20 7d7h
kafka-cluster-zookeeper-token-rf5tv kubernetes.io/service-account-token 3 7d7h

@jrivers96 jrivers96 added the bug label Sep 22, 2020
@jrivers96 jrivers96 changed the title [Bug] unable to load certificates when cruise control is turn on [Bug] unable to load certificates when cruise control is turned on Sep 22, 2020
@scholzj
Copy link
Member

scholzj commented Sep 23, 2020

So, just to clarify ... the Cruise Control pod is crash looping after the unable to load certificates error? Is the tls-sidecar container in the same pod seem to run ok or does it have also some error? Can you check that the secret kafka-cluster-cruise-control-certs is not empty and contains the crt and key files?

@jrivers96
Copy link
Author

Yes, the cruise control pod is crash looping after the unable to load certificates.

It looks like the sidecar fails as well.

k -n system-strimzi-dev logs kafka-cluster-cruise-control-6795f646b8-zrhx6 -c tls-sidecar
Starting Stunnel with configuration:
pid = /usr/local/var/run/stunnel.pid
foreground = yes
debug = notice
sslVersion = TLSv1.2
[zookeeper-2181]
client = yes
CAfile = /tmp/cluster-ca.crt
cert = /etc/tls-sidecar/cc-certs/cruise-control.crt
key = /etc/tls-sidecar/cc-certs/cruise-control.key
accept = 127.0.0.1:2181
connect = kafka-cluster-zookeeper-client:2181
delay = yes
verify = 2


Clients allowed=504187
stunnel 4.56 on x86_64-redhat-linux-gnu platform
Compiled/running with OpenSSL 1.0.1e-fips 11 Feb 2013
Threading:PTHREAD Sockets:POLL,IPv6 SSL:ENGINE,OCSP,FIPS Auth:LIBWRAP
Reading configuration from file /tmp/stunnel.conf
FIPS mode is enabled
Compression not enabled
PRNG seeded successfully
Initializing service [zookeeper-2181]
Insecure file permissions on /etc/tls-sidecar/cc-certs/cruise-control.key
Certificate: /etc/tls-sidecar/cc-certs/cruise-control.crt
Error reading certificate file: /etc/tls-sidecar/cc-certs/cruise-control.crt
error queue: 140DC009: error:140DC009:SSL routines:SSL_CTX_use_certificate_chain_file:PEM lib
SSL_CTX_use_certificate_chain_file: 906D06C: error:0906D06C:PEM routines:PEM_read_bio:no start line
Service [zookeeper-2181]: Failed to initialize SSL context
str_stats: 12 block(s), 1053 data byte(s), 696 control byte(s)

It appears the crt file is empty.

Name:         kafka-cluster-cruise-control-certs
Namespace:    mdic-strimzi-pro
Labels:       app.kubernetes.io/instance=kafka-cluster
              app.kubernetes.io/managed-by=strimzi-cluster-operator
              app.kubernetes.io/name=cruise-control
              app.kubernetes.io/part-of=strimzi-kafka-cluster
              strimzi.io/cluster=kafka-cluster
              strimzi.io/kind=Kafka
              strimzi.io/name=strimzi
Annotations:  <none>

Type:  Opaque

Data
====
cruise-control.crt:       0 bytes
cruise-control.key:       1704 bytes
cruise-control.p12:       0 bytes
cruise-control.password:  12 bytes

@scholzj
Copy link
Member

scholzj commented Sep 23, 2020

It looks like something went wrong when generating the certificates - the .crt and .p12 files are empty. Could you:

  • Check the cluster-ca secrets (but the <cluster-name>-cluster-ca and <cluster-name>-cluster-ca-cert) whether they exist and have some actual certificate and key?
  • If you have a log from the Cluster Operator from when you enabled Cruise Control it might tell us what went wrong and fix it.

If the Cluster CA secrets look ok, I think you should just delete the kafka-cluster-cruise-control-certs secret and wait until it is recreated (and afterwards delete the CC pod). If the CA secrets are also broken, that might possibly be related to the issues from the comment you deleted and the other issues.

@jrivers96
Copy link
Author

I deleted my comment because I realized there are were changes that I didn't realize were made to the cluster this afternoon.

A cyber security engineer applied this to all svcs:
service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags:

I looked at the persistent volumes and they were all in a terminating state. I'm not sure why. I didn't get to verify if the cluster ca looked okay. ( but all consumers and producers were operational)

I'll stand the cluster back up and turn on cruise control as I did before and see if it works this time.

@scholzj
Copy link
Member

scholzj commented Sep 23, 2020

Hmm, that sounds weird. Normally if you apply a label or annotation to a service it should not touch the pods at all unless it causes the loadbalancer to recreate (which would roll them but not delete them).

@jrivers96
Copy link
Author

I think at 1:30 EST we had a broker down and 2:30 EST cyber security came by and applied that tag while broker was still down. I then looked at cluster and noticed brokers were missing and persistent volumes were in a terminating state.

@scholzj
Copy link
Member

scholzj commented Sep 23, 2020

I think I probably asked about it already ... but in your storage configuration, you do not use the deleteClaim: true, right?

@jrivers96
Copy link
Author

storage:
  type: jbod
  volumes:
  - id: 0
    type: persistent-claim
    class: aws-ssd
    size: 250Gi
    deleteClaim: false

@jrivers96
Copy link
Author

strange - I wiped out the cluster.

k -n -strimzi-pro delete -f kafka-persistent-pro-deployed.yaml
k -n -strimzi-pro get PersistentVolumeClaim | awk '{ print $1 }' | grep kafka | xargs kubectl -n-strimzi-pro delete PersistentVolumeClaim
k -n -strimzi-pro get PersistentVolume | awk '{ print $1,$6 }' | grep kafka | awk '{ print $1 }'| xargs kubectl -n-strimzi-pro delete PersistentVolume
k -n -strimzi-pro apply -f kafka-persistent-pro-deployed.yaml 	

I can't even seem to put the cluster back...

k -n mdic-strimzi-pro logs strimzi-cluster-operator-57869945dd-rnv42

2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 9 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-9, retrying after at least 16000ms
2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 8 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-8, retrying after at least 16000ms
2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 7 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-7, retrying after at least 16000ms
2020-09-23 22:55:28 INFO  KafkaRoller:260 - Reconciliation #1(watch) Kafka(mdic-strimzi-pro/kafka-cluster): Could not roll pod 6 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: An error while try to create an admin client for pod kafka-cluster-kafka-6, retrying after at least 16000ms

Can the CRDs get corrupted somehow or in some sort of state?

@scholzj
Copy link
Member

scholzj commented Sep 23, 2020

Can you share the whole log? Do the pods exist? When the cluster is deleted, it can happen that it is deleted during the reconciliation ... in which case the Cluster Operator might need some time to figure it out and finish the old reconciliation of the deleted cluster and start the new one with the new cluster. This could be the case maybe? I would need the full log to confirm.

Can the CRDs get corrupted somehow or in some sort of state?

I haven't seen anything like that. But TBH we use the CRDs and know how they work from the user perspective ... but I cannot say I understand the details of the CRD implementation in Kubernetes and all the was how they can go wrong. In any case, deleting the CRD triggers deletion of the CRs which trigger deletion of the cluster. So if there would be something wrong, everything is possible.

@jrivers96
Copy link
Author

jrivers96 commented Sep 24, 2020

oh boy. The cyber security engineer came by and applied a yaml where where requests/limits on kafka were flipped. This was the cause of the kafka cluster dying and the volumes going away.

I believe the cruise control bug was valid though, but can't collect information on it because of the aforementioned problem.

@scholzj
Copy link
Member

scholzj commented Sep 25, 2020

Yeah, if the secret was empty it was probably some bug. Hard to find the exact cause without the logs. Did deleting the secret helped? Or did it this resolved by the other problem?

@jrivers96
Copy link
Author

yeah, we resolved this by causing a bigger problem and having to reset the cluster unfortunately. I'll try to reproduce this at some point before we go into production with it.

@scholzj
Copy link
Member

scholzj commented Feb 21, 2021

@jrivers96 Did you managed to reproduce this again?

@scholzj
Copy link
Member

scholzj commented Jul 7, 2022

Triaged on 7th July 2022: Should be closed. Seems like it never happened again?

@scholzj scholzj closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants