ES status health reported by the operator not updated due to a reconciliation error #5330

thbkrkr · 2022-02-04T13:17:17Z

The ES health reported by the operator in the ES resource status may never be updated if the operator encounters an error during the reconciliation loop.

Steps to reproduce

Create an ES cluster

apiVersion: v1
kind: Secret
metadata:
  name: bad-ca
stringData:
  ca.crt: bla
  ca.key: bla
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: yellowforever
spec:
  version: 7.16.2
  # transport:                  
  #   tls:
  #     certificate:
  #       secretName: bad-ca
  nodeSets:
  - name: master
    count: 1
    config:
      node.store.allow_mmap: false

Make cluster health yellow

# create an index with 1 replica
-XPUT /my-index-1 -d '{"settings":{"number_of_shards":1,"number_of_replicas":1}}'

Configure the bad CA secret in the cluster

Uncomment transport.tls.certificate.secretName and reapply.

Make cluster health green

# delete the index
-XDELETE /my-index-1

The text was updated successfully, but these errors were encountered:

thbkrkr · 2022-02-04T15:14:43Z

The right thing to do seems to be to update the status even when there are errors in the reconcile loop. And that's what we already do. The problem here is that we depend on the bad secret to get the cluster health. We fail to reconcile the certificates required to create the ES observer required to get the cluster health required to update the status. Precisely here, we don't really need the certificates for the transport layer to get cluster health but the certificates for HTTP and transport are reconciled together.

cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go

Lines 143 to 189 in 4650e61

    
           certificateResources, res := certificates.Reconcile( 
        
           	ctx, 
        
           	d, 
        
           	d.ES, 
        
           	[]corev1.Service{*externalService, *internalService}, 
        
           	d.OperatorParameters.CACertRotation, 
        
           	d.OperatorParameters.CertRotation, 
        
           ) 
        
           if results.WithResults(res).HasError() { 
        
           	return results 
        
           } 
        
           controllerUser, err := user.ReconcileUsersAndRoles(ctx, d.Client, d.ES, d.DynamicWatches(), d.Recorder()) 
        
           if err != nil { 
        
           	return results.WithError(err) 
        
           } 
        
           // Patch the Pods to add the expected node labels as annotations. Record the error, if any, but do not stop the 
        
           // reconciliation loop as we don't want to prevent other updates from being applied to the cluster. 
        
           results.WithResults(annotatePodsWithNodeLabels(ctx, d.Client, d.ES)) 
        
           resourcesState, err := reconcile.NewResourcesStateFromAPI(d.Client, d.ES) 
        
           if err != nil { 
        
           	return results.WithError(err) 
        
           } 
        
           min, err := version.MinInPods(resourcesState.CurrentPods, label.VersionLabelName) 
        
           if err != nil { 
        
           	return results.WithError(err) 
        
           } 
        
           if min == nil { 
        
           	min = &d.Version 
        
           } 
        
           warnUnsupportedDistro(resourcesState.AllPods, d.ReconcileState.Recorder) 
        
           observedState := d.Observers.ObservedStateResolver( 
        
           	d.ES, 
        
           	d.newElasticsearchClient( 
        
           		resourcesState, 
        
           		controllerUser, 
        
           		*min, 
        
           		certificateResources.TrustedHTTPCertificates, 
        
           	), 
        
           ) 
        
           // always update the elasticsearch state bits 
        
           d.ReconcileState.UpdateElasticsearchState(*resourcesState, observedState())

I see few things to improve that:

We could separate the certs reconciliation for HTTP from that for transport so that as soon as we reconciled the HTTP certs, we start the observer and update the status. The problem will still occur if there is an error to reconcile the HTTP certs. So this is just an optimization to reduce the situations in which the problem can occur.
We could set the status health to unknown when there is a reconciliation error before the observer is started.

pebrc · 2022-09-26T13:16:32Z

Is this bug still present given that we have made significant changes to how the status is calculated?

thbkrkr · 2022-09-26T13:47:04Z

Indeed, this bug is fixed by #5349.

thbkrkr added the discuss We need to figure this out label Feb 4, 2022

botelastic bot added the triage label Feb 4, 2022

thbkrkr changed the title ~~ES cluster health reported by the operator might be misleading if a reconciliation error~~ ES status health reported by the operator not updated due to a reconciliation error Feb 4, 2022

thbkrkr added the >bug Something isn't working label Feb 4, 2022

botelastic bot removed the triage label Feb 4, 2022

thbkrkr mentioned this issue Feb 9, 2022

Avoid reporting outdated ES health on reconciliation error that prevents getting the real one #5349

Merged

thbkrkr removed the discuss We need to figure this out label Feb 17, 2022

thbkrkr closed this as completed Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES status health reported by the operator not updated due to a reconciliation error #5330

ES status health reported by the operator not updated due to a reconciliation error #5330

thbkrkr commented Feb 4, 2022 •

edited

Loading

thbkrkr commented Feb 4, 2022

pebrc commented Sep 26, 2022

thbkrkr commented Sep 26, 2022

ES status health reported by the operator not updated due to a reconciliation error #5330

ES status health reported by the operator not updated due to a reconciliation error #5330

Comments

thbkrkr commented Feb 4, 2022 • edited Loading

Steps to reproduce

thbkrkr commented Feb 4, 2022

pebrc commented Sep 26, 2022

thbkrkr commented Sep 26, 2022

thbkrkr commented Feb 4, 2022 •

edited

Loading