Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES status health reported by the operator not updated due to a reconciliation error #5330

Closed
thbkrkr opened this issue Feb 4, 2022 · 3 comments
Labels
>bug Something isn't working

Comments

@thbkrkr
Copy link
Contributor

thbkrkr commented Feb 4, 2022

The ES health reported by the operator in the ES resource status may never be updated if the operator encounters an error during the reconciliation loop.

Steps to reproduce

  • Create an ES cluster
apiVersion: v1
kind: Secret
metadata:
  name: bad-ca
stringData:
  ca.crt: bla
  ca.key: bla
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: yellowforever
spec:
  version: 7.16.2
  # transport:                  
  #   tls:
  #     certificate:
  #       secretName: bad-ca
  nodeSets:
  - name: master
    count: 1
    config:
      node.store.allow_mmap: false
  • Make cluster health yellow
# create an index with 1 replica
-XPUT /my-index-1 -d '{"settings":{"number_of_shards":1,"number_of_replicas":1}}'
  • Configure the bad CA secret in the cluster

Uncomment transport.tls.certificate.secretName and reapply.

  • Make cluster health green
# delete the index
-XDELETE /my-index-1
@thbkrkr thbkrkr added the discuss We need to figure this out label Feb 4, 2022
@botelastic botelastic bot added the triage label Feb 4, 2022
@thbkrkr thbkrkr changed the title ES cluster health reported by the operator might be misleading if a reconciliation error ES status health reported by the operator not updated due to a reconciliation error Feb 4, 2022
@thbkrkr
Copy link
Contributor Author

thbkrkr commented Feb 4, 2022

The right thing to do seems to be to update the status even when there are errors in the reconcile loop. And that's what we already do. The problem here is that we depend on the bad secret to get the cluster health. We fail to reconcile the certificates required to create the ES observer required to get the cluster health required to update the status. Precisely here, we don't really need the certificates for the transport layer to get cluster health but the certificates for HTTP and transport are reconciled together.

certificateResources, res := certificates.Reconcile(
ctx,
d,
d.ES,
[]corev1.Service{*externalService, *internalService},
d.OperatorParameters.CACertRotation,
d.OperatorParameters.CertRotation,
)
if results.WithResults(res).HasError() {
return results
}
controllerUser, err := user.ReconcileUsersAndRoles(ctx, d.Client, d.ES, d.DynamicWatches(), d.Recorder())
if err != nil {
return results.WithError(err)
}
// Patch the Pods to add the expected node labels as annotations. Record the error, if any, but do not stop the
// reconciliation loop as we don't want to prevent other updates from being applied to the cluster.
results.WithResults(annotatePodsWithNodeLabels(ctx, d.Client, d.ES))
resourcesState, err := reconcile.NewResourcesStateFromAPI(d.Client, d.ES)
if err != nil {
return results.WithError(err)
}
min, err := version.MinInPods(resourcesState.CurrentPods, label.VersionLabelName)
if err != nil {
return results.WithError(err)
}
if min == nil {
min = &d.Version
}
warnUnsupportedDistro(resourcesState.AllPods, d.ReconcileState.Recorder)
observedState := d.Observers.ObservedStateResolver(
d.ES,
d.newElasticsearchClient(
resourcesState,
controllerUser,
*min,
certificateResources.TrustedHTTPCertificates,
),
)
// always update the elasticsearch state bits
d.ReconcileState.UpdateElasticsearchState(*resourcesState, observedState())

I see few things to improve that:

  • We could separate the certs reconciliation for HTTP from that for transport so that as soon as we reconciled the HTTP certs, we start the observer and update the status. The problem will still occur if there is an error to reconcile the HTTP certs. So this is just an optimization to reduce the situations in which the problem can occur.
  • We could set the status health to unknown when there is a reconciliation error before the observer is started.

@thbkrkr thbkrkr added the >bug Something isn't working label Feb 4, 2022
@botelastic botelastic bot removed the triage label Feb 4, 2022
@thbkrkr thbkrkr removed the discuss We need to figure this out label Feb 17, 2022
@pebrc
Copy link
Collaborator

pebrc commented Sep 26, 2022

Is this bug still present given that we have made significant changes to how the status is calculated?

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Sep 26, 2022

Indeed, this bug is fixed by #5349.

@thbkrkr thbkrkr closed this as completed Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants