Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Elastic operator mistakenly deletes the secret objects when seeing a stale cluster state #5249

Closed
srteam2020 opened this issue Jan 14, 2022 · 2 comments · Fixed by #5273
Labels
>bug Something isn't working

Comments

@srteam2020
Copy link
Contributor

Bug Report

We find that when the elastic operator, when seeing stale cluster state, can mistakenly delete the "soft-owned" secret objects used for storing user passwords and TLS certificates.

What did you do?
We ran a simple workload with the elastic operator that:

  1. create an elasticsearch cluster elasticsearch
  2. delete the elasticsearch
  3. recreate the elasticsearch

What did you expect to see?
After recreation, the elasticsearch and its owned objects should become stable without unexpected object deletion.

What did you see instead? Under which circumstances?
We find that three secret objects -es-elastic-user, -es-http-certs-public, -es-transport-certs-public (used for storing user passwords and TLS certificates) are mistakenly deleted by the operator after recreating elasticsearch. The unexpected deletion is caused by the operator seeing a stale cluster state from an apiserver. After further inspection, we identified the concrete steps that trigger the bug:

  1. When deleting the elasticsearch, the operator successfully deletes all the "soft-owned" secrets.
  2. After recreating the elasticsearch, the operator creates the three secrets again.
  3. The operator crashes and restarts (e.g., due to a machine reboot) and connects to a stale apiserver. The apiserver is not updated with the most updated cluster state, and has a stale view where the elasticsearch is deleted but the secrets exist (from step 1)
  4. After restart, the operator finds that the elasticsearch is missing from the stale apiserver, and invokes GarbageCollectSoftOwnedSecrets to delete the three secrets that are currently in use.

Environment

  • ECK version:

    660bc92

  • Kubernetes information:

    insert any information about your Kubernetes environment that could help us:

    • On premise: v1.18.9
  • Logs:
    After recreating the elasticsearch we see the following logs showing that the operator is garbage collecting the secrets, which is not expected:

{"log.level":"info","@timestamp":"2022-01-14T01:25:00.681Z","log.logger":"generic-reconciler","message":"Garbage collecting secret","service.version":"2.0.0-SNAPSHOT+3cba748e","service.type":"eck","ecs.version":"1.4.0","namespace":"default","secret_name":"elasticsearch-cluster-es-transport-certs-public","owner_name":"elasticsearch-cluster","owner_kind":"Elasticsearch"}
...
{"log.level":"info","@timestamp":"2022-01-14T01:25:00.686Z","log.logger":"generic-reconciler","message":"Garbage collecting secret","service.version":"2.0.0-SNAPSHOT+3cba748e","service.type":"eck","ecs.version":"1.4.0","namespace":"default","secret_name":"elasticsearch-cluster-es-http-certs-public","owner_name":"elasticsearch-cluster","owner_kind":"Elasticsearch"}
...
{"log.level":"info","@timestamp":"2022-01-14T01:25:00.690Z","log.logger":"generic-reconciler","message":"Garbage collecting secret","service.version":"2.0.0-SNAPSHOT+3cba748e","service.type":"eck","ecs.version":"1.4.0","namespace":"default","secret_name":"elasticsearch-cluster-es-elastic-user","owner_name":"elasticsearch-cluster","owner_kind":"Elasticsearch"}
  • Additional information:
    The staleness problem can be alleviated by setting the UID of the secret object in the precondition when deleting the objects in GarbageCollectSoftOwnedSecrets. So that k8s will find the precondition is not satisfied if the secret objects observed by the operator are also stale and refuses the deletion.
    We are willing to send a PR to help fix this problem.
@botelastic botelastic bot added the triage label Jan 14, 2022
@pebrc
Copy link
Collaborator

pebrc commented Jan 17, 2022

Thanks for the bug report. I definitely sounds like a reasonable proposal to use a precondition when deleting the secrets. I wonder however if it will cover all edge cases of the scenario you are describing. You description implies inconsistent results across the different resources where the secrets are already reflecting the new state (or an even older state) while the Elasticsearch resource is lagging behind and still shown as deleted. Of these two possible causes (secrets in the cache with the old UID and secrets in the cache with the new UID) only the first one would be caught by the precondition IIUC.

@pebrc pebrc added the >bug Something isn't working label Jan 17, 2022
@botelastic botelastic bot removed the triage label Jan 17, 2022
@srteam2020
Copy link
Contributor Author

@pebrc Thanks for the reply!

Of these two possible causes (secrets in the cache with the old UID and secrets in the cache with the new UID) only the first one would be caught by the precondition IIUC.

This is a great point. Yes, using UID can only help the case where the secrets seen by the operator are also stale as k8s can reject the stale UID. If the secrets seen by the operator are fresh, precondition with the fresh UID cannot prevent the deletion. Unfortunately I have not found a easy-to-implement solution to prevent the second case for the elastic operator.

Please let me know if you have any thought on how to prevent the second case. I can open a PR to use the precondition so that at least the first case can be handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants