You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We find that when the elastic operator, when seeing stale cluster state, can mistakenly delete the "soft-owned" secret objects used for storing user passwords and TLS certificates.
What did you do?
We ran a simple workload with the elastic operator that:
create an elasticsearch cluster elasticsearch
delete the elasticsearch
recreate the elasticsearch
What did you expect to see?
After recreation, the elasticsearch and its owned objects should become stable without unexpected object deletion.
What did you see instead? Under which circumstances?
We find that three secret objects -es-elastic-user, -es-http-certs-public, -es-transport-certs-public (used for storing user passwords and TLS certificates) are mistakenly deleted by the operator after recreating elasticsearch. The unexpected deletion is caused by the operator seeing a stale cluster state from an apiserver. After further inspection, we identified the concrete steps that trigger the bug:
When deleting the elasticsearch, the operator successfully deletes all the "soft-owned" secrets.
After recreating the elasticsearch, the operator creates the three secrets again.
The operator crashes and restarts (e.g., due to a machine reboot) and connects to a stale apiserver. The apiserver is not updated with the most updated cluster state, and has a stale view where the elasticsearch is deleted but the secrets exist (from step 1)
After restart, the operator finds that the elasticsearch is missing from the stale apiserver, and invokes GarbageCollectSoftOwnedSecrets to delete the three secrets that are currently in use.
Additional information:
The staleness problem can be alleviated by setting the UID of the secret object in the precondition when deleting the objects in GarbageCollectSoftOwnedSecrets. So that k8s will find the precondition is not satisfied if the secret objects observed by the operator are also stale and refuses the deletion.
We are willing to send a PR to help fix this problem.
The text was updated successfully, but these errors were encountered:
Thanks for the bug report. I definitely sounds like a reasonable proposal to use a precondition when deleting the secrets. I wonder however if it will cover all edge cases of the scenario you are describing. You description implies inconsistent results across the different resources where the secrets are already reflecting the new state (or an even older state) while the Elasticsearch resource is lagging behind and still shown as deleted. Of these two possible causes (secrets in the cache with the old UID and secrets in the cache with the new UID) only the first one would be caught by the precondition IIUC.
Of these two possible causes (secrets in the cache with the old UID and secrets in the cache with the new UID) only the first one would be caught by the precondition IIUC.
This is a great point. Yes, using UID can only help the case where the secrets seen by the operator are also stale as k8s can reject the stale UID. If the secrets seen by the operator are fresh, precondition with the fresh UID cannot prevent the deletion. Unfortunately I have not found a easy-to-implement solution to prevent the second case for the elastic operator.
Please let me know if you have any thought on how to prevent the second case. I can open a PR to use the precondition so that at least the first case can be handled.
Bug Report
We find that when the elastic operator, when seeing stale cluster state, can mistakenly delete the "soft-owned" secret objects used for storing user passwords and TLS certificates.
What did you do?
We ran a simple workload with the elastic operator that:
elasticsearch
elasticsearch
elasticsearch
What did you expect to see?
After recreation, the
elasticsearch
and its owned objects should become stable without unexpected object deletion.What did you see instead? Under which circumstances?
We find that three secret objects
-es-elastic-user
,-es-http-certs-public
,-es-transport-certs-public
(used for storing user passwords and TLS certificates) are mistakenly deleted by the operator after recreatingelasticsearch
. The unexpected deletion is caused by the operator seeing a stale cluster state from an apiserver. After further inspection, we identified the concrete steps that trigger the bug:elasticsearch
, the operator successfully deletes all the "soft-owned" secrets.elasticsearch
, the operator creates the three secrets again.elasticsearch
is deleted but the secrets exist (from step 1)elasticsearch
is missing from the stale apiserver, and invokesGarbageCollectSoftOwnedSecrets
to delete the three secrets that are currently in use.Environment
ECK version:
660bc92
Kubernetes information:
insert any information about your Kubernetes environment that could help us:
Logs:
After recreating the
elasticsearch
we see the following logs showing that the operator is garbage collecting the secrets, which is not expected:The staleness problem can be alleviated by setting the UID of the secret object in the
precondition
when deleting the objects inGarbageCollectSoftOwnedSecrets
. So that k8s will find the precondition is not satisfied if the secret objects observed by the operator are also stale and refuses the deletion.We are willing to send a PR to help fix this problem.
The text was updated successfully, but these errors were encountered: