[BUG] Elastic operator mistakenly deletes the secret objects when seeing a stale cluster state #5249

srteam2020 · 2022-01-14T17:04:47Z

Bug Report

We find that when the elastic operator, when seeing stale cluster state, can mistakenly delete the "soft-owned" secret objects used for storing user passwords and TLS certificates.

What did you do?
We ran a simple workload with the elastic operator that:

create an elasticsearch cluster elasticsearch
delete the elasticsearch
recreate the elasticsearch

What did you expect to see?
After recreation, the elasticsearch and its owned objects should become stable without unexpected object deletion.

What did you see instead? Under which circumstances?
We find that three secret objects -es-elastic-user, -es-http-certs-public, -es-transport-certs-public (used for storing user passwords and TLS certificates) are mistakenly deleted by the operator after recreating elasticsearch. The unexpected deletion is caused by the operator seeing a stale cluster state from an apiserver. After further inspection, we identified the concrete steps that trigger the bug:

When deleting the elasticsearch, the operator successfully deletes all the "soft-owned" secrets.
After recreating the elasticsearch, the operator creates the three secrets again.
The operator crashes and restarts (e.g., due to a machine reboot) and connects to a stale apiserver. The apiserver is not updated with the most updated cluster state, and has a stale view where the elasticsearch is deleted but the secrets exist (from step 1)
After restart, the operator finds that the elasticsearch is missing from the stale apiserver, and invokes GarbageCollectSoftOwnedSecrets to delete the three secrets that are currently in use.

Environment

ECK version:

660bc92
Kubernetes information:

insert any information about your Kubernetes environment that could help us:
- On premise: v1.18.9
Logs:
After recreating the elasticsearch we see the following logs showing that the operator is garbage collecting the secrets, which is not expected:

{"log.level":"info","@timestamp":"2022-01-14T01:25:00.681Z","log.logger":"generic-reconciler","message":"Garbage collecting secret","service.version":"2.0.0-SNAPSHOT+3cba748e","service.type":"eck","ecs.version":"1.4.0","namespace":"default","secret_name":"elasticsearch-cluster-es-transport-certs-public","owner_name":"elasticsearch-cluster","owner_kind":"Elasticsearch"}
...
{"log.level":"info","@timestamp":"2022-01-14T01:25:00.686Z","log.logger":"generic-reconciler","message":"Garbage collecting secret","service.version":"2.0.0-SNAPSHOT+3cba748e","service.type":"eck","ecs.version":"1.4.0","namespace":"default","secret_name":"elasticsearch-cluster-es-http-certs-public","owner_name":"elasticsearch-cluster","owner_kind":"Elasticsearch"}
...
{"log.level":"info","@timestamp":"2022-01-14T01:25:00.690Z","log.logger":"generic-reconciler","message":"Garbage collecting secret","service.version":"2.0.0-SNAPSHOT+3cba748e","service.type":"eck","ecs.version":"1.4.0","namespace":"default","secret_name":"elasticsearch-cluster-es-elastic-user","owner_name":"elasticsearch-cluster","owner_kind":"Elasticsearch"}

Additional information:
The staleness problem can be alleviated by setting the UID of the secret object in the precondition when deleting the objects in GarbageCollectSoftOwnedSecrets. So that k8s will find the precondition is not satisfied if the secret objects observed by the operator are also stale and refuses the deletion.
We are willing to send a PR to help fix this problem.

The text was updated successfully, but these errors were encountered:

pebrc · 2022-01-17T11:03:25Z

Thanks for the bug report. I definitely sounds like a reasonable proposal to use a precondition when deleting the secrets. I wonder however if it will cover all edge cases of the scenario you are describing. You description implies inconsistent results across the different resources where the secrets are already reflecting the new state (or an even older state) while the Elasticsearch resource is lagging behind and still shown as deleted. Of these two possible causes (secrets in the cache with the old UID and secrets in the cache with the new UID) only the first one would be caught by the precondition IIUC.

srteam2020 · 2022-01-17T23:03:48Z

@pebrc Thanks for the reply!

Of these two possible causes (secrets in the cache with the old UID and secrets in the cache with the new UID) only the first one would be caught by the precondition IIUC.

This is a great point. Yes, using UID can only help the case where the secrets seen by the operator are also stale as k8s can reject the stale UID. If the secrets seen by the operator are fresh, precondition with the fresh UID cannot prevent the deletion. Unfortunately I have not found a easy-to-implement solution to prevent the second case for the elastic operator.

Please let me know if you have any thought on how to prevent the second case. I can open a PR to use the precondition so that at least the first case can be handled.

botelastic bot added the triage label Jan 14, 2022

pebrc added the >bug Something isn't working label Jan 17, 2022

botelastic bot removed the triage label Jan 17, 2022

srteam2020 mentioned this issue Jan 23, 2022

Use precondition when deleting secrets #5273

Merged

pebrc closed this as completed in #5273 Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Elastic operator mistakenly deletes the secret objects when seeing a stale cluster state #5249

[BUG] Elastic operator mistakenly deletes the secret objects when seeing a stale cluster state #5249

srteam2020 commented Jan 14, 2022

pebrc commented Jan 17, 2022

srteam2020 commented Jan 17, 2022

[BUG] Elastic operator mistakenly deletes the secret objects when seeing a stale cluster state #5249

[BUG] Elastic operator mistakenly deletes the secret objects when seeing a stale cluster state #5249

Comments

srteam2020 commented Jan 14, 2022

Bug Report

pebrc commented Jan 17, 2022

srteam2020 commented Jan 17, 2022