Elasticsearch controller: fix panic and dropped error result during node shutdown #7875
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We were ignoring an error during a node shutdown initialisation and panicking on a subsequently uninitialised map.
Technically it should be enough to handle the error but I chose to also intialise the map by default.
For future reference, this was discovered through an e2e test (props to @thbkrkr for the throrough analysis).
We discovered that a previous shutdown record was still present. This should not happen as we clear shutdown records before proceeding to the next node. We then found that the operator had been restarted twice once in the middle to the upgrade process. Which lead to the discovery of the following panic:
If a shutdown request was made and the local state about the number of shutdown requests was not intialised correctly another shutdown request can be made. Updating the state with the outcome of the request could then lead to the observed panic.
We are lackign a bit of test coverage in this area,
so I am probably going to push a unit test once I find a the right scope.I don't have the capacity right now to improve the coverage. We relied on e2e test coverage for this part of the code in the past and one could say that the e2e test found the bug eventually.We should however alert on panics in the e2e cluster.