Elasticsearch controller: fix panic and dropped error result during node shutdown #7875

pebrc · 2024-06-07T10:11:35Z

We were ignoring an error during a node shutdown initialisation and panicking on a subsequently uninitialised map.

Technically it should be enough to handle the error but I chose to also intialise the map by default.

For future reference, this was discovered through an e2e test (props to @thbkrkr for the throrough analysis).

[
	expected less than 1, got 2, restarts: [
		{o44cyWaAQQuzmvh3g_iOzQ RESTART 187844 1717724317347 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}} 
		{LSN_PaSuRoOIsQF3QKyx3g RESTART 187995 1717724469546 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
]

We discovered that a previous shutdown record was still present. This should not happen as we clear shutdown records before proceeding to the next node. We then found that the operator had been restarted twice once in the middle to the upgrade process. Which lead to the discovery of the following panic:

If a shutdown request was made and the local state about the number of shutdown requests was not intialised correctly another shutdown request can be made. Updating the state with the outcome of the request could then lead to the observed panic.

We are lackign a bit of test coverage in this area, ~~so I am probably going to push a unit test once I find a the right scope.~~ I don't have the capacity right now to improve the coverage. We relied on e2e test coverage for this part of the code in the past and one could say that the e2e test found the bug eventually.

We should however alert on panics in the e2e cluster.

thbkrkr

LGTM

thbkrkr · 2024-06-07T12:33:02Z

buildkite test this -f p=gke,s=7.17.8,t=TestMutationAndReversal

Fix panic and dropped error result

46f1a25

pebrc added the >bug Something isn't working label Jun 7, 2024

thbkrkr approved these changes Jun 7, 2024

View reviewed changes

Improve error message in e2e test

62b188c

pebrc mentioned this pull request Jun 7, 2024

Alert on operator panics during e2e tests #7876

Open

pebrc merged commit 33546a3 into elastic:main Jun 7, 2024
5 checks passed

barkbay added the v2.14.0 label Jul 25, 2024

barkbay changed the title ~~Fix panic and dropped error result~~ Elasticsearch controller: fix panic and dropped error result during node shutdown Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch controller: fix panic and dropped error result during node shutdown #7875

Elasticsearch controller: fix panic and dropped error result during node shutdown #7875

pebrc commented Jun 7, 2024 •

edited

Loading

thbkrkr left a comment

thbkrkr commented Jun 7, 2024

Elasticsearch controller: fix panic and dropped error result during node shutdown #7875

Elasticsearch controller: fix panic and dropped error result during node shutdown #7875

Conversation

pebrc commented Jun 7, 2024 • edited Loading

thbkrkr left a comment

Choose a reason for hiding this comment

thbkrkr commented Jun 7, 2024

pebrc commented Jun 7, 2024 •

edited

Loading