Handle empty NODE_ID in Elasticsearch PreStop hook #7892

BobVanB · 2024-06-11T11:36:08Z

The main problem

When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.

This piece of code can be empty:

NODE_ID=$(grep "$POD_NAME" "$resp_body" | cut -f 1 -d ' ')

Result

There is no NODE_ID and the request is broken between _nodes and shutdown

{"@timestamp": "2024-06-11T09:37:05+00:00", "message": "400 http://<cluster>-es-internal-http.<namespace>.svc:9200/_nodes//shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

The pod that is restarting will crash each time with error_exit "failed to call node shutdown API" and a shutdown is never called. Thus resulting in recreating the same pod again and starting allover from the top.

What i still want to know

Is the node removed before calling _cat/nodes

When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the _cat/nodes query? Or is it possible that the query ends op on the terminated node and doesn't give a result.

This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.

request -X GET "${ES_URL}/_cat/nodes?full_id=true&h=id,name"

Why is terminationGracePeriodSeconds way less then possible script run time?

The default terminationGracePeriodSeconds is 180 seconds.
The scripts has also has 2 retry 10 calls, witch has count ** 2 as wait.
This can result in alot of wait time:
round 1: 1 second
round 2: 1 second of previous round + 1 + 2 = 4 seconds
round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds
...
round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutes

should the terminationGracePeriodSeconds set to 30 minutes?
should the retry 10 be way less, something like retry 8
and get "retry 8/8 exited 1, no more retries left"

What has been done

After some debugging and trying to understand the code, i ended up cleaning it up a little and used shellcheck.
I tried to not rewrite it all
WONT: build a retry loop to get the NODE_ID
Want to know if this should use a retry 3 or just error_exit "failed to retrieve node ID"
After cleanup, looks like this was not needed.
Use spaces instead of \t, this will ensure a readability inside the configmap.
Retry to 8

PoC Result

Added some debug information to prove that the script is working.
Will add that it is not fun to debug the bash script without 'set -x'.

{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving nodes", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "resp_body: /tmp/tmp.k6cZwbNtph", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "NODE_ID: h3WUy....aTV9qjl7w", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "success to retrieve node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving shutdown request", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "check shutdown response", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "delaying termination for 50 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

What has not been done

...

prepare-fs.sh
readiness-probe-script.sh
suspend.sh

Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>

You got me. Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>

thbkrkr · 2024-06-27T08:08:01Z

buildkite test this

thbkrkr · 2024-06-27T12:03:44Z

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr · 2024-06-27T18:15:13Z

This breaks two e2e tests:

TestMutationResizeMemoryDown/Stopping_to_watch_for_correct_node_shutdown_API_usage
TestMutationResizeMemoryUp/Stopping_to_watch_for_correct_node_shutdown_API_usage

I don't know exactly what's going on yet.
The failure is because during the mutation, /_nodes/shutdown returned more than one entry:

[
	{Q3SwszElnHxaJg RESTART pre-stop hook 1719495839993 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
	{eUcnfdK-Q3SwszElnHxaJg RESTART 70382 1719495839494 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
]

thbkrkr · 2024-06-27T18:48:53Z

The pre-stop hook incorrectly extracted the node id, which created 2 shutdown records with different ids (Q3SwszElnHxaJg and eUcnfdK-Q3SwszElnHxaJg).

pkg/controller/elasticsearch/nodespec/lifecycle_hook.go

Allow the '-' character.

thbkrkr · 2024-06-27T19:23:23Z

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

BobVanB · 2024-06-29T12:13:41Z

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr · 2024-06-29T13:07:04Z

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr · 2024-07-01T08:33:26Z

Thank you @BobVanB!

pkg/controller/elasticsearch/nodespec/lifecycle_hook.go

This should be quoted.

thbkrkr · 2024-07-03T20:40:43Z

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr · 2024-07-04T09:17:25Z

buildkite test this

thbkrkr · 2024-07-04T11:40:52Z

Thank you very much for the contribution and for your patience @BobVanB.

botelastic bot added the triage label Jun 11, 2024

BobVanB changed the title ~~fix: cleanup pre-stop-hook-script.sh and wait for NODE_ID~~ fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty Jun 11, 2024

BobVanB changed the title ~~fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty~~ WIP: fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty Jun 11, 2024

BobVanB changed the title ~~WIP: fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty~~ fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty Jun 12, 2024

BobVanB added 12 commits June 12, 2024 11:44

fix: cleanup pre-stop-hook-script.sh

482f23b

fix: update the exit default from delayed_exit

fb62e1d

fix: use spaces inside shellscript

9561a9f

fix: use spaces inside shellscript

1442ef1

fix: cleanup grep for NODE_ID

18f1b87

fix: spaces before param

46d08b3

fix: update retry to 7

8f53e76

fix: temp return expanded param for basic auth (wrong)

c3a8fd9

fix: update retry to 8

99a6d51

fix: add working version with alot of debug info

9e22058

fix: cleanup debug code

452fb48

fix: remove comments

8f1cea9

BobVanB force-pushed the pre-stop-hook-script branch from 57ddcbe to 8f1cea9 Compare June 12, 2024 09:44

BobVanB and others added 12 commits June 12, 2024 11:54

fix: update retry to 8

0e8c5c4

fix; update set flags

6ee066b

fix: remove comments

2954da3

fix: use simpel definition for argument parameter

d8db45d

Merge branch 'main' into pre-stop-hook-script

ecf5746

Merge branch 'main' into pre-stop-hook-script

48e005c

fix: correct BASIC_AUTH array

03d6bc0

fix: remove shellcheck disabled comment

cf587e6

fix: prevent globbing

97f12ed

fix: globbing message

cb2bf9e

fix: smaller footprint

29234cd

fix: smaller footprint

a74bc1d

pebrc added the >enhancement Enhancement of existing functionality label Jun 17, 2024

BobVanB and others added 2 commits June 27, 2024 08:36

Update pkg/controller/elasticsearch/nodespec/lifecycle_hook.go

2e6ea64

Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>

Update pkg/controller/elasticsearch/nodespec/lifecycle_hook.go

b85c7e8

You got me. Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>

thbkrkr added the v2.14.0 label Jun 27, 2024

thbkrkr reviewed Jun 27, 2024

View reviewed changes

pkg/controller/elasticsearch/nodespec/lifecycle_hook.go Outdated Show resolved Hide resolved

Update pkg/controller/elasticsearch/nodespec/lifecycle_hook.go

a56d570

Allow the '-' character.

BobVanB and others added 2 commits June 28, 2024 06:27

Merge branch 'main' into pre-stop-hook-script

4cc0bd0

fix: return grep to original

5f955e6

thbkrkr approved these changes Jul 1, 2024

View reviewed changes

pebrc approved these changes Jul 1, 2024

View reviewed changes

barkbay reviewed Jul 1, 2024

View reviewed changes

pkg/controller/elasticsearch/nodespec/lifecycle_hook.go Outdated Show resolved Hide resolved

BobVanB commented Jul 2, 2024

View reviewed changes

pkg/controller/elasticsearch/nodespec/lifecycle_hook.go Outdated Show resolved Hide resolved

BobVanB and others added 3 commits July 2, 2024 18:04

Update pkg/controller/elasticsearch/nodespec/lifecycle_hook.go

e7c69da

This should be quoted.

Merge branch 'main' into pre-stop-hook-script

94ca9f9

Variabilize and comment magic number

3285537

Merge branch 'main' into pre-stop-hook-script

953cf00

thbkrkr merged commit b6c77b6 into elastic:main Jul 4, 2024
5 checks passed

barkbay changed the title ~~fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty~~ Handle empty NODE_ID in Elasticsearch PreStop hook Jul 25, 2024

barkbay added >bug Something isn't working and removed >enhancement Enhancement of existing functionality labels Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle empty NODE_ID in Elasticsearch PreStop hook #7892

Handle empty NODE_ID in Elasticsearch PreStop hook #7892

BobVanB commented Jun 11, 2024 •

edited by thbkrkr

Loading

thbkrkr commented Jun 27, 2024 •

edited

Loading

thbkrkr commented Jun 27, 2024

thbkrkr commented Jun 27, 2024

thbkrkr commented Jun 27, 2024

thbkrkr commented Jun 27, 2024

BobVanB commented Jun 29, 2024

thbkrkr commented Jun 29, 2024

thbkrkr commented Jul 1, 2024

thbkrkr commented Jul 3, 2024

thbkrkr commented Jul 4, 2024

thbkrkr commented Jul 4, 2024

Handle empty NODE_ID in Elasticsearch PreStop hook #7892

Handle empty NODE_ID in Elasticsearch PreStop hook #7892

Conversation

BobVanB commented Jun 11, 2024 • edited by thbkrkr Loading

The main problem

Result

What i still want to know

Is the node removed before calling _cat/nodes

Why is terminationGracePeriodSeconds way less then possible script run time?

What has been done

PoC Result

What has not been done

thbkrkr commented Jun 27, 2024 • edited Loading

thbkrkr commented Jun 27, 2024

thbkrkr commented Jun 27, 2024

thbkrkr commented Jun 27, 2024

thbkrkr commented Jun 27, 2024

BobVanB commented Jun 29, 2024

thbkrkr commented Jun 29, 2024

thbkrkr commented Jul 1, 2024

thbkrkr commented Jul 3, 2024

thbkrkr commented Jul 4, 2024

thbkrkr commented Jul 4, 2024

BobVanB commented Jun 11, 2024 •

edited by thbkrkr

Loading

thbkrkr commented Jun 27, 2024 •

edited

Loading