Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty NODE_ID in Elasticsearch PreStop hook #7892

Merged
merged 34 commits into from
Jul 4, 2024

Conversation

BobVanB
Copy link
Contributor

@BobVanB BobVanB commented Jun 11, 2024

The main problem

When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.

This piece of code can be empty:

NODE_ID=$(grep "$POD_NAME" "$resp_body" | cut -f 1 -d ' ')

Result

  • There is no NODE_ID and the request is broken between _nodes and shutdown
    {"@timestamp": "2024-06-11T09:37:05+00:00", "message": "400 http://<cluster>-es-internal-http.<namespace>.svc:9200/_nodes//shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
    
  • The pod that is restarting will crash each time with error_exit "failed to call node shutdown API" and a shutdown is never called. Thus resulting in recreating the same pod again and starting allover from the top.

What i still want to know

Is the node removed before calling _cat/nodes

When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the _cat/nodes query? Or is it possible that the query ends op on the terminated node and doesn't give a result.

This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.

request -X GET "${ES_URL}/_cat/nodes?full_id=true&h=id,name"

Why is terminationGracePeriodSeconds way less then possible script run time?

The default terminationGracePeriodSeconds is 180 seconds.
The scripts has also has 2 retry 10 calls, witch has count ** 2 as wait.
This can result in alot of wait time:
round 1: 1 second
round 2: 1 second of previous round + 1 + 2 = 4 seconds
round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds
...
round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutes

  1. should the terminationGracePeriodSeconds set to 30 minutes?
  2. should the retry 10 be way less, something like retry 8
    and get "retry 8/8 exited 1, no more retries left"

What has been done

  • After some debugging and trying to understand the code, i ended up cleaning it up a little and used shellcheck.
    I tried to not rewrite it all
  • WONT: build a retry loop to get the NODE_ID
    Want to know if this should use a retry 3 or just error_exit "failed to retrieve node ID"
    After cleanup, looks like this was not needed.
  • Use spaces instead of \t, this will ensure a readability inside the configmap.
  • Retry to 8

PoC Result

Added some debug information to prove that the script is working.
Will add that it is not fun to debug the bash script without 'set -x'.

{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving nodes", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "resp_body: /tmp/tmp.k6cZwbNtph", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "NODE_ID: h3WUy....aTV9qjl7w", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "success to retrieve node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving shutdown request", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "check shutdown response", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "delaying termination for 50 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

What has not been done

...

  • prepare-fs.sh
  • readiness-probe-script.sh
  • suspend.sh

@botelastic botelastic bot added the triage label Jun 11, 2024
@BobVanB BobVanB changed the title fix: cleanup pre-stop-hook-script.sh and wait for NODE_ID fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty Jun 11, 2024
@BobVanB BobVanB changed the title fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty WIP: fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty Jun 11, 2024
@BobVanB BobVanB changed the title WIP: fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty Jun 12, 2024
@pebrc pebrc added the >enhancement Enhancement of existing functionality label Jun 17, 2024
BobVanB and others added 2 commits June 27, 2024 08:36
Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>
You got me.

Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>
@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 27, 2024

buildkite test this

@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 27, 2024

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 27, 2024

This breaks two e2e tests:

TestMutationResizeMemoryDown/Stopping_to_watch_for_correct_node_shutdown_API_usage
TestMutationResizeMemoryUp/Stopping_to_watch_for_correct_node_shutdown_API_usage

I don't know exactly what's going on yet.
The failure is because during the mutation, /_nodes/shutdown returned more than one entry:

[
	{Q3SwszElnHxaJg RESTART pre-stop hook 1719495839993 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
	{eUcnfdK-Q3SwszElnHxaJg RESTART 70382 1719495839494 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
]

@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 27, 2024

The pre-stop hook incorrectly extracted the node id, which created 2 shutdown records with different ids (Q3SwszElnHxaJg and eUcnfdK-Q3SwszElnHxaJg).

@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 27, 2024

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

@BobVanB
Copy link
Contributor Author

BobVanB commented Jun 29, 2024

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

1 similar comment
@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 29, 2024

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 1, 2024

Thank you @BobVanB!

@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 3, 2024

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 4, 2024

buildkite test this

@thbkrkr thbkrkr merged commit b6c77b6 into elastic:main Jul 4, 2024
5 checks passed
@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 4, 2024

Thank you very much for the contribution and for your patience @BobVanB.

@barkbay barkbay changed the title fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty Handle empty NODE_ID in Elasticsearch PreStop hook Jul 25, 2024
@barkbay barkbay added >bug Something isn't working and removed >enhancement Enhancement of existing functionality labels Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v2.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants