Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Elasticsearch readiness port #7847

Merged
merged 24 commits into from
Jun 4, 2024
Merged

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented May 24, 2024

Fixes #7841

As of 8.2.0 use the new readiness.port setting to enable a TCP readiness check that is sensitive to cluster membeship of the node. This should improve cluster availablity during external disruptions e.g. due to node upgrades, as the PDB will disallow furhter upgrades until the most recently upgraded node has rejoined the cluster.

I choose to create a separate file mounted next to the existing script to facilitate upgrades. Specifically I wanted to avoid a situation where during the upgrade we update the scripts configmap overwriting the old probe script with the new one and thereby breaking the not yet upgraded nodes in the cluster.

The approach taken here creates some techical debt in form of the extra script hanging around when not needed. But I imaging we can drop it when we stop supporting ES < 8.2.0.

An alternative approach would be to integrate a version check in the existing script which seemed more complicated to reason about to me.

Marked as bug because a cluster might be unavailable if master nodes are deleted while the previously deleted ones are not yet back in the cluster, which violates our promise of interruption free ES operations when running in HA mode with multiple master nodes.

@pebrc pebrc added the >bug Something isn't working label May 24, 2024
@pebrc
Copy link
Collaborator Author

pebrc commented May 24, 2024

buildkite test this -f p=gke,E2E_TAGS=es -m s=8.1.3,s=8.13.2

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose to create a separate file mounted next to the existing script to facilitate upgrades. Specifically I wanted to avoid a situation where during the upgrade we update the scripts configmap overwriting the old probe script with the new one and thereby breaking the not yet upgraded nodes in the cluster.

👍

LGTM!

Copy link
Contributor

@thbkrkr thbkrkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thbkrkr
Copy link
Contributor

thbkrkr commented May 24, 2024

buildkite test this p=gke,s=8.1.3,t=TestVersionUpgradeToLatest8x

Edit: Oups, I didn't see you've already done it in the long version via: #7847 (comment).

"command": [
"bash",
"-c",
"/mnt/elastic-internal/scripts/pre-stop-hook-script.sh"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre 8.2.0

"command": [
"bash",
"-c",
"/mnt/elastic-internal/scripts/readiness-port-script.sh"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New script post 8.2.0

@pebrc
Copy link
Collaborator Author

pebrc commented May 25, 2024

buildkite test this p=gke,s=8.1.3,t=TestVersionUpgradeToLatest8x

@pebrc
Copy link
Collaborator Author

pebrc commented May 25, 2024

buildkite test this -f p=kind,s=8.1.3 -m t=TestVersionUpgradeSingleToLatest8x,t=TestVersionUpgradeTwoNodesToLatest8x,t=TestExternalESStackMonitoring,t=TestForceUpgradePendingPodsInOneStatefulSet,t=TestKillSingleNodeReusePV,t=TestPodTemplateValidation,t=TestRedClusterCanBeModifiedByDisablingPredicate,t=TestStackConfigPolicy,t=TestVolumeRetention

@pebrc
Copy link
Collaborator Author

pebrc commented May 25, 2024

There is an edge case with single node clusters that I did not consider:

  • the readiness.port reports an ES node as not ready if a shutdown record exists for the ES node
  • an unready node is not listed in the (internal) k8s service we use and assumed not reacheable
  • this means in turn that the operator is not able to call the DELETE _nodes/$NODE_ID/shutdown API to remove the shutdown record and the cluster remains unready forever

The problem does not present itself with multinode clusters where at least one node is available at all times in the same way but the edge case applies to those as well if for example all nodes are deleted at the same time or some other external factor.

@pebrc pebrc marked this pull request as draft May 25, 2024 15:20
@pebrc
Copy link
Collaborator Author

pebrc commented May 25, 2024

buildkite test this -f p=kind,E2E_TAGS=es -m s=8.1.3,s=8.13.2

@pebrc
Copy link
Collaborator Author

pebrc commented May 27, 2024

buildkite test this -f p=kind,E2E_TAGS=es -m s=8.1.3,s=8.13.2

@pebrc
Copy link
Collaborator Author

pebrc commented May 27, 2024

I am thinking about the different options to address the problem:

  1. consider this a bug in the Elasticsearch implementation of the readiness probe? (I leaning towards no, as the contract for the shutdown API is indeed for the orchestrator to clean up the shutdown records)
  2. Allow unready Pods in the service (maybe only as of ES >= 8.2.0), this is what I have been testing in 2358947 and following commits. While this addresses the issue at hand it also introduces extra errors when the operator tries to connect to a node that is not ready and actually also not serving requests.
  3. Rewrite the Elasticsearch client logic to not go through the k8s service but instead to pick a Pod at random from a priority list which is listing all ready pods and then all running (but not ready ) pods.
  4. A variant of 3 that does still go throught the service but uses a different label that the operator puts onto the pods when they become ready to serve but not necessarily ready in the k8s sense (Njal's idea from 2019) I don't like that too much due to the extra API requests required for labeling the pods and prefer 3.

Not that all of the options only affect the internal service, the external service used by users and clients should still be based on teh readiness of the Pods.

@barkbay @thbkrkr given that you have reviewed this PR I would be curious about your thoughts.

@barkbay
Copy link
Contributor

barkbay commented May 27, 2024

Could we rely on the service unless there is no ready endpoint, in which case we would pick a random Pod? (mostly to be conservative and only use a new logic for some edge cases like single node clusters, all Pods deleted at the same time...)

@pebrc pebrc added the v2.14.0 label May 27, 2024
@pebrc
Copy link
Collaborator Author

pebrc commented May 29, 2024

buildkite test this -f p=kind,E2E_TAGS=es -m s=8.1.3,s=8.13.2

@pebrc
Copy link
Collaborator Author

pebrc commented May 29, 2024

The test run from the comment above completed successfully in 2h instead of the usual 4h, which is suspicious. I need to take a closer look if this is a bug or if the fact the we optmistically make connections to non-ready pods speeds up the tests so much.

@pebrc pebrc marked this pull request as ready for review May 29, 2024 20:32
@pebrc
Copy link
Collaborator Author

pebrc commented May 29, 2024

buildkite test this -f p=gke,TESTS_MATCH=* -m s=8.1.3,s=8.13.2

@pebrc
Copy link
Collaborator Author

pebrc commented May 29, 2024

buildkite test this -f p=gke,t=Test -m s=8.1.3,s=8.13.2

@pebrc pebrc requested review from thbkrkr and barkbay May 30, 2024 05:05
@thbkrkr
Copy link
Contributor

thbkrkr commented May 30, 2024

It is indeed faster.

# gke from cloud-on-k8s-operator-nightly/builds/512
> awk '{print $2, $1}' prev | grep -v "ms " | grep "^[[0-9]*m" | sort -rn | head -10
20m18.72s TestSamples
13m51.4s TestUpdateESSecureSettings
12m34.79s TestFleetMode
10m43.55s TestVersionUpgradeOrdering
9m59.37s TestMutationAndReversal
9m26.65s TestMutationWithLargerMaxUnavailable
9m14.36s TestMutationHTTPToHTTPS
8m21.28s TestMutationResizeMemoryDown
8m20.25s TestFleetAPMIntegrationRecipe
8m19.11s TestMutationResizeMemoryUp

# gke-8-13-2 from cloud-on-k8s-operator/builds/8587
> awk '{print $2, $1}' new | grep -v "ms " | grep "^[[0-9]*m" | sort -rn | head -10
17m23.71s TestSamples
10m22.91s TestFleetMode
8m27.45s TestUpdateESSecureSettings
7m55.65s TestAutoscaling
7m37.89s TestVersionUpgradeOrdering
6m7.49s TestMutationHTTPToHTTPS
6m51.17s TestMetricbeatStackMonitoringRecipe
6m5.69s TestExternalESStackMonitoring
6m20.72s TestFleetAgentWithoutTLS
5m54.3s TestGlobalCA

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/utils/k8s/k8sutils.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/client/url.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/client/url.go Outdated Show resolved Hide resolved
@pebrc
Copy link
Collaborator Author

pebrc commented Jun 4, 2024

I tried to compare memory usage against a baseline from the last nightly run:
Screenshot 2024-06-04 at 12 07 55

with the one of the longer runs triggered from this PR here #7847 (comment)
Screenshot 2024-06-04 at 12 08 29

I did not spot any significant change, but the comparison leaves much to be desired.

@pebrc pebrc enabled auto-merge (squash) June 4, 2024 18:15
@pebrc pebrc merged commit d1a0de6 into elastic:main Jun 4, 2024
5 checks passed
@barkbay barkbay changed the title Use readiness port Use Elasticearch readiness port Jul 25, 2024
@barkbay barkbay changed the title Use Elasticearch readiness port Use Elasticsearch readiness port Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v2.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Elasticearch readiness port instead of / for Pod readiness
3 participants