Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to allow etcd-druid to patch STS when there is only a change of mount-path even if all pods are not up and running #915

Merged
merged 8 commits into from
Nov 12, 2024

Conversation

unmarshall
Copy link
Contributor

@unmarshall unmarshall commented Nov 12, 2024

How to categorize this PR?

/area control-plane
/kind bug

What this PR does / why we need it:
The issue is described well in #908. This PR allows etcd-druid to patch the StatefulSet allowing the unhealthy pod to come up eventually if what has changed is only TLS volume mount path.

NOTE: If the TLS has been enabled for peer URL and is still not been reflected in StatefulSet then to ensure that the cluster comes up post update of StatefulSet, it is essential that all members be up and running in a 3 member cluster. Therefore that restriction will still be in place. What this PR does is relax this restriction in case only the mount paths have changed.
v0.23.x changes the mount paths for TLS volumes.

Additionally it also now adds an additional label to the etcd client and peer service to ensure that only statefulset pods are selected by it.

Which issue(s) this PR fixes:
Fixes #908 and #914

Release note:

Fixes etcd client and peer service label selector, ensuring that only Etcd statefulset pods are selected.
etcd controller now differentiates between TLS configuration change and peer TLS enablement. Only if peer TLS has been enabled and not yet reflected it will wait for all pods to come up else it will allow patching of statefulset.

@unmarshall unmarshall requested a review from a team as a code owner November 12, 2024 07:34
@gardener-robot gardener-robot added area/control-plane Control plane related kind/bug Bug needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Nov 12, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 12, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 12, 2024
@ishan16696
Copy link
Member

/assign @ishan16696 @seshachalam-yv

@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 12, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 12, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 12, 2024
@unmarshall
Copy link
Contributor Author

/retest

@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 12, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 12, 2024
@unmarshall
Copy link
Contributor Author

/retest

@ishan16696
Copy link
Member

I have mentioned the steps to reproduce the bug and to test the fix : #908 (comment)

@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 12, 2024
Copy link
Contributor

@seshachalam-yv seshachalam-yv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review labels Nov 12, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 12, 2024
@ishan16696
Copy link
Member

ishan16696 commented Nov 12, 2024

I have also tested the compaction job fix:

Before the fix etcd compaction pod is also added to endpointslices due to label selector of svc.

$kubectl describe endpointslices.discovery.k8s.io etcd-main-client-qght6
Name:         etcd-main-client-qght6
....
Endpoints:
  - Addresses:  10.244.0.19
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/etcd-main-0
    NodeName:   kind-control-plane
    Zone:       <unset>
  - Addresses:  10.244.0.20
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/etcd-main-compactor-stvkj
    NodeName:   kind-control-plane
    Zone:       <unset>
Events:         <none>

$ kubectl describe endpointslices.discovery.k8s.io etcd-main-peer-cbc4h

Name:         etcd-main-peer-cbc4h
...
Endpoints:
  - Addresses:  10.244.0.19
    Conditions:
      Ready:    true
    Hostname:   etcd-main-0
    TargetRef:  Pod/etcd-main-0
    NodeName:   kind-control-plane
    Zone:       <unset>
  - Addresses:  10.244.0.20
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/etcd-main-compactor-stvkj
    NodeName:   kind-control-plane
    Zone:       <unset>
Events:         <none>

Then I deployed druid with this PR image and trigger the compaction pod, etcd compaction pod is not added to endpointSlices as now services are selecting the only etcd statefulset pods.

NAME                          READY   STATUS    RESTARTS   AGE
etcd-main-0                   2/2     Running   0          9m16s
etcd-main-compactor-88zsb     1/1     Running   0          90s

k describe endpointslices.discovery.k8s.io etcd-main-client-qght6
Name:         etcd-main-client-qght6
...
Endpoints:
  - Addresses:  10.244.0.19
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/etcd-main-0
    NodeName:   kind-control-plane
    Zone:       <unset>
Events:         <none>

k describe endpointslices.discovery.k8s.io etcd-main-peer-cbc4h
Name:         etcd-main-peer-cbc4h
...
Endpoints:
  - Addresses:  10.244.0.19
    Conditions:
      Ready:    true
    Hostname:   etcd-main-0
    TargetRef:  Pod/etcd-main-0
    NodeName:   kind-control-plane
    Zone:       <unset>
Events:         <none>

Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@unmarshall
Copy link
Contributor Author

/test pull-etcd-druid-e2e-kind-nondistroless-etcd

@unmarshall unmarshall merged commit 1af5ea3 into gardener:master Nov 12, 2024
11 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 12, 2024
unmarshall added a commit to unmarshall/etcd-druid that referenced this pull request Nov 12, 2024
…ount-path even if all pods are not up and running (gardener#915)
unmarshall added a commit that referenced this pull request Nov 12, 2024
…ount-path even if all pods are not up and running (#915) (#918)
@ishan16696 ishan16696 deleted the cmfix branch November 12, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/bug Bug needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
7 participants