Modularization & Status Enhancements #188

timuthy · 2021-06-02T09:43:05Z

How to categorize this PR?

/area control-plane
/kind enhancement

What this PR does / why we need it:
This PR adds the following status checks for the etcd resource:

Conditions:
- Ready check: Checks if resource has enough Ready members in status.members to fulfill the quorum.
- AllMembersReady check: Checks if all members in status.members are Ready.
Members:
- Ready check: Treats the LastUpdateTime as a heartbeat and checks if it is within the expected time range (configurable via --etcd-member-threshold).

Which issue(s) this PR fixes:
Fixes #151

Special notes for your reviewer:
The PR does not handle cases in which an etcd member is lost and replaced and thus the corresponding status.member has to be removed. This needs to be done by the component which removes the lost member from the etcd cluster.

Release note:

Various `condition` and etcd `member` checks have been added to Etcd-Druid. The results of those checks will be reflected in the `etcd.status` sub-resource.
- Conditions:
  - Ready check: Checks if resource has enough `Ready` members in `status.members` to fulfill the quorum.
  - AllMembersReady check: Checks if all members in `status.members` are `Ready`.
- Members:
  - Ready check: Treats the `LastUpdateTime` as a heartbeat and checks if it is within the expected time range (configurable via `--etcd-member-threshold`).

A re-sync mechanism has been added for the Custodian controller. The new flag `--custodian-sync-period (default 30s)` controls the duration after which the Custodian controller re-enqueues `etcd` resources for reconciliation. This can be considered as a health check interval.

amshuman-kr

Thanks a lot for the PR @timuthy! Some very minor comments/questions below.

Besides that, I see that you have addressed the transition to Unknown state for the members but not the further transition from Unknown to NotReady and also the separate transition for not ready pod to NotReady status. Would you address that in a separate PR or do you think those cases should be addressed differently?

pkg/health/condition/check_ready.go

pkg/health/condition/condition_suite_test.go

pkg/health/status/check.go

timuthy · 2021-06-02T15:35:27Z

Besides that, I see that you have addressed the transition to Unknown state for the members but not the further transition from Unknown to NotReady and also the separate transition for not ready pod to NotReady status. Would you address that in a separate PR or do you think those cases should be addressed differently?

Thanks for mentioning it @amshuman-kr. I will update this PR as written in the proposal and as discussed via Slack.

timuthy · 2021-06-02T15:58:05Z

/status author-action

gardener-robot · 2021-06-02T15:58:13Z

@timuthy The pull request was assigned to you under author-action. Please unassign yourself when you are done. Thank you.

timuthy · 2021-06-04T14:52:04Z

Besides that, I see that you have addressed the transition to Unknown state for the members but not the further transition from Unknown to NotReady and also the separate transition for not ready pod to NotReady status

Added via 134958d

amshuman-kr

Thanks for the change @timuthy! They look good. Do you think adding status.initialReplicas and using max(status.initialReplicas, len(status.Members)) make sense? The maintenance of status.initialReplicas could be postponed to later issues where the custodian is enhanced for other reconciliation actions like triggering bootstrapping.

WDYT?

pkg/health/etcdmember/check_ready.go

Remove `status.podRef` which was introduced by an earlier, unreleased commit and use `status.name` instead.

timuthy · 2021-06-14T17:58:52Z

Thanks for the discussion @amshuman-kr. I added the missing pieces in the last 3 commits (separated it for an easier review). We can squash merge the PR once we are ready.

amshuman-kr

Thanks a lot for the changes @timuthy! The changes look good. Just a couple of comments below (the first one is merely FYI).

amshuman-kr · 2021-06-15T11:42:02Z

controllers/etcd_custodian_controller.go

+		// Bootstrap is a special case which is handled by the etcd controller.
+		if !inBootstrap(etcd) && len(members) != 0 {
+			etcd.Status.ClusterSize = pointer.Int32Ptr(int32(len(members)))
+		}


Strictly speaking, this is incorrect. Only the initial bootstrapping is triggered by the etcd controller. Later bootstrapping due to quorum failure is to be done by the custodian controller. But this change could be done in #158.

I added this comment as this reflects the current state.

Tbh, I'm not convinced yet that the Custodian controller performs self healing in the form of re-bootstrapping because the Etcd controller can in the meantime still reconcile etcd resources which can lead to races. In order to avoid a race, you need to introduce some kind of lock which is also not too nice, IMO.

Maybe I didn't check all cases in detail but at the moment I think an alternative would be easier to implement:

Custodian sees that cluster is not healthy and self healing is required.

Custodian triggers a reconciliation for the affected etcd resource.

Etcd controller kicks in and takes necessary action to bring cluster up again (=> desired state). In the meantime any changes to the very same etcd resource cannot lead to a race.

We don't have to decide in this PR of course, but still wanted to give an alternative approach to think about.

In order to avoid a race, you need to introduce some kind of lock which is also not too nice, IMO.

Even if only the etcd controller is responsible for bootstrapping non-quorate clusters, some sort of race avoidance is still required because the Etcd spec might get updated while bootstrapping is still on-going, in which case, it should either wait until the on-going bootstrapping completed before applying the latest spec changes or should abort the current bootstrapping (which might unnecessarily prolong the downtime).

Custodian triggers a reconciliation for the affected etcd resource.

This cannot be the regular etcd controller reconciliation which might apply Etcd resource spec changes along with triggering bootstrapping. As a part of the contract of a gardener extension, etcd-druid not apply Etcd resource spec changes unless reconcile operation annotation is present.

On the other hand, it makes no sense for the etcd-druid to wait for the next shoot reconciliation to fix a non-quorate ETCD cluster.

But let's take this discussion out of this PR.

pkg/health/condition/check_ready.go

amshuman-kr

Thanks @timuthy for the changes. /lgtm

timuthy added 2 commits June 2, 2021 11:26

Add etcd status checks

0518097

Add sync period to custodian controller

dde320b

timuthy requested a review from a team as a code owner June 2, 2021 09:43

timuthy changed the title ~~Feature.etcd status updater~~ Modularization & Status Enhancements Jun 2, 2021

gardener-robot added area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Jun 2, 2021

amshuman-kr reviewed Jun 2, 2021

View reviewed changes

gardener-robot added the status/author-action Issue requires issue author action label Jun 2, 2021

gardener-robot assigned timuthy Jun 2, 2021

timuthy added 2 commits June 4, 2021 10:37

Address minor comments from review

82fce48

Handle Unknown -> Ready members

134958d

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 4, 2021

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 4, 2021

amshuman-kr reviewed Jun 5, 2021

View reviewed changes

pkg/health/etcdmember/check_ready.go Outdated Show resolved Hide resolved

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 10, 2021

timuthy added 3 commits June 14, 2021 10:38

Remove status.podRef

d55b704

Remove `status.podRef` which was introduced by an earlier, unreleased commit and use `status.name` instead.

Add ClusterSize

63273d9

Check ContainersReady condition

1e76ea4

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 14, 2021

timuthy removed the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 14, 2021

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 14, 2021

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 15, 2021

amshuman-kr reviewed Jun 15, 2021

View reviewed changes

amshuman-kr approved these changes Jun 15, 2021

View reviewed changes

timuthy merged commit f02ff02 into gardener:master Jun 15, 2021

timuthy deleted the feature.etcd-status-updater branch June 15, 2021 15:12

timuthy mentioned this pull request Nov 23, 2021

Fix etcd.status.clusterSize update #260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularization & Status Enhancements #188

Modularization & Status Enhancements #188

timuthy commented Jun 2, 2021

amshuman-kr left a comment

timuthy commented Jun 2, 2021

timuthy commented Jun 2, 2021

gardener-robot commented Jun 2, 2021

timuthy commented Jun 4, 2021

amshuman-kr left a comment

timuthy commented Jun 14, 2021

amshuman-kr left a comment

amshuman-kr Jun 15, 2021

timuthy Jun 15, 2021

amshuman-kr Jun 15, 2021 •

edited

Loading

amshuman-kr left a comment

Modularization & Status Enhancements #188

Modularization & Status Enhancements #188

Conversation

timuthy commented Jun 2, 2021

amshuman-kr left a comment

Choose a reason for hiding this comment

timuthy commented Jun 2, 2021

timuthy commented Jun 2, 2021

gardener-robot commented Jun 2, 2021

timuthy commented Jun 4, 2021

amshuman-kr left a comment

Choose a reason for hiding this comment

timuthy commented Jun 14, 2021

amshuman-kr left a comment

Choose a reason for hiding this comment

amshuman-kr Jun 15, 2021

Choose a reason for hiding this comment

timuthy Jun 15, 2021

Choose a reason for hiding this comment

amshuman-kr Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

amshuman-kr left a comment

Choose a reason for hiding this comment

amshuman-kr Jun 15, 2021 •

edited

Loading