Reduce load on etcd/kube-apiserver on pod eviction #949

thiyyakat · 2024-10-21T08:35:59Z

What this PR does / why we need it:

The PR changes the way pods are listed before eviction, by using a PodInformer which uses a local cache rather than directly querying the kube-apiserver/etcd. The pods are listed only after the cache has synced.

Which issue(s) this PR fixes:
Fixes #703

Special notes for your reviewer:

Impact of changes tested by running pods configuring a PDB with maxUnavailable set to 0, on a cluster with 20 machines, and deleting all the machines, thereby initiating drain. Without change, peak traffic recorded was 8.51 MB/s. With change, peak traffic recorded was 1.48 MB/s.

Integration tests run for providers AWS and Azure completed successfully.

Additionally, to manually check if cache for podInformer syncs successfully before the podLister.List() call,time.Sleep( 30 * time.Second) was introduced before calling RunCordonOrUncordon(), and a new pod (default/nginx-pod2 ) was deployed during the sleep period. Logs were added to print the names of all pods on the node returned by the podLister. After triggering the deletion of the machine, the machine entered the drain flow, and after the sleep the new pod's name was logged.

I1023 16:13:08.044770   91923 machine_util.go:1182] (drainNode) Invoking RunDrain, forceDeleteMachine: false, forceDeletePods: false, timeOutDuration: 5m0s
I1023 16:13:08.044894   91923 drain.go:238] ABOUT TO SLEEP for 30s
I1023 16:13:43.861554   91923 drain.go:369] Found pod default/nginx-pod2

Release note:

MCM will use an `informer` instead of the`clientset` to list pods in the drain logic. This will reduce the load on etcd/kube-apiserver.

gardener-robot-ci-3 · 2024-10-21T08:36:34Z

Thank you @thiyyakat for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

elankath · 2024-10-22T04:04:39Z

pkg/util/provider/drain/drain.go

@@ -232,20 +239,26 @@ func (o *Options) RunDrain(ctx context.Context) error {
 		klog.Errorf("Drain Error: Cordoning of node failed with error: %v", err)
 		return err
 	}
+	stopCh := make(chan struct{})


We should not construct an empty stopCh here as it becomes useless for signalling. We should use a channel obtained from context.WithTimeout using the drain timeout passed to NewDrainOptions. And then use the Done() method on the returned context to get the stopCh.

PS: There is already a context created for this a little later in RunDrain. drainContext, cancelFn := context.WithDeadline(ctx, o.drainStartedOn.Add(o.Timeout))

We should use the Done() channel from this.

elankath · 2024-10-22T04:32:03Z

pkg/util/provider/drain/drain.go

-			if w != nil {
-				ws[w.string] = append(ws[w.string], pod.Name)
+	for _, pod := range podList {
+		if pod.Spec.NodeName == o.nodeName {


Just use a not check here and continue instead of nesting.

elankath

looks good. small changes requested.

aaronfern · 2024-10-22T09:02:31Z

pkg/util/provider/app/app.go

@@ -279,6 +279,7 @@ func StartControllers(s *options.MCServer,
 		targetCoreInformerFactory.Storage().V1().VolumeAttachments(),
 		machineSharedInformers.MachineClasses(),
 		machineSharedInformers.Machines(),
+		targetCoreInformerFactory.Core().V1().Pods(),


Nit: Can you move this a few lines up so that all InformerFactorys are together?

elankath · 2024-10-23T06:24:29Z

pkg/util/provider/drain/drain.go

 	if err != nil {
-		return pods, err
+		return nil, err


minor nit: with named return params, you don't need explicit valued return. Just a bare return is sufficient.

elankath

/lgtm . Please run IT.

rishabh-11

/lgtm

* Reduce etcd traffic by using SharedInformer to list pods for drain logic * Address review comments

thiyyakat requested a review from a team as a code owner October 21, 2024 08:35

gardener-robot added needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Oct 21, 2024

thiyyakat self-assigned this Oct 21, 2024

rishabh-11 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 21, 2024

gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 21, 2024

Reduce etcd traffic by using SharedInformer to list pods for drain logic

32a2ae9

thiyyakat force-pushed the reduce-drain-traffic branch from c84c066 to 32a2ae9 Compare October 21, 2024 11:38

elankath reviewed Oct 22, 2024

View reviewed changes

elankath requested changes Oct 22, 2024

View reviewed changes

gardener-robot added the needs/changes Needs (more) changes label Oct 22, 2024

aaronfern reviewed Oct 22, 2024

View reviewed changes

thiyyakat force-pushed the reduce-drain-traffic branch from c582bbe to 08127fe Compare October 22, 2024 10:19

rishabh-11 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 22, 2024

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 22, 2024

thiyyakat requested a review from elankath October 23, 2024 06:09

elankath reviewed Oct 23, 2024

View reviewed changes

elankath approved these changes Oct 23, 2024

View reviewed changes

gardener-robot added reviewed/lgtm Has approval for merging and removed needs/changes Needs (more) changes needs/review Needs review labels Oct 23, 2024

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 23, 2024

Address review comments

cd01e68

thiyyakat force-pushed the reduce-drain-traffic branch from 08127fe to cd01e68 Compare October 23, 2024 06:56

gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 23, 2024

rishabh-11 approved these changes Oct 28, 2024

View reviewed changes

rishabh-11 merged commit 31834e2 into gardener:master Oct 28, 2024
8 checks passed

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 28, 2024

acumino pushed a commit to acumino/machine-controller-manager that referenced this pull request Nov 25, 2024

Reduce load on etcd/kube-apiserver on pod eviction (gardener#949)

0249abe

* Reduce etcd traffic by using SharedInformer to list pods for drain logic * Address review comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce load on etcd/kube-apiserver on pod eviction #949

Reduce load on etcd/kube-apiserver on pod eviction #949

thiyyakat commented Oct 21, 2024 •

edited

Loading

gardener-robot-ci-3 commented Oct 21, 2024

elankath Oct 22, 2024

elankath Oct 22, 2024

elankath Oct 22, 2024 •

edited

Loading

elankath left a comment

aaronfern Oct 22, 2024

elankath Oct 23, 2024

elankath left a comment

rishabh-11 left a comment

Reduce load on etcd/kube-apiserver on pod eviction #949

Reduce load on etcd/kube-apiserver on pod eviction #949

Conversation

thiyyakat commented Oct 21, 2024 • edited Loading

gardener-robot-ci-3 commented Oct 21, 2024

elankath Oct 22, 2024

Choose a reason for hiding this comment

elankath Oct 22, 2024

Choose a reason for hiding this comment

elankath Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

elankath left a comment

Choose a reason for hiding this comment

aaronfern Oct 22, 2024

Choose a reason for hiding this comment

elankath Oct 23, 2024

Choose a reason for hiding this comment

elankath left a comment

Choose a reason for hiding this comment

rishabh-11 left a comment

Choose a reason for hiding this comment

thiyyakat commented Oct 21, 2024 •

edited

Loading

elankath Oct 22, 2024 •

edited

Loading