-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce load on etcd/kube-apiserver on pod eviction #949
Conversation
Thank you @thiyyakat for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below. |
c84c066
to
32a2ae9
Compare
pkg/util/provider/drain/drain.go
Outdated
@@ -232,20 +239,26 @@ func (o *Options) RunDrain(ctx context.Context) error { | |||
klog.Errorf("Drain Error: Cordoning of node failed with error: %v", err) | |||
return err | |||
} | |||
stopCh := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not construct an empty stopCh
here as it becomes useless for signalling. We should use a channel obtained from context.WithTimeout
using the drain timeout passed to NewDrainOptions
. And then use the Done()
method on the returned context to get the stopCh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS: There is already a context created for this a little later in RunDrain
. drainContext, cancelFn := context.WithDeadline(ctx, o.drainStartedOn.Add(o.Timeout))
We should use the Done()
channel from this.
pkg/util/provider/drain/drain.go
Outdated
if w != nil { | ||
ws[w.string] = append(ws[w.string], pod.Name) | ||
for _, pod := range podList { | ||
if pod.Spec.NodeName == o.nodeName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use a not check here and continue instead of nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. small changes requested.
pkg/util/provider/app/app.go
Outdated
@@ -279,6 +279,7 @@ func StartControllers(s *options.MCServer, | |||
targetCoreInformerFactory.Storage().V1().VolumeAttachments(), | |||
machineSharedInformers.MachineClasses(), | |||
machineSharedInformers.Machines(), | |||
targetCoreInformerFactory.Core().V1().Pods(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you move this a few lines up so that all InformerFactory
s are together?
c582bbe
to
08127fe
Compare
pkg/util/provider/drain/drain.go
Outdated
if err != nil { | ||
return pods, err | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: with named return params, you don't need explicit valued return. Just a bare return
is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm . Please run IT.
08127fe
to
cd01e68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
* Reduce etcd traffic by using SharedInformer to list pods for drain logic * Address review comments
What this PR does / why we need it:
The PR changes the way pods are listed before eviction, by using a
PodInformer
which uses a local cache rather than directly querying the kube-apiserver/etcd. The pods are listed only after the cache has synced.Which issue(s) this PR fixes:
Fixes #703
Special notes for your reviewer:
Impact of changes tested by running pods configuring a PDB with
maxUnavailable
set to 0, on a cluster with 20 machines, and deleting all the machines, thereby initiating drain. Without change, peak traffic recorded was 8.51 MB/s. With change, peak traffic recorded was 1.48 MB/s.Integration tests run for providers AWS and Azure completed successfully.
Additionally, to manually check if cache for
podInformer
syncs successfully before thepodLister.List()
call,time.Sleep( 30 * time.Second)
was introduced before callingRunCordonOrUncordon()
, and a new pod (default/nginx-pod2 ) was deployed during the sleep period. Logs were added to print the names of all pods on the node returned by the podLister. After triggering the deletion of the machine, the machine entered the drain flow, and after the sleep the new pod's name was logged.Release note: