Joining of machines fails when max # of (worker) reconciliation is reached. #490

prashanth26 · 2020-07-17T07:51:19Z

What happened:
Newly created machines fail to join which hinders (takes longer) for rollingUpdates/Scale up. This problem and leads to hanging of machineSet/machineDeployments.

What you expected to happen:
Already created machines should be able to join/re-join even even if there are several machine operations in parallel.

How to reproduce it (as minimally and precisely as possible):
Create a machineSet with a higher desired value. Deploy pods with PDBs that prohibit draining of machines. Now scale down the machine set to a much smaller number (more than half it original desired). You will see that the machineSet takes too long to reach the desired state (over drain timeout period).

Anything else we need to know:

Environment:

prashanth26 · 2020-07-17T07:52:37Z

/priority critical
/area operations
/status in-progress
/size xs

/assign

prashanth26 · 2020-07-17T08:11:12Z

Restarting the MCM temporarily solves this issue. However, we need to look into this issue

prashanth26 · 2020-07-17T08:35:22Z

I checked through the code path. Looks like joining/re-joining of nodes is possible (and isn't blocked by this). However, I think this issue is due to the maximum parallel workers allowed. Currently, MCM only allows 5 workers (per resource kind) to process in parallel.

Code for the same. In below links

My guess is due to this as restarting usually fixes this issue. And I had observed more than 5 machines stuck in draining machines hence causing this. Also checked Prometheus to find this graph to find the average processing time and worker (machine) queue length to increase causing this starvation. The peeks keep dropping to zero on restarts and keeps increasing as machines stuck on draining.

prashanth26 · 2020-07-17T08:42:14Z

There are two ways solutions i see.

Increase the max allowed concurrent workers.
a. Set this statically to a higher number. Say - 10.
b. We could set this value based on the number of nodes in the cluster.
Set drain call timeouts to smaller intervals (~10mins intervals). So drainTimeout of 2hours and setting this to 10mins would mean. 12 different drain calls instead of 1 long 120mins call. Which is currently causing this block.

I prefer approach (2). However, maybe there is better way to solve this.

cc: @amshuman-kr

amshuman-kr · 2020-07-17T09:46:52Z

@prashanth26 Thanks a lot for analysis. My preference, actually is different from option 1 and 2.

Decouple drain (at least the waiting part) from reconciliation path. Ideally, we should fire whatever delete/evict calls that are possible and simply ReconcileAfter and pick from where the last reconciliation left off in the next reconciliation. The current waiting and retries in a loop inside the reconciliation is a bad idea generally.

prashanth26 · 2020-07-17T10:11:15Z

(3) is also a good idea. It skipped my mind.

Let's increase the workers to 10 for now. And later we can decouple the drain logic as you mentioned in (3).

prashanth26 · 2020-07-22T08:44:13Z

/reopen to keep track of this issue and implement Amshu's suggestion of part (3) eventually.

prashanth26 · 2020-08-13T20:30:38Z

closing this issue in favor of - #48.

prashanth26 added the kind/bug Bug label Jul 17, 2020

gardener-robot assigned prashanth26 Jul 17, 2020

gardener-robot added priority/critical Needs to be resolved soon, because it impacts users negatively size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/in-progress Issue is in progress/work labels Jul 17, 2020

prashanth26 added the area/ops-productivity Operator productivity related (how to improve operations) label Jul 17, 2020

prashanth26 changed the title ~~Joining of machines fails if machineSet is frozen~~ Joining of machines fails when max # of reconciliation is reached. Jul 17, 2020

prashanth26 changed the title ~~Joining of machines fails when max # of reconciliation is reached.~~ Joining of machines fails when max # of (worker) reconciliation is reached. Jul 17, 2020

prashanth26 mentioned this issue Jul 17, 2020

Increased concurrent worker syncs from 5 --> 10 #491

Merged

hardikdr closed this as completed in #491 Jul 17, 2020

gardener deleted a comment from gardener-robot Jul 17, 2020

gardener-robot reopened this Jul 22, 2020

prashanth26 mentioned this issue Jul 27, 2020

Specify NodeSelector/Affinities for Shoot core components gardener/gardener#2462

Closed

prashanth26 closed this as completed Aug 13, 2020

prashanth26 mentioned this issue Aug 13, 2020

Move the drain logic out of the machine loop #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joining of machines fails when max # of (worker) reconciliation is reached. #490

Joining of machines fails when max # of (worker) reconciliation is reached. #490

prashanth26 commented Jul 17, 2020 •

edited

Loading

prashanth26 commented Jul 17, 2020

prashanth26 commented Jul 17, 2020

prashanth26 commented Jul 17, 2020 •

edited

Loading

prashanth26 commented Jul 17, 2020 •

edited

Loading

amshuman-kr commented Jul 17, 2020

prashanth26 commented Jul 17, 2020

prashanth26 commented Jul 22, 2020

prashanth26 commented Aug 13, 2020

Joining of machines fails when max # of (worker) reconciliation is reached. #490

Joining of machines fails when max # of (worker) reconciliation is reached. #490

Comments

prashanth26 commented Jul 17, 2020 • edited Loading

prashanth26 commented Jul 17, 2020

prashanth26 commented Jul 17, 2020

prashanth26 commented Jul 17, 2020 • edited Loading

prashanth26 commented Jul 17, 2020 • edited Loading

amshuman-kr commented Jul 17, 2020

prashanth26 commented Jul 17, 2020

prashanth26 commented Jul 22, 2020

prashanth26 commented Aug 13, 2020

prashanth26 commented Jul 17, 2020 •

edited

Loading

prashanth26 commented Jul 17, 2020 •

edited

Loading

prashanth26 commented Jul 17, 2020 •

edited

Loading