Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining of machines fails when max # of (worker) reconciliation is reached. #490

Closed
prashanth26 opened this issue Jul 17, 2020 · 8 comments · Fixed by #491
Closed

Joining of machines fails when max # of (worker) reconciliation is reached. #490

prashanth26 opened this issue Jul 17, 2020 · 8 comments · Fixed by #491
Assignees
Labels
area/ops-productivity Operator productivity related (how to improve operations) effort/1d Effort for issue is around 1 day kind/bug Bug priority/2 Priority (lower number equals higher priority) status/in-progress Issue is in progress/work

Comments

@prashanth26
Copy link
Contributor

prashanth26 commented Jul 17, 2020

What happened:
Newly created machines fail to join which hinders (takes longer) for rollingUpdates/Scale up. This problem and leads to hanging of machineSet/machineDeployments.

What you expected to happen:
Already created machines should be able to join/re-join even even if there are several machine operations in parallel.

How to reproduce it (as minimally and precisely as possible):
Create a machineSet with a higher desired value. Deploy pods with PDBs that prohibit draining of machines. Now scale down the machine set to a much smaller number (more than half it original desired). You will see that the machineSet takes too long to reach the desired state (over drain timeout period).

Anything else we need to know:

Environment:

@prashanth26 prashanth26 added the kind/bug Bug label Jul 17, 2020
@prashanth26
Copy link
Contributor Author

/priority critical
/area operations
/status in-progress
/size xs

/assign

@gardener-robot gardener-robot added priority/critical Needs to be resolved soon, because it impacts users negatively size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/in-progress Issue is in progress/work labels Jul 17, 2020
@prashanth26 prashanth26 added the area/ops-productivity Operator productivity related (how to improve operations) label Jul 17, 2020
@prashanth26
Copy link
Contributor Author

Restarting the MCM temporarily solves this issue. However, we need to look into this issue

@prashanth26
Copy link
Contributor Author

prashanth26 commented Jul 17, 2020

I checked through the code path. Looks like joining/re-joining of nodes is possible (and isn't blocked by this). However, I think this issue is due to the maximum parallel workers allowed. Currently, MCM only allows 5 workers (per resource kind) to process in parallel.

Code for the same. In below links

My guess is due to this as restarting usually fixes this issue. And I had observed more than 5 machines stuck in draining machines hence causing this. Also checked Prometheus to find this graph to find the average processing time and worker (machine) queue length to increase causing this starvation. The peeks keep dropping to zero on restarts and keeps increasing as machines stuck on draining.

Screen Shot 2020-07-17 at 2 04 03 PM

Screen Shot 2020-07-17 at 2 02 50 PM

@prashanth26
Copy link
Contributor Author

prashanth26 commented Jul 17, 2020

There are two ways solutions i see.

  1. Increase the max allowed concurrent workers.
    a. Set this statically to a higher number. Say - 10.
    b. We could set this value based on the number of nodes in the cluster.
  2. Set drain call timeouts to smaller intervals (~10mins intervals). So drainTimeout of 2hours and setting this to 10mins would mean. 12 different drain calls instead of 1 long 120mins call. Which is currently causing this block.

I prefer approach (2). However, maybe there is better way to solve this.

cc: @amshuman-kr

@prashanth26 prashanth26 changed the title Joining of machines fails if machineSet is frozen Joining of machines fails when max # of reconciliation is reached. Jul 17, 2020
@prashanth26 prashanth26 changed the title Joining of machines fails when max # of reconciliation is reached. Joining of machines fails when max # of (worker) reconciliation is reached. Jul 17, 2020
@amshuman-kr
Copy link

@prashanth26 Thanks a lot for analysis. My preference, actually is different from option 1 and 2.

  1. Decouple drain (at least the waiting part) from reconciliation path. Ideally, we should fire whatever delete/evict calls that are possible and simply ReconcileAfter and pick from where the last reconciliation left off in the next reconciliation. The current waiting and retries in a loop inside the reconciliation is a bad idea generally.

@prashanth26
Copy link
Contributor Author

(3) is also a good idea. It skipped my mind.

Let's increase the workers to 10 for now. And later we can decouple the drain logic as you mentioned in (3).

@prashanth26
Copy link
Contributor Author

/reopen to keep track of this issue and implement Amshu's suggestion of part (3) eventually.

@prashanth26
Copy link
Contributor Author

closing this issue in favor of - #48.

@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) effort/1d Effort for issue is around 1 day and removed priority/critical Needs to be resolved soon, because it impacts users negatively size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) effort/1d Effort for issue is around 1 day kind/bug Bug priority/2 Priority (lower number equals higher priority) status/in-progress Issue is in progress/work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants