-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joining of machines fails when max # of (worker) reconciliation is reached. #490
Comments
/priority critical /assign |
Restarting the MCM temporarily solves this issue. However, we need to look into this issue |
I checked through the code path. Looks like joining/re-joining of nodes is possible (and isn't blocked by this). However, I think this issue is due to the maximum parallel workers allowed. Currently, MCM only allows 5 workers (per resource kind) to process in parallel. Code for the same. In below links
My guess is due to this as restarting usually fixes this issue. And I had observed more than 5 machines stuck in draining machines hence causing this. Also checked Prometheus to find this graph to find the average processing time and worker (machine) queue length to increase causing this starvation. The peeks keep dropping to zero on restarts and keeps increasing as machines stuck on draining. |
There are two ways solutions i see.
I prefer approach (2). However, maybe there is better way to solve this. cc: @amshuman-kr |
@prashanth26 Thanks a lot for analysis. My preference, actually is different from option 1 and 2.
|
(3) is also a good idea. It skipped my mind. Let's increase the workers to 10 for now. And later we can decouple the drain logic as you mentioned in (3). |
/reopen to keep track of this issue and implement Amshu's suggestion of part (3) eventually. |
closing this issue in favor of - #48. |
What happened:
Newly created machines fail to join which hinders (takes longer) for rollingUpdates/Scale up. This problem and leads to hanging of machineSet/machineDeployments.
What you expected to happen:
Already created machines should be able to join/re-join even even if there are several machine operations in parallel.
How to reproduce it (as minimally and precisely as possible):
Create a machineSet with a higher desired value. Deploy pods with PDBs that prohibit draining of machines. Now scale down the machine set to a much smaller number (more than half it original desired). You will see that the machineSet takes too long to reach the desired state (over drain timeout period).
Anything else we need to know:
Environment:
The text was updated successfully, but these errors were encountered: