-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UToronto] Pod scheduled on full node and spawn failure #1004
Comments
@consideRatio, I searched the logs after that user pod but didn't find anything for some reason.
Not sure if it has anything to do with this issue, but I did see a "recommendation" in the Azure portal about "unsupported k8s version detected". Running
And I believe we're running the default 1.20.7? |
Are maybe hitting the max amount of pods the node can contain and that is maybe eliciting somehow this behavior? |
@damianavila, I believe it's 110 with 110. But yes, I've checked and 110 is the max number of pods allowed per node. FindingsI analyzed the dashboards a bit and this is what it shows:
|
I don't feel I have the capacity to debug this, but I want to throw in a quick workaround idea: to make sure that the user pods will be out of memory to schedule that packed on a node before they are out of "number of pods" per node and fail ugly like this. If you do that workaround, remember to consider that some non-user pods may also be running on each node, so maybe aim for max 90 user pods per node for example. If this is done, make sure also to put in a note that the resources requested are tweaked to ensure this - and that it will depend on the nodes you have in the cluster. So, it is not something you could do if you have nodes with different capacity available for users. |
One possibility is that it's a race condition because we have multiple schedulers running on the cluster - the default cluster scheduler and the custom user scheduler. They could be racing each other, and the following sequence of events might be happening.
I encountered something like this (but not exactly this) at the Berkeley cluster. There, I worked around it by configuring the GKE cluster scheduler itself to behave closer to our user scheduler (see https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles). I don't think such a thing is possible in AKS yet. If so, I think we should:
|
This can be tricky sometimes due to how kube-scheduler has changed, and given that we want to support a few versions of k8s at the same time. Here is a relevant note from z2jh about the need to bump RBAC resources. I'll look into this right now a bit. See jupyterhub/zero-to-jupyterhub-k8s#2590 |
GKE supports setting a cluster autoscaler profile that optimizes utilization[1]. This changes the behavior of the autoscaler *and* the scheduler, packing pods in tighter into nodes rather than 'spreading' them around. This is what the z2jh user-scheduler primarily does. We can actually do away with the user-scheduler if we use the OPTIMIZE_UTILIZATION autoscaler profile - Berkeley has been running under this setup for a few months now. This reduces how much resources our core nodes need, and the GKE provided scheduler is also faster. This will also prevent issues like 2i2c-org#1004. [1]: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles
GKE supports setting a cluster autoscaler profile that optimizes utilization[1]. This changes the behavior of the autoscaler *and* the scheduler, packing pods in tighter into nodes rather than 'spreading' them around. This is what the z2jh user-scheduler primarily does. We can actually do away with the user-scheduler if we use the OPTIMIZE_UTILIZATION autoscaler profile - Berkeley has been running under this setup for a few months now. This reduces how much resources our core nodes need, and the GKE provided scheduler is also faster. This will also prevent issues like 2i2c-org#1004. [1]: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles
Thanks for pushing this forward @consideRatio! With #1026, I believe we should close this issue once #1087 that updates the What do you all think? I figure, since this is most likely a race, reproducing it will be hard. We could also increase the resources of the scheduler before closing this also as per @yuvipanda's suggestion. |
The user pods declare the scheduler they want to be scheduled by, so im confident it isnt a race between schedulers trying to schedule the same pod(s). If many pods are to be scheduled on the same node, by different schedulers, maybe there could be some race condition that would be avoided by having pure user nodes with only user pods (and a few system pods). But, the user-scheduler was a configured k8s 1.19 kube-scheduler, what was the k8s api-server version? They are supposed to have the same version - but may still work and has worked great since user-scheduler was introduced as long ago. |
Yeah, this was my assumption, as system pods are not scheduled by the user-scheduler. But there aren't a lot of system pods around, and this occurs too often for that. k8s is 1.20. |
@choldgraf I don't think #1087 includes the updated kube-scheduler, that would probably be in #1055 |
Ah, and #1026 doesn't affect this because that only works for GKE clusters. |
I saw:
Note that the update of kube-scheduler won't happen until z2jh 2.0.0 is used. |
OK so it sounds like this one will be resolved (hopefully!) when this issue is resolved: I'll add that to the top comment, and since this is an improvement over time and not a continuous active fire, I think that we can close out this incident. I'll let the Toronto folks know to watch that issue. If nobody objects in an hour or two, I'll close this and email them w/ this info |
Did you email them back, @choldgraf? I do not see a note on the ticket: https://2i2c.freshdesk.com/a/tickets/79 |
@damianavila good point - I've sent them an email today! |
Since the email has been sent, let's close the incident! |
Description of problem and opportunity to address it
New ticket from UToronto folks reporting a spawn failure (that has also happened in the past). This is the ticket https://2i2c.freshdesk.com/a/tickets/79
Relevant logs reported:
Updates and ongoing work
Initially
I first thought the spawn timeout was because the new node took to much time to come up, but as @consideRatio pointed out, there isn't actually any scale-up event in the logs.
To investigate
I'm copy-pasting what @consideRatio noted in the ticket because I believe it's very relevant and presents a great debugging strategy of this issue:
If it tried to scale up, I would expect to see a note about a scale up event - but instead there are two "successfully assigned pod" events - but then there is a note about "too many pods" and the pod being kicked out of the node.
Questions to debug:
Possible theory
Race condition because we have multiple schedulers running on the cluster - the default cluster scheduler and the custom user scheduler. They could be racing each other, and the following sequence of events might be happening.
Steps that could fix it
kube-scheduler
version ->Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055The text was updated successfully, but these errors were encountered: