[UToronto] Pod scheduled on full node and spawn failure #1004

GeorgianaElena · 2022-02-17T12:14:36Z

Description of problem and opportunity to address it

New ticket from UToronto folks reporting a spawn failure (that has also happened in the past). This is the ticket https://2i2c.freshdesk.com/a/tickets/79

Relevant logs reported:

Updates and ongoing work

Initially

I first thought the spawn timeout was because the new node took to much time to come up, but as @consideRatio pointed out, there isn't actually any scale-up event in the logs.

To investigate

I'm copy-pasting what @consideRatio noted in the ticket because I believe it's very relevant and presents a great debugging strategy of this issue:

If it tried to scale up, I would expect to see a note about a scale up event - but instead there are two "successfully assigned pod" events - but then there is a note about "too many pods" and the pod being kicked out of the node.

Questions to debug:

why are the user pod being assigned to a full node?
one lead to consider is the custom pod scheduler we use to schedule the user pods onto nodes. The JupyterHub Helm chart starts a dedicated kube-scheduler that may fail to acknowledge that a node may have too many pods and schedule it there even though it shouldn't. The user-scheduler is running a kube-scheduler binary with custom configuration. This kube-scheduler binary comes from a fixed version of k8s, maybe it should be updated because the scheduling logic is different now compared to before - or had a bug related to this etc?
the logs of the user-scheduler pod assigning this user pod to a node is a source of relevant information I think, but probably quite technical in nature.

Possible theory

Race condition because we have multiple schedulers running on the cluster - the default cluster scheduler and the custom user scheduler. They could be racing each other, and the following sequence of events might be happening.

User scheduler sees nodeA with 109 pods
Cluster scheduler sees nodeA with 109 pods
Cluster scheduler assigns podA to nodeA
User scheduler doesn't update current status of nodeA, but assigns podB to nodeA
Kubelet on nodeA pulls podA, and when it tries to pull podB it fails with this error message - 110 is a kubernetes default per-pod node limit.

Steps that could fix it

Update kube-scheduler version ->Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055

The text was updated successfully, but these errors were encountered:

GeorgianaElena · 2022-02-17T12:36:12Z

The logs of the user-scheduler pod assigning this user pod to a node is a source of relevant information I think, but probably quite technical in nature.

@consideRatio, I searched the logs after that user pod but didn't find anything for some reason.

The user-scheduler is running a kube-scheduler binary with custom configuration. This kube-scheduler binary comes from a fixed version of k8s, maybe it should be updated because the scheduling logic is different now compared to before - or had a bug related to this etc

Not sure if it has anything to do with this issue, but I did see a "recommendation" in the Azure portal about "unsupported k8s version detected".

Running az aks get-versions --location canadacentral -o table returns:

KubernetesVersion    Upgrades
-------------------  -------------------------
1.23.3(preview)      None available
1.22.6               1.23.3(preview)
1.22.4               1.22.6, 1.23.3(preview)
1.21.9               1.22.4, 1.22.6
1.21.7               1.21.9, 1.22.4, 1.22.6
1.20.15              1.21.7, 1.21.9
1.20.13              1.20.15, 1.21.7, 1.21.9
1.19.13              1.20.13, 1.20.15
1.19.11              1.19.13, 1.20.13, 1.20.15

And I believe we're running the default 1.20.7?

damianavila · 2022-02-19T14:28:54Z

"too many pods"

Are maybe hitting the max amount of pods the node can contain and that is maybe eliciting somehow this behavior?
That used: 110 capacity:100 smells bad...

GeorgianaElena · 2022-02-22T16:32:04Z

That used: 110 capacity:100 smells bad...

@damianavila, I believe it's 110 with 110. But yes, I've checked and 110 is the max number of pods allowed per node.

Findings

I analyzed the dashboards a bit and this is what it shows:

The peaks are aprox 20min after the timestamp of the error reported (btw, the log timestamps are UTC, right?)
Also, I never see that node in the logs having 110 pods on it 🤔 It was at 108 at the time the logs reported the node full.
There are also a few pods consuming a significant amount of RAM in that time interval.
A new node did come up 15min after this error, even if the logs didn't show a scale-up event

consideRatio · 2022-02-22T16:37:06Z

I don't feel I have the capacity to debug this, but I want to throw in a quick workaround idea: to make sure that the user pods will be out of memory to schedule that packed on a node before they are out of "number of pods" per node and fail ugly like this. If you do that workaround, remember to consider that some non-user pods may also be running on each node, so maybe aim for max 90 user pods per node for example.

If this is done, make sure also to put in a note that the resources requested are tweaked to ensure this - and that it will depend on the nodes you have in the cluster. So, it is not something you could do if you have nodes with different capacity available for users.

yuvipanda · 2022-02-22T16:56:17Z

One possibility is that it's a race condition because we have multiple schedulers running on the cluster - the default cluster scheduler and the custom user scheduler. They could be racing each other, and the following sequence of events might be happening.

User scheduler sees nodeA with 109 pods
Cluster scheduler sees nodeA with 109 pods
Cluster scheduler assigns podA to nodeA
User scheduler doesn't update current status of nodeA, but assigns podB to nodeA
Kubelet on nodeA pulls podA, and when it tries to pull podB it fails with this error message - 110 is a kubernetes default per-pod node limit.

I encountered something like this (but not exactly this) at the Berkeley cluster. There, I worked around it by configuring the GKE cluster scheduler itself to behave closer to our user scheduler (see https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles). I don't think such a thing is possible in AKS yet. If so, I think we should:

See if version of user scheduler needs to be upgraded, and upgrade it
See if more CPU and memory for user scheduler will help.

consideRatio · 2022-02-22T17:39:53Z

See if version of user scheduler needs to be upgraded, and upgrade it

This can be tricky sometimes due to how kube-scheduler has changed, and given that we want to support a few versions of k8s at the same time. Here is a relevant note from z2jh about the need to bump RBAC resources.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/6f4609e2f73258c7e0146f8e849d31aa7fe7f09f/jupyterhub/values.yaml#L432-L436

I'll look into this right now a bit. See jupyterhub/zero-to-jupyterhub-k8s#2590

GKE supports setting a cluster autoscaler profile that optimizes utilization[1]. This changes the behavior of the autoscaler *and* the scheduler, packing pods in tighter into nodes rather than 'spreading' them around. This is what the z2jh user-scheduler primarily does. We can actually do away with the user-scheduler if we use the OPTIMIZE_UTILIZATION autoscaler profile - Berkeley has been running under this setup for a few months now. This reduces how much resources our core nodes need, and the GKE provided scheduler is also faster. This will also prevent issues like 2i2c-org#1004. [1]: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles

GeorgianaElena · 2022-03-11T09:28:02Z

I'll look into this right now a bit. See jupyterhub/zero-to-jupyterhub-k8s#2590

Thanks for pushing this forward @consideRatio!

With #1026, I believe we should close this issue once #1087 that updates the kube-scheduler is merged.

What do you all think? I figure, since this is most likely a race, reproducing it will be hard. We could also increase the resources of the scheduler before closing this also as per @yuvipanda's suggestion.

consideRatio · 2022-03-11T09:34:51Z

The user pods declare the scheduler they want to be scheduled by, so im confident it isnt a race between schedulers trying to schedule the same pod(s). If many pods are to be scheduled on the same node, by different schedulers, maybe there could be some race condition that would be avoided by having pure user nodes with only user pods (and a few system pods).

But, the user-scheduler was a configured k8s 1.19 kube-scheduler, what was the k8s api-server version? They are supposed to have the same version - but may still work and has worked great since user-scheduler was introduced as long ago.

yuvipanda · 2022-03-11T20:49:33Z

If many pods are to be scheduled on the same node, by different schedulers, maybe there could be some race condition that would be avoided by having pure user nodes with only user pods (and a few system pods).

Yeah, this was my assumption, as system pods are not scheduled by the user-scheduler. But there aren't a lot of system pods around, and this occurs too often for that.

k8s is 1.20.

choldgraf · 2022-03-18T16:29:41Z

Hey all - I see that in this comment:

With #1026, I believe we should close this issue once #1087 that updates the kube-scheduler is merged.

We suggested closing this one once those two PRs are merged. They are now both merged. Should we close this issue? If not, what is our next step here?

yuvipanda · 2022-03-21T04:04:24Z

@choldgraf I don't think #1087 includes the updated kube-scheduler, that would probably be in #1055

yuvipanda · 2022-03-21T04:06:03Z

Ah, and #1026 doesn't affect this because that only works for GKE clusters.

consideRatio · 2022-03-21T08:34:52Z

I saw:

[x] Update kube-scheduler version -> Revert "Revert "Upgrade to z2jh 1.2.0"" #1087

Note that the update of kube-scheduler won't happen until z2jh 2.0.0 is used.

choldgraf · 2022-03-24T18:34:16Z

OK so it sounds like this one will be resolved (hopefully!) when this issue is resolved:

Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055

I'll add that to the top comment, and since this is an improvement over time and not a continuous active fire, I think that we can close out this incident. I'll let the Toronto folks know to watch that issue.

If nobody objects in an hour or two, I'll close this and email them w/ this info

damianavila · 2022-04-02T12:49:10Z

If nobody objects in an hour or two, I'll close this and email them w/ this info

Did you email them back, @choldgraf? I do not see a note on the ticket: https://2i2c.freshdesk.com/a/tickets/79
But I am not sure if you had some other email exchange with them.

choldgraf · 2022-04-02T20:27:06Z

@damianavila good point - I've sent them an email today!

damianavila · 2022-04-12T22:40:14Z

Since the email has been sent, let's close the incident!

yuvipanda mentioned this issue Feb 24, 2022

Use OPTIMIZE_UTILIZATION cluster autoscaler profile #1026

Merged

choldgraf added the type: Hub Incident label Mar 24, 2022

choldgraf added this to DEPRECATED Engineering and Product Backlog Mar 30, 2022

choldgraf moved this to Blocked in DEPRECATED Engineering and Product Backlog Mar 30, 2022

damianavila moved this from Blocked to Ready to work in DEPRECATED Engineering and Product Backlog Apr 12, 2022

damianavila closed this as completed Apr 12, 2022

Repository owner moved this from Ready to work to Complete in DEPRECATED Engineering and Product Backlog Apr 12, 2022

damianavila assigned yuvipanda, choldgraf and GeorgianaElena Apr 12, 2022

yuvipanda mentioned this issue Jun 24, 2022

Bump z2jh closer to v2.0 #1468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UToronto] Pod scheduled on full node and spawn failure #1004

[UToronto] Pod scheduled on full node and spawn failure #1004

GeorgianaElena commented Feb 17, 2022 •

edited by choldgraf

Loading

GeorgianaElena commented Feb 17, 2022

damianavila commented Feb 19, 2022

GeorgianaElena commented Feb 22, 2022

consideRatio commented Feb 22, 2022

yuvipanda commented Feb 22, 2022

consideRatio commented Feb 22, 2022 •

edited

Loading

GeorgianaElena commented Mar 11, 2022

consideRatio commented Mar 11, 2022 •

edited

Loading

yuvipanda commented Mar 11, 2022

choldgraf commented Mar 18, 2022

yuvipanda commented Mar 21, 2022

yuvipanda commented Mar 21, 2022

consideRatio commented Mar 21, 2022

choldgraf commented Mar 24, 2022 •

edited

Loading

damianavila commented Apr 2, 2022

choldgraf commented Apr 2, 2022 •

edited

Loading

damianavila commented Apr 12, 2022

[UToronto] Pod scheduled on full node and spawn failure #1004

[UToronto] Pod scheduled on full node and spawn failure #1004

Comments

GeorgianaElena commented Feb 17, 2022 • edited by choldgraf Loading

Description of problem and opportunity to address it

Updates and ongoing work

Initially

To investigate

Possible theory

Steps that could fix it

GeorgianaElena commented Feb 17, 2022

damianavila commented Feb 19, 2022

GeorgianaElena commented Feb 22, 2022

Findings

consideRatio commented Feb 22, 2022

yuvipanda commented Feb 22, 2022

consideRatio commented Feb 22, 2022 • edited Loading

GeorgianaElena commented Mar 11, 2022

consideRatio commented Mar 11, 2022 • edited Loading

yuvipanda commented Mar 11, 2022

choldgraf commented Mar 18, 2022

yuvipanda commented Mar 21, 2022

yuvipanda commented Mar 21, 2022

consideRatio commented Mar 21, 2022

choldgraf commented Mar 24, 2022 • edited Loading

damianavila commented Apr 2, 2022

choldgraf commented Apr 2, 2022 • edited Loading

damianavila commented Apr 12, 2022

GeorgianaElena commented Feb 17, 2022 •

edited by choldgraf

Loading

consideRatio commented Feb 22, 2022 •

edited

Loading

consideRatio commented Mar 11, 2022 •

edited

Loading

choldgraf commented Mar 24, 2022 •

edited

Loading

choldgraf commented Apr 2, 2022 •

edited

Loading