Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UToronto] Pod scheduled on full node and spawn failure #1004

Closed
1 task
GeorgianaElena opened this issue Feb 17, 2022 · 17 comments
Closed
1 task

[UToronto] Pod scheduled on full node and spawn failure #1004

GeorgianaElena opened this issue Feb 17, 2022 · 17 comments
Assignees

Comments

@GeorgianaElena
Copy link
Member

GeorgianaElena commented Feb 17, 2022

Description of problem and opportunity to address it

New ticket from UToronto folks reporting a spawn failure (that has also happened in the past). This is the ticket https://2i2c.freshdesk.com/a/tickets/79

Relevant logs reported:
toront-logs

Updates and ongoing work

Initially

I first thought the spawn timeout was because the new node took to much time to come up, but as @consideRatio pointed out, there isn't actually any scale-up event in the logs.

To investigate

I'm copy-pasting what @consideRatio noted in the ticket because I believe it's very relevant and presents a great debugging strategy of this issue:

If it tried to scale up, I would expect to see a note about a scale up event - but instead there are two "successfully assigned pod" events - but then there is a note about "too many pods" and the pod being kicked out of the node.

Questions to debug:

  • why are the user pod being assigned to a full node?
  • one lead to consider is the custom pod scheduler we use to schedule the user pods onto nodes. The JupyterHub Helm chart starts a dedicated kube-scheduler that may fail to acknowledge that a node may have too many pods and schedule it there even though it shouldn't. The user-scheduler is running a kube-scheduler binary with custom configuration. This kube-scheduler binary comes from a fixed version of k8s, maybe it should be updated because the scheduling logic is different now compared to before - or had a bug related to this etc?
  • the logs of the user-scheduler pod assigning this user pod to a node is a source of relevant information I think, but probably quite technical in nature.

Possible theory

Race condition because we have multiple schedulers running on the cluster - the default cluster scheduler and the custom user scheduler. They could be racing each other, and the following sequence of events might be happening.

  1. User scheduler sees nodeA with 109 pods
  2. Cluster scheduler sees nodeA with 109 pods
  3. Cluster scheduler assigns podA to nodeA
  4. User scheduler doesn't update current status of nodeA, but assigns podB to nodeA
  5. Kubelet on nodeA pulls podA, and when it tries to pull podB it fails with this error message - 110 is a kubernetes default per-pod node limit.

Steps that could fix it

@GeorgianaElena
Copy link
Member Author

The logs of the user-scheduler pod assigning this user pod to a node is a source of relevant information I think, but probably quite technical in nature.

@consideRatio, I searched the logs after that user pod but didn't find anything for some reason.

The user-scheduler is running a kube-scheduler binary with custom configuration. This kube-scheduler binary comes from a fixed version of k8s, maybe it should be updated because the scheduling logic is different now compared to before - or had a bug related to this etc

Not sure if it has anything to do with this issue, but I did see a "recommendation" in the Azure portal about "unsupported k8s version detected".

Running az aks get-versions --location canadacentral -o table returns:

KubernetesVersion    Upgrades
-------------------  -------------------------
1.23.3(preview)      None available
1.22.6               1.23.3(preview)
1.22.4               1.22.6, 1.23.3(preview)
1.21.9               1.22.4, 1.22.6
1.21.7               1.21.9, 1.22.4, 1.22.6
1.20.15              1.21.7, 1.21.9
1.20.13              1.20.15, 1.21.7, 1.21.9
1.19.13              1.20.13, 1.20.15
1.19.11              1.19.13, 1.20.13, 1.20.15

And I believe we're running the default 1.20.7?

@damianavila
Copy link
Contributor

"too many pods"

Are maybe hitting the max amount of pods the node can contain and that is maybe eliciting somehow this behavior?
That used: 110 capacity:100 smells bad...

@GeorgianaElena
Copy link
Member Author

That used: 110 capacity:100 smells bad...

@damianavila, I believe it's 110 with 110. But yes, I've checked and 110 is the max number of pods allowed per node.

Findings

I analyzed the dashboards a bit and this is what it shows:

toronto-dashboard

  • The peaks are aprox 20min after the timestamp of the error reported (btw, the log timestamps are UTC, right?)

  • Also, I never see that node in the logs having 110 pods on it 🤔 It was at 108 at the time the logs reported the node full.

  • There are also a few pods consuming a significant amount of RAM in that time interval.
    high-mem

  • A new node did come up 15min after this error, even if the logs didn't show a scale-up event

new-node

@consideRatio
Copy link
Member

I don't feel I have the capacity to debug this, but I want to throw in a quick workaround idea: to make sure that the user pods will be out of memory to schedule that packed on a node before they are out of "number of pods" per node and fail ugly like this. If you do that workaround, remember to consider that some non-user pods may also be running on each node, so maybe aim for max 90 user pods per node for example.

If this is done, make sure also to put in a note that the resources requested are tweaked to ensure this - and that it will depend on the nodes you have in the cluster. So, it is not something you could do if you have nodes with different capacity available for users.

@yuvipanda
Copy link
Member

One possibility is that it's a race condition because we have multiple schedulers running on the cluster - the default cluster scheduler and the custom user scheduler. They could be racing each other, and the following sequence of events might be happening.

  1. User scheduler sees nodeA with 109 pods
  2. Cluster scheduler sees nodeA with 109 pods
  3. Cluster scheduler assigns podA to nodeA
  4. User scheduler doesn't update current status of nodeA, but assigns podB to nodeA
  5. Kubelet on nodeA pulls podA, and when it tries to pull podB it fails with this error message - 110 is a kubernetes default per-pod node limit.

I encountered something like this (but not exactly this) at the Berkeley cluster. There, I worked around it by configuring the GKE cluster scheduler itself to behave closer to our user scheduler (see https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles). I don't think such a thing is possible in AKS yet. If so, I think we should:

  1. See if version of user scheduler needs to be upgraded, and upgrade it
  2. See if more CPU and memory for user scheduler will help.

@consideRatio
Copy link
Member

consideRatio commented Feb 22, 2022

See if version of user scheduler needs to be upgraded, and upgrade it

This can be tricky sometimes due to how kube-scheduler has changed, and given that we want to support a few versions of k8s at the same time. Here is a relevant note from z2jh about the need to bump RBAC resources.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/6f4609e2f73258c7e0146f8e849d31aa7fe7f09f/jupyterhub/values.yaml#L432-L436

I'll look into this right now a bit. See jupyterhub/zero-to-jupyterhub-k8s#2590

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Feb 24, 2022
GKE supports setting a cluster autoscaler profile that
optimizes utilization[1]. This changes the behavior of
the autoscaler *and* the scheduler, packing pods in tighter
into nodes rather than 'spreading' them around. This is
what the z2jh user-scheduler primarily does. We can actually
do away with the user-scheduler if we use the OPTIMIZE_UTILIZATION
autoscaler profile - Berkeley has been running under this setup
for a few months now. This reduces how much resources our core
nodes need, and the GKE provided scheduler is also faster.
This will also prevent issues like
2i2c-org#1004.

[1]: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Feb 28, 2022
GKE supports setting a cluster autoscaler profile that
optimizes utilization[1]. This changes the behavior of
the autoscaler *and* the scheduler, packing pods in tighter
into nodes rather than 'spreading' them around. This is
what the z2jh user-scheduler primarily does. We can actually
do away with the user-scheduler if we use the OPTIMIZE_UTILIZATION
autoscaler profile - Berkeley has been running under this setup
for a few months now. This reduces how much resources our core
nodes need, and the GKE provided scheduler is also faster.
This will also prevent issues like
2i2c-org#1004.

[1]: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#autoscaling_profiles
@GeorgianaElena
Copy link
Member Author

I'll look into this right now a bit. See jupyterhub/zero-to-jupyterhub-k8s#2590

Thanks for pushing this forward @consideRatio!

With #1026, I believe we should close this issue once #1087 that updates the kube-scheduler is merged.

What do you all think? I figure, since this is most likely a race, reproducing it will be hard. We could also increase the resources of the scheduler before closing this also as per @yuvipanda's suggestion.

@consideRatio
Copy link
Member

consideRatio commented Mar 11, 2022

The user pods declare the scheduler they want to be scheduled by, so im confident it isnt a race between schedulers trying to schedule the same pod(s). If many pods are to be scheduled on the same node, by different schedulers, maybe there could be some race condition that would be avoided by having pure user nodes with only user pods (and a few system pods).

But, the user-scheduler was a configured k8s 1.19 kube-scheduler, what was the k8s api-server version? They are supposed to have the same version - but may still work and has worked great since user-scheduler was introduced as long ago.

@yuvipanda
Copy link
Member

If many pods are to be scheduled on the same node, by different schedulers, maybe there could be some race condition that would be avoided by having pure user nodes with only user pods (and a few system pods).

Yeah, this was my assumption, as system pods are not scheduled by the user-scheduler. But there aren't a lot of system pods around, and this occurs too often for that.

k8s is 1.20.

@choldgraf
Copy link
Member

Hey all - I see that in this comment:

With #1026, I believe we should close this issue once #1087 that updates the kube-scheduler is merged.

We suggested closing this one once those two PRs are merged. They are now both merged. Should we close this issue? If not, what is our next step here?

@yuvipanda
Copy link
Member

@choldgraf I don't think #1087 includes the updated kube-scheduler, that would probably be in #1055

@yuvipanda
Copy link
Member

Ah, and #1026 doesn't affect this because that only works for GKE clusters.

@consideRatio
Copy link
Member

I saw:

[x] Update kube-scheduler version -> Revert "Revert "Upgrade to z2jh 1.2.0"" #1087

Note that the update of kube-scheduler won't happen until z2jh 2.0.0 is used.

@choldgraf
Copy link
Member

choldgraf commented Mar 24, 2022

OK so it sounds like this one will be resolved (hopefully!) when this issue is resolved:

I'll add that to the top comment, and since this is an improvement over time and not a continuous active fire, I think that we can close out this incident. I'll let the Toronto folks know to watch that issue.

If nobody objects in an hour or two, I'll close this and email them w/ this info

@damianavila
Copy link
Contributor

If nobody objects in an hour or two, I'll close this and email them w/ this info

Did you email them back, @choldgraf? I do not see a note on the ticket: https://2i2c.freshdesk.com/a/tickets/79
But I am not sure if you had some other email exchange with them.

@choldgraf
Copy link
Member

choldgraf commented Apr 2, 2022

@damianavila good point - I've sent them an email today!

@damianavila damianavila moved this from Blocked to Ready to work in DEPRECATED Engineering and Product Backlog Apr 12, 2022
@damianavila
Copy link
Contributor

Since the email has been sent, let's close the incident!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

5 participants