Kurtosis schedules pods to the same node, even if there are multiple nodes are available #953

barnabasbusa · 2023-07-20T13:11:19Z

What's your CLI version?

0.80.12

Description & steps to reproduce

I use digitalocean as a kubernetes cluster. The cluster has 2 nodes, with max nodes set to 6.

NAME                       STATUS   ROLES    AGE   VERSION
chaos-ams3-default-f1158   Ready    <none>   2d    v1.27.2
chaos-ams3-default-fzppo   Ready    <none>   92m   v1.27.2

I used the config below to deploy a workload on this cluster:

{
  "participants": [
    {
      "el_client_type": "geth",
      "el_client_image": "ethpandaops/geth:4844-devnet-6-e03b5ad",
      "cl_client_type": "lighthouse",
      "cl_client_image": "ethpandaops/lighthouse:boxed-blobs-d534ac0",
      "count": 8
    }
  ],
  "network_params": {
    "deneb_fork_epoch": 3
  },
  "launch_additional_services": true,
  "wait_for_finalization": false,
  "wait_for_verifications": false,
  "global_client_log_level": "info"
}

Its expected to spin up 8 pairs of ethereum nodes.

However, most of these pods getting killed due to running out of resources.

When inspecting the cluster, I can see that all the pods got scheduled to the same node, which will not be sufficient to run all these containers:

k -n kurtosis-enclave--5f9291d1b4414340a936ee2717bbfd2c get pods -owide
NAME                             READY   STATUS                   RESTARTS   AGE     IP             NODE                       NOMINATED NODE   READINESS GATES
cl-1-lighthouse-geth             1/1     Running                  0          2m22s   10.244.0.72    chaos-ams3-default-fzppo   <none>           <none>
cl-1-lighthouse-geth-validator   1/1     Running                  0          2m10s   10.244.0.6     chaos-ams3-default-fzppo   <none>           <none>
cl-2-lighthouse-geth             1/1     Running                  0          2m7s    10.244.0.85    chaos-ams3-default-fzppo   <none>           <none>
cl-2-lighthouse-geth-validator   1/1     Running                  0          2m3s    10.244.0.25    chaos-ams3-default-fzppo   <none>           <none>
cl-3-lighthouse-geth             1/1     Running                  0          118s    10.244.0.83    chaos-ams3-default-fzppo   <none>           <none>
cl-3-lighthouse-geth-validator   1/1     Running                  0          114s    10.244.0.93    chaos-ams3-default-fzppo   <none>           <none>
cl-4-lighthouse-geth             1/1     Running                  0          110s    10.244.0.26    chaos-ams3-default-fzppo   <none>           <none>
cl-4-lighthouse-geth-validator   1/1     Running                  0          104s    10.244.0.16    chaos-ams3-default-fzppo   <none>           <none>
cl-5-lighthouse-geth             1/1     Running                  0          100s    10.244.0.84    chaos-ams3-default-fzppo   <none>           <none>
cl-5-lighthouse-geth-validator   1/1     Running                  0          96s     10.244.0.43    chaos-ams3-default-fzppo   <none>           <none>
cl-6-lighthouse-geth             1/1     Running                  0          92s     10.244.0.47    chaos-ams3-default-fzppo   <none>           <none>
cl-6-lighthouse-geth-validator   1/1     Running                  0          87s     10.244.0.21    chaos-ams3-default-fzppo   <none>           <none>
el-1-geth-lighthouse             0/1     OOMKilled                0          3m59s   10.244.0.26    chaos-ams3-default-fzppo   <none>           <none>
el-2-geth-lighthouse             0/1     OOMKilled                0          3m51s   10.244.0.120   chaos-ams3-default-fzppo   <none>           <none>
el-3-geth-lighthouse             0/1     OOMKilled                0          3m47s   10.244.0.37    chaos-ams3-default-fzppo   <none>           <none>
el-4-geth-lighthouse             1/1     Running                  0          3m42s   10.244.0.3     chaos-ams3-default-fzppo   <none>           <none>
el-5-geth-lighthouse             0/1     ContainerStatusUnknown   1          3m36s   10.244.0.27    chaos-ams3-default-fzppo   <none>           <none>
el-6-geth-lighthouse             0/1     OOMKilled                0          3m28s   10.244.0.30    chaos-ams3-default-fzppo   <none>           <none>
el-7-geth-lighthouse             0/1     OOMKilled                0          3m17s   10.244.0.13    chaos-ams3-default-fzppo   <none>           <none>
el-8-geth-lighthouse             1/1     Running                  0          3m10s   10.244.0.48    chaos-ams3-default-fzppo   <none>           <none>
kurtosis-api                     1/1     Running                  0          5m21s   10.244.0.81    chaos-ams3-default-fzppo   <none>           <none>

I have a feeling that kurtosis somehow trying to handle pod scheduling instead of letting the kubernetes scheduler to do this for it.

Desired behavior

Inspect how many nodes there are available, and based on that do Round Robin distribution of the node pairs into different machines.
Working some magic with an autoscaler would be icing on top.

What is the severity of this bug?

Painful; this is causing significant friction in my workflow.

mieubrisse · 2023-07-20T16:18:14Z

Oh that's super weird; we don't touch the scheduling algorithm at all - just throw Pods at Kubernetes and let it do its thing. I suspect that it's related to your discussion on #952 , where - because the resource limits aren't getting set - Kubernetes thinks "oh these are very light Pods" and just throws them all on the same node, but in reality they're very heavy. If you were to hack in a min_memory requirement to the ServiceConfig that you're using, does that temporarily solve the issue?

mieubrisse · 2023-07-20T16:20:03Z

And re.

Working some magic with an autoscaler would be icing on top.

Coming in the next 1-2 months ;)

barnabasbusa added the bug Something isn't working label Jul 20, 2023

github-actions bot added the painful Painful bug label Jul 20, 2023

barnabasbusa mentioned this issue Jul 21, 2023

feat: add resource management kurtosis-tech/eth-network-package#64

Merged

barnabasbusa closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kurtosis schedules pods to the same node, even if there are multiple nodes are available #953

Kurtosis schedules pods to the same node, even if there are multiple nodes are available #953

barnabasbusa commented Jul 20, 2023

mieubrisse commented Jul 20, 2023 •

edited

Loading

mieubrisse commented Jul 20, 2023

Kurtosis schedules pods to the same node, even if there are multiple nodes are available #953

Kurtosis schedules pods to the same node, even if there are multiple nodes are available #953

Comments

barnabasbusa commented Jul 20, 2023

What's your CLI version?

Description & steps to reproduce

Desired behavior

What is the severity of this bug?

mieubrisse commented Jul 20, 2023 • edited Loading

mieubrisse commented Jul 20, 2023

mieubrisse commented Jul 20, 2023 •

edited

Loading