Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Open
otoupin-nsesi opened this issue Nov 25, 2024 · 10 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@otoupin-nsesi
Copy link

otoupin-nsesi commented Nov 25, 2024

Description

Observed Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and endless new nodes are created to try to schedule those pods. Also, pods that don't have PDBs are NOT evicted.

Expected Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and one node at most is created to schedule those pods. Also, pods that don't have PDBs are evicted. There may be an odd pod that has a PDB preventing the node from getting recycled, but if this is the case we can set terminationGracePeriod.

Reproduction Steps:

  1. Have one CloudNativePG database in the cluster (or a similar workload => single replica & a PDB)
  2. CloudNativePG will add a PDB to the primary.
  3. Have a nodepool with relatively short expiry (expireAfter). In our case we have dev environments set at 24h, so we caught this early.
  4. Once a node expires, a weird behaviour is triggered.
    1. As expected, in v1 expiries are now forceful, so Karpenter begins to evict the pods.
    2. As expected, a new node is spun out to take up the slack.
    3. But then the problems start,
      1. Since there is a PDB on a single replica (there is only one PG primary at the time), eviction is not happening. So far so good (this is also the old behaviour, in v0.37.x the node just can't expire until we restart the database manually (or kill the primary)).
      2. However, any other pods on this node are not evicted either, while the documentation, and the log messages appear to believe it should be the case.
      3. The new node from earlier is nomitated for those pods, but they never transfer to that node, as they are not evicted.
      4. Then at the next batch of pod scheduling, we get found provisionable pod(s) again, and a new nodeclaim is added (for the same pod as earlier)
      5. And again
      6. And again
      7. And again
    4. So we end up in a situation where we have a lot of unused nodes, containing only daemonset and new workloads.
  5. At the point, I restart the database, the primary move, the PDB is removed, and everything can then slowly heal. However, there was no sign of "infinite nodeclaim creation" ever ending before.

We believe this is a bug, we couldn't find a workaround (aside from removing expireAfter), and reverted to v0.37.x series for now.

A few clues:
The state of the cluster 30m-45m after expiry. Node 53-23 is the one that expired. Any nodes younger than 30min are running mostly empty (aside from daemonsets).

node-create-hell-clean

On the expired node, the pods are nominated to be scheduled on a different node, but as you can see it can never happen.

NOTE: I don't recall 100% if this screenshot was CloudNativePG primary itself or one of its neighbouring pods, but I think so.

node-should-schedule

And finally the log that appears after every scheduling event saying it found provisionable pod(s) and they precede a new “unnecessary nodeclaim."

karpenter-5d967c944c-k8xb8 {"level":"INFO","time":"2024-11-13T22:47:24.148Z","logger":"controller","message":"found provisionable pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"7c981fa7-3071-4de8-87b3-370a15664ba7","Pods":"monitoring/monitoring-grafana-pg-1, kube-system/coredns-58745b69fb-sd222, cnpg-system/cnpg-cloudnative-pg-7667bd696d-lrqvb, kube-system/aws-load-balancer-controller-74b584c6df-fckdn, harbor/harbor-container-webhook-78657f5698-kmmrz","duration":"87.726672ms"}

Versions:

  • Chart Version: 1.0.7
  • Kubernetes Version (kubectl version): v1.29.10

Extra:

  • I would like to build / modify a test case to prove / diagnose this behaviour, any pointer? I've looked at the source code, but I wanted to post this report first to gather feedback.
  • Any other workaround aside from disabling expireAfter on the node pool?
  • Finally, in our context this bug is triggered by CloudNativePG primaries, but it would apply to any workload with a single replica and a PDB minAvailable: 1.
@otoupin-nsesi otoupin-nsesi added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 25, 2024
@danielloader
Copy link

@jonathan-innis
Copy link
Member

Can you share the PDB that you are using and the StatefulSet/Deployment? From looking at the other thread, it sounds like there may be something else that is blocking Karpenter from performing the eviction that needs integration with Karpenter's down-scaling logic

@jonathan-innis
Copy link
Member

/traige needs-information

@sidewinder12s
Copy link

sidewinder12s commented Dec 10, 2024

From the linked issue it sounds like they configure a PDB which will forever block pod termination.

I think I am also seeing similar behavior to this with the do-not-evict annotation on pods blocking pod termination. I think you can observe similar running karpenter, create a deployment with topologySpreadConstraint, like 15 replicas and an expireAfter period of like 10m.

I'm using v1.0.8

@sidewinder12s
Copy link

Actually just tested this again, letting Karpenter run in that configuration with 15 pods blocking node termination put Karpenter into a bad state seemingly unable to scale down nodes with a lot of this waiting on cluster sync message:

{"level":"DEBUG","time":"2024-12-10T23:30:54.287Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"c8f4afc7-527c-491f-bd3c-e73f119dcc30"}
{"level":"DEBUG","time":"2024-12-10T23:30:55.289Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"600f540f-60cb-498b-9b94-c72e4bf5a4d4"}
{"level":"DEBUG","time":"2024-12-10T23:30:56.292Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"7e5a4f3d-c985-40ef-b46b-eca1925ce2ee"}
{"level":"DEBUG","time":"2024-12-10T23:30:57.294Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"9719c8af-4cbd-4a01-bd81-8c57c5a8c482"}
{"level":"DEBUG","time":"2024-12-10T23:30:58.296Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"63b7771e-b55e-4b41-a313-cac4d9ebc53a"}
{"level":"DEBUG","time":"2024-12-10T23:30:58.644Z","logger":"controller","caller":"reconcile/reconcile.go:142","message":"deleting expired nodeclaim","commit":"a2875e3","controller":"nodeclaim.expiration","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"use1-test01-default-spot-kjsx9"},"namespace":"","name":"use1-test01-default-spot-kjsx9","reconcileID":"3224ba1b-82e6-4989-b77f-ea08e798ba2c"}
{"level":"DEBUG","time":"2024-12-10T23:30:59.080Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"222b8e98-b936-4e43-b696-f8a38ab4f78d"}
{"level":"DEBUG","time":"2024-12-10T23:30:59.297Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"97b9d15d-133c-4d6e-991d-7b688a48e2ef"}

@bartoszgridgg
Copy link

We just ran into same issue with do-not-evict annotation, each pod gets new node and we end up with massive amount of underutilized nodes.

@heybronson
Copy link

We just experienced this issue: All node claims expired, and Karpenter was stuck waiting on cluster sync indefinitely. We had to remove these nodeclaims manually.

@saurav-agarwalla
Copy link
Contributor

I put up a PR to solve this for pods with karpenter.sh/do-not-disrupt=true since they can't really reschedule to another node: #2033

This was the same issue that was raised in aws/karpenter-provider-aws#7521.

But regarding the original expectation:

After expireAfter elapse on a node, pods are starting to get evicted, and one node at most is created to schedule those pods. Also, pods that don't have PDBs are evicted. There may be an odd pod that has a PDB preventing the node from getting recycled, but if this is the case we can set terminationGracePeriod.

You are correct that terminationGracePeriod is the way to handle this. If PDBs indefinitely hold up a nodeclaim from terminating, those pods will not move to a new node.

The docs talk more about this:

Warning
Misconfigured PDBs and pods with the karpenter.sh/do-not-disrupt annotation may block draining indefinitely. For this reason, it is not recommended to set expireAfter without also setting terminationGracePeriod if your cluster has pods with the karpenter.sh/do-not-disrupt annotation. Doing so can result in partially drained nodes stuck in the cluster, driving up cluster cost and potentially requiring manual intervention to resolve.

@saurav-agarwalla
Copy link
Contributor

/assign
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 26, 2025
@jorgeperezc
Copy link

jorgeperezc commented Feb 27, 2025

I put up a PR to solve this for pods with karpenter.sh/do-not-disrupt=true since they can't really reschedule to another node: #2033

This was the same issue that was raised in aws/karpenter-provider-aws#7521.

That PR is not the solution we were expecting.

But regarding the original expectation:

After expireAfter elapse on a node, pods are starting to get evicted, and one node at most is created to schedule those pods. Also, pods that don't have PDBs are evicted. There may be an odd pod that has a PDB preventing the node from getting recycled, but if this is the case we can set terminationGracePeriod.

You are correct that terminationGracePeriod is the way to handle this. If PDBs indefinitely hold up a nodeclaim from terminating, those pods will not move to a new node.

The docs talk more about this:

Warning
Misconfigured PDBs and pods with the karpenter.sh/do-not-disrupt annotation may block draining indefinitely. For this reason, it is not recommended to set expireAfter without also setting terminationGracePeriod if your cluster has pods with the karpenter.sh/do-not-disrupt annotation. Doing so can result in partially drained nodes stuck in the cluster, driving up cluster cost and potentially requiring manual intervention to resolve.

The documentation you mentioned was added after our reports, a clever way to avoid the problem. Defining terminationGracePeriod is also not the solution, as if your workload doesn't terminate quickly, you could end up provisioning resources that you won't use.

This is explained here in case you'd like to review it: #1928

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

9 participants