Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle unset apiVersion and kind in nodeClassRef #909

Closed
jonathan-innis opened this issue Dec 24, 2023 · 14 comments · Fixed by #1360
Closed

How to handle unset apiVersion and kind in nodeClassRef #909

jonathan-innis opened this issue Dec 24, 2023 · 14 comments · Fixed by #1360
Labels
kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone

Comments

@jonathan-innis
Copy link
Member

jonathan-innis commented Dec 24, 2023

Description

What problem are you trying to solve?

As part of #337, we need to know the apiVersion and the kind of the nodeClassRef. This is also true when it comes to supporting #493 where we would want to roll-up NodeClass readiness into NodePool readiness so that we can determine if a NodePool can be provisioned with based on both whether the NodePool is in a "Ready" state as well as when the NodeClass is in a "Ready" state. You can imagine something like the example below when it comes to reasoning about NodeClass readiness and NodePool readiness.

NodeClass Readiness
status:
  conditions:
  - type: Ready
    status: "True"
NodePool Readiness
status:
  conditions:
  - type: Ready
    status: "True"
  - type: NodeClassReady
    status: "True"

Right now, there is no hard requirement to set apiVersion and kind in your nodeClassRef (only name) since the CloudProviders who are actually grabbing the nodeClassRef to pull true type for GetInstanceTypes and Create calls already know the type under the hood. For AWS, this type is always EC2NodeClass; for Azure, this type is always AKSNodeClass today. This works fine for now; however, this becomes problematic when we want to grab the underlying resource dynamically from the neutral code using the dynamic client, like we would need to do if we were trying to grab the readiness of the NodeClass to bubble-up that readiness into the NodePool.

We need to have some way to infer the actual type and the name field isn't enough. Ideally, we would require that the apiVersion and kind always be set in the NodePool's nodeClassRef; however, doing so out of the gate would be a breaking change and would break existing resources which don't already have this field set on the cluster. There are a couple thoughts that I had here around how to achieve this:

Solution 1: Runtime Defaults through cloudprovider.go

Codify the concept of a "default" NodeClass. This means that a CloudProvider would set some global variables var DefaultNodeClassKind string and var DefaultNodeClassApiVersion string in the cloudprovider.go file which Karpenter would cast an object into if the type wasn't explicitly set on the nodeClassRef. This works fine, but probably doesn't have good, long-term extensibility. It's also less observable to set this at runtime vs. having it statically resolve into the NodePool. You could also imagine a way to do this through something like a CRD annotation that would set the default NodeClass through the discovery client (in a way very similar to the default storage class). This is really a fancier version of the in-code methodology so for the purposes of thinking about the best decision here, I'm throwing it into the same bucket.

Solution 2: Static Defaults through OpenAPI

The second option is to have the CloudProviders today default the apiVersion and the kind directly into the OpenAPI spec so that the string values are automatically resolved into the NodePool statically. That basically means that if someone submitted the following nodeClassRef through their manifest

Submitted Manifest
nodeClassRef:
  name: default
Resolved Manifest (for AWS)
nodeClassRef:
  apiVersion: karpenter.k8s.aws/v1beta1
  kind: EC2NodeClass
  name: default

We can allow these values to be defaulted for some time and then require them to be set explicitly at v1. This would allow existing resources that don't have these fields set to have a migration path to be fully required to be set down the line. The kicker of this solution: We can't update the nodeClassRef defaulting without causing a breaking change to the hash mechanism that we are using for drift. If we updating the defaulting mechanism outright, we would cause all nodes on the cluster to drift and get rolled on upgrade, which we obviously want to avoid. One solution to the drift problem is:

  1. Create a second annotation which we call karpenter.sh/nodepool-hash-version
  2. Set this value to v2 on NodePool's and re-hash the NodePool when we upgrade to the new version
  3. Re-inject the karpenter.sh/nodepool-hash based on the current hash value of the owning NodePool. Update the karpenter.sh/nodepool-hash and set karpenter.sh/nodepool-hash-version to v2 on startup
  4. Don't let any NodeClaim/Node drift without karpenter.sh/nodepool-hash-version annotation set to equal the karpenter.sh/nodepool-hash-version of the owning NodePool

The upshot of Solution 2 is that we would need to coordinate setting the OpenAPI defaults with the other cloud providers that we support on the same minor version. That way, each cloudprovider got the benefit of us bumping the hash version alongside us setting static defaults for nodeClassRefs.

How important is this feature to you?

This unblocks us in different ways where we would like to do dynamic watches based on unknown apiVersion/group and kind but have some of these values unset in NodePool's today.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis
Copy link
Member Author

@jonathan-innis jonathan-innis added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/feature labels Dec 26, 2023
@njtran
Copy link
Contributor

njtran commented Dec 28, 2023

You could also imagine a way to do this through something like a CRD annotation that would set the default NodeClass through the discovery client (in a way very similar to the default storage class). This is really a fancier version of the in-code methodology so for the purposes of thinking about the best decision here, I'm throwing it into the same bucket.

Not quite following this point. Is this to say that CPs would vend an annotation on their NodeClass CRD like karpenter.sh/nodeclass: owned or something like that so Karpenter would list CRDs that have that annotation, and infer the GVK from the object it discovered?

We can't update the nodeClassRef defaulting without causing a breaking change to the hash mechanism that we are using for drift. If we updating the defaulting mechanism outright, we would cause all nodes on the cluster to drift and get rolled on upgrade, which we obviously want to avoid. One solution to the drift problem is:

Can't we just make the kind and apiVersion fields non-hashed? I'm not sure if it wires through properly, being a referenced struct, but we should check the internal implementation of the hashing function here to see if it recursively hashes.

@jonathan-innis
Copy link
Member Author

or something like that so Karpenter would list CRDs that have that annotation, and infer the GVK from the object it discovered

Yep, you could imagine that a Cloud Provider adds annotations to NodeClasses to tell the kubernetes-sigs/karpenter library that this CRD is a NodeClass and should be watched.

Can't we just make the kind and apiVersion fields non-hashed

Changing from using it in the hash to not using it in the hash is a breaking change. The hashing mechanism does currently recursively hash everything. You can imagine that it basically does a JSON serialization and then hashes that, which means that it includes the current apiVersion and kind if they are set.

@njtran
Copy link
Contributor

njtran commented Dec 29, 2023

Changing from using it in the hash to not using it in the hash is a breaking change. The hashing mechanism does currently recursively hash everything. You can imagine that it basically does a JSON serialization and then hashes that, which means that it includes the current apiVersion and kind if they are set.

Right, so if it recursively respects hash settings per-field, if we add the Kind and APIVersion to the spec, we also add the nohash annotation, which would be the same as the previous version, and there would be no changes to the hashing function, right?

The hashing compatibility is only an issue when we want to remove fields that were previously hashed, as going from hash(n fields) -> hash(n+1 fields) is injective, and hash(n fields) -> hash(n-1 fields) is surjective.

Yep, you could imagine that a Cloud Provider adds annotations to NodeClasses to tell the kubernetes-sigs/karpenter library that this CRD is a NodeClass and should be watched.

Doesn't this assume that there's only one NodeClass CRD for a given CP? We'd need to explicitly add a apiVersion and kind later in the NodePool spec to support this, which makes our solution less flexible to changes here. Also, I recognize this is opening a can of worms into a whole new ideological discussion about if multiple NodeClass CRDs even make sense.

@sftim
Copy link

sftim commented Dec 29, 2023

Someone may one day want to allow people to migrate from:

nodeClassRef:
  apiVersion: karpenter.k8s.aws/v1
  kind: EC2NodeClass
  name: default

to

nodeClassRef:
  apiVersion: karpenter.k8s.aws/v2alpha1 # new API version
  kind: EC2NodeClass
  name: default

(or a variant on that scheme, such as dual-running two NodePools as part of the rollout).
A cloud provider could also want to rename their product. Unlikely for EC2 but someone somewhere will want to.

How about:

  1. change the provider-specific controller code to mutate any nodeClassRef to be specific to the API group/version/kind that the code recommends, but then put that behind 2 command line options. Defaulting the apiVersion I'd have as on-by-default; defaulting the kind I would have off-by-default at tool level (and then on-by-default at Helm chart level; more on that later in this comment).
  2. switch to a v1beta2 API for NodePool, and for that API version require that nodeClassRef has an apiVersion and kind.

and

  1. encourage provider-specific implementations to offer a mutating admission webhook,, that infers the apiVersion and kind provided it's unambiguous to do so. The webhook can also warn people that they left some fields out.

Helm charts etc could have the controller's mutation of existing objects on by default and have the webhook on by default. Later down the line, eg v1beta3, switch the controller to not mutate existing kind fields but leave the webhook active. The mutating webhook remains as something that you typically deploy, but don't have to.


BTW: Kubernetes API versioning aims to support round-trip compatibility between stable versions, eg karpenter.k8s.aws/v1 to karpenter.k8s.aws/v2 (eg: you can create an object using the v2 API and read it back with the v1 API, mutate it, and then read that back using the v2 API).

We don't promise it works fine between a stable version and a newer alpha, though for core Kubernetes we do try hard. This is relevant because we probably do want experimenters people to try out future API versions before we stabilize them.

@jonathan-innis
Copy link
Member Author

jonathan-innis commented Dec 30, 2023

which would be the same as the previous version, and there would be no changes to the hashing function, right

Unfortunately, this isn't correct. We are hashing the entirety of the nodeClassRef today, including the kind and apiVersion values that are currently able to be set. We can't know if a user has set this value or not, so if we start defaulting this value while still hashing kind and apiVersion, we will roll nodes for users that previously hadn't set this value. If we stop hashing these fields, this will roll nodes for users that had previously set the values.

In either case, I think we want to hash on these values, so I think our current behavior is correct. The main concern here is just handling avoiding node rolling if we start defaulting it.

Doesn't this assume that there's only one NodeClass CRD for a given CP?

It doesn't necessarily imply that. You could imagine that you could create an annotation solution that had karpenter.sh/nodeclass: true for all NodeClass CRDs and then karpenter.sh/nodeclass-default: true for the default one. I don't personally this is the direction that we should go, but this is one way to make this solution viable for multiple NodeClasses.


Someone may one day want to allow people to migrate from

Personally, this is why I think we should use group instead of apiVersion. I'm not sure there's much point in having someone differentiate versions like this considering that versions should convertible between each other like you called out (though happy to discuss if you think that there are merits to having version specified rather than just the group).

switch to a v1beta2 API for NodePool

If we bumped to a new apiVersion and started requiring the value to be set in v1beta2, I'm imagining that the code would have to start automatically setting the nodeClassRef for any Update/Patch calls which means that there's going to be a change that isn't round-trippable from v1beta1 (since we won't know which data has been injected into the spec going from v1beta1 to v1beta2), unless v1beta1 chooses to default fields that aren't set outright.

This is part of the reason that I was thinking that it would be reasonable to default in the OpenAPI spec for the CRDs that the CloudProviders release (I see this as an equivalent solution to the webhook solution since it's basically a defaulting webhook that just uses the OpenAPI schema to set the default). Setting the default in this way seems reasonable to me, but we still have to overcome the hash value used for drift changing here.

@tallaxes
Copy link

In either case, I think we want to hash on these values, so I think our current behavior is correct.

Why do we want to include the reference itself in the hash? There is a separate hash on NodeClass itself, so there is already a mechanism for detecting material change there. It seems it is a detected change in a combination of NodePool and NodeClass that indicates drift, the reference between them should not matter? (Except for the purpose of actually locating the corresponding NodeClass ...) Or am I missing something?

(I understand the unfortunate implications of possibly changing the hashing function, just wondering about "our current behavior is correct" ...)

@jonathan-innis
Copy link
Member Author

Why do we want to include the reference itself in the hash? There is a separate hash on NodeClass itself, so there is already a mechanism for detecting material change there

Changing the hash when the reference changes is the least surprising of the two options IMO. For instance, the AWS provider tags instances with the karpenter.k8s.aws/ec2nodeclass tag and if Karpenter didn't drift when there was a change in the nodeClassRef, this tag wouldn't change for instances and Karpenter wouldn't drift existing NodeClaims onto the new tag, which is a tad confusing. Given that users are generally not going to be creating multiple NodeClasses unless there are marked differences between them, I think that 99.9% of the time the NodeClaim would drift in either case, but this strikes me as the most expected behavior.

@tallaxes
Copy link

For instance, the AWS provider tags instances with the karpenter.k8s.aws/ec2nodeclass tag and if Karpenter didn't drift when there was a change in the nodeClassRef, this tag wouldn't change for instances and Karpenter wouldn't drift existing NodeClaims onto the new tag, which is a tad confusing.

I see. Maybe another way to think about this is that, since NodeClass name itself happens to be part of / affects the desired state of resources (at least in AWS provider), it should be included in the computed hash of NodeClass? Would that achieve the same effect? (I do realize that most changes to hash logic are breaking, just trying to understand the tradeoffs.)

It also occurs to me that this tag (and maybe resource tagging in general) may be yet another good candidate for in-place drift?

@jonathan-innis
Copy link
Member Author

may be yet another good candidate for in-place drift

Yeah, I could definitely see an argument for that. Today, we don't in-place drift anything, but I could see us configuring to do this for certain resources. Out of curiousity, what's the benefit for you of not drifting on changes to the nodeClassRef?

@tallaxes
Copy link

Out of curiousity, what's the benefit for you of not drifting on changes to the nodeClassRef?

This is cited as one for the reasons for having to do drift hash versioning, so got me wondering why do it to begin with. Other than that - no particular benefit, just trying to get clarity on where a change in reference properly fits in the drift framework, and understand the forces and tradeoffs between options.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2024
@jonathan-innis
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants