-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop remove operations from webhook patches #3132
Drop remove operations from webhook patches #3132
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
49ebfba
to
8b5b1ce
Compare
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just confirmation. Have you ever seen any dropping schemes in the reconcilers/controllers?
For sure, I have seen your comment mentioning the investigations: #2878 (comment)
"sigs.k8s.io/controller-runtime/pkg/webhook/conversion" | ||
) | ||
|
||
// This code is copied from https://github.com/kubernetes-sigs/controller-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the exact file name is /pkg/builder/webhook.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added filename and hash.
that could be caused by unkown fields in the go types Change-Id: Iabc66880aef0227f60ec3deec9109cfb255fe2da
8b5b1ce
to
e58ee5f
Compare
Change-Id: I4de1ebbe75a9220466c50381d0e708a1b6d23c8f
e58ee5f
to
de03011
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm.
Change-Id: I232b1f2cd1a8a0c6b2f2fa7506da4d17d3e3ae55
0d84c42
to
c2f6eca
Compare
// that are not present in the go types, which can occur when Kueue libraries are behind the latest | ||
// released CRDs. | ||
// Dropping the "remove" operations is safe because Kueue's job mutators never remove fields. | ||
func (h *losslessDefaulter) Handle(ctx context.Context, req admission.Request) admission.Response { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was chatting with @jpbetz to get his opinion on my former approach: doing a round-trip of the raw object. This proposal was rejected by the maintainers of controller-runtime because it might break backwards compatibility when a go type removes an omitempty clause that was set for a field before kubernetes-sigs/controller-runtime#2931 (comment)
He agreed that both approaches (current behavior of controller-runtime and kubernetes-sigs/controller-runtime#2931) have pros and cons. However, getting new fields dropped is worse than dealing with CRDs that did backward-incompatible changes related to omitempty.
And, in the case of Kueue, we are only dealing with production ready CRDs and we only modify a few fields, so the likelihood of a problem is low.
So, we could move in that direction instead of the approach in the current PR to drop "remove" operations, which can be less future proof.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support moving forward with kubernetes-sigs/controller-runtime#2932 to at least enable us in Kueue to register the lossLesshandler.
However, for the long term solution LossLessHandler isn't ideal. Even if we don't the problem yet, it might happen with increased adoption, for example an adopter of Kueue wants to write an in-house webhook to drop a problematic field from Volcano CRD (for example to work-around some issue).
I see the concern in kubernetes-sigs/controller-runtime#2931, but some solution like this could be handy, maybe it just needs additional care for the problematic cases mentioned?
I'm also wondering if this problem is related to kubernetes/kubernetes#126175 (might be red herring, but PTAL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but some solution like this could be handy, maybe it just needs additional care for the problematic cases mentioned?
We wouldn't be able to know generically. It depends on the CRD. But again, the likelihood is very low.
So, up to you. Would you prefer the implementation of LosslessHandler as suggested in kubernetes-sigs/controller-runtime#2931 or are you ok with proceeding with the current PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need to learn more about the pros & cons of thttps://github.com/kubernetes-sigs/controller-runtime/pull/2931.
Would the change in omitempty be problematic then only for build-in types, or CRDs also? Also is it problematic removing "omitempty", or adding as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problematic scenario is when the go library in Kueue doesn't have omitempty, but the object stored in apiserver doesn't have the field. In this case, roundtripping in Kueue would add the fields, leading to possibly invalid patches. This can happen in two scenarios:
- The go type author removes the omitempty from a field, and this is the library that Kueue uses. A user created the object when omitempty was in place.
- Kueue is using an older library that doesn't have omitempty. The go type author adds omitempty to a field, and a user creates an object using this version.
So yes, it can happen in either scenario. @jpbetz, please correct me if my interpretation is wrong.
Another diminishing factor is that this only matters if we are modifying a field that is having these omitempty transitions. If the transition happens in an object that Kueue doesn't touch, then this wouldn't show up in the jsondiff. But the omitempty status of all the parents of the field could be relevant.
I think kubernetes/kubernetes#126175 is also related, but it rather happens with the current CustomDefaulter of controller-runtime, and it wouldn't happen with my version because it wouldn't show in the jsondiff.
So the question comes down to:
Do we want to prioritize supporting newly added fields that Kueue libraries don't know about? Or do we prioritize supporting buggy types (which, as we can see, could also be a problem in the current version)? I think it's only fair to prioritize the former and be in the watch to report issues as necessary in other libraries and upgrade when appropriate.
OTOH, the LosslessDefaulter as currently implemented (just dropping every remove operation), doesn't have problems with adding or removing omitempty, as long as we know that we are not removing any fields in the webhook. Although it's still subject to a case like kubernetes/kubernetes#126175
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the omitempty issue only affects Kueue webhooks if they are missing in the fields that Kueue modifies. If Kueue doesn't modify them, then they would simply not form part of the jsondiff. So, to ensure Kueue correctness, we only need to audit the fields that we touch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that Michal's concern is for the third-party Kueue controllers. Isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but unfortunately we cannot do much better than document the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but unfortunately we cannot do much better than document the problem.
I agree with you. The third-party controller is not controllable by us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was more speculative, touching two potential concerns. Let me explain below in more detail.
So, to ensure Kueue correctness, we only need to audit the fields that we touch.
Yes, but even if we audit all fields modified by Kueue for TFJob at a given release of Kueue and TFJob, Kueue may want to use new fields in its future releases (for new features, like managedBy). Then, we need to audit the new field, and if we find an "omitempty" bug in TFJob operator in the field, then what do we do? Wait with the new feature of Kueue until the TFJob is fixed and released, or build the new feature based on the buggy field? None of the choices is ideal.
I guess that Michal's concern is for the third-party Kueue controllers. Isn't it?
Yeah, this is another consideration, if RoundTrip defaulter is widespread in customer, and third-party webhooks, it would create an incentive not to fix the omitempty bugs (whatever small, but still).
/lgtm Also, @tenzen-y @alculquicondor WDYT about cherry-picking to 0.8? |
LGTM label has been added. Git tree hash: a5685fbb976d32e79114b24e05592ed8622f3625
|
re cherry-picking: I think we are relatively safe given that we are using the latest libraries in 0.8 already. And people who might be thinking of upgrading k8s or kuberay are more likely to also stay up to date with Kueue. |
Indeed, I had some discussions about cherry-picking there: #3132 (comment) So, I'm ok without cherry-picking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alculquicondor @mimowo @jpbetz!
/lgtm
/approve
/hold cancel
@alculquicondor Please open a PR to update docs and an issue to discuss about the RoudtripDefaulter as we discussed.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This seems to be a networking connectivity error. /test pull-kueue-test-multikueue-e2e-main |
Due to #3144 |
/test pull-kueue-test-integration-main |
if response.Allowed { | ||
var patches []jsonpatch.Operation | ||
for _, p := range response.Patches { | ||
if p.Operation != "remove" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor would it be feasible to make the LossLessDefaulter more selective - only drop remove operations which correspond to fields outside of schema? If possible, then we could allow webhooks to remove known fields, while still preventing dropping of unknown fields. I suppose the answer may not be straightforward - depending how easily we can check if a remove operation corresponds to a known field, also performance may play a role.
If possible, maybe it would increase chances for promoting LossLessDefaulter as part of opt-in defaulter inside controller-runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there is a way using the reflect
package. But I don't think I'll have time to explore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo That sounds more safety. Could we open an issue on the Kueue side?
If anyone is interested in this enhancement, they can take it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be wonderful, opened: #3174
/cherry-pick release-0.8 |
@mbobrovskyi: #3132 failed to apply on top of branch "release-0.8":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…atches (#3358) * Drop remove operations from webhook patches that could be caused by unkown fields in the go types Change-Id: Iabc66880aef0227f60ec3deec9109cfb255fe2da * Resync webhook builder Change-Id: I4de1ebbe75a9220466c50381d0e708a1b6d23c8f * Rename Defaulter to LosslessDefaulter Change-Id: I232b1f2cd1a8a0c6b2f2fa7506da4d17d3e3ae55 --------- Co-authored-by: Aldo Culquicondor <acondor@google.com>
* Drop remove operations from webhook patches that could be caused by unkown fields in the go types Change-Id: Iabc66880aef0227f60ec3deec9109cfb255fe2da * Resync webhook builder Change-Id: I4de1ebbe75a9220466c50381d0e708a1b6d23c8f * Rename Defaulter to LosslessDefaulter Change-Id: I232b1f2cd1a8a0c6b2f2fa7506da4d17d3e3ae55
What type of PR is this?
/kind bug
What this PR does / why we need it:
Prevent job webhooks from dropping fields for newer API fields when Kueue libraries are behind the latest released CRDs.
This was possible by reusing controller-runtime's handler for CustomDefaulter, by dropping the "remove" operations from the jsondiffs.
I had to duplicate controller-runtime's webhook builder, because there is no way to introduce a custom handler in it. I have an open PR to upstream this ability kubernetes-sigs/controller-runtime#2932
Which issue(s) this PR fixes:
Fixes #2878
Special notes for your reviewer:
Just to expand on why I decided to go with this approach of dropping the remove operations:
Another option I considered was for every job to return jsonpatches in the Suspend operation. But then it becomes quite hard to explain how to do your own, if you aren't already familiar with jsonpatches. It's much simpler to say "just edit the object". Then I was thinking how can we produce safe patches generically, starting from an edited object. One option would have been to do something similar to kubernetes-sigs/controller-runtime#2931, but that approach might introduce other bugs, as explained in kubernetes-sigs/controller-runtime#2931 (comment)
So ultimately I decided to exploit the fact that our webhooks simply don't drop fields.
Does this PR introduce a user-facing change?