Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop remove operations from webhook patches #3132

Merged
merged 3 commits into from
Sep 27, 2024

Conversation

alculquicondor
Copy link
Contributor

@alculquicondor alculquicondor commented Sep 24, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Prevent job webhooks from dropping fields for newer API fields when Kueue libraries are behind the latest released CRDs.

This was possible by reusing controller-runtime's handler for CustomDefaulter, by dropping the "remove" operations from the jsondiffs.

I had to duplicate controller-runtime's webhook builder, because there is no way to introduce a custom handler in it. I have an open PR to upstream this ability kubernetes-sigs/controller-runtime#2932

Which issue(s) this PR fixes:

Fixes #2878

Special notes for your reviewer:

Just to expand on why I decided to go with this approach of dropping the remove operations:
Another option I considered was for every job to return jsonpatches in the Suspend operation. But then it becomes quite hard to explain how to do your own, if you aren't already familiar with jsonpatches. It's much simpler to say "just edit the object". Then I was thinking how can we produce safe patches generically, starting from an edited object. One option would have been to do something similar to kubernetes-sigs/controller-runtime#2931, but that approach might introduce other bugs, as explained in kubernetes-sigs/controller-runtime#2931 (comment)

So ultimately I decided to exploit the fact that our webhooks simply don't drop fields.

Does this PR introduce a user-facing change?

Prevent job webhooks from dropping fields for newer API fields when Kueue libraries are behind the latest released CRDs.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 24, 2024
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2024
Copy link

netlify bot commented Sep 24, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit c2f6eca
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66f4536bd1fdc300070ab33b

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 24, 2024
@alculquicondor alculquicondor force-pushed the fix-webhooks branch 2 times, most recently from 49ebfba to 8b5b1ce Compare September 24, 2024 18:59
@alculquicondor alculquicondor changed the title WIP: Drop remove operations from webhook patches Drop remove operations from webhook patches Sep 24, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024
@tenzen-y
Copy link
Member

/cc

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just confirmation. Have you ever seen any dropping schemes in the reconcilers/controllers?

For sure, I have seen your comment mentioning the investigations: #2878 (comment)

"sigs.k8s.io/controller-runtime/pkg/webhook/conversion"
)

// This code is copied from https://github.com/kubernetes-sigs/controller-runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the exact file name is /pkg/builder/webhook.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added filename and hash.

pkg/controller/jobframework/webhook/builder.go Outdated Show resolved Hide resolved
that could be caused by unkown fields in the go types

Change-Id: Iabc66880aef0227f60ec3deec9109cfb255fe2da
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 25, 2024
Change-Id: I4de1ebbe75a9220466c50381d0e708a1b6d23c8f
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm.

Change-Id: I232b1f2cd1a8a0c6b2f2fa7506da4d17d3e3ae55
// that are not present in the go types, which can occur when Kueue libraries are behind the latest
// released CRDs.
// Dropping the "remove" operations is safe because Kueue's job mutators never remove fields.
func (h *losslessDefaulter) Handle(ctx context.Context, req admission.Request) admission.Response {
Copy link
Contributor Author

@alculquicondor alculquicondor Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was chatting with @jpbetz to get his opinion on my former approach: doing a round-trip of the raw object. This proposal was rejected by the maintainers of controller-runtime because it might break backwards compatibility when a go type removes an omitempty clause that was set for a field before kubernetes-sigs/controller-runtime#2931 (comment)

He agreed that both approaches (current behavior of controller-runtime and kubernetes-sigs/controller-runtime#2931) have pros and cons. However, getting new fields dropped is worse than dealing with CRDs that did backward-incompatible changes related to omitempty.
And, in the case of Kueue, we are only dealing with production ready CRDs and we only modify a few fields, so the likelihood of a problem is low.
So, we could move in that direction instead of the approach in the current PR to drop "remove" operations, which can be less future proof.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support moving forward with kubernetes-sigs/controller-runtime#2932 to at least enable us in Kueue to register the lossLesshandler.

However, for the long term solution LossLessHandler isn't ideal. Even if we don't the problem yet, it might happen with increased adoption, for example an adopter of Kueue wants to write an in-house webhook to drop a problematic field from Volcano CRD (for example to work-around some issue).

I see the concern in kubernetes-sigs/controller-runtime#2931, but some solution like this could be handy, maybe it just needs additional care for the problematic cases mentioned?

I'm also wondering if this problem is related to kubernetes/kubernetes#126175 (might be red herring, but PTAL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but some solution like this could be handy, maybe it just needs additional care for the problematic cases mentioned?

We wouldn't be able to know generically. It depends on the CRD. But again, the likelihood is very low.

So, up to you. Would you prefer the implementation of LosslessHandler as suggested in kubernetes-sigs/controller-runtime#2931 or are you ok with proceeding with the current PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to learn more about the pros & cons of thttps://github.com/kubernetes-sigs/controller-runtime/pull/2931.
Would the change in omitempty be problematic then only for build-in types, or CRDs also? Also is it problematic removing "omitempty", or adding as well?

Copy link
Contributor Author

@alculquicondor alculquicondor Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problematic scenario is when the go library in Kueue doesn't have omitempty, but the object stored in apiserver doesn't have the field. In this case, roundtripping in Kueue would add the fields, leading to possibly invalid patches. This can happen in two scenarios:

  • The go type author removes the omitempty from a field, and this is the library that Kueue uses. A user created the object when omitempty was in place.
  • Kueue is using an older library that doesn't have omitempty. The go type author adds omitempty to a field, and a user creates an object using this version.

So yes, it can happen in either scenario. @jpbetz, please correct me if my interpretation is wrong.

Another diminishing factor is that this only matters if we are modifying a field that is having these omitempty transitions. If the transition happens in an object that Kueue doesn't touch, then this wouldn't show up in the jsondiff. But the omitempty status of all the parents of the field could be relevant.

I think kubernetes/kubernetes#126175 is also related, but it rather happens with the current CustomDefaulter of controller-runtime, and it wouldn't happen with my version because it wouldn't show in the jsondiff.

So the question comes down to:
Do we want to prioritize supporting newly added fields that Kueue libraries don't know about? Or do we prioritize supporting buggy types (which, as we can see, could also be a problem in the current version)? I think it's only fair to prioritize the former and be in the watch to report issues as necessary in other libraries and upgrade when appropriate.

OTOH, the LosslessDefaulter as currently implemented (just dropping every remove operation), doesn't have problems with adding or removing omitempty, as long as we know that we are not removing any fields in the webhook. Although it's still subject to a case like kubernetes/kubernetes#126175

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the omitempty issue only affects Kueue webhooks if they are missing in the fields that Kueue modifies. If Kueue doesn't modify them, then they would simply not form part of the jsondiff. So, to ensure Kueue correctness, we only need to audit the fields that we touch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that Michal's concern is for the third-party Kueue controllers. Isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but unfortunately we cannot do much better than document the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but unfortunately we cannot do much better than document the problem.

I agree with you. The third-party controller is not controllable by us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was more speculative, touching two potential concerns. Let me explain below in more detail.

So, to ensure Kueue correctness, we only need to audit the fields that we touch.

Yes, but even if we audit all fields modified by Kueue for TFJob at a given release of Kueue and TFJob, Kueue may want to use new fields in its future releases (for new features, like managedBy). Then, we need to audit the new field, and if we find an "omitempty" bug in TFJob operator in the field, then what do we do? Wait with the new feature of Kueue until the TFJob is fixed and released, or build the new feature based on the buggy field? None of the choices is ideal.

I guess that Michal's concern is for the third-party Kueue controllers. Isn't it?

Yeah, this is another consideration, if RoundTrip defaulter is widespread in customer, and third-party webhooks, it would create an incentive not to fix the omitempty bugs (whatever small, but still).

@mimowo
Copy link
Contributor

mimowo commented Sep 27, 2024

/lgtm
/approve
/hold
In case @tenzen-y has more comments. @alculquicondor please follow up with a note about the recommendation for registering webhooks for integrations.

Also, @tenzen-y @alculquicondor WDYT about cherry-picking to 0.8?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 27, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 27, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a5685fbb976d32e79114b24e05592ed8622f3625

@alculquicondor
Copy link
Contributor Author

re cherry-picking:
This PR depends on #3132, so maybe a bit too much for a clean merge?

I think we are relatively safe given that we are using the latest libraries in 0.8 already. And people who might be thinking of upgrading k8s or kuberay are more likely to also stay up to date with Kueue.

@tenzen-y
Copy link
Member

re cherry-picking: This PR depends on #3132, so maybe a bit too much for a clean merge?

I think we are relatively safe given that we are using the latest libraries in 0.8 already. And people who might be thinking of upgrading k8s or kuberay are more likely to also stay up to date with Kueue.

Indeed, I had some discussions about cherry-picking there: #3132 (comment)

So, I'm ok without cherry-picking.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alculquicondor @mimowo @jpbetz!
/lgtm
/approve
/hold cancel

@alculquicondor Please open a PR to update docs and an issue to discuss about the RoudtripDefaulter as we discussed.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 27, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alculquicondor,mimowo,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

go: github.com/mikefarah/yq/v4@v4.44.3: Get "https://proxy.golang.org/github.com/mikefarah/yq/v4/@v/v4.44.3.info": dial tcp 142.250.191.113:443: i/o timeout

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/3132/pull-kueue-test-multikueue-e2e-main/1839710600670744576

This seems to be a networking connectivity error.

/test pull-kueue-test-multikueue-e2e-main

@tenzen-y
Copy link
Member

Due to #3144

@tenzen-y
Copy link
Member

/test pull-kueue-test-integration-main

@k8s-ci-robot k8s-ci-robot merged commit 08caf7e into kubernetes-sigs:main Sep 27, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Sep 27, 2024
if response.Allowed {
var patches []jsonpatch.Operation
for _, p := range response.Patches {
if p.Operation != "remove" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor would it be feasible to make the LossLessDefaulter more selective - only drop remove operations which correspond to fields outside of schema? If possible, then we could allow webhooks to remove known fields, while still preventing dropping of unknown fields. I suppose the answer may not be straightforward - depending how easily we can check if a remove operation corresponds to a known field, also performance may play a role.

If possible, maybe it would increase chances for promoting LossLessDefaulter as part of opt-in defaulter inside controller-runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there is a way using the reflect package. But I don't think I'll have time to explore it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo That sounds more safety. Could we open an issue on the Kueue side?
If anyone is interested in this enhancement, they can take it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be wonderful, opened: #3174

@mbobrovskyi
Copy link
Contributor

/cherry-pick release-0.8

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mbobrovskyi: #3132 failed to apply on top of branch "release-0.8":

Applying: Drop remove operations from webhook patches
Using index info to reconstruct a base tree...
M	go.mod
A	pkg/controller/jobframework/noop_webhook.go
A	pkg/controller/jobs/deployment/deployment_webhook.go
M	pkg/controller/jobs/job/job_webhook.go
M	pkg/controller/jobs/pod/pod_webhook.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/jobs/pod/pod_webhook.go
Auto-merging pkg/controller/jobs/noop/noop_webhook.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/noop/noop_webhook.go
Auto-merging pkg/controller/jobs/job/job_webhook.go
CONFLICT (modify/delete): pkg/controller/jobs/deployment/deployment_webhook.go deleted in HEAD and modified in Drop remove operations from webhook patches. Version Drop remove operations from webhook patches of pkg/controller/jobs/deployment/deployment_webhook.go left in tree.
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Drop remove operations from webhook patches

In response to this:

/cherry-pick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot pushed a commit that referenced this pull request Oct 29, 2024
…atches (#3358)

* Drop remove operations from webhook patches

that could be caused by unkown fields in the go types

Change-Id: Iabc66880aef0227f60ec3deec9109cfb255fe2da

* Resync webhook builder

Change-Id: I4de1ebbe75a9220466c50381d0e708a1b6d23c8f

* Rename Defaulter to LosslessDefaulter

Change-Id: I232b1f2cd1a8a0c6b2f2fa7506da4d17d3e3ae55

---------

Co-authored-by: Aldo Culquicondor <acondor@google.com>
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
* Drop remove operations from webhook patches

that could be caused by unkown fields in the go types

Change-Id: Iabc66880aef0227f60ec3deec9109cfb255fe2da

* Resync webhook builder

Change-Id: I4de1ebbe75a9220466c50381d0e708a1b6d23c8f

* Rename Defaulter to LosslessDefaulter

Change-Id: I232b1f2cd1a8a0c6b2f2fa7506da4d17d3e3ae55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kueue mutating webhooks drops fields in KubeRay resources
8 participants