☂️-issue for Server-Side Apply in Gardener #4122

timebertt · 2021-05-27T16:59:51Z

How to categorize this issue?

/area dev-productivity scalability
/kind impediment

[...] Server-Side Apply. We identified many places, especially in the Gardenlet code, that will benefit from SSA compared to Json-Merge/Strategic-Merge patches that are used today. Less conflicts, less network transfers, less API calls, lower code complexity, etc.. Other components in the Gardener landscape, e.g. Etcd-Druid or provider extensions, will benefit from SSA to at least the same extent.

While investigating how and under which circumstances SSA can be leveraged in Gardener as part of #4027, we got aware of the following major problems that will most probably prevent us from leveraging SSA as originally intended. @timuthy and I agreed on summarizing them in an issue for future reference.

removing/unsetting fields in custom resources via apply that were set during a previous apply is buggy (ref "null" values are dropped from CustomResources kubernetes/kubernetes#101036, Field deletion doesn't work well in SSA kubernetes-sigs/structured-merge-diff#171). The bug was fixed in v1.21 and backported to v1.{18,19,20}, though we would heavily depend on it once gardener uses SSA to deploy/reconcile extension resources (e.g. Infrastructure) in the seed.
- We probably don't want to add additional requirements for specific patch versions for the seed's k8s versions (e.g. require at least v1.18.19), so this effectively prevents us from properly using SSA for this use case.
switching from the typical CreateOrUpdate-logic to SSA in a controller doesn't seem to be straight-forward for resources that were created before SSA was enabled (resources that don't have field ownership tracked in metadata.managedFields yet).
- The problem is, that the first apply will add the before-first-apply field manager, that owns all fields that were present before the apply. Once the new field manager (e.g. gardenlet) wants to remove a field, it will only give up ownership but the field will still not be removed, as it is still co-owned by the before-first-apply manager.
- In order to workaround this issue and take over sole ownership of all fields during the migration, we need to take ownership of at least one field before the first apply
- e.g. take ownership of timestamp annotation:
```
$ k patch infra infra --type=merge -p '{"metadata":{"managedFields": [{"apiVersion":"extensions.gardener.cloud/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{"f:gardener.cloud/timestamp":{}}}},"manager":"gardenlet","operation":"Update","time":"2021-05-27T13:03:49Z"}]}}'
```
- and then apply the full spec, which will make the new manager owner of all spec fields

There are also other minor problems with SSA, that won't hinder us from using it, but are worth keeping in mind:

ApplyConfigurations for built-in API resources are only available in client-go@v1.21+ (ref Apply for client-go's typed clients kubernetes/enhancements#2155), so we will need to vendor v1.21 first before using SSA for built-in types, where using go structs is not applicable
ApplyConfigurations for our own API types need to be generated using applyconfiguration-gen, so same condition for them as above
several issues cause the resourceVersion to be bumped on every SSA even though it's a no-op (SSA triggers update events on apply to resource when it is no-op kubernetes/kubernetes#100024)
- one of them regarding the creationTimestamp is solved in v1.22 (ref fieldmanager: Strip managedfields BEFORE we update the timestamp kubernetes/kubernetes#100032)
status is tracked in managedFields although not set explicitly by an applier (as our extension API types don't specify omitempty for the status field)
- not problematic as long as the status is protected by the /status subresource and status itself is not updated via SSA
- fixed in v1.21 (ref Server-Side Apply status wiping kubernetes/kubernetes#99661)
changes via the /scale subresource are not tracked in managedFields
- fixed in v1.22 (ref Track ownership of scale subresource kubernetes/kubernetes#98377)
- not problematic for us right now (I think), as long as we don't use SSA for resources scaled by HPA

The text was updated successfully, but these errors were encountered:

timebertt · 2021-05-27T17:05:28Z

So far there's no decision whether to drop the idea of using SSA in gardener or pursue it further. However, the list above shows that we have to be very conscious about when we can properly leverage it.
We still want to understand the before-first-apply manager and the reasoning behind it, so maybe this provides us with some more arguments for taking a decision here.

For the reasons explained in gardener#4057, MergePatchOrCreate is not safe to use without reading objects from the API server / cache first. Ideally we would switch to server-side apply to simplify such operations on controlled objects (e.g. extension resources), though we cannot leverage it for now because of several issues described in gardener#4122.

* Cleanup unused/dead code * Make EXPECTPatch easier to debug * Refactor extensions helper funcs This prefactoring performs several changes in the extension helper funcs: - the funcs directly work on the passed in objects instead of creating new objects via separately passed newObjFuncs. Passed objects are expected to be filled with the latest state the controller/component applied/observed/retrieved, but at least namespace and name. This way, controllers can keep working on a single in-memory object instead of retrieving the same object over and over again. - WaitUntilExtensionObjectReady additionally checks that the added timestamp annotation is present on the object to prevent false early exits when reading stale data (e.g. from a cached client). - makes naming more consistent and clearer This provides the following benefits: - makes the funcs ready to be used with cached clients, as they are able to properly handle stale data now - controllers/components automatically have the latest state of the objects present without retrieving them again from the API server. This will ease further refactorings and allows to safely use patches in a lot of cases instead of CreateOrUpdate (+RetryOnConflict). - the funcs are easier to use and comprehend: only the desired object needs to be passed, no need to pass newObjFuncs, namespace and name - the funcs are more similar to other helper funcs (e.g. CreateOrUpdate) and the mechanisms employed in e.g. c-r clients - no need to pass a client.Object to postReadyFuncs which needs to be casted again * Make BackupBucket controller ready for cached client * Make BackupEntry controller/component ready for cached client * Make ContainerRuntime component ready for cached client * Make ControlPlane component ready for cached client * Make Extension component ready for cached client * Make Infrastructure component ready for cached client * Make Network component ready for cached client * Make OperatingSystemConfig component ready for cached client * Make Worker component ready for cached client * Fix failing unit test * Address review feedback * Switch from MergePatchOrCreate to GetAndCreateOrMergePatch For the reasons explained in #4057, MergePatchOrCreate is not safe to use without reading objects from the API server / cache first. Ideally we would switch to server-side apply to simplify such operations on controlled objects (e.g. extension resources), though we cannot leverage it for now because of several issues described in #4122. * Remove *PatchOrCreate funcs As described in #4057, *PatchOrCreate is easy to use incorrectly. Remove *PatchOrCreate funcs in order to minimize the risk of misuse. There might still be good use cases for this mechanism, though they are limited. If there is such a good use case, we can use it also without a helper func.

timebertt · 2021-06-01T14:30:40Z

There is an upstream issue about the migration from CSA/CreateOrUpdate to SSA: kubernetes/kubernetes#99003

gardener-ci-robot · 2022-03-30T13:00:27Z

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed
You can:
Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle rotten

* Cleanup unused/dead code * Make EXPECTPatch easier to debug * Refactor extensions helper funcs This prefactoring performs several changes in the extension helper funcs: - the funcs directly work on the passed in objects instead of creating new objects via separately passed newObjFuncs. Passed objects are expected to be filled with the latest state the controller/component applied/observed/retrieved, but at least namespace and name. This way, controllers can keep working on a single in-memory object instead of retrieving the same object over and over again. - WaitUntilExtensionObjectReady additionally checks that the added timestamp annotation is present on the object to prevent false early exits when reading stale data (e.g. from a cached client). - makes naming more consistent and clearer This provides the following benefits: - makes the funcs ready to be used with cached clients, as they are able to properly handle stale data now - controllers/components automatically have the latest state of the objects present without retrieving them again from the API server. This will ease further refactorings and allows to safely use patches in a lot of cases instead of CreateOrUpdate (+RetryOnConflict). - the funcs are easier to use and comprehend: only the desired object needs to be passed, no need to pass newObjFuncs, namespace and name - the funcs are more similar to other helper funcs (e.g. CreateOrUpdate) and the mechanisms employed in e.g. c-r clients - no need to pass a client.Object to postReadyFuncs which needs to be casted again * Make BackupBucket controller ready for cached client * Make BackupEntry controller/component ready for cached client * Make ContainerRuntime component ready for cached client * Make ControlPlane component ready for cached client * Make Extension component ready for cached client * Make Infrastructure component ready for cached client * Make Network component ready for cached client * Make OperatingSystemConfig component ready for cached client * Make Worker component ready for cached client * Fix failing unit test * Address review feedback * Switch from MergePatchOrCreate to GetAndCreateOrMergePatch For the reasons explained in gardener#4057, MergePatchOrCreate is not safe to use without reading objects from the API server / cache first. Ideally we would switch to server-side apply to simplify such operations on controlled objects (e.g. extension resources), though we cannot leverage it for now because of several issues described in gardener#4122. * Remove *PatchOrCreate funcs As described in gardener#4057, *PatchOrCreate is easy to use incorrectly. Remove *PatchOrCreate funcs in order to minimize the risk of misuse. There might still be good use cases for this mechanism, though they are limited. If there is such a good use case, we can use it also without a helper func.

gardener-ci-robot · 2022-04-29T13:23:38Z

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

gardener-prow · 2022-04-29T13:23:41Z

@gardener-ci-robot: Closing this issue.

In response to this:

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* Cleanup unused/dead code * Make EXPECTPatch easier to debug * Refactor extensions helper funcs This prefactoring performs several changes in the extension helper funcs: - the funcs directly work on the passed in objects instead of creating new objects via separately passed newObjFuncs. Passed objects are expected to be filled with the latest state the controller/component applied/observed/retrieved, but at least namespace and name. This way, controllers can keep working on a single in-memory object instead of retrieving the same object over and over again. - WaitUntilExtensionObjectReady additionally checks that the added timestamp annotation is present on the object to prevent false early exits when reading stale data (e.g. from a cached client). - makes naming more consistent and clearer This provides the following benefits: - makes the funcs ready to be used with cached clients, as they are able to properly handle stale data now - controllers/components automatically have the latest state of the objects present without retrieving them again from the API server. This will ease further refactorings and allows to safely use patches in a lot of cases instead of CreateOrUpdate (+RetryOnConflict). - the funcs are easier to use and comprehend: only the desired object needs to be passed, no need to pass newObjFuncs, namespace and name - the funcs are more similar to other helper funcs (e.g. CreateOrUpdate) and the mechanisms employed in e.g. c-r clients - no need to pass a client.Object to postReadyFuncs which needs to be casted again * Make BackupBucket controller ready for cached client * Make BackupEntry controller/component ready for cached client * Make ContainerRuntime component ready for cached client * Make ControlPlane component ready for cached client * Make Extension component ready for cached client * Make Infrastructure component ready for cached client * Make Network component ready for cached client * Make OperatingSystemConfig component ready for cached client * Make Worker component ready for cached client * Fix failing unit test * Address review feedback * Switch from MergePatchOrCreate to GetAndCreateOrMergePatch For the reasons explained in gardener#4057, MergePatchOrCreate is not safe to use without reading objects from the API server / cache first. Ideally we would switch to server-side apply to simplify such operations on controlled objects (e.g. extension resources), though we cannot leverage it for now because of several issues described in gardener#4122. * Remove *PatchOrCreate funcs As described in gardener#4057, *PatchOrCreate is easy to use incorrectly. Remove *PatchOrCreate funcs in order to minimize the risk of misuse. There might still be good use cases for this mechanism, though they are limited. If there is such a good use case, we can use it also without a helper func.

timebertt added the kind/enhancement Enhancement, improvement, extension label May 27, 2021

gardener-robot added area/dev-productivity Developer productivity related (how to improve development) area/scalability Scalability related kind/impediment Something that impedes developers, operators, users or others in their work labels May 27, 2021

timuthy mentioned this issue Jul 20, 2021

[Feature] Prepare Druid for Server-Side Apply gardener/etcd-druid#179

Closed

2 tasks

gardener-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2021

gardener-prow bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 30, 2022

gardener-prow bot closed this as completed Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️-issue for Server-Side Apply in Gardener #4122

☂️-issue for Server-Side Apply in Gardener #4122

timebertt commented May 27, 2021

timebertt commented May 27, 2021

timebertt commented Jun 1, 2021

gardener-ci-robot commented Mar 30, 2022

gardener-ci-robot commented Apr 29, 2022

gardener-prow bot commented Apr 29, 2022

☂️-issue for Server-Side Apply in Gardener #4122

☂️-issue for Server-Side Apply in Gardener #4122

Comments

timebertt commented May 27, 2021

timebertt commented May 27, 2021

timebertt commented Jun 1, 2021

gardener-ci-robot commented Mar 30, 2022

gardener-ci-robot commented Apr 29, 2022

gardener-prow bot commented Apr 29, 2022