Allow smooth upgrades to new kube-proxy with nft #5345

g-gaston · 2023-03-22T16:25:15Z

Problem statement

Symptoms

When upgrading a Cloudstack cluster (either in k8s 1.22 or 1.23) from eks-a v0.14.x to v0.15.0 (code in main), the api-server looses connectivity to the webhooks. This makes impossible to create/update CRDs with validation/mutation webhooks. In particular, the eks-a process fails when moving back the management resources to the management cluster with clusterctl, since this recreates the capi objects.

Cause

Preface: upgrading from v0.14.3 to v0.15.0 involves upgrading from eks-d v1.23.16-eks-1-23-16 to v1.23.16-eks-1-23-17. The new eks-d version includes the new kube-proxy with support for iptables nft. The old kube-proxy always uses iptables legacy.

During an upgrade, when the new machine for the new CP node is started, if the machine has iptables nft as the default, the kubelet will use it. Then, before capi updates the kube-proxy image version in the DS (this doesn't happen until the CP upgrade is finished), the old kube-proxy is scheduled in the node. This old kube-proxy doesn't support nft and always uses iptables legacy. When it starts, it adds legacy iptables rules. However, at this point the kubelet has already added iptables-nft rules.

After the CP has been updated, capi updates the kube-proxy DS to the new version. This new version has the new wrapper, which detects the rules introduced by the kubelet, so it starts using nft.

The hypothesis is that these leftover legacy rules break the k8s service IP "redirection". It's also possible that Cilium is getting messed up when it starts since it also depends on the rules introduced by kube-proxy: cilium/cilium#20123

So it turns out this problem is not exclusive to Cloudstack, it will happen with any OS that defaults to nft. It just happens that the only provider where we have that kind of image right now is Cloudstack.

Solution

TLDR: before the upgrade, schedule a DS with the old kube proxy only in the old nodes and schedule the existing DS with the new kube-proxy version only in the new nodes.

Upgrade flow changes:

Run everything we do today until before the actual capi upgrade/CP upgrade (the capi objects upgrade not the capi components upgrade).
Add annotation to KCP to stop kube-proxy reconciliations. If not, the kcp controller will undo our next changes.
Add special label to all nodes anywhere.eks.amazonaws.com/iptableslegacy=true. This allow us to identify the old nodes later.
Add node (anti) affinity to main kube-proxy DS using requiredDuringSchedulingIgnoredDuringExecution and DoesNotExist. This ensures the original kube-proxy is not scheduled in the old nodes after we update the image version.
Delete kube-proxy pods in old nodes. Node affinity should remove them, but this removes race conditions.
Create new DS kube-proxy-iptables-legacy with old kube-proxy version and using node selector with anywhere.eks.amazonaws.com/iptableslegacy=true. This DS is only scheduled in the current (pre-upgrade) nodes.
Update main kube-proxy DS with new eks-d version v1.23.16-eks-1-23-18.
Continue with the upgrade flow.
Delete kube-proxy-iptables-legacy.
Remove nodeAffinity from kube-proxy DS.
Remove annotation from KCP to resume kube-proxy reconciliations.

Implementation notes

I have used a go client for the k8s api calls. I know this is a semi new thing, specially in the CLI and it requires a group conversation and agreement before it gets introduced in the codebase. However, I went with it because I knew the implementation was gonna be significantly faster and this issue is a release blocker. If there are concerns with using this client, I would ask the reviewers to write them down but let the PR continue its course. And if required, I'll come back and refactor this to use kubectl calls.
I haven't put a lot of thought into naming and the placement of the code, both physically (package) and logically (from where do we call the new code). If you have better ideas please voice them out. I'm willing to move this around as necessary if the timing allows it. If it doesn't, I'll create a ticket to come back and addressed the agreed alternative.

Testing

I have tested manually the present code a few times and it works just fine.

TODOs

Cleanup code
Add cleanup phase after upgrade implementing 8, 9 and 10.
Run e2e test locally
Add k8s -> eks-d mapping for all versions
Unit tests
(Maybe) add retrying to the client calls or retry the whole "prepare" call and make the code reentrant. This will depend on timing. So far I haven't found issues during testing.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

eks-distro-bot · 2023-03-22T16:25:20Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

codecov · 2023-03-22T22:48:33Z

Codecov Report

Merging #5345 (7b917ad) into main (7ee77c6) will increase coverage by 0.25%.
The diff coverage is 81.74%.

❗ Current head 7b917ad differs from pull request most recent head a8bd485. Consider uploading reports for the commit a8bd485 to get more accurate results

@@            Coverage Diff             @@
##             main    #5345      +/-   ##
==========================================
+ Coverage   72.39%   72.65%   +0.25%     
==========================================
  Files         440      441       +1     
  Lines       36125    36660     +535     
==========================================
+ Hits        26152    26634     +482     
- Misses       8408     8426      +18     
- Partials     1565     1600      +35

Impacted Files	Coverage Δ
pkg/task/task.go	`88.14% <ø> (ø)`
pkg/workflows/upgrade.go	`63.56% <60.00%> (-0.16%)`	⬇️
pkg/clustermanager/kube_proxy.go	`79.06% <79.06%> (ø)`
controllers/resource/reconciler.go	`63.76% <100.00%> (+4.60%)`	⬆️
pkg/clients/kubernetes/runtimeclient.go	`79.41% <100.00%> (+1.28%)`	⬆️
pkg/cluster/spec.go	`78.03% <100.00%> (+1.04%)`	⬆️
pkg/clustermanager/client.go	`100.00% <100.00%> (ø)`
pkg/clustermanager/retrier_client.go	`92.30% <100.00%> (ø)`
pkg/controller/clientutil/annotations.go	`100.00% <100.00%> (ø)`
pkg/controller/clusterapi.go	`100.00% <100.00%> (ø)`
... and 1 more

... and 26 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

g-gaston · 2023-03-23T16:01:46Z

controllers/resource/reconciler_wb_test.go

Apologies for doing this kind of unexported method testing
But this reconciler and the legacy controller are extremely difficult to test without gigantic setups. Since there were no tests for this, there was no setup I could reuse.

This controller should go away on the next iteration, so I opted to test the private method directly.

g-gaston · 2023-03-23T16:03:25Z

pkg/clustermanager/retrier_client.go

@@ -12,15 +12,15 @@ import (

 // RetrierClient wraps around a ClusterClient, offering retry functionality for some operations.
 type RetrierClient struct {
-	*client
+	*clusterManageClient


This is just renaming to avoid collision with the client package

Suggested change

*clusterManageClient

*clusterManagerClient

g-gaston · 2023-03-23T16:04:10Z

pkg/eksctl/eksctl_test.go

@@ -15,6 +16,7 @@ func TestValidateVersionSuccess(t *testing.T) {
 }

 func TestValidateVersionError(t *testing.T) {
+	os.Unsetenv(eksctl.VersionEnvVar)


This prevents flakes locally when the env var is set

controllers/resource/reconciler.go

vivek-koppuru · 2023-03-23T22:33:18Z

pkg/clustermanager/kube_proxy.go

+	if needsPrepare, err := needsKubeProxyPreUpgrade(spec, kcp); err != nil {
+		return err
+	} else if !needsPrepare {
+		log.V(4).Info("Kube-proxy upgrade doesn't need special handling", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)


Isn't this just the following? Or did you want to highlight the special handling between old to new differently?

Suggested change

log.V(4).Info("Kube-proxy upgrade doesn't need special handling", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)

log.V(4).Info("skipping kube-proxy upgrade setup as it contains same version", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)

It doesn't necessarily contain the same version
It can contain a different version but none of them including the new kube-proxy
Or it can also contain two different versions but both including the new kube-proxy

To clarify: this prepare process is only necessary when there is a transition between a kube-proxy that always uses legacy (the "old" one) and a kube-proxy that uses the new iptables wrapper (the "new" one), which adds support for nft.

vivek-koppuru · 2023-03-23T22:42:57Z

pkg/clustermanager/kube_proxy.go

+
+		currentImage := kubeProxy.Spec.Template.Spec.Containers[0].Image
+		if currentImage == newKubeProxyImage {
+			log.V(4).Info("Kube-proxy image update seems stable", "wantImage", newKubeProxyImage, "currentImage", currentImage)


How do you ensure stability here? So with these retries is this enough to ensure that the new image has not been reverted? This won't run into race conditions right or can it?

It waits for two seconds after every update and then checks if the image is the expected one
This tries to mitigate a race condition where the KCP controller hasn't yet read the skip annotation so it undoes our update to the kube-proxy image. So we repeat this update in a loop and verify that the image doesn't get changed back.

If you are asking if this guarantees correctness: as almost any wait based mechanism, it doesn't. It's still possible, although unlikely, that the KCP controller won't see the skip annotation in 2 seconds and might try to undo the change. The longer the wait, the more confidence, but it will never be 100%. I couldn't figure out any algorithm that guaranteed the KCP controller had read the skip annotation before we update the DS.

Is 2 seconds enough to have this check then? Just wondering if we need longer or if KCP modifies another property when it sees the skip annotation. If not, yea then this seems fine.

vivek-koppuru · 2023-03-23T22:48:42Z

pkg/clustermanager/retrier_client.go

@@ -12,15 +12,15 @@ import (

 // RetrierClient wraps around a ClusterClient, offering retry functionality for some operations.
 type RetrierClient struct {
-	*client
+	*clusterManageClient


Suggested change

*clusterManageClient

*clusterManagerClient

pkg/clients/kubernetes/runtimeclient.go

pkg/cluster/spec.go

pkg/clustermanager/kube_proxy.go

jiayiwang7 · 2023-03-24T18:14:52Z

/lgtm

The new eks-d version includes the new kube-proxy with support for iptables nft. The old kube-proxy always uses iptables legacy. During an upgrade, when the new machine for the new CP node is started, if the machine has iptables nft as the default, the kubelet will use it. Then, before capi updates the kube-proxy image version in the DS (this doesn't happen until the CP upgrade is finished), the old kube-proxy is scheduled in the node. This old kube-proxy doesn't support nft and always uses iptables legacy. When it starts, it adds legacy iptables rules. However, at this point the kubelet has already added iptables-nft rules. After the CP has been updated, capi updates the kube-proxy DS to the new version. This new version has the new wrapper, which detects the rules introduced by the kubelet, so it starts using nft. The hypothesis is that these leftover legacy rules break the k8s service IP "redirection". This allows a smooth transition by scheduling a DS with the old kube proxy only in the old nodes and schedule a DS with the new kube-proxy only in the new nodes.

jiayiwang7 · 2023-03-24T18:22:31Z

/lgtm

g-gaston · 2023-03-24T18:25:34Z

/approve

eks-distro-bot · 2023-03-24T18:25:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: g-gaston

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [g-gaston]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

g-gaston · 2023-03-24T18:25:56Z

/cherry-pick release-0.15

eks-distro-pr-bot · 2023-03-24T18:25:59Z

@g-gaston: once the present PR merges, I will cherry-pick it on top of release-0.15 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

g-gaston · 2023-03-24T18:30:10Z

eks-anywhere-e2e-presubmit seems like a flake when pushing the image
/retest

g-gaston · 2023-03-24T19:45:49Z

/retest

eks-distro-pr-bot · 2023-03-24T20:21:58Z

@g-gaston: new pull request created: #5383

In response to this:

/cherry-pick release-0.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.

Before the kube-proxy upgrader (#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.

Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.

Before the kube-proxy upgrader (#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands. Co-authored-by: Guillermo Gaston <gaslor@amazon.com>

Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.

eks-distro-bot added the do-not-merge/work-in-progress label Mar 22, 2023

eks-distro-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 22, 2023

g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch 2 times, most recently from d679b5c to 5ab642f Compare March 23, 2023 00:32

g-gaston marked this pull request as ready for review March 23, 2023 00:33

g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from 1e05315 to bb9dfe9 Compare March 23, 2023 15:58

g-gaston changed the title ~~[WIP] Allow smooth upgrades from old iptables legacy kube-proxy to new kube-proxy~~ Allow smooth upgrades to new kube-proxy with nft Mar 23, 2023

eks-distro-bot removed the do-not-merge/work-in-progress label Mar 23, 2023

g-gaston commented Mar 23, 2023

View reviewed changes

drewvanstone added this to the v0.15.0 milestone Mar 23, 2023

drewvanstone added the blocker label Mar 23, 2023

drewvanstone assigned g-gaston Mar 23, 2023

jiayiwang7 assigned jiayiwang7 and unassigned jiayiwang7 Mar 23, 2023

jiayiwang7 self-requested a review March 23, 2023 17:57

vivek-koppuru reviewed Mar 23, 2023

View reviewed changes

g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from e253e34 to ba4386d Compare March 24, 2023 14:46

jiayiwang7 reviewed Mar 24, 2023

View reviewed changes

pkg/clustermanager/kube_proxy.go Show resolved Hide resolved

jiayiwang7 approved these changes Mar 24, 2023

View reviewed changes

eks-distro-bot assigned jiayiwang7 Mar 24, 2023

eks-distro-bot added the lgtm label Mar 24, 2023

g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from c1ccb4b to b088d84 Compare March 24, 2023 17:24

eks-distro-bot removed the lgtm label Mar 24, 2023

eks-distro-bot added lgtm and removed lgtm labels Mar 24, 2023

g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from e352eac to 7b917ad Compare March 24, 2023 17:59

eks-distro-bot added the lgtm label Mar 24, 2023

g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from 7b917ad to a8bd485 Compare March 24, 2023 18:15

eks-distro-bot removed the lgtm label Mar 24, 2023

eks-distro-bot added the lgtm label Mar 24, 2023

eks-distro-bot added the approved label Mar 24, 2023

eks-distro-bot merged commit 049cb5c into aws:main Mar 24, 2023

eks-distro-pr-bot mentioned this pull request Mar 24, 2023

[release-0.15] Allow smooth upgrades to new kube-proxy with nft #5383

Merged

g-gaston mentioned this pull request Apr 12, 2023

Use kubectl for kube-proxy upgrader calls #5609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow smooth upgrades to new kube-proxy with nft #5345

Allow smooth upgrades to new kube-proxy with nft #5345

g-gaston commented Mar 22, 2023 •

edited

Loading

eks-distro-bot commented Mar 22, 2023

codecov bot commented Mar 22, 2023 •

edited

Loading

g-gaston Mar 23, 2023

g-gaston Mar 23, 2023

vivek-koppuru Mar 23, 2023

g-gaston Mar 23, 2023

vivek-koppuru Mar 23, 2023

g-gaston Mar 24, 2023

g-gaston Mar 24, 2023

vivek-koppuru Mar 23, 2023

g-gaston Mar 24, 2023

vivek-koppuru Mar 24, 2023

vivek-koppuru Mar 23, 2023

jiayiwang7 commented Mar 24, 2023

jiayiwang7 commented Mar 24, 2023

g-gaston commented Mar 24, 2023

eks-distro-bot commented Mar 24, 2023

g-gaston commented Mar 24, 2023

eks-distro-pr-bot commented Mar 24, 2023

g-gaston commented Mar 24, 2023

g-gaston commented Mar 24, 2023

eks-distro-pr-bot commented Mar 24, 2023

	log.V(4).Info("Kube-proxy upgrade doesn't need special handling", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)
	log.V(4).Info("skipping kube-proxy upgrade setup as it contains same version", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)

Allow smooth upgrades to new kube-proxy with nft #5345

Allow smooth upgrades to new kube-proxy with nft #5345

Conversation

g-gaston commented Mar 22, 2023 • edited Loading

Problem statement

Symptoms

Cause

Solution

Implementation notes

Testing

TODOs

eks-distro-bot commented Mar 22, 2023

codecov bot commented Mar 22, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayiwang7 commented Mar 24, 2023

jiayiwang7 commented Mar 24, 2023

g-gaston commented Mar 24, 2023

eks-distro-bot commented Mar 24, 2023

g-gaston commented Mar 24, 2023

eks-distro-pr-bot commented Mar 24, 2023

g-gaston commented Mar 24, 2023

g-gaston commented Mar 24, 2023

eks-distro-pr-bot commented Mar 24, 2023

g-gaston commented Mar 22, 2023 •

edited

Loading

codecov bot commented Mar 22, 2023 •

edited

Loading