-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow smooth upgrades to new kube-proxy with nft #5345
Allow smooth upgrades to new kube-proxy with nft #5345
Conversation
Skipping CI for Draft Pull Request. |
Codecov Report
@@ Coverage Diff @@
## main #5345 +/- ##
==========================================
+ Coverage 72.39% 72.65% +0.25%
==========================================
Files 440 441 +1
Lines 36125 36660 +535
==========================================
+ Hits 26152 26634 +482
- Misses 8408 8426 +18
- Partials 1565 1600 +35
... and 26 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
d679b5c
to
5ab642f
Compare
1e05315
to
bb9dfe9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for doing this kind of unexported method testing
But this reconciler and the legacy controller are extremely difficult to test without gigantic setups. Since there were no tests for this, there was no setup I could reuse.
This controller should go away on the next iteration, so I opted to test the private method directly.
pkg/clustermanager/retrier_client.go
Outdated
@@ -12,15 +12,15 @@ import ( | |||
|
|||
// RetrierClient wraps around a ClusterClient, offering retry functionality for some operations. | |||
type RetrierClient struct { | |||
*client | |||
*clusterManageClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just renaming to avoid collision with the client package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*clusterManageClient | |
*clusterManagerClient |
@@ -15,6 +16,7 @@ func TestValidateVersionSuccess(t *testing.T) { | |||
} | |||
|
|||
func TestValidateVersionError(t *testing.T) { | |||
os.Unsetenv(eksctl.VersionEnvVar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents flakes locally when the env var is set
if needsPrepare, err := needsKubeProxyPreUpgrade(spec, kcp); err != nil { | ||
return err | ||
} else if !needsPrepare { | ||
log.V(4).Info("Kube-proxy upgrade doesn't need special handling", "currentVersion", kcp.Spec.Version, "newVersion", newVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just the following? Or did you want to highlight the special handling between old to new differently?
log.V(4).Info("Kube-proxy upgrade doesn't need special handling", "currentVersion", kcp.Spec.Version, "newVersion", newVersion) | |
log.V(4).Info("skipping kube-proxy upgrade setup as it contains same version", "currentVersion", kcp.Spec.Version, "newVersion", newVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't necessarily contain the same version
It can contain a different version but none of them including the new kube-proxy
Or it can also contain two different versions but both including the new kube-proxy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: this prepare process is only necessary when there is a transition between a kube-proxy that always uses legacy (the "old" one) and a kube-proxy that uses the new iptables wrapper (the "new" one), which adds support for nft.
|
||
currentImage := kubeProxy.Spec.Template.Spec.Containers[0].Image | ||
if currentImage == newKubeProxyImage { | ||
log.V(4).Info("Kube-proxy image update seems stable", "wantImage", newKubeProxyImage, "currentImage", currentImage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you ensure stability here? So with these retries is this enough to ensure that the new image has not been reverted? This won't run into race conditions right or can it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It waits for two seconds after every update and then checks if the image is the expected one
This tries to mitigate a race condition where the KCP controller hasn't yet read the skip annotation so it undoes our update to the kube-proxy image. So we repeat this update in a loop and verify that the image doesn't get changed back.
If you are asking if this guarantees correctness: as almost any wait based mechanism, it doesn't. It's still possible, although unlikely, that the KCP controller won't see the skip annotation in 2 seconds and might try to undo the change. The longer the wait, the more confidence, but it will never be 100%. I couldn't figure out any algorithm that guaranteed the KCP controller had read the skip annotation before we update the DS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 2 seconds enough to have this check then? Just wondering if we need longer or if KCP modifies another property when it sees the skip annotation. If not, yea then this seems fine.
pkg/clustermanager/retrier_client.go
Outdated
@@ -12,15 +12,15 @@ import ( | |||
|
|||
// RetrierClient wraps around a ClusterClient, offering retry functionality for some operations. | |||
type RetrierClient struct { | |||
*client | |||
*clusterManageClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*clusterManageClient | |
*clusterManagerClient |
e253e34
to
ba4386d
Compare
c1ccb4b
to
b088d84
Compare
e352eac
to
7b917ad
Compare
/lgtm |
The new eks-d version includes the new kube-proxy with support for iptables nft. The old kube-proxy always uses iptables legacy. During an upgrade, when the new machine for the new CP node is started, if the machine has iptables nft as the default, the kubelet will use it. Then, before capi updates the kube-proxy image version in the DS (this doesn't happen until the CP upgrade is finished), the old kube-proxy is scheduled in the node. This old kube-proxy doesn't support nft and always uses iptables legacy. When it starts, it adds legacy iptables rules. However, at this point the kubelet has already added iptables-nft rules. After the CP has been updated, capi updates the kube-proxy DS to the new version. This new version has the new wrapper, which detects the rules introduced by the kubelet, so it starts using nft. The hypothesis is that these leftover legacy rules break the k8s service IP "redirection". This allows a smooth transition by scheduling a DS with the old kube proxy only in the old nodes and schedule a DS with the new kube-proxy only in the new nodes.
7b917ad
to
a8bd485
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: g-gaston The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-0.15 |
@g-gaston: once the present PR merges, I will cherry-pick it on top of release-0.15 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
@g-gaston: new pull request created: #5383 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Before the kube-proxy upgrader (#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Before the kube-proxy upgrader (#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands. Co-authored-by: Guillermo Gaston <gaslor@amazon.com>
Before the kube-proxy upgrader (aws#5345) all API calls to the management cluster came either from kubectl or clusterctl, which both happened to be run in a docker container in the admin machine. This was the first piece of code that introduced the use of Kubernetes Go client directly from the CLI binary. This means that if a user was relying on this internal implementation (explicit interface vs implicit interface), their system could break if it wasn't setup to give the CLI network connectivity to the kind cluster. This PR "reverts" the addition of that new paradigm byt changing the underlying client implementation to use kubectl commands.
Problem statement
Symptoms
When upgrading a Cloudstack cluster (either in k8s 1.22 or 1.23) from eks-a
v0.14.x
tov0.15.0
(code inmain
), the api-server looses connectivity to the webhooks. This makes impossible to create/update CRDs with validation/mutation webhooks. In particular, the eks-a process fails when moving back the management resources to the management cluster withclusterctl
, since this recreates the capi objects.Cause
Preface: upgrading from v0.14.3 to v0.15.0 involves upgrading from eks-d
v1.23.16-eks-1-23-16
tov1.23.16-eks-1-23-17
. The new eks-d version includes the newkube-proxy
with support for iptables nft. The oldkube-proxy
always uses iptables legacy.During an upgrade, when the new machine for the new CP node is started, if the machine has iptables nft as the default, the kubelet will use it. Then, before capi updates the
kube-proxy
image version in the DS (this doesn't happen until the CP upgrade is finished), the oldkube-proxy
is scheduled in the node. This oldkube-proxy
doesn't support nft and always uses iptables legacy. When it starts, it adds legacy iptables rules. However, at this point the kubelet has already added iptables-nft rules.After the CP has been updated, capi updates the
kube-proxy
DS to the new version. This new version has the new wrapper, which detects the rules introduced by the kubelet, so it starts using nft.The hypothesis is that these leftover legacy rules break the k8s service IP "redirection". It's also possible that Cilium is getting messed up when it starts since it also depends on the rules introduced by
kube-proxy
: cilium/cilium#20123So it turns out this problem is not exclusive to Cloudstack, it will happen with any OS that defaults to nft. It just happens that the only provider where we have that kind of image right now is Cloudstack.
Solution
TLDR: before the upgrade, schedule a DS with the old kube proxy only in the old nodes and schedule the existing DS with the new kube-proxy version only in the new nodes.
Upgrade flow changes:
kube-proxy
reconciliations. If not, the kcp controller will undo our next changes.anywhere.eks.amazonaws.com/iptableslegacy=true
. This allow us to identify the old nodes later.kube-proxy
DS usingrequiredDuringSchedulingIgnoredDuringExecution
andDoesNotExist
. This ensures the originalkube-proxy
is not scheduled in the old nodes after we update the image version.kube-proxy
pods in old nodes. Node affinity should remove them, but this removes race conditions.kube-proxy-iptables-legacy
with oldkube-proxy
version and using node selector withanywhere.eks.amazonaws.com/iptableslegacy=true
. This DS is only scheduled in the current (pre-upgrade) nodes.kube-proxy
DS with new eks-d versionv1.23.16-eks-1-23-18
.kube-proxy-iptables-legacy
.nodeAffinity
fromkube-proxy
DS.kube-proxy
reconciliations.Implementation notes
kubectl
calls.Testing
I have tested manually the present code a few times and it works just fine.
TODOs
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.