Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow smooth upgrades to new kube-proxy with nft #5345

Merged

Conversation

g-gaston
Copy link
Member

@g-gaston g-gaston commented Mar 22, 2023

Problem statement

Symptoms

When upgrading a Cloudstack cluster (either in k8s 1.22 or 1.23) from eks-a v0.14.x to v0.15.0 (code in main), the api-server looses connectivity to the webhooks. This makes impossible to create/update CRDs with validation/mutation webhooks. In particular, the eks-a process fails when moving back the management resources to the management cluster with clusterctl, since this recreates the capi objects.

Cause

Preface: upgrading from v0.14.3 to v0.15.0 involves upgrading from eks-d v1.23.16-eks-1-23-16 to v1.23.16-eks-1-23-17. The new eks-d version includes the new kube-proxy with support for iptables nft. The old kube-proxy always uses iptables legacy.

During an upgrade, when the new machine for the new CP node is started, if the machine has iptables nft as the default, the kubelet will use it. Then, before capi updates the kube-proxy image version in the DS (this doesn't happen until the CP upgrade is finished), the old kube-proxy is scheduled in the node. This old kube-proxy doesn't support nft and always uses iptables legacy. When it starts, it adds legacy iptables rules. However, at this point the kubelet has already added iptables-nft rules.

After the CP has been updated, capi updates the kube-proxy DS to the new version. This new version has the new wrapper, which detects the rules introduced by the kubelet, so it starts using nft.

The hypothesis is that these leftover legacy rules break the k8s service IP "redirection". It's also possible that Cilium is getting messed up when it starts since it also depends on the rules introduced by kube-proxy: cilium/cilium#20123

So it turns out this problem is not exclusive to Cloudstack, it will happen with any OS that defaults to nft. It just happens that the only provider where we have that kind of image right now is Cloudstack.

Solution

TLDR: before the upgrade, schedule a DS with the old kube proxy only in the old nodes and schedule the existing DS with the new kube-proxy version only in the new nodes.

Upgrade flow changes:

  1. Run everything we do today until before the actual capi upgrade/CP upgrade (the capi objects upgrade not the capi components upgrade).
  2. Add annotation to KCP to stop kube-proxy reconciliations. If not, the kcp controller will undo our next changes.
  3. Add special label to all nodes anywhere.eks.amazonaws.com/iptableslegacy=true. This allow us to identify the old nodes later.
  4. Add node (anti) affinity to main kube-proxy DS using requiredDuringSchedulingIgnoredDuringExecution and DoesNotExist. This ensures the original kube-proxy is not scheduled in the old nodes after we update the image version.
  5. Delete kube-proxy pods in old nodes. Node affinity should remove them, but this removes race conditions.
  6. Create new DS kube-proxy-iptables-legacy with old kube-proxy version and using node selector with anywhere.eks.amazonaws.com/iptableslegacy=true. This DS is only scheduled in the current (pre-upgrade) nodes.
  7. Update main kube-proxy DS with new eks-d version v1.23.16-eks-1-23-18.
  8. Continue with the upgrade flow.
  9. Delete kube-proxy-iptables-legacy.
  10. Remove nodeAffinity from kube-proxy DS.
  11. Remove annotation from KCP to resume kube-proxy reconciliations.

Implementation notes

  • I have used a go client for the k8s api calls. I know this is a semi new thing, specially in the CLI and it requires a group conversation and agreement before it gets introduced in the codebase. However, I went with it because I knew the implementation was gonna be significantly faster and this issue is a release blocker. If there are concerns with using this client, I would ask the reviewers to write them down but let the PR continue its course. And if required, I'll come back and refactor this to use kubectl calls.
  • I haven't put a lot of thought into naming and the placement of the code, both physically (package) and logically (from where do we call the new code). If you have better ideas please voice them out. I'm willing to move this around as necessary if the timing allows it. If it doesn't, I'll create a ticket to come back and addressed the agreed alternative.

Testing

I have tested manually the present code a few times and it works just fine.

TODOs

  • Cleanup code
  • Add cleanup phase after upgrade implementing 8, 9 and 10.
  • Run e2e test locally
  • Add k8s -> eks-d mapping for all versions
  • Unit tests
  • (Maybe) add retrying to the client calls or retry the whole "prepare" call and make the code reentrant. This will depend on timing. So far I haven't found issues during testing.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@eks-distro-bot eks-distro-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 22, 2023
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Merging #5345 (7b917ad) into main (7ee77c6) will increase coverage by 0.25%.
The diff coverage is 81.74%.

❗ Current head 7b917ad differs from pull request most recent head a8bd485. Consider uploading reports for the commit a8bd485 to get more accurate results

@@            Coverage Diff             @@
##             main    #5345      +/-   ##
==========================================
+ Coverage   72.39%   72.65%   +0.25%     
==========================================
  Files         440      441       +1     
  Lines       36125    36660     +535     
==========================================
+ Hits        26152    26634     +482     
- Misses       8408     8426      +18     
- Partials     1565     1600      +35     
Impacted Files Coverage Δ
pkg/task/task.go 88.14% <ø> (ø)
pkg/workflows/upgrade.go 63.56% <60.00%> (-0.16%) ⬇️
pkg/clustermanager/kube_proxy.go 79.06% <79.06%> (ø)
controllers/resource/reconciler.go 63.76% <100.00%> (+4.60%) ⬆️
pkg/clients/kubernetes/runtimeclient.go 79.41% <100.00%> (+1.28%) ⬆️
pkg/cluster/spec.go 78.03% <100.00%> (+1.04%) ⬆️
pkg/clustermanager/client.go 100.00% <100.00%> (ø)
pkg/clustermanager/retrier_client.go 92.30% <100.00%> (ø)
pkg/controller/clientutil/annotations.go 100.00% <100.00%> (ø)
pkg/controller/clusterapi.go 100.00% <100.00%> (ø)
... and 1 more

... and 26 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@g-gaston g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch 2 times, most recently from d679b5c to 5ab642f Compare March 23, 2023 00:32
@g-gaston g-gaston marked this pull request as ready for review March 23, 2023 00:33
@g-gaston g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from 1e05315 to bb9dfe9 Compare March 23, 2023 15:58
@g-gaston g-gaston changed the title [WIP] Allow smooth upgrades from old iptables legacy kube-proxy to new kube-proxy Allow smooth upgrades to new kube-proxy with nft Mar 23, 2023
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for doing this kind of unexported method testing
But this reconciler and the legacy controller are extremely difficult to test without gigantic setups. Since there were no tests for this, there was no setup I could reuse.

This controller should go away on the next iteration, so I opted to test the private method directly.

@@ -12,15 +12,15 @@ import (

// RetrierClient wraps around a ClusterClient, offering retry functionality for some operations.
type RetrierClient struct {
*client
*clusterManageClient
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just renaming to avoid collision with the client package

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*clusterManageClient
*clusterManagerClient

@@ -15,6 +16,7 @@ func TestValidateVersionSuccess(t *testing.T) {
}

func TestValidateVersionError(t *testing.T) {
os.Unsetenv(eksctl.VersionEnvVar)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents flakes locally when the env var is set

@drewvanstone drewvanstone added this to the v0.15.0 milestone Mar 23, 2023
@jiayiwang7 jiayiwang7 assigned jiayiwang7 and unassigned jiayiwang7 Mar 23, 2023
@jiayiwang7 jiayiwang7 self-requested a review March 23, 2023 17:57
controllers/resource/reconciler.go Outdated Show resolved Hide resolved
if needsPrepare, err := needsKubeProxyPreUpgrade(spec, kcp); err != nil {
return err
} else if !needsPrepare {
log.V(4).Info("Kube-proxy upgrade doesn't need special handling", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just the following? Or did you want to highlight the special handling between old to new differently?

Suggested change
log.V(4).Info("Kube-proxy upgrade doesn't need special handling", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)
log.V(4).Info("skipping kube-proxy upgrade setup as it contains same version", "currentVersion", kcp.Spec.Version, "newVersion", newVersion)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't necessarily contain the same version
It can contain a different version but none of them including the new kube-proxy
Or it can also contain two different versions but both including the new kube-proxy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: this prepare process is only necessary when there is a transition between a kube-proxy that always uses legacy (the "old" one) and a kube-proxy that uses the new iptables wrapper (the "new" one), which adds support for nft.


currentImage := kubeProxy.Spec.Template.Spec.Containers[0].Image
if currentImage == newKubeProxyImage {
log.V(4).Info("Kube-proxy image update seems stable", "wantImage", newKubeProxyImage, "currentImage", currentImage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you ensure stability here? So with these retries is this enough to ensure that the new image has not been reverted? This won't run into race conditions right or can it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It waits for two seconds after every update and then checks if the image is the expected one
This tries to mitigate a race condition where the KCP controller hasn't yet read the skip annotation so it undoes our update to the kube-proxy image. So we repeat this update in a loop and verify that the image doesn't get changed back.

If you are asking if this guarantees correctness: as almost any wait based mechanism, it doesn't. It's still possible, although unlikely, that the KCP controller won't see the skip annotation in 2 seconds and might try to undo the change. The longer the wait, the more confidence, but it will never be 100%. I couldn't figure out any algorithm that guaranteed the KCP controller had read the skip annotation before we update the DS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 2 seconds enough to have this check then? Just wondering if we need longer or if KCP modifies another property when it sees the skip annotation. If not, yea then this seems fine.

@@ -12,15 +12,15 @@ import (

// RetrierClient wraps around a ClusterClient, offering retry functionality for some operations.
type RetrierClient struct {
*client
*clusterManageClient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*clusterManageClient
*clusterManagerClient

@g-gaston g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from e253e34 to ba4386d Compare March 24, 2023 14:46
pkg/clients/kubernetes/runtimeclient.go Outdated Show resolved Hide resolved
pkg/clients/kubernetes/runtimeclient.go Outdated Show resolved Hide resolved
pkg/cluster/spec.go Outdated Show resolved Hide resolved
pkg/clustermanager/kube_proxy.go Show resolved Hide resolved
pkg/clustermanager/kube_proxy.go Show resolved Hide resolved
pkg/clustermanager/kube_proxy.go Outdated Show resolved Hide resolved
pkg/clustermanager/kube_proxy.go Outdated Show resolved Hide resolved
pkg/clustermanager/kube_proxy.go Show resolved Hide resolved
pkg/clustermanager/kube_proxy.go Outdated Show resolved Hide resolved
@g-gaston g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from c1ccb4b to b088d84 Compare March 24, 2023 17:24
@eks-distro-bot eks-distro-bot added lgtm and removed lgtm labels Mar 24, 2023
@g-gaston g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from e352eac to 7b917ad Compare March 24, 2023 17:59
@jiayiwang7
Copy link
Member

/lgtm

The new eks-d version includes the new kube-proxy with support for
iptables nft. The old kube-proxy always uses iptables legacy.

During an upgrade, when the new machine for the new CP node is started,
if the machine has iptables nft as the default, the kubelet will use it.
Then, before capi updates the kube-proxy image version in the DS (this
doesn't happen until the CP upgrade is finished), the old kube-proxy is
scheduled in the node. This old kube-proxy doesn't support nft and
always uses iptables legacy. When it starts, it adds legacy iptables
rules. However, at this point the kubelet has already added iptables-nft
rules.

After the CP has been updated, capi updates the kube-proxy DS to the new
version. This new version has the new wrapper, which detects the rules
introduced by the kubelet, so it starts using nft.

The hypothesis is that these leftover legacy rules break the k8s service
IP "redirection".

This allows a smooth transition by scheduling a DS with the old kube proxy
only in the old nodes and schedule a DS with the new kube-proxy only in
the new nodes.
@g-gaston g-gaston force-pushed the fix-upgrade-cloudstack-from-iptables-legacy branch from 7b917ad to a8bd485 Compare March 24, 2023 18:15
@jiayiwang7
Copy link
Member

/lgtm

@g-gaston
Copy link
Member Author

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: g-gaston

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@g-gaston
Copy link
Member Author

/cherry-pick release-0.15

@eks-distro-pr-bot
Copy link
Contributor

@g-gaston: once the present PR merges, I will cherry-pick it on top of release-0.15 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@g-gaston
Copy link
Member Author

eks-anywhere-e2e-presubmit seems like a flake when pushing the image
/retest

@g-gaston
Copy link
Member Author

/retest

@eks-distro-pr-bot
Copy link
Contributor

@g-gaston: new pull request created: #5383

In response to this:

/cherry-pick release-0.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

g-gaston added a commit to g-gaston/eks-anywhere that referenced this pull request Apr 12, 2023
Before the kube-proxy upgrader
(aws#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
g-gaston added a commit to g-gaston/eks-anywhere that referenced this pull request Apr 12, 2023
Before the kube-proxy upgrader
(aws#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
g-gaston added a commit to g-gaston/eks-anywhere that referenced this pull request Apr 12, 2023
Before the kube-proxy upgrader
(aws#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
g-gaston added a commit to g-gaston/eks-anywhere that referenced this pull request Apr 12, 2023
Before the kube-proxy upgrader
(aws#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
eks-distro-bot pushed a commit that referenced this pull request Apr 12, 2023
Before the kube-proxy upgrader
(#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
eks-distro-pr-bot pushed a commit to eks-distro-pr-bot/eks-anywhere that referenced this pull request Apr 13, 2023
Before the kube-proxy upgrader
(aws#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
eks-distro-pr-bot pushed a commit to eks-distro-pr-bot/eks-anywhere that referenced this pull request Apr 13, 2023
Before the kube-proxy upgrader
(aws#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
eks-distro-bot pushed a commit that referenced this pull request Apr 13, 2023
Before the kube-proxy upgrader
(#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.

Co-authored-by: Guillermo Gaston <gaslor@amazon.com>
ddjjia pushed a commit to ddjjia/eks-anywhere that referenced this pull request Jun 8, 2023
Before the kube-proxy upgrader
(aws#5345) all API calls to the
management cluster came either from kubectl or clusterctl, which both
happened to be run in a docker container in the admin machine. This was
the first piece of code that introduced the use of Kubernetes Go client
directly from the CLI binary. This means that if a user was relying on
this internal implementation (explicit interface vs implicit interface),
their system could break if it wasn't setup to give the CLI network
connectivity to the kind cluster.

This PR "reverts" the addition of that new paradigm byt changing the
underlying client implementation to use kubectl commands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved blocker lgtm size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants