Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting hosts configured to use iptables-nft in kube-proxy #1729

Closed
jaxesn opened this issue Feb 6, 2023 · 5 comments · Fixed by aws/eks-distro-build-tooling#885 or #1728
Closed

Comments

@jaxesn
Copy link
Member

jaxesn commented Feb 6, 2023

Background

We introduced our minimal base images when we released Kubernetes 1.22 support. We have since gone back and retrofitted previous versions to ship on the minimal bases. In EKS-D we use two minimal bases, eks-distro-minimal-base and eks-distro-minimal-base-iptables. The iptables variant is only used by the kube-proxy image. We have since used this image elsewhere (the base for the kind image, base for the kindnetd image, EKS Managed kube-proxy addon has a minimal variant based on our iptables image). At the time when we introduced this image, upstream was using debian as the base image for their kube-proxy.

A decision we made when creating the iptables minimal base was to avoid including bash and other "coreutils". This meant a divergent between our kube-proxy and upstreams. Upstream's image uses iptabels-wrapper.sh in place of iptables directly to be able to support iptables-legacy and iptables-nft. Based on what it detects the host using, update-alternatives is called within the container to ensure that if the host is using iptables-legacy, the container uses iptables-legacy and the same for iptables-nft. Since we do not include bash and this script in our kube-proxy, our image will only ever use iptables-legacy.

@M00nF1sh and I discussed this back when creating the iptables base image and this issue was logged. We decided that since our primary node images on EKS and EKS-Anywhere were AL2 and Ubuntu 20.04, both of which use iptables-legacy by default, we did not need to handle the case of hosts being configured to use iptables-nft. We did however decide to include iptables-nft in our base image.

Upstream has recently introduced a iptables base image based on distroless which I believe shipped as part of 1.24 or 1.25. They use a similar technique to build this image as we do for some of our minimal base image. They specify the binaries needed for that iptables-wrapper.sh script and using ldd determine which libraries also need to be included in the final image. This gives a pretty good best of both worlds where the final image is fairly minimal both in size and cve scope and their wrapper script is able to function to support both legacy and nft iptables.

There has been some efforts upstream to go-ify this wrapper script which would remove the need for bash (or dash in the case of upstream's image) and the coreutils from the final image.

Ubuntu 22.04 and RHEL 8.7

We have recently begun testing EKS-A using Ubuntu 22.04 and RHEL 8.7, both of which default iptables to iptables-nft. Through this testing we have run into issues both in CAPD and CAPC clusters which appear to be nft vs legacy related.

An interesting bit is that, at least in terms of RHEL 8.7 on CAPC, 1.23 works totally fine, but when using 1.24 we see the issue. Our best guess, thanks @vignesh-goutham for tracking this down!, is that in 1.24 this was added to the kubelet. Since kubelet runs on the host when it makes that "hint" chain, the kube-proxy container which is still using iptables-legacy is not able to deal with the nft created chain, but since on 1.23 kube-proxy was the only actor interacting with iptables this was not an issue. From what I can gather from the following stackoverflow and stackoverflow, the main difference between nft and legacy is the frontend syntax and the resulting rules are "compatible", this could explain why we dont see issues with 1.23 but we do with 1.24.

Options

Recommended short-term
We modify our kube-proxy-base build to include bash and the other utils upstream includes in their distroless based iptables image. This is a pretty small lift which would look something like this. The downside here is slight bloat to the image and most importantly for us larger cve scope which would require more frequent releases. That said, I do not expect this to significant increase our need for security releases.

Alternative short-term
Documenting that we only support hosts using iptables legacy and for eks-a, we just ship node images (ova/ami/raw/etc) configured with legacy for now (this is assuming all the OS's we build still support both implementations).

Long-term
Get involved with the effort to produce an equivalent golang based version of iptables-wrapper.sh and bake that into our iptables base image instead of relying on bash and the wrapper script. This is something we could rollout in our images ahead of upstream since we have already created a divergent between ours and theres anyway.

Others

  • The aws-vpc-cni has similar handling for iptables as the kube-proxy image. They took the approach of allowing their users to set an env var to switch to nft mode from legacy. We could introduce a small golang app entrypoint in our kube-proxy which checks for this env var and runs the correct update-alternatives command. This would require we modify EKS-A to set this on our kube-proxy deployments when the node OS is one that defaults to nft.
  • Remove what we think is the offending change between 1.23 and 1.24 in the EKS-D build of the kubelet. This feels a bit extreme and risky.
  • In 1.25, it looks like some of the iptables handling in the kubelet was moved to be configurable by a feature gate. We could do more testing in 1.25 and see if maybe we could backport this to 1.24 if enabling that flag solves our issue in 1.25.

Testing

I was able to repro, I think, using the Ubuntu 22.04 ami with the EKS-D kops scripts. The conformance tests for 1.23 still pass and there are a few that fail on 1.24 which pass if I use upstream's kube-proxy image.

@danbudris
Copy link
Member

danbudris commented Feb 6, 2023

If we're going to go to the Go entrypoint eventually anyway, do we believe that it'd be enough work and risk that we shouldn't just do it as the first solution?

final image is fairly minimal both in size and cve scope

this seems right, something like GNU Bash won't add significantly to our CVE load -- total of 17 CVEs since 1999 and only one in the last couple years. The other utils are also older and more stable and won't be subject to security fixes of the frequency of something under morea ctive development. It will however expand the surface of the kube-proxy image to have a shell available; using a Go binary as the entrypoint for kube-proxy would allow us to avoid including a shell in the image.

The aws-vpc-cni has similar handling for iptables as the kube-proxy image. They took the approach of allowing their users to set an env var to switch to aws/amazon-vpc-cni-k8s#2155 mode from legacy. We could introduce a small golang app entrypoint in our kube-proxy which checks for this env var and runs the correct update-alternatives command. This would require we modify EKS-A to set this on our kube-proxy deployments when the node OS is one that defaults to nft.

Wondering what you think the cons here are? The work required vs a bash script which is well-trodden and straight-forward?

Remove what we think is the offending change between 1.23 and 1.24 in the EKS-D build of the kubelet. This feels a bit extreme and risky.

Agreed; this doesn't seem like a good option when we have more stable alternatives.

@vignesh-goutham
Copy link
Member

I prefer both the short-term and long-term solution proposed over the other solutions. While we are diverged, this allows us to stay in the same direction and sets us up better to handle when the IP Tables ownership change from kubelet to kube-proxy feature goes fully GA with Kubernetes 1.28.

In 1.25, it looks like some of the iptables handling in the kubelet was moved to be configurable by a feature gate. We could do more testing in 1.25 and see if maybe we could backport this to 1.24 if enabling that flag solves our issue in 1.25.

When EKS-A supports 1.25 we should start using this feature gate for sure. But I don't think back-porting this would be enough to solve our issue. Even with this feature gate, kubelet will still create KUBE-FIREWAL rule for martian packet filtering which could potentially cause problems if kubelet and kube-proxy uses different iptables.

@g-gaston
Copy link
Member

g-gaston commented Feb 6, 2023

The aws-vpc-cni has similar handling for iptables as the kube-proxy image. They took the approach of allowing their users to set an env var to switch to aws/amazon-vpc-cni-k8s#2155 mode from legacy. We could introduce a small golang app entrypoint in our kube-proxy which checks for this env var and runs the correct update-alternatives command. This would require we modify EKS-A to set this on our kube-proxy deployments when the node OS is one that defaults to nft.

I would vote against this. It adds a fair amount of work since we would need to tinker with the default kubeadm installation, possibly having to handle it by ourselves. Also, it adds one more "OS dependent" knob, which is always hard in eks-a given we can't assume full control over the host's image and we can't inspect before the machines are created. It also means we would need to make this option available in our API, since users can have custom images.

Remove what we think is the offending change between 1.23 and 1.24 in the EKS-D build of the kubelet. This feels a bit extreme and risky.
Yeah this doesn't seem wise.

In 1.25, it looks like some of the iptables handling in the kubelet was moved to be configurable by a feature gate. We could do more testing in 1.25 and see if maybe we could backport this to 1.24 if enabling that flag solves our issue in 1.25.

As @vignesh-goutham was saying, that doesn't guarantee it will work.

Upstream recommends using the same implementation for both rules set from the host and from containers. Although you can technically use both legacy and nft at the same time, they are two different kernel subsystems so the priority/precedence works independently in the two sets of rules. This means that unless you define the rules in both backends with this in mind, we can end up packets go through both chains and end up nowhere (which is what I saw for capd). I suspect that the fact that it worked for <1.24 was just pure luck in the way the two set of rules interacted with each other.

And even if we are able to test all the possible network scenarios (also, how many CNI will we test?) to guarantee there isn't pattern that is broken, it could break again in a future k8s release. It seems this can increase the maintenance cost down the road.


As for an alternative to the short term solution. What about just documenting that we only support hosts using iptables legacy? And for eks-a, we just ship all images configured with legacy for now (this is assuming all the OS's we build still support both implementations).

@jaxesn
Copy link
Member Author

jaxesn commented Feb 6, 2023

If we're going to go to the Go entrypoint eventually anyway, do we believe that it'd be enough work and risk that we shouldn't just do it as the first solution?

I actually don't think its a huge left to jump to the golang based solution. At the end of the day the wrapper script is not doing too much and I suspect the upstream PR more of less already works. I also dont see going this route as a huge risk, considering right now nft does not work at all. I would support going straight to this option and as I said, I don't think we have to wait for upstream to adopt the golang approach.

I agree with other comments around the other options not being very ideal. I would also vote against all of those.

As for an alternative to the short term solution. What about just documenting that we only support hosts using iptables legacy? And for eks-a, we just ship all images configured with legacy for now (this is assuming all the OS's we build still support both implementations).

That could certainly be an option. @vignesh-goutham would need to confirm that we could prebake our ubuntu 22.04 and rhel 8.7 images with legacy as the iptables version vs nft which is the default. I can't image we couldnt since thats basically what we do in the container image.

@jaxesn
Copy link
Member Author

jaxesn commented Feb 6, 2023

I suspect that the fact that it worked for <1.24 was just pure luck in the way the two set of rules interacted with each other.

Yes exactly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants