Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-router does not work with iptables 1.8.8 (nf_tables) on host #112477

Closed
ncopa opened this issue Sep 15, 2022 · 29 comments
Closed

kube-router does not work with iptables 1.8.8 (nf_tables) on host #112477

ncopa opened this issue Sep 15, 2022 · 29 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@ncopa
Copy link
Contributor

ncopa commented Sep 15, 2022

What happened?

Running kubelet on a host with iptables-1.8.8 (nf_tables mode) does not work due to kube-proxy image uses iptables-1.8.7. kube-proxy ends up replace the rule

-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP

with

-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -j DROP

This leads to network stop working.

What did you expect to happen?

Network continue to work regardless of version of iptables installed on the host.

How can we reproduce it (as minimally and precisely as possible)?

Try to join a worker with iptables 1.8.8 in nf_tables mode on the host.

Anything else we need to know?

Problem is that iptables-save with iptables 1.8.7 does not work with iptables rules created with iptables 1.8.8 (nf_tables).

If I on the host manually (using iptables 1.8.8) do:

iptables-save | grep -E '(Generated by|mytest)'
# Generated by iptables-save v1.8.8 (nf_tables) on Thu Sep 15 14:34:46 2022
# Generated by iptables-save v1.8.8 (nf_tables) on Thu Sep 15 14:34:46 2022
-A KUBE-FIREWALL -m comment --comment mytest -m mark --mark 0x8000/0x8000 -j DROP
# Generated by iptables-save v1.8.8 (nf_tables) on Thu Sep 15 14:34:46 2022

It shows the -m --mark 0x8000/0x8000.

If I then use nsenter to the kube-proxy pod and do the same I get:

/ # nsenter -t $(pidof kube-proxy) -m iptables-save | grep -E '(Generated by|mytest)'
# Generated by iptables-save v1.8.7 on Thu Sep 15 14:36:38 2022
# Generated by iptables-save v1.8.7 on Thu Sep 15 14:36:38 2022
-A KUBE-FIREWALL -m comment --comment mytest -j DROP
# Generated by iptables-save v1.8.7 on Thu Sep 15 14:36:38 2022

As you see, the -m --mark 0x8000/0x8000 is lost and all packages are dropped, not only the marked ones.

Possible workarounds:

  • use iptables in legacy mode on host
  • downgrade iptables to 1.8.7 on the host

Possible fixes:

  • upgrade iptables in kube-proxy image to 1.8.8 (and make sure that it always is latest).
  • change logic in kube-proxy so that it does not touch/re-inject previously created iptables rules (eg does not iptables-save | ... | iptables-restore)

Kubernetes version

$ kubectl version

W0915 14:46:03.488791    2464 loader.go:223] Config not found: /var/lib/k0s/pki/admin.conf

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Unable to connect to the server: dial tcp 127.0.0.1:8080: i/o timeout
I think the `Unable to connect` message is due to firewall being broken...

Cloud provider

n/a

OS version

# On Linux:
$ cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.16.2
PRETTY_NAME="Alpine Linux v3.16"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

$ uname -a
Linux worker0 5.15.67-0-lts #1-Alpine SMP Fri, 09 Sep 2022 06:15:47 +0000 x86_64 GNU/Linu

Install tools

k0s

Container runtime (CRI) and version (if applicable)

n/a

Related plugins (CNI, CSI, ...) and versions (if applicable)

n/a
@ncopa ncopa added the kind/bug Categorizes issue or PR as related to a bug. label Sep 15, 2022
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 15, 2022
@k8s-ci-robot
Copy link
Contributor

@ncopa: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 15, 2022
@ncopa
Copy link
Contributor Author

ncopa commented Sep 15, 2022

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 15, 2022
@ncopa
Copy link
Contributor Author

ncopa commented Sep 15, 2022

The rule that gets mangled by older iptables is created here:

if _, err := iptClient.EnsureRule(utiliptables.Append, utiliptables.TableFilter, KubeFirewallChain,

@bridgetkromhout
Copy link
Member

/assign @danwinship

@uablrek
Copy link
Contributor

uablrek commented Sep 15, 2022

change logic in kube-proxy so that it does not touch/re-inject previously created iptables rules (eg does not iptables-save | ... | iptables-restore)

Iptables rules must not be added one-by one. Reason is that the update is; read-everything-to-user-space, update-one-rule, write-everything-back-to-kernel. This was one reason to invent nft, so perhaps this is not true for nf_tables mode, but please check it.

@danwinship
Copy link
Contributor

kube-proxy re-creates this rule on purpose so it doesn't matter whether it's doing it one-by-one or via iptables-restore.

@danwinship
Copy link
Contributor

This seems to be a bug in iptables and I don't think we can plausibly work around it. (Changing the version of iptables in the kube-proxy image would just introduce the bug in the opposite scenario, where kube-proxy has the newer version and kubelet has the older version.) The answer for now seems to be "don't use iptables 1.8.8, it's broken".

@danwinship
Copy link
Contributor

Filed https://bugzilla.netfilter.org/show_bug.cgi?id=1632

@aojea
Copy link
Member

aojea commented Sep 15, 2022

Filed https://bugzilla.netfilter.org/show_bug.cgi?id=1632

Pasting the reply here for visibility

I fear this is expected behaviour - at least I have not seen any attempts at keeping the in kernel ruleset compatible to older versions of iptables-nft.

The root-cause here is that newer iptables-nft uses a native nftables expression to match on packet mark, while older ones use the xtables extension in kernel.

Newer iptables-nft will correctly interpret rulesets created by older versions, but a mere 'iptables-save | iptables-restore' might shut the door behind you.

another if here https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh Dan 😄

@ncopa
Copy link
Contributor Author

ncopa commented Sep 16, 2022

This seems to be a bug in iptables and I don't think we can plausibly work around it.

I think the behavior is expected.

(Changing the version of iptables in the kube-proxy image would just introduce the bug in the opposite scenario, where kube-proxy has the newer version and kubelet has the older version.)

It is only a problem if kubelet reads/parses the iptables rules, which I don't think it does. I could not find any iptables-savein kubelet. There is an iptables -C though, but I think in worst case it will only end up adding a duplicate rule. I can try experiment with it a bit.

The answer for now seems to be "don't use iptables 1.8.8, it's broken".

It means, "don't use any iptables newer than whatever is shipped with kube-proxy".

@aojea
Copy link
Member

aojea commented Sep 16, 2022

It is only a problem if kubelet reads/parses the iptables rules, which I don't think it does

yeah, based on your comments it seems is the iptables-save | restore combo is the problematic one, kubelet doesn´t use that, but doesn't mean users are not doing that for another reasons and we'll break those users :/

It means, "don't use any iptables newer than whatever is shipped with kube-proxy".

I think that is more "if you use a containerized kube-proxy, don't use iptables newer if is not compatible with the kube-proxy iptables version inside its containers" and "kubernetes 1.25 releases a kube-proxy image that uses iptables version 1.8.7, that is incompatible with iptables version 1.8.8 and will break your cluster if your versions are not in sync" 🙃

So, it is important to mention that the kube-proxy image generated as part of the release is based on debian-bullseye and we use the iptables version shipped there

https://github.com/kubernetes/release/blob/a0c26c8a657c321a82e44d1ec101cdae4d692578/images/build/distroless-iptables/Makefile#L23

I don't know the commitment we have with the published images of the components, but I don't think that is going to be feasible to support all the combinations existing in the wild (the kube-proxy binary must be 100% compatible though)

/sig release

@k8s-ci-robot k8s-ci-robot added the sig/release Categorizes an issue or PR as relevant to SIG Release. label Sep 16, 2022
@danwinship
Copy link
Contributor

It is only a problem if kubelet reads/parses the iptables rules, which I don't think it does.

Yeah, I'm not sure exactly how this is failing for the OP... It seems to me that you should end up with two copies of the KUBE-FIREWALL rule, which both look correct according to 1.8.8, but only one of which looks correct to 1.8.7. But both of them should be correct at the nftables level...

@danwinship
Copy link
Contributor

The answer for now seems to be "don't use iptables 1.8.8, it's broken".

It means, "don't use any iptables newer than whatever is shipped with kube-proxy".

I'm arguing that the behavior in 1.8.8 is a bug, and that it should presumably be fixed in 1.8.9, after which you will again be able to use any combination of iptables versions 1.8.3-1.8.7 and 1.8.9+. (The iptables maintainers are currently arguing against this, but I'm going to continue arguing against them.)

If the behavior in 1.8.8 is not declared a bug, then we probably need to accelerate (and possibly backport) KEP 3178 because the only plausible answer at that point is that we need to make sure there are never any cases where kubelet and kube-proxy look at / modify the same rules.

@danwinship
Copy link
Contributor

Yeah, I'm not sure exactly how this is failing for the OP

@ncopa can you get a copy of "iptables-save-1.8.7 | grep KUBE-FIREWALL", "iptables-save-1.8.8 | grep KUBE-FIREWALL" and "nft list ruleset | awk '/KUBE-FIREWALL/,/}/'" from when the bug is occurring? (I think that nft command line should pull out just the KUBE-FIREWALL-relevant parts of the "nft list ruleset" output, but if that fails then just do a full "nft list ruleset" and copy out the relevant bits yourself...)

@Bamul19
Copy link

Bamul19 commented Sep 20, 2022

/assign @danwinship

@jnummelin
Copy link
Contributor

jnummelin commented Sep 20, 2022

@danwinship I've been working with @ncopa debugging this

Here's what I can see with 1.8.8:

bash-5.1# /var/lib/k0s/bin/iptables-save -V
iptables-save v1.8.8 (nf_tables)
bash-5.1# /var/lib/k0s/bin/iptables-save | grep KUBE-FIREWALL
:KUBE-FIREWALL - [0:0]
-A INPUT -j KUBE-FIREWALL
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL ! -s 127.0.0.0/8 -d 127.0.0.0/8 -m comment --comment "block incoming localnet connections" -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP

Dropping 1.8.7 into the host gets me this:

bash-5.1# ./xtables-nft-multi iptables-save -V
iptables-save v1.8.7 (nf_tables)
bash-5.1# ./xtables-nft-multi iptables-save | grep KUBE-FIREWALL
:KUBE-FIREWALL - [0:0]
-A INPUT -j KUBE-FIREWALL
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL ! -s 127.0.0.0/8 -d 127.0.0.0/8 -m comment --comment "block incoming localnet connections" -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -j DROP

Looking directly with nft:

bash-5.1# nft -V
nftables v1.0.2 (Lester Gooch)
  cli:          readline
  json:         yes
  minigmp:      no
  libxtables:   no
bash-5.1# nft list ruleset | awk '/KUBE-FIREWALL/,/}/'
        chain KUBE-FIREWALL {
                ip saddr != 127.0.0.0/8 ip daddr 127.0.0.0/8 # xt_comment # xt_conntrack counter packets 0 bytes 0 drop
                # xt_comment meta mark & 0x00008000 == 0x00008000 counter packets 0 bytes 0 drop
        }
                counter packets 274303 bytes 71631859 jump KUBE-FIREWALL
        }
                counter packets 270657 bytes 69900591 jump KUBE-FIREWALL
        }
        chain KUBE-FIREWALL {
                # xt_comment meta mark & 0x00008000 == 0x00008000 counter packets 0 bytes 0 drop
        }

Here's the full nft list ruleset output:

bash-5.1# nft list ruleset
table ip mangle {
        chain KUBE-IPTABLES-HINT {
        }

        chain KUBE-KUBELET-CANARY {
        }
}
table ip filter {
        chain KUBE-FIREWALL {
                ip saddr != 127.0.0.0/8 ip daddr 127.0.0.0/8 # xt_comment # xt_conntrack counter packets 0 bytes 0 drop
                # xt_comment meta mark & 0x00008000 == 0x00008000 counter packets 0 bytes 0 drop
        }

        chain OUTPUT {
                type filter hook output priority filter; policy accept;
                counter packets 288636 bytes 74204816 jump KUBE-FIREWALL
        }

        chain INPUT {
                type filter hook input priority filter; policy accept;
                counter packets 284972 bytes 73116499 jump KUBE-FIREWALL
        }

        chain KUBE-KUBELET-CANARY {
        }
}
table ip nat {
        chain KUBE-MARK-DROP {
                counter packets 0 bytes 0 # xt_MARK
        }

        chain KUBE-MARK-MASQ {
                counter packets 0 bytes 0 # xt_MARK
        }

        chain KUBE-POSTROUTING {
                meta mark & 0x00004000 != 0x00004000 counter packets 544 bytes 38556 return
                counter packets 0 bytes 0 # xt_MARK
                # xt_comment counter packets 0 bytes 0 # xt_MASQUERADE
        }

        chain POSTROUTING {
                type nat hook postrouting priority srcnat; policy accept;
                # xt_comment counter packets 544 bytes 38556 jump KUBE-POSTROUTING
        }

        chain KUBE-KUBELET-CANARY {
        }
}
table ip6 mangle {
        chain KUBE-IPTABLES-HINT {
        }

        chain KUBE-KUBELET-CANARY {
        }
}
table ip6 nat {
        chain KUBE-MARK-DROP {
                counter packets 0 bytes 0 # xt_MARK
        }

        chain KUBE-MARK-MASQ {
                counter packets 0 bytes 0 # xt_MARK
        }

        chain KUBE-POSTROUTING {
                meta mark & 0x00004000 != 0x00004000 counter packets 0 bytes 0 return
                counter packets 0 bytes 0 # xt_MARK
                # xt_comment counter packets 0 bytes 0 # xt_MASQUERADE
        }

        chain POSTROUTING {
                type nat hook postrouting priority srcnat; policy accept;
                # xt_comment counter packets 0 bytes 0 jump KUBE-POSTROUTING
        }

        chain KUBE-KUBELET-CANARY {
        }
}
table ip6 filter {
        chain KUBE-FIREWALL {
                # xt_comment meta mark & 0x00008000 == 0x00008000 counter packets 0 bytes 0 drop
        }

        chain KUBE-KUBELET-CANARY {
        }
}

@danwinship
Copy link
Contributor

danwinship commented Sep 20, 2022

So belatedly, it occurs to me that neither the iptables kube-proxy nor the ipvs kube-proxy ever refers to the filter table KUBE-FIREWALL chain... so the fact that it would not be able to read it correctly with the older kube-proxy seems even less relevant.

If packets are actually getting dropped, that implies that the kernel representation of the rule is wrong. But from the nft list ruleset output it seems correct.

Also relevant note: this is with the ipvs backend, not the iptables backend.

Did you upgrade to 1.25 at the same time as you upgraded from iptables 1.8.7 to iptables 1.8.8? If not, what order did things happen in? (ie, do we know that either "kube 1.24 with ipvs proxy and iptables 1.8.8" or "kube 1.25 with ipvs proxy and iptables 1.8.7" definitely works?)

@jnummelin
Copy link
Contributor

I'm not versed enough on nft rules to even start guessing... :)

This is with fresh install of 1.25 when the host has 1.8.8 in nftables mode. It works perfectly with legacy mode, we've used that for a good while. Things started to break once we changed our embedded (k0s distro) iptables to nftables mode. We caught this in our smoke test on Alpine and have been testing on various OSes since.

@danwinship
Copy link
Contributor

Can anyone confirm that

  • kube 1.25
  • ipvs proxy
  • iptables 1.8.7 in nft mode

works correctly?

@uablrek
Copy link
Contributor

uablrek commented Sep 21, 2022

I don't run the kube-proxy container, but start kube-proxy directly on the node with a script. I guess that's not what you want since the problem seems to be host-container version mismatch, but I can test any combination if my setup is ok.

@uablrek
Copy link
Contributor

uablrek commented Sep 21, 2022

It works in my env. But I also have kernel linux-5.19.9, and build iptables myself, so sorry I don't think I can make relevant tests.

@danwinship
Copy link
Contributor

I guess that's not what you want since the problem seems to be host-container version mismatch

I'm not at all convinced that that's what the problem is.

My worry was that this had nothing at all to do with iptables and the bug was just "ipvs mode in 1.25 is totally broken (for at least some users)".

If it's not that, then my next two theories are "there's a second change in iptables 1.8.8 that is actually causing the problem, and the thing the OP noticed is completely irrelevant" and "iptables 1.8.8 is incompatible with certain kernel versions".

The fact that 1.8.7 can't correctly read a certain rule created by 1.8.8 should not be causing any problems, because kube-proxy never looks at that rule.

@thockin
Copy link
Member

thockin commented Sep 24, 2022

🍿

@ncopa
Copy link
Contributor Author

ncopa commented Oct 7, 2022

The fact that 1.8.7 can't correctly read a certain rule created by 1.8.8 should not be causing any problems, because kube-proxy never looks at that rule.

I may be completely wrong but it awfully much looks like pkg/proxy/iptables executes iptables-save here:

if err := ipt.SaveInto(utiliptables.TableNAT, iptablesData); err != nil {

and then send it back to kernel with iptables-restore here:

err = ipt.Restore(utiliptables.TableNAT, natLines, utiliptables.NoFlushTables, utiliptables.RestoreCounters)

It also looks like kube-proxy imports pkg/proxy/iptables here:

"k8s.io/kubernetes/pkg/proxy/iptables"

Even if it does not look at that rule it certaily looks that it sends the output of iptables-save to iptables-restore. But I may be wrong it could be something else, like kube-router that does it.

@danwinship
Copy link
Contributor

That's the mode: iptables proxy source, and you're using mode: ipvs. The ipvs proxier doesn't ever call iptables-save in 1.25.

But ignoring that, even in the iptables proxier, no rules get copied from the iptables-save output to the iptables-restore input; the only part of the iptables-save output that it looks at is the chain declarations, which it uses to figure out if there are stale service/endpoint chains lying around that need to be deleted now.

But I may be wrong it could be something else, like kube-router that does it.

ah, I know nothing about kube-router, but yes, maybe it might be some component other than kube-proxy that is breaking things...

@BenTheElder
Copy link
Member

@ncopa what k0s version, kube-router image, etc?

AFAICT kube-router is used as the CNI in k0s when it is used, deployed with an image, that probably has some other iptables version.

https://github.com/k0sproject/k0s/blob/5d5fa1161efd675a8eacc7aa02fe22a8e2ed56d8/docs/networking.md#in-cluster-networking

Searching kube-router while thinking about:

But ignoring that, even in the iptables proxier, no rules get copied from the iptables-save output to the iptables-restore input; the only part of the iptables-save output that it looks at is the chain declarations, which it uses to figure out if there are stale service/endpoint chains lying around that need to be deleted now.

At a glance, it looks like kube-router does iptables-save | filter | iptables-restore ...

https://github.com/cloudnativelabs/kube-router/blob/b5028025b2dbed4f76fb9723c682fe3f64d66a25/pkg/controllers/netpol/network_policy_controller.go#L591-L612

@danwinship
Copy link
Contributor

OK, I commented more on the kube-router bug (cloudnativelabs/kube-router#1370) suggesting how to fix the problem. I don't think there's anything more we need to be tracking here...

FWIW, as a workaround until they fix it, you could try running kubelet with --feature-gates=IPTablesOwnershipCleanup=true, which will cause it to not create the problematic "-j DROP" rule. Of course, this is an alpha feature and you may run into problems with components that assume kubelet still creates those rules, but if you do then you can report them and help that KEP move forward 🙂

/close

@k8s-ci-robot
Copy link
Contributor

@danwinship: Closing this issue.

In response to this:

OK, I commented more on the kube-router bug (cloudnativelabs/kube-router#1370) suggesting how to fix the problem. I don't think there's anything more we need to be tracking here...

FWIW, as a workaround until they fix it, you could try running kubelet with --feature-gates=IPTablesOwnershipCleanup=true, which will cause it to not create the problematic "-j DROP" rule. Of course, this is an alpha feature and you may run into problems with components that assume kubelet still creates those rules, but if you do then you can report them and help that KEP move forward 🙂

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jnummelin
Copy link
Contributor

I don't think there's anything more we need to be tracking here...

@danwinship Maybe it would be worth to add some general docs (not sure if anyone will actually read those though 😄 ) about possible iptables version inconsistencies. Yes, the issue is not triggered by kube-proxy but if iptables provides no "backwards combatibility", I fear this is not the first time someone hits these sort of issues with some networking components.

Thanks anyways for looking into this even it revealed to be outside of kube-proxy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests

10 participants