Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-router does not work with iptables 1.8.8 (nf_tables) on host #1370

Closed
jnummelin opened this issue Oct 10, 2022 · 6 comments
Closed

kube-router does not work with iptables 1.8.8 (nf_tables) on host #1370

jnummelin opened this issue Oct 10, 2022 · 6 comments
Labels

Comments

@jnummelin
Copy link
Contributor

jnummelin commented Oct 10, 2022

What happened?
Running kubelet on a host with iptables-1.8.8 (nf_tables mode) does not work due to kube-router image uses iptables-1.8.7. kube-proxy ends up replace the rule

-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP

with

-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -j DROP

This leads to network stop working.

Problem is that iptables-save with iptables 1.8.7 does not work with iptables rules created with iptables 1.8.8 (nf_tables).

If I on the host manually (using iptables 1.8.8) do:

iptables-save | grep -E '(Generated by|mytest)'
# Generated by iptables-save v1.8.8 (nf_tables) on Thu Sep 15 14:34:46 2022
# Generated by iptables-save v1.8.8 (nf_tables) on Thu Sep 15 14:34:46 2022
-A KUBE-FIREWALL -m comment --comment mytest -m mark --mark 0x8000/0x8000 -j DROP
# Generated by iptables-save v1.8.8 (nf_tables) on Thu Sep 15 14:34:46 2022

It shows the -m --mark 0x8000/0x8000.

If I then use nsenter to the kube-router pod and do the same I get:

/ # nsenter -t $(pidof kube-router) -m iptables-save | grep -E '(Generated by|mytest)'
# Generated by iptables-save v1.8.7 on Thu Sep 15 14:36:38 2022
# Generated by iptables-save v1.8.7 on Thu Sep 15 14:36:38 2022
-A KUBE-FIREWALL -m comment --comment mytest -j DROP
# Generated by iptables-save v1.8.7 on Thu Sep 15 14:36:38 2022

As you see, the -m --mark 0x8000/0x8000 is lost and all packages are dropped, not only the marked ones.

Now as kube-router’s NPC code does work using iptables-save and iptables-restore the end result is that the rules are broken during restore and thus all networking on the host goes bust.

Possible workarounds:

  • use iptables in legacy mode on host
  • downgrade iptables to 1.8.7 on the host (we selected this on k0s distro)

What did you expect to happen?
Network continue to work regardless of version of iptables installed on the host.

How can we reproduce the behavior you experienced?
Steps to reproduce the behavior:

  1. Install iptables 1.8.8 (nftables mode) on the host
  2. Deploy kube-router
  3. Networking goes 💥

Screenshots / Architecture Diagrams / Network Topologies

System Information (please complete the following information):

  • Kube-Router Version (kube-router --version): 1.5.1
  • Kube-Router Parameters: --run-router=true --run-firewall=true --run-service-proxy=false --bgp-graceful-restart=true
  • Kubernetes Version (kubectl version) : 1.25, 1.24, 1.23; I believe kube version is quite irrelevant here
  • Cloud Type: any
  • Kubernetes Deployment Type: k0s
  • Kube-Router Deployment Type: DaemonSet
  • Cluster Size: any

Logs, other output, metrics
Please provide logs, other kind of output or observed metrics here.

Additional context
At first we thought this is an issue with kube-proxy and hence we've created issue in k8s repo too which has some discussion on this: kubernetes/kubernetes#112477
k8s folks created an issue on netfilter tracker: https://bugzilla.netfilter.org/show_bug.cgi?id=1632
kube-router slack channel has also open discussion on this: https://kubernetes.slack.com/archives/C8DCQGTSB/p1663231184884829

@jnummelin jnummelin added the bug label Oct 10, 2022
@danwinship
Copy link

danwinship commented Oct 11, 2022

So yeah, as pointed out by @BenTheElder in the kube-proxy bug, the problem is this code here that does basically iptables-save | npc.cleanupStaleRules() | iptables-restore. Not only does this trigger this new bug, but that code has a race condition anyway; it's possible kube-proxy or someone else will modify the rules in between when you do iptables-save and iptables-restore, but then that other component's modifications will be lost by your iptables-restore. (This sounds like an unlikely race because npc.cleanupStaleRules() should be fast, but unfortunately, if kube-proxy tries to start its iptables-restore at any point while kube-router is in the iptables-save (which could take several seconds), then kube-proxy will end up getting the xtables lock for its iptables-restore before kube-router gets a chance to do its iptables-restore.

The "right" way to do this is to use iptables-restore --noflush and send it a restore input that only contains -X rules for the chains you want to delete. This will (a) avoid the new 1.8.7 vs 1.8.8 bug, (b) avoid the race condition, (c) be a lot faster (on hosts with lots of rules) because iptables-restore won't have to re-parse all those rules you aren't actually changing.

eg, if the iptables-save output looks like:

*filter
:INPUT - [0:0]
:OUTPUT - [0:0]
:KUBE-ROUTER-1 - [0:0]
:KUBE-ROUTER-2 - [0:0]
:KUBE-ROUTER-3 - [0:0]
:KUBE-ROUTER-4 - [0:0]
:KUBE-SVC-ABCD - [0:0]
:KUBE-SVC-EFGH - [0:0]
:I-DUNNO-SOME-ISTIO-CHAIN-OR-SOMETHING - [0:0]
-A KUBE-ROUTER-1 ...
-A KUBE-ROUTER-2 ...
...

and you want to delete KUBE-ROUTER-1 and KUBE-ROUTER-4, then you'd write out

*filter
:KUBE-ROUTER-1 - [0:0]
:KUBE-ROUTER-4 - [0:0]
-X KUBE-ROUTER-1
-X KUBE-ROUTER-4

and pass that to iptables-restore --noflush, and it will delete those chains and leave every other chain untouched.

(which means you didn't even need most of the iptable-save output, but unfortunately there's no way to ask iptables to just tell you what chains currently exist without having it also tell you all the actual rules.)

@aauren
Copy link
Collaborator

aauren commented Oct 11, 2022

Hey @danwinship and @jnummelin!

Thanks for doing a lot of the leg work on this one.

In response to @danwinship's comment, it should be noted that the logic that Dan points out really only accounts for a very small use-case, essentially when people call kube-router --cleanup-config. The real place where most of the iptables-save and iptables-restore stuff happens is when kube-router performs full syncs: https://github.com/cloudnativelabs/kube-router/blob/master/pkg/controllers/netpol/network_policy_controller.go#L218-L286 This also has the same overall logic flow, but it's not just about deleting chains and rules, its also updating chains and rules in more complex ways that would be difficult to refactor.

While it is possible that this may introduce a race condition, it seems unlikely. Since we introduced this in kube-router-1.3 we have never found that this has happened in practice. Even in high volume nodes (more than 100 pods) that are part of large clusters, we typically see that the entire sync loop runs in is less than 2 seconds. And of these 2 seconds almost all of that time is spent by the iptables-save and iptables-restore logic itself where the iptables binary will be holding the iptables lock. This means that the time when another binary could conceivably update an iptables rule between the two operations is almost minuscule.

I would also argue that more control loops that interact with iptables should interact on the basis of state not events here. Meaning, that if they miss a single sync or if somehow that sync is ineffectual, then the next time the loop runs it should attempt to correct this by applying state again. This is what kube-router does and it has been the only effective way that we've found to work with iptables / ipsets / ipvs in a consistent way when you consider that these resources are system wide and you can never guarantee that your app is the only app that is changing these resources.

Still, I can't deny that it might happen. The current approach is pretty blasé when considering other elements on the system that might try to interact with the filter table. While that hasn't been a problem that users have noticed as of yet, I suppose it could happen, and when it does happen it would probably lead to a bug that was pretty difficult to track down.

In response to this, I'll probably create a new issue that tracks upgrading kube-router to work with the --noflush flag on iptables-restore. Because we aren't just deleting chains and cleaning up, but rather inserting new logic and potentially re-arranging old logic, this would be a pretty significant rewrite (i.e. if an existing rule needed to be moved up or down in the chain, we would have to send a delete instruction, track how that changed the order of the existing rules, and then insert that same instruction in a new spot). Given the historic lack of external contributions to kube-router, this seems like something that wouldn't happen in the very near future.

However, I think that most of the above is just treating a symptom and not the root cause. The real issue here is that upstream isn't keeping compatibility of iptables-save and iptables-restore. Having kube-router only touch the rules that it creates might have worked around this specific issue, but it also might not have. What if kube-router had to work with an existing rule made by kube-proxy in order to create a valid network policy? This would mean that still kube-router could end up in this situation where it cannot trust the data that it is getting from iptables-save when it formulates it's restore. This still seems like it is an accident waiting to happen.

I best paths forward that I see here are:

  1. kube-router should attempt to stay at the maximum user-space tooling versions whenever possible, so that if anything, it uses a newer version than the host's space does. This may have the side effect of breaking host NS tools that use iptables-save / iptables-restore, but I would argue that kube-router is likely the biggest user of these tools during a normal runtime, so it would be better for us to be more advanced in versions than behind.
  2. We should attempt to convince upstream that breaking backwards compatibility for things like this is not something that should be done on bugfix revisions. Theoretically backwards compatibility should only be broken for major revisions, but even minor revisions would be a step up. That would allow us to see changes like this coming more easily.
  3. In the iptables upstream thread, they mentioned that there is supposed to be some sort of indication of when you might be attempting to execute against a non-compatible ruleset in iptables, I would recommend that we figure out how this could have been seen from our userspace tooling and then force an error to be logged to the user so that it doesn't cause their system to be unreachable like it did for @jnummelin. From the bugzilla thread:
Iptables extensions have revisions. These were introduced to allow for changes to their binary representation in transmit between kernel and user space. Since iptables prefers the highest kernel-supported revision, a rule added using iptables version X may not list correctly with iptables version X-1. It will print '[unsupported revision]', so this situation may at least be detected.
  1. We should document in kube-router that differing versions of the iptables / ipset / ipvs userspace tooling between the host and the container version of kube-router have the potential to cause issues so that user's are at least theoretically more aware that this can be an issue. User's who absolutely want to avoid this issue could also do so by running kube-router as a binary on the host and using the host's built-in tooling. I would note, that, that is how most of the sites I manage run kube-router and while it makes some of the deployment orchestration more difficult, it has saved us from numerous bugs like this in the past.
  2. I'll create the issue that I mentioned earlier to track moving our usage of iptables-restore to iptables-restore --noflush so that we would have less potential for touching rules that were not created with the userspace tools inside the kube-router container.

@danwinship / @jnummelin let me know what you think!

@danwinship
Copy link

And of these 2 seconds almost all of that time is spent by the iptables-save and iptables-restore logic itself where the iptables binary will be holding the iptables lock. This means that the time when another binary could conceivably update an iptables rule between the two operations is almost minuscule.

What I was trying to point out is that kube-proxy doesn't have to start its update during the miniscule space between the operations; if it starts any time between when kube-router starts its iptables-save and when kube-router starts its iptables-restore, then it will get the lock before kube-router's iptables-restore does:


TIME      KUBE-ROUTER                      KUBE-PROXY

0.000     exec(iptables-save)
            acquires xtables lock
1.500                                      exec(iptables-restore)
                                             (blocks trying to get the lock)
2.000       releases lock
                                             acquires the lock
                                             writes new kube-proxy rules...
          npc.syncNetworkPolicyChains()
          npc.syncPodFirewallChains()
          npc.cleanupStaleRules()
          exec(iptables-restore)
            (blocks trying to get the lock)
4.001:                                       releases lock
            acquires the lock
            overwrites kube-proxy's previous changes...
            releases lock

Since we introduced this in kube-router-1.3 we have never found that this has happened in practice.

Fortunately, it seems like kube-router is mostly focused on the filter table, while kube-proxy is mostly focused on the nat table, so they'd only rarely come into conflict. (And kube-proxy writes out its full filter table state every time it updates the filter table, so even if the race condition did get hit, things would only be out of sync briefly.)

Because we aren't just deleting chains and cleaning up, but rather inserting new logic and potentially re-arranging old logic, this would be a pretty significant rewrite

Ah, no, it wouldn't be actually. I didn't talk about inserting new rules and stuff in my example above because I had only seen the cleanup code, not the sync code. But you can pass both -X rules and -A rules to iptables-restore together. (This is what kube-proxy does.) The important thing is just that your iptables-restore input should only contain your own chains, not anyone else's. Basically, when you use --noflush, there are three possibilities for a chain:

  • if you have no :CHAINNAME line for the chain, and no -X CHAINNAME or -A CHAINNAME rules, then the restore doesn't affect that chain at all
  • if you have :CHAINNAME and then later -X CHAINNAME (and no -A CHAINNAME), then the chain gets deleted
  • if you have :CHAINNAME (and no -X CHAINNAME), and 0 or more -A CHAINNAME ... lines, then the chain gets completely replaced by the rules in the restore input (just like what would have happened in the non---noflush case)

So, write out all the old/modified/new rules for your own chains, and write out -X rules for the chains you want to delete, and just ignore everything else that was in the iptables-save output.

@aauren
Copy link
Collaborator

aauren commented Oct 11, 2022

What I was trying to point out is that kube-proxy doesn't have to start its update during the miniscule space between the operations; if it starts any time between when kube-router starts its iptables-save and when kube-router starts its iptables-restore, then it will get the lock before kube-router's iptables-restore does:

Fair enough. Although, to be more specific, most of the time is taken by iptables-restore not save or the kube-router logic, so I think the chance is minimal. Especially given what you said about kube-proxy being mostly concerned with tables other than filter and the fact that kube-proxy enforces state as a controller.

Still this is not an argument for not changing it. I agree that we shouldn't be touching rules that are not ours where we can help it. I was more just trying to point out that this rare race condition is not what is forcing this issue right now and is unlikely to force this issue in the future.

The important thing is just that your iptables-restore input should only contain your own chains, not anyone else's

I'm not sure that this is wholly possible. Its true that it is for 99% of the rules that kube-router works with where it is our own rules in our own chains. However, we do still have to operate on common chains like INPUT, OUTPUT, and FORWARD where we have the possibility that we'll have to interact in some way with other people's rules.

But you can pass both -X rules and -A rules to iptables-restore together.

I'll admit that this is a better approach than what I was considering, by trying to manipulate rules in place with individual -D and -I and -A operations. So thanks for the pointer based upon what y'all have learned with kube-proxy.

I still think though that at the end of the day, it is important that we are able to detect when there is an incompatibility between iptables rules that are written in one version of the application and ones that are written in another version. For now, this was reserved to just a single module --mark what if the host rules didn't show up at all with iptables-save? We wouldn't be able to intelligently figure out how to order our chains and rules at all in a way that didn't have potentially disastrous effects on the system.

@danwinship
Copy link

However, we do still have to operate on common chains like INPUT, OUTPUT, and FORWARD where we have the possibility that we'll have to interact in some way with other people's rules.

ah, yeah. In kube-proxy we do a handful of iptables -C calls to validate the base rules in the "shared" chains (eg "-A INPUT -j KUBE-SERVICES"), followed by iptables -A calls to recreate them if they got deleted somehow, followed by a big iptables-restore to sync up all of our "private" chains.

@aauren
Copy link
Collaborator

aauren commented Oct 20, 2022

@jnummelin - Release v1.5.2 has been released that contains an update to the Alpine base image (3.16) and also has iptables-1.8.8. The release specifically calls out this change and advises users about potential for user-space conflicts. Given that the most precarious operation seems to be using iptables-save / iptables-restore of which, kube-router, is likely using more than other apps on the system, keeping up with iptables updates and Alpine base images in the future should hopefully help reduce this issue.

I've also added new documentation to the requirements section of the user-guide to help advise users about keeping user-space tooling versions in sync.

Additionally we also opened #1372 to track changing the iptables-restore logic to use --noflush which should help minimize the issue by only touching our own rules.

That being said, I think that there is still a potential for problems at some point in the future with conflicting user-spaces between the kube-router container and other containers / the host's user-space tooling. Unfortunately, I don't believe that there is anything further that kube-router can do (besides what we've already done above) to further reduce the likelihood of this occurring. I'll keep monitoring the upstream issue to see if they are able to give us any more information in the future.

Thanks @danwinship / @jnummelin for tracking this down! For now, I'm going to close out this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants