From 1be36b71b07c34fb569c75756edd55988b67f4e0 Mon Sep 17 00:00:00 2001 From: Dan Winship Date: Thu, 1 Dec 2022 12:46:13 -0500 Subject: [PATCH] Initial proposal of kube-proxy nftables mode --- keps/prod-readiness/sig-network/3866.yaml | 6 + .../sig-network/3866-nftables-proxy/README.md | 1941 +++++++++++++++++ keps/sig-network/3866-nftables-proxy/kep.yaml | 39 + 3 files changed, 1986 insertions(+) create mode 100644 keps/prod-readiness/sig-network/3866.yaml create mode 100644 keps/sig-network/3866-nftables-proxy/README.md create mode 100644 keps/sig-network/3866-nftables-proxy/kep.yaml diff --git a/keps/prod-readiness/sig-network/3866.yaml b/keps/prod-readiness/sig-network/3866.yaml new file mode 100644 index 000000000000..fc9f12e430e9 --- /dev/null +++ b/keps/prod-readiness/sig-network/3866.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 3866 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-network/3866-nftables-proxy/README.md b/keps/sig-network/3866-nftables-proxy/README.md new file mode 100644 index 000000000000..adaa89754ca6 --- /dev/null +++ b/keps/sig-network/3866-nftables-proxy/README.md @@ -0,0 +1,1941 @@ +# KEP-3866: Add an nftables-based kube-proxy backend + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [The iptables kernel subsystem has unfixable performance problems](#the-iptables-kernel-subsystem-has-unfixable-performance-problems) + - [Upstream development has moved on from iptables to nftables](#upstream-development-has-moved-on-from-iptables-to-nftables) + - [The ipvs mode of kube-proxy will not save us](#the--mode-of-kube-proxy-will-not-save-us) + - [The nf_tables mode of /sbin/iptables will not save us](#the--mode-of--will-not-save-us) + - [The iptables mode of kube-proxy has grown crufty](#the--mode-of-kube-proxy-has-grown-crufty) + - [We will hopefully be able to trade 2 supported backends for 1](#we-will-hopefully-be-able-to-trade-2-supported-backends-for-1) + - [Writing a new kube-proxy mode will help to focus our cleanup/refactoring efforts](#writing-a-new-kube-proxy-mode-will-help-to-focus-our-cleanuprefactoring-efforts) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) + - [Functionality](#functionality) + - [Compatibility](#compatibility) + - [Security](#security) +- [Design Details](#design-details) + - [High level design](#high-level-design) + - [Low level design](#low-level-design) + - [Tables](#tables) + - [Communicating with the kernel nftables subsystem](#communicating-with-the-kernel-nftables-subsystem) + - [Notes on the sample rules in this KEP](#notes-on-the-sample-rules-in-this-kep) + - [Versioning and compatibility](#versioning-and-compatibility) + - [NAT rules](#nat-rules) + - [General Service dispatch](#general-service-dispatch) + - [Masquerading](#masquerading) + - [Session affinity](#session-affinity) + - [Filter rules](#filter-rules) + - [Dropping or rejecting packets for services with no endpoints](#dropping-or-rejecting-packets-for-services-with-no-endpoints) + - [Dropping traffic rejected by LoadBalancerSourceRanges](#dropping-traffic-rejected-by-) + - [Forcing traffic on HealthCheckNodePorts to be accepted](#forcing-traffic-on-s-to-be-accepted) + - [Future improvements](#future-improvements) + - [Changes from the iptables kube-proxy backend](#changes-from-the-iptables-kube-proxy-backend) + - [Localhost NodePorts](#localhost-nodeports) + - [NodePort Addresses](#nodeport-addresses) + - [Behavior of service IPs](#behavior-of-service-ips) + - [Defining an API for integration with admin/debug/third-party rules](#defining-an-api-for-integration-with-admindebugthird-party-rules) + - [Rule monitoring](#rule-monitoring) + - [Multiple instances of kube-proxy](#multiple-instances-of-) + - [Switching between kube-proxy modes](#switching-between-kube-proxy-modes) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Scalability & Performance tests](#scalability--performance-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Continue to improve the iptables mode](#continue-to-improve-the--mode) + - [Fix up the ipvs mode](#fix-up-the--mode) + - [Use an existing nftables-based kube-proxy implementation](#use-an-existing-nftables-based-kube-proxy-implementation) + - [Create an eBPF-based proxy implementation](#create-an-ebpf-based-proxy-implementation) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +The default kube-proxy implementation on Linux is currently based on +iptables. IPTables was the preferred packet filtering and processing +system in the Linux kernel for many years (starting with the 2.4 +kernel in 2001). However, problems with iptables led to the +development of a successor, nftables, first made available in the 3.13 +kernel in 2014, and growing increasingly featureful and usable as a +replacement for iptables since then. Development on iptables has +mostly stopped, with new features and performance improvements +primarily going into nftables instead. + +This KEP proposes the creation of a new official/supported nftables +backend for kube-proxy. While it is hoped that this backend will +eventually replace both the `iptables` and `ipvs` backends and become +the default kube-proxy mode on Linux, that replacement/deprecation +would be handled in a separate future KEP. + +## Motivation + +There are currently two officially supported kube-proxy backends for +Linux: `iptables` and `ipvs`. (The original `userspace` backend was +deprecated several releases ago and removed from the tree in 1.26.) + +The `iptables` mode of kube-proxy is currently the default, and it is +generally considered "good enough" for most use cases. Nonetheless, +there are good arguments for replacing it with a new `nftables` mode. + +### The iptables kernel subsystem has unfixable performance problems + +Although much work has been done to improve the performance of the +kube-proxy `iptables` backend, there are fundamental +performance-related problems with the implementation of iptables in +the kernel, both on the "control plane" side and on the "data plane" +side: + + - The control plane is problematic because the iptables API does not + support making incremental changes to the ruleset. If you want to + add a single iptables rule, the iptables binary must acquire a lock, + download the entire ruleset from the kernel, find the appropriate + place in the ruleset to add the new rule, add it, re-upload the + entire ruleset to the kernel, and release the lock. This becomes + slower and slower as the ruleset increases in size (ie, as the + number of Kubernetes Services grows). If you want to replace a large + number of rules (as kube-proxy does frequently), then simply the + time that it takes `/sbin/iptables-restore` to parse all of the + rules becomes substantial. + + - The data plane is problematic because (for the most part), the + number of iptables rules used to implement a set of Kubernetes + Services is directly proportional to the number of Services. And + every packet going through the system then needs to pass through + all of these rules, slowing down the traffic. + +IPTables is the bottleneck in kube-proxy performance, and it always +will be until we stop using it. + +### Upstream development has moved on from iptables to nftables + +In large part due to its unfixable problems, development on iptables +in the kernel has slowed down and mostly stopped. New features are not +being added to iptables, because nftables is supposed to do everything +iptables does, but better. + +Although there is no plan to remove iptables from the upstream kernel, +that does not guarantee that iptables will remain supported by +_distributions_ forever. In particular, Red Hat has declared that +[iptables is deprecated in RHEL 9] and is likely to be removed +entirely in RHEL 10, a few years from now. Other distributions have +made smaller steps in the same direction; for instance, [Debian +removed `iptables` from the set of "required" packages] in Debian 11 +(Bullseye). + +The RHEL deprecation in particular impacts Kubernetes in two ways: + + 1. Many Kubernetes users run RHEL or one of its downstreams, so in a + few years when RHEL 10 is released, they will be unable to use + kube-proxy in `iptables` mode (or, for that matter, in `ipvs` or + `userspace` mode, since those modes also make heavy use of the + iptables API). + + 2. Several upstream iptables bugs and performance problems that + affect Kubernetes have been fixed by Red Hat developers over the + past several years. With Red Hat no longer making any effort to + maintain iptables, it is less likely that upstream iptables bugs + that affect Kubernetes in the future would be fixed promptly, if + at all. + +[iptables is deprecated in RHEL 9]: https://access.redhat.com/solutions/6739041 +[Debian removed `iptables` from the set of "required" packages]: https://salsa.debian.org/pkg-netfilter-team/pkg-iptables/-/commit/c59797aab9 + +### The `ipvs` mode of kube-proxy will not save us + +Because of the problems with iptables, some developers added an `ipvs` +mode to kube-proxy in 2017. It was generally hoped that this could +eventually solve all of the problems with the `iptables` mode and +become its replacement, but this never really happened. It's not +entirely clear why... [kubeadm #817], "Track when we can enable the +ipvs mode for the kube-proxy by default" is perhaps a good snapshot of +the initial excitement followed by growing disillusionment with the +`ipvs` mode: + + - "a few issues ... re: the version of iptables/ipset shipped in the + kube-proxy container image" + - "clearly not ready for defaulting" + - "complications ... with IPVS kernel modules missing or disabled on + user nodes" + - "we are still lacking tests" + - "still does not completely align with what [we] support in + iptables mode" + - "iptables works and people are familiar with it" + - "[not sure that it was ever intended for IPVS to be the default]" + +Additionally, the kernel IPVS APIs alone do not provide enough +functionality to fully implement Kubernetes services, and so the +`ipvs` backend also makes heavy use of the iptables API. Thus, if we +are worried about iptables deprecation, then in order to switch to +using `ipvs` as the default mode, we would have to port the iptables +parts of it to use nftables anyway. But at that point, there would be +little excuse for using IPVS for the core load-balancing part, +particularly given that IPVS, like iptables, is no longer an +actively-developed technology. + +[kubeadm #817]: https://github.com/kubernetes/kubeadm/issues/817 +[not sure that it was ever intended for IPVS to be the default]: https://en.wikipedia.org/wiki/The_Fox_and_the_Grapes + +### The `nf_tables` mode of `/sbin/iptables` will not save us + +In 2018, with the 1.8.0 release of the iptables client binaries, a new +mode was added to the binaries, to allow them to use the nftables API +in the kernel rather than the legacy iptables API, while still +preserving the "API" of the original iptables binaries. As of 2022, +most Linux distributions now use this mode, so the legacy iptables +kernel API is mostly dead. + +However, this new mode does not add any new _syntax_, and so it is not +possible to use any of the new nftables features (like maps) that are +not present in iptables. + +Furthermore, the compatibility constraints imposed by the user-facing +API of the iptables binaries themselves prevent them from being able +to take advantage of many of the performance improvements associated +with nftables. + +(Additionally, the RHEL deprecation of iptables includes +`iptables-nft` as well.) + +### The `iptables` mode of kube-proxy has grown crufty + +Because `iptables` is the default kube-proxy mode, it is subject to +strong backward-compatibility constraints which mean that certain +"features" that are now considered to be bad ideas cannot be removed +because they might break some existing users. A few examples: + + - It allows NodePort services to be accessed on `localhost`, which + requires it to set a sysctl to a value that may introduce security + holes on the system. More generally, it defaults to having + NodePort services be accessible on _all_ node IPs, when most users + would probably prefer them to be more restricted. + + - It implements the `LoadBalancerSourceRanges` feature for traffic + addressed directly to LoadBalancer IPs, but not for traffic + redirected to a NodePort by an external LoadBalancer. + + - Some new functionality only works correctly if the administrator + passes certain command-line options to kube-proxy (eg, + `--cluster-cidr`), but we cannot make those options be mandatory, + since that would break old clusters that aren't passing them. + +A new kube-proxy mode, which existing users would have to explicitly opt +into, could revisit these and other decisions. (Though if we expect it +to eventually become the default, then we might decide to avoid such +changes anyway.) + +### We will hopefully be able to trade 2 supported backends for 1 + +Right now SIG Network is supporting both the `iptables` and `ipvs` +backends of kube-proxy, and does not feel like it can ditch `ipvs` +because of perceived performance issues with `iptables`. If we create a new +backend which is as functional and non-buggy as `iptables` but as +performant as `ipvs`, then we could (eventually) deprecate both of the +existing backends and only have one Linux backend to support in the future. + +### Writing a new kube-proxy mode will help to focus our cleanup/refactoring efforts + +There is a desire to provide a "kube-proxy library" that third parties +could use as a base for external service proxy implementations +([KEP-3786]). The existing "core kube-proxy" code, while functional, +is not very well designed and is not something we would want to +support other people using in its current form. + +Writing a new proxy backend will force us to look over all of this +shared code again, and perhaps give us new ideas on how it can be +cleaned up, rationalized, and optimized. + +[KEP-3786]: https://github.com/kubernetes/enhancements/issues/3786 + +### Goals + +- Design and implement an `nftables` mode for kube-proxy. + + - Consider various fixes to legacy `iptables` mode behavior. + + - Do not enable the `route_localnet` sysctl. + + - Add a more restrictive startup mode to kube-proxy, which + will error out if the configuration is invalid (e.g., + "`--detect-local-mode ClusterCIDR`" without specifying + "`--cluster-cidr`") or incomplete (e.g., + partially-dual-stack but not fully-dual-stack). + + - (Possibly other changes discussed in this KEP.) + + - Ensure that any such changes are clearly documented for + users. + + - To the extent possible, provide metrics to allow `iptables` + users to easily determine if they are using features that + would behave differently in `nftables` mode. + + - Document specific details of the nftables implementation that we + want to consider as "API". In particular, document the + high-level behavior that authors of network plugins can rely + on. We may also document ways that third parties or + administrators can integrate with kube-proxy's rules at a lower + level. + +- Allowing switching from the `iptables` (or `ipvs`) mode to + `nftables`, or vice versa, without needing to manually clean up + rules in between. + +- Document the minimum kernel/distro requirements for the new backend. + +- Document incompatible changes between `iptables` mode and `nftables` + mode (e.g. localhost NodePorts, firewall handling, etc). + +- Do performance testing comparing the `iptables`, + `ipvs`, and `nftables` backends in small, medium, and large + clusters, comparing both the "control plane" aspects (time/CPU usage + spent reprogramming rules) and "data plane" aspects (latency and + throughput of packets to service IPs). + +- Help with the clean-up and refactoring of the kube-proxy "library" + code. + +- Although this KEP does not include anything post-GA (e.g., making + `nftables` the default backend, or changing the status of the + `iptables` and/or `ipvs` backends), we should have at least the + start of a plan for the future by the time this KEP goes GA, to + ensure that we don't just end up permanently maintaining 3 backends + instead of 2. + +### Non-Goals + +- Falling into the same traps as the `ipvs` backend, to the extent + that we can identify what those traps were. + +- Removing the iptables `KUBE-IPTABLES-HINT` chain from kubelet; that + chain exists for the benefit of any component on the node that wants + to use iptables, and so should continue to exist even if no part of + the kubernetes core uses iptables itself. (And there is no need to + add anything similar for nftables, since there are no bits of host + filesystem configuration related to nftables that containerized + nftables users need to worry about.) + +## Proposal + +### Notes/Constraints/Caveats + +At least three nftables-based kube-proxy implementations already +exist, but none of them seems suitable either to adopt directly or to +use as a starting point: + +- [kube-nftlb]: This is built on top of a separate nftables-based load + balancer project called [nftlb], which means that rather than + translating Kubernetes Services directly into nftables rules, it + translates them into nftlb load balancer objects, which then get + translated into nftables rules. Besides making the code more + confusing for users who aren't already familiar with nftlb, this + also means that in many cases, new Service features would need to + have features added to the nftlb core first before kube-nftld could + consume them. (Also, it has not been updated since November 2020.) + +- [nfproxy]: Its README notes that "nfproxy is not a 1:1 copy of + kube-proxy (iptables) in terms of features. nfproxy is not going to + cover all corner cases and special features addressed by + kube-proxy". (Also, it has not been updated since January 2021.) + +- [kpng's nft backend]: This was written as a proof of concept and is + mostly a straightforward translation of the iptables rules to + nftables, and doesn't make good use of nftables features that would + let it reduce the total number of rules. It also makes heavy use of + kpng's APIs, like "DiffStore", which there is not consensus about + adopting upstream. + +[kube-nftlb]: https://github.com/zevenet/kube-nftlb +[nftlb]: https://github.com/zevenet/nftlb +[nfproxy]: https://github.com/sbezverk/nfproxy +[kpng's nft backend]: https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft + +### Risks and Mitigations + +#### Functionality + +The primary risk of the proposal is feature or stability regressions, +which will be addressed by testing, and by a slow, optional, rollout +of the new proxy mode. + +The most important mitigation for this risk is ensuring that rollback +from `nftables` mode back to `iptables`/`ipvs` mode works reliably. + +#### Compatibility + +Many Kubernetes networking implementations use kube-proxy as their +service proxy implementation. Given that few low-level details of +kube-proxy's behavior are explicitly specified, using it as part of a +larger networking implementation (and in particular, writing a +NetworkPolicy implementation that interoperates with it correctly) +necessarily requires making assumptions about (currently-)undocumented +aspects of its behavior (such as exactly when and how packets get +rewritten). + +While the `nftables` mode is likely to look very similar to the +`iptables` mode from the outside, some CNI plugins, NetworkPolicy +implementations, etc, may need updates in order to work with it. (This +may further limit the amount of testing the new mode can get during +the Alpha phase, if it is not yet compatible with popular network +plugins at that point.) There is not much we can do here, other than +avoiding *gratuitous* behavioral differences. + +#### Security + +The `nftables` mode should not pose any new security issues relative +to the `iptables` mode. + +## Design Details + +### High level design + +At a high level, the new mode should have the same architecture as the +existing modes; it will use the service/endpoint-tracking code in +`k8s.io/kubernetes/pkg/proxy` to watch for changes, and update rules +in the kernel accordingly. + +### Low level design + +Some details will be figured out as we implement it. We may start with +an implementation that is architecturally closer to the `iptables` +mode, and then rewrite it to take advantage of additional nftables +features over time. + +#### Tables + +Unlike iptables, nftables does not have any reserved/default tables or +chains (eg, `nat`, `PREROUTING`). Instead, each nftables user is +expected to create and work with its own table(s), and to ignore the +tables created by other components (for example, when firewalld is +running in nftables mode, restarting it only flushes the rules in the +`firewalld` table, unlike when it is running in iptables mode, where +restarting it causes it to flush _all_ rules). + +Within each table, "base chains" can be connected to "hooks" that give +them behavior similar to the built-in iptables chains. (For example, a +chain with the properties `type nat` and `hook prerouting` would work +like the `PREROUTING` chain in the iptables `nat` table.) The +"priority" of a base chain controls when it runs relative to other +chains connected to the same hook in the same or other tables. + +An nftables table can only contain rules for a single "family" (`ip` +(v4), `ip6`, `inet` (both IPv4 and IPv6), `arp`, `bridge`, or +`netdev`). We will create a single `kube-proxy` table in the `ip` +family, and another in the `ip6` family. All of our chains, sets, +maps, etc, will go into those tables. + +(In theory, instead of creating one table each in the `ip` and `ip6` +families, we could create a single table in the `inet` family and put +both IPv4 and IPv6 chains/rules there. However, this wouldn't really +result in much simplification, because we would still need separate +sets/maps to match IPv4 addresses and IPv6 addresses. (There is no +data type that can store/match either an IPv4 address or an IPv6 +address.) Furthermore, because of how Kubernetes Services evolved in +parallel with the existing kube-proxy implementation, we have ended up +with a dual-stack Service semantics that is most easily implemented by +handling IPv4 and IPv6 completely separately anyway.) + +#### Communicating with the kernel nftables subsystem + +We will use the `nft` command-line tool to read and write rules, much +like how we use command-line tools in the `iptables` and `ipvs` +backends. + +However, the `nft` tool is mostly just a thin wrapper around +`libnftables`, so any golang API that wraps the `nft` command-line +could easily be rewritten to use `libnftables` directly (via a cgo +wrapper) in the future if that seemed like a better idea. (In theory +we could also use netlink directly, without needing cgo or external +libraries, but this would probably be a bad idea; `libnftables` +implements quite a bit of functionality on top of the raw netlink +API.) + +The nftables command-line tool allows either a single command per +invocation (as with `/sbin/iptables`): + +``` +$ nft add table ip kube-proxy '{ comment "Kubernetes service proxying rules"; }' +$ nft add chain ip kube-proxy services +$ nft add rule ip kube-proxy services ip daddr . ip protocol . th dport vmap @service_ips +``` + +or multiple commands to be executed in a single atomic transaction (as +with `/sbin/iptables-restore`, but more flexible): + +``` +$ nft -f - <> + +Decide if we want to stick with iptables-like affinity on sourceIP +only, switch to ipvs-like sourceIP+sourcePort affinity, add a new +`v1.ServiceAffinity` value to disambiguate, or something else. + +(See also https://github.com/kubernetes/kubernetes/pull/112806, which +removed session affinity timeouts from conformance, and claimed that +"Our plan is to deprecate the current affinity options and re-add +specific options for various behaviors so it's clear exactly what +plugins support and which behavior (if any) we want to require for +conformance in the future.") + +(FTR, the nftables backend would have no difficulty implementing the +existing timeout behavior.) + +<<[/UNRESOLVED]>> +``` + +#### Filter rules + +The `iptables` mode uses the `filter` table for three kinds of rules: + +##### Dropping or rejecting packets for services with no endpoints + +As with service dispatch, this is easily handled with a verdict map: + +``` +map no_endpoint_services { + type ipv4_addr . inet_proto . inet_service : verdict + elements = { + 192.168.99.22 . tcp . 80 : drop, + 172.30.0.46 . tcp . 80 : goto reject_chain, + 1.2.3.4 . tcp . 80 : drop + } +} + +chain filter { + ... + ip daddr . ip protocol . th dport vmap @no_endpoint_services + ... +} + +# helper chain needed because "reject" is not a "verdict" and so can't +# be used directly in a verdict map +chain reject_chain { + reject +} +``` + +##### Dropping traffic rejected by `LoadBalancerSourceRanges` + +The implementation of LoadBalancer source ranges will be similar to +the ipset-based implementation in the `ipvs` kube proxy: we use one +set to recognize "traffic that is subject to source ranges", and then +another to recognize "traffic that is _accepted_ by its service's +source ranges". Traffic which matches the first set but not the second +gets dropped: + +``` +set firewall { + comment "destinations that are subject to LoadBalancerSourceRanges"; + type ipv4_addr . inet_proto . inet_service +} +set firewall_allow { + comment "destination+sources that are allowed by LoadBalancerSourceRanges"; + type ipv4_addr . inet_proto . inet_service . ipv4_addr +} + +chain filter { + ... + ip daddr . ip protocol . th dport @firewall jump firewall_check + ... +} + +chain firewall_check { + ip daddr . ip protocol . th dport . ip saddr @firewall_allow return + drop +} +``` + +Where, eg, adding a Service with LoadBalancer IP `10.1.2.3`, port +`80`, and source ranges `["192.168.0.3/32", "192.168.1.0/24"]` would +result in: + +``` +add element ip kube-proxy firewall { 10.1.2.3 . tcp . 80 } +add element ip kube-proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.0.3/32 } +add element ip kube-proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.1.0/24 } +``` + +##### Forcing traffic on `HealthCheckNodePort`s to be accepted + +The `iptables` mode adds rules to ensure that traffic to NodePort +services' health check ports is allowed through the firewall. eg: + +``` +-A KUBE-NODEPORTS -m comment --comment "ns2/svc2:p80 health check node port" -m tcp -p tcp --dport 30000 -j ACCEPT +``` + +(There are also rules to accept any traffic that has already been +tagged by conntrack.) + +This cannot be done reliably in nftables; the semantics of `accept` +(or `-j ACCEPT` in iptables) is to end processing _of the current +table_. In iptables, this effectively guarantees that the packet is +accepted (since `-j ACCEPT` is mostly only used in the `filter` +table), but in nftables, it is still possible that someone would later +call `drop` on the packet from another table, causing it to be +dropped. There is no way to reliably "sneak behind the firewall's +back" like you can in iptables; if an nftables-based firewall is +dropping kube-proxy's packets, then you need to actually configure +_that firewall_ to accept them instead. + +However, this firewall-bypassing behavior is somewhat legacy anyway; +the `iptables` proxy is able to bypass a _local_ firewall, but has no +ability to bypass a firewall implemented at the cloud network layer, +which is perhaps a more common configuration these days anyway. +Administrators using non-local firewalls are already required to +configure those firewalls correctly to allow Kubernetes traffic +through, and it is reasonable for us to just extend that requirement +to administrators using local firewalls as well. + +Thus, the `nftables` backend will not attempt to replicate these +`iptables`-backend rules. + +#### Future improvements + +Further improvements are likely possible. + +For example, it would be nice to not need a separate "hairpin" check for +every endpoint. There is no way to ask directly "does this packet have +the same source and destination IP?", but the proof-of-concept [kpng +nftables backend] does this instead: + +``` +set hairpin { + type ipv4_addr . ipv4_addr; + elements { + 10.180.0.4 . 10.180.0.4, + 10.180.0.5 . 10.180.0.5, + ... + } +} + +chain ... { + ... + ip saddr . ip daddr @hairpin jump mark_for_masquerade +} +``` + +More efficiently, if nftables eventually got the ability to call eBPF +programs as part of rule processing (like iptables's `-m ebpf`) then +we could write a trivial eBPF program to check "source IP equals +destination IP" and then call that rather than needing the giant set +of redundant IPs. + +If we do this, then we don't need the per-endpoint hairpin check +rules. If we could also get rid of the per-endpoint affinity-updating +rules, then we could get rid of the per-endpoint chains entirely, +since `dnat to ...` is an allowed vmap verdict: + +``` +chain svc_4SW47YFZTEDKD3PK { + # FIXME handle affinity somehow + + # Send to random endpoint + random mod 2 vmap { + 0 : dnat to 10.180.0.4:8000 + 1 : dnat to 10.180.0.5:8000 + } +} +``` + +With the current set of nftables functionality, it does not seem +possible to do this (in the case where affinity is in use), but future +features may make it possible. + +It is not yet clear what the tradeoffs of such rewrites are, either in +terms of runtime performance, or of admin/developer-comprehensibility +of the ruleset. + +[kpng nftables backend]: https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft + +### Changes from the iptables kube-proxy backend + +Switching to a new backend which people will have to opt into gives us +the chance to break backward-compatibility in various places where we +don't like the current iptables kube-proxy behavior. + +However, if we intend to eventually make the `nftables` mode the +default, then differences from `iptables` mode will be more of a +problem, so we should limit these changes to cases where the benefit +outweighs the cost. + +#### Localhost NodePorts + +Kube-proxy in `iptables` mode supports NodePorts on `127.0.0.1` (for +IPv4 services) by default. (Kube-proxy in `ipvs` mode does not support +this, and neither mode supports localhost NodePorts for IPv6 services, +although `userspace` mode did, in single-stack IPv6 clusters.) + +Localhost NodePort traffic does not work cleanly with a DNAT-based +approach to NodePorts, because moving a localhost packet to network +interface other than `lo` causes the kernel to consider it "martian" +and refuse to route it. There are various ways around this problem: + + 1. The `userspace` approach: Proxy packets in userspace rather than + redirecting them with DNAT. (The `userspace` proxy did this for + all IPs; the fact that localhost NodePorts worked with the + `userspace` proxy was a coincidence, not an explicitly-intended + feature). + + 2. The `iptables` approach: Enable the `route_localnet` sysctl, + which tells the kernel to never consider IPv4 loopback addresses + to be "martian", so that DNAT works. This only works for IPv4; + there is no corresponding sysctl for IPv6. Unfortunately, enabling + this sysctl opens security holes ([CVE-2020-8558]), which + kube-proxy then needs to try to close, which it does by creating + iptables rules to block all the packets that `route_localnet` + would have blocked _except_ for the ones we want (which assumes + that the administrator [didn't also change certain other sysctls] + that might have been safe to change had we not set + `route_localnet`, and which according to some reports [may block + legitimate traffic] in some configurations). + + 3. The Cilium approach: Intercept the connect(2) call with eBPF and + rewrite the destination IP there, so that the network stack never + actually sees a packet with destination `127.0.0.1` / `::1`. (As + in the `userspace` kube-proxy case, this is not a special-case + for localhost, it's just how Cilium does service proxying.) + + 4. If you control the client, you can explicitly bind the socket to + `127.0.0.1` / `::1` before connecting. (I'm not sure why this + works since the packet still eventually gets routed off `lo`.) It + doesn't seem to be possible to "spoof" this after the socket is + created, though as with the previous case, you could do this by + intercepting syscalls with eBPF. + +In discussions about this feature, only one real use case has been +presented: it allows you to run a docker registry in a pod and then +have nodes use a NodePort service via `127.0.0.1` to access that +registry. Docker treats `127.0.0.1` as an "insecure registry" by +default (though containerd and cri-o do not) and so does not require +TLS authentication in this case; using any other IP would require +setting up TLS certificates, making the deployment more complicated. +(In other words, this is basically an intentional exploitation of the +security hole that CVE-2020-8558 warns about: enabling +`route_localnet` may allow someone to access a service that doesn't +require authentication because it assumed it was only accessible to +localhost.) + +In all other cases, it is generally possible (though not always +convenient) to just rewrite things to use the node IP rather than +localhost (or to use a ClusterIP rather than a NodePort). Indeed, +since localhost NodePorts do not work with `ipvs` mode or with IPv6, +many places that used to use NodePorts on `127.0.0.1` have already +been rewritten to not do so (eg [contiv/vpp#1434]). + +So: + + - There is no way to make IPv6 localhost NodePorts work with a + NAT-based solution. + + - The way to make IPv4 localhost NodePorts work with NAT introduces + a security hole, and we don't necessarily have a fully-generic way + to mitigate it. + + - The only commonly-argued-for use case for the feature involves + deploying a service in a configuration which its own documentation + describes as insecure and "only appropriate for testing". + + - The use case in question works by default against cri-dockerd + but not against containerd or cri-o with their default + configurations. + + - cri-dockerd, containerd, and cri-o all allow additional + "insecure registry" IPs/CIDRs to be configured, so an + administrator could configure them to allow non-TLS image + pulling against a ClusterIP. + +Given this, I think we should not try to support localhost NodePorts +in the `nftables` backend. + +``` +<<[UNRESOLVED dnat-but-no-route_localnet ]>> + +As a possible compromise, we could make the `nftables` backend create +appropriate DNAT and SNAT rules for localhost NodePorts (when +`--nodeport-addresses` includes `127.0.0.1`), but _not_ change +`route_localnet`. In that case, we could document that administrators +could enable `route_localnet` themselves if they wanted to support +NodePorts on `127.0.0.1`, but then they would also be responsible for +mitigating any security holes they had introduced. + +<<[/UNRESOLVED]>> +``` + +[CVE-2020-8558]: https://nvd.nist.gov/vuln/detail/CVE-2020-8558 +[didn't also change certain other sysctls]: https://github.com/kubernetes/kubernetes/pull/91666#issuecomment-640733664 +[may block legitimate traffic]: https://github.com/kubernetes/kubernetes/pull/91666#issuecomment-763549921 +[contiv/vpp#1434]: https://github.com/contiv/vpp/pull/1434 + +#### NodePort Addresses + +In addition to the localhost issue, iptables kube-proxy defaults to +accepting NodePort connections on all local IPs, which has effects +varying from intended-but-unexpected ("why can people connect to +NodePort services from the management network?") to clearly-just-wrong +("why can people connect to NodePort services on LoadBalancer IPs?") + +The nftables proxy should default to only opening NodePorts on a +single interface, probably the interface with the default route by +default. (Ideally, you really want it to accept NodePorts on the +interface that holds the route to the cloud load balancers, but we +don't necessarily know what that is ahead of time.) Admins can use +`--nodeport-addresses` to override this. + +#### Behavior of service IPs + +``` +<<[UNRESOLVED unused service IP ports ]>> + +@thockin has suggested that service IPs should reject connections on +ports they aren't using. (This would most easily be implemented by +adding a `--service-cidr` flag to kube-proxy so we could just "reject +everything else", but even without that we could at least reject +connections on inactive ports of active service IPs.) + +<<[/UNRESOLVED]>> +``` + +``` +<<[UNRESOLVED service IP pings ]>> + +Users sometimes get confused by the fact that service IPs do not +respond to ICMP pings, and perhaps this is something we could change. + +<<[/UNRESOLVED]>> +``` + +#### Defining an API for integration with admin/debug/third-party rules + +Administrators sometimes want to add rules to log or drop certain +packets. Kube-proxy makes this difficult because it is constantly +rewriting its rules, making it likely that admin-added rules will be +deleted shortly after being added. + +Likewise, external components (eg, NetworkPolicy implementations) may +want to write rules that integrate with kube-proxy's rules in +well-defined ways. + +The existing kube-proxy modes do not provide any explicit "API" for +integrating with them, although certain implementation details of the +`iptables` backend in particular (e.g. the fact that service IPs in +packets are rewritten to endpoint IPs during iptables's `PREROUTING` +phase, and that masquerading will not happen before `POSTROUTING`) are +effectively API, in that we know that changing them would result in +significant ecosystem breakage. + +We should provide a stronger definition of these larger-scale "black +box" guarantees in the `nftables` backend. NFTables makes this easier +than iptables in some ways, because each application is expected to +create their own table, and not interfere with anyone else's tables. +If we document the `priority` values we use to connect to each +nftables hook, then admins and third party developers should be able +to reliably process packets before or after kube-proxy, without +needing to modify kube-proxy's chains/rules. + +In cases where administrators want to insert rules into the middle of +particular service or endpoint chains, we would have the same problem +that the `iptables` backend has, which is that it would be difficult +for us to avoid accidentally overwriting them when we update rules. +Additionally, we want to preserve our ability to redesign the rules +later to take better advantage of nftables features, which would be +impossible to do if we were officially allowing users to modify the +existing rules. + +One possibility would be to add "admin override" vmaps that are +normally empty but which admins could add `jump`/`goto` rules to for +specific services to augment/bypass the normal service processing. It +probably makes sense to leave these out initially and see if people +actually do need them, or if creating rules in another table is +sufficient. + +``` +<<[UNRESOLVED external rule integration API ]>> + +It will be easier to figure out what the right thing to do here is +once we actually have a working implementation. + +<<[/UNRESOLVED]>> +``` + +#### Rule monitoring + +Given the constraints of the iptables API, it would be extremely +inefficient to do [a controller loop in the "standard" style]: + +``` +for { + desired := getDesiredState() + current := getCurrentState() + makeChanges(desired, current) +} +``` + +(In particular, the combination of "`getCurrentState`" and +"`makeChanges`" is slower than just skipping the "`getCurrentState`" +and rewriting everything from scratch every time.) + +In the past, the `iptables` backend *did* rewrite everything from +scratch every time: + +``` +for { + desired := getDesiredState() + makeChanges(desired, nil) +} +``` + +but [KEP-3453] "Minimizing iptables-restore input size" changed this, +to improve performance: + +``` +for { + desired := getDesiredState() + predicted := getPredictedState() + if err := makeChanges(desired, predicted); err != nil { + makeChanges(desired, nil) + } +} +``` + +That is, it makes incremental updates under the assumption that the +current state is correct, but if an update fails (e.g. because it +assumes the existence of a chain that didn't exist), kube-proxy falls +back to doing a full rewrite. (It also eventually falls back to a full +update after enough time passes.) + +Proxies based on iptables have also historically had the problem that +system processes (particularly firewall implementations) would +sometimes flush all iptables rules and restart with a clean state, +thus completely breaking kube-proxy. The initial solution for this +problem was to just recreate all iptables rules every 30 seconds even +if no services/endpoints had changed. Later this was changed to create +a single "canary" chain, and check every 30 seconds that the canary +had not been deleted, and only recreate everything from scratch if the +canary disappears. + +NFTables provides a way to monitor for changes without doing polling; +you can keep a netlink socket open to the kernel (or a pipe open to an +`nft monitor` process) and receive notifications when particular kinds +of nftables objects are created or destroyed. + +However, the "everyone uses their own table" design of nftables means +that this should not be necessary. IPTables-based firewall +implementations flush all iptables rules because everyone's iptables +rules are all mixed together and it's hard to do otherwise. But in +nftables, a firewall ought to only flush _its own_ table when +restarting, and leave everyone else's tables untouched. In particular, +firewalld works this way when using nftables. We will need to see what +other firewall implementations do. + +[a controller loop in the "standard" style]: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md +[KEP-3453]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3453-minimize-iptables-restore/README.md + +#### Multiple instances of `kube-proxy` + +``` +<<[UNRESOLVED multiple instances ]>> + +@uablrek has suggested various changes aimed at allowing multiple +kube-proxy instances on a single node: + + - Have the top-level table name by overridable. + - Allow configuring the chain priorities. + - Allow configuring which interfaces to process traffic on. + +This can be revisited once we have a basic implementation. + +<<[/UNRESOLVED]>> +``` + +### Switching between kube-proxy modes + +In the past, kube-proxy attempted to allow users to switch between the +`userspace` and `iptables` modes (and later the `ipvs` mode) by just +restarting kube-proxy with the new arguments. Each mode would attempt +to clean up the iptables rules used by the other modes on startup. + +Unfortunately, this didn't work well because the three modes all used +some of the same iptables chains, so, e.g., when kube-proxy started up +in `iptables` mode, it would try to delete the `userspace` rules, but +this would end up deleting rules that had been created by `iptables` +mode too, which mean that any time you restarted kube-proxy, it would +immediately delete some of its rules and be in a broken state until it +managed to re-sync from the apiserver. So this code was removed with +[KEP-2448]. + +However, the same problem would not apply when switching between an +iptables-based mode and an nftables-based mode; it should be safe to +delete all `iptables` and `ipvs` rules when starting kube-proxy in +`nftables` mode, and to delete all `nftables` rules when starting +kube-proxy in `iptables` or `ipvs` mode. This will make it easier for +users to switch between modes. + +Since rollback from `nftables` mode is most important when the +`nftables` mode is not actually working correctly, we should do our +best to make sure that the cleanup code that runs when rolling back to +`iptables`/`ipvs` mode is likely to work correctly even if the rest of +the `nftables` code is broken. To that end, we can have it simply run +`nft` directly, bypassing the abstractions used by the rest of the +code. Since our rules will be isolated to our own tables, all we need +to do to clean up all of our rules is: + +``` +nft delete table ip kube-proxy +nft delete table ip6 kube-proxy +``` + +In fact, this is simple enough that we could document it explicitly as +something administrators could do if they run into problems while +rolling back. + +[KEP-2448]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2448-Remove-kube-proxy-automatic-clean-up-logic + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + +We will add unit tests for the `nftables` mode that are equivalent to +the ones for the `iptables` mode. In particular, we will port over the +tests that feed Services and EndpointSlices into the proxy engine, +dump the generated ruleset, and then mock running packets through the +ruleset to determine how they would behave. + +Since virtually all of the new code will be in a new directory, there +should not be any large changes either way to the test coverage +percentages in any existing directories. + +As of 2023-09-22, `pkg/proxy/iptables` has 70.6% code coverage in its +unit tests. For Alpha, we will have comparable coverage for +`nftables`. However, since the `nftables` implementation is new, and +more likely to have bugs than the older, widely-used `iptables` +implementation, we will also add additional unit tests before Beta. + +##### Integration tests + +Kube-proxy does not have integration tests. + +##### e2e tests + +Most of the e2e testing of kube-proxy is backend-agnostic. Initially, +we will need a separate e2e job to test the nftables mode (like we do +with ipvs). Eventually, if nftables becomes the default, then this +would be flipped around to having a legacy "iptables" job. + +The test "`[It should recreate its iptables rules if they are +deleted]`" tests (a) that kubelet recreates `KUBE-IPTABLES-HINT` if it +is deleted, and (b) that deleting all `KUBE-*` iptables rules does not +cause services to be broken forever. The latter part is obviously a +no-op under `nftables` kube-proxy, but we can run it anyway. (We are +currently assuming that we will not need an nftables version of this +test, since the problem of one component deleting another component's +rules should not exist with nftables.) + +(Though not directly related to kube-proxy, there are also other e2e +tests that use iptables which should eventually be ported to nftables; +notably, the ones using [`TestUnderTemporaryNetworkFailure`].) + +For the most part, we should not need to add any nftables-specific e2e +tests; the `nftables` backend's job is just to implement the Service +proxy API to the same specifications as the other backends do, so the +existing e2e tests already cover everything relevant. The only +exception to this is in cases where we change default behavior from +the `iptables` backend, in which case we may need new tests for the +different behavior. + +We will eventually need e2e tests for switching between `iptables` and +`nftables` mode in an existing cluster. + +[It should recreate its iptables rules if they are deleted]: https://github.com/kubernetes/kubernetes/blob/v1.27.0/test/e2e/network/networking.go#L550 +[`TestUnderTemporaryNetworkFailure`]: https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.2/test/e2e/framework/network/utils.go#L1078 + + + +- : + +#### Scalability & Performance tests + +``` +<<[UNRESOLVED perfscale ]>> + +- For the control plane side, the existing scalability tests are + probably reasonable, assuming we implement the same + `NetworkProgrammingLatency` metric as the existing backends. + +- For the data plane side, there are tests of + `InClusterNetworkLatency`, but no one is really looking at the + results yet and they may need work before they are useable. + +- We should also make sure that other metrics (CPU, RAM, I/O, etc) + remain reasonable in an `nftables` cluster. + +<<[/UNRESOLVED]>> +``` + +### Graduation Criteria + +#### Alpha + +- `kube-proxy --proxy-mode nftables` available behind a feature gate + +- nftables mode has unit test parity with iptables + +- An nftables-mode e2e job exists, and passes + +- Documentation describes any changes in behavior between the + `iptables` and `ipvs` modes and the `nftables` mode. + +- Documentation explains how to manually clean up nftables rules in + case things go very wrong. + +#### Beta + +- At least two releases since Alpha. + +- The nftables mode has seen at least a bit of real-world usage. + +- No major outstanding bugs. + +- nftables mode better unit test coverage than iptables mode + (currently) has. (It is possible that we will end up adding + equivalent unit tests to the iptables backend in the process.) + +- A "kube-proxy mode-switching" e2e job exists, to confirm that you + can redeploy kube-proxy in a different mode in an existing cluster. + Rollback is confirmed to be reliable. + +- An nftables e2e periodic perf/scale job exists, and shows + performance as good as iptables and ipvs. + +- Documentation describes any changes in behavior between the + `iptables` and `ipvs` modes and the `nftables` mode. Any warnings + that we have decide to add for `iptables` users using functionality + that behaves differently in `nftables` have been added. + +- No UNRESOLVED sections in the KEP. (In particular, we have figured + out what sort of "API" we will offer for integrating third-party + nftables rules.) + +#### GA + +- At least two releases since Beta. + +- The nftables mode has seen non-trivial real-world usage. + +- The nftables mode has no bugs / regressions that would make us + hesitate to recommend it. + +- We have at least the start of a plan for the next steps (changing + the default mode, deprecating the old backends, etc). + +### Upgrade / Downgrade Strategy + +The new mode should not introduce any upgrade/downgrade problems, +excepting that you can't downgrade or feature-disable a cluster using +the new kube-proxy mode without switching it back to `iptables` or +`ipvs` first. (The older kube-proxy would refuse to start if given +`--proxy-mode nftables`, and wouldn't know how to clean up stale +nftables service rules if any were present.) + +When rolling out or rolling back the feature, it should be safe to +enable the feature gate and change the configuration at the same time, +since nothing cares about the feature gate except for kube-proxy +itself. Likewise, it is expected to be safe to roll out the feature in +a live cluster, even though this will result in different proxy modes +running on different nodes, because Kubernetes service proxying is +defined in such a way that no node needs to be aware of the +implementation details of the service proxy implementation on any +other node. + +### Version Skew Strategy + +The feature is isolated to kube-proxy and does not introduce any API +changes, so the versions of other components do not matter. + +Kube-proxy has no problems skewing with different versions of itself +across different nodes, because Kubernetes service proxying is defined +in such a way that no node needs to be aware of the implementation +details of the service proxy implementation on any other node. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +The administrator must enable the feature gate to make the feature +available, and then must run kube-proxy with the +`--proxy-mode=nftables` flag. + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: NFTablesProxyMode + - Components depending on the feature gate: + - kube-proxy +- [X] Other + - Describe the mechanism: + - kube-proxy must be restarted with the new `--proxy-mode`. + - Will enabling / disabling the feature require downtime of the control + plane? + - No + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + - No + +###### Does enabling the feature change any default behavior? + +Enabling the feature gate does not change any behavior; it just makes +the `--proxy-mode=nftables` option available. + +Switching from `--proxy-mode=iptables` or `--proxy-mode=ipvs` to +`--proxy-mode=nftables` will likely change some behavior, depending +on what we decide to do about certain un-loved kube-proxy features +like localhost nodeports. Whatever differences in behavior exist will +be explained clearly by the documentation; this is no different from +users switching from `iptables` to `ipvs`, which initially did not +have feature parity with `iptables`. + +(Assuming we eventually make `nftables` the default, then differences +in behavior from `iptables` will be more important, but making it the +default is not part of _this_ KEP.) + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, though it is necessary to clean up the nftables rules that were +created, or they will continue to intercept service traffic. In any +normal case, this should happen automatically when restarting +kube-proxy in `iptables` or `ipvs` mode, however, that assumes the +user is rolling back to a still-new-enough version of kube-proxy. If +the user wants to roll back the cluster to a version of Kubernetes +that doesn't have the nftables kube-proxy code (i.e., rolling back +from Alpha to Pre-Alpha), or if they are rolling back to an external +service proxy implementation (e.g., kpng), then they would need to +make sure that the nftables rules got cleaned up _before_ they rolled +back, or else clean them up manually. (We can document how to do +this.) + +###### What happens if we reenable the feature if it was previously rolled back? + +It should just work. + +###### Are there any tests for feature enablement/disablement? + +The actual feature gate enablement/disablement itself is not +interesting, since it only controls whether `--proxy-mode nftables` +can be selected. + +We will need an e2e test of switching a node from `iptables` (or +`ipvs`) mode to `nftables`, and vice versa. The Graduation Criteria +currently list this e2e test as being a criterion for Beta, not Alpha, +since we don't really expect people to be switching their existing +clusters over to an Alpha version of kube-proxy anyway. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + +The operator is the one who would enable the feature, and they would +know it is in use by looking at the kube-proxy configuration. + +###### How can someone using this feature know that it is working for their instance? + +- [X] Other (treat as last resort) + - Details: If Services still work then the feature is working + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +TBD. + +We should implement the existing "programming latency" metrics that +the other backends implement (`NetworkProgrammingLatency`, +`SyncProxyRulesLastQueuedTimestamp` / `SyncProxyRulesLastTimestamp`, +and `SyncProxyRulesLatency`). It's not clear if there will be a +distinction between "full syncs" and "partial syncs" that works the +same way as in the `iptables` backend, but if there is, then the +metrics related to that should also be implemented. + +It's not clear yet what sort of nftables-specific metrics will be +interesting. For example, in the `iptables` backend we have +`sync_proxy_rules_iptables_total`, which tells you the total number of +iptables rules kube-proxy has programmed. But the equivalent metric in +the `nftables` backend is not going to be as interesting, because many +of the things that are done with rules in the `iptables` backend will +be done with maps and sets in the `nftables` backend. Likewise, just +tallying "total number of rules and set/map elements" is not likely to +be useful, because the entire point of sets and maps is that they have +more-or-less **O(1)** behavior, so knowing the number of elements is +not going to give you much information about how well the system is +likely to be performing. + +- [X] Metrics + - Metric names: + - `network_programming_duration_seconds` (already exists) + - `sync_proxy_rules_last_queued_timestamp_seconds` (already exists) + - `sync_proxy_rules_last_timestamp_seconds` (already exists) + - `sync_proxy_rules_duration_seconds` (already exists) + - ... + - Components exposing the metric: + - kube-proxy + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +If we change any functionality relative to `iptables` mode (e.g., not +allowing localhost NodePorts by default), it would be good to add +metrics to the `iptables` mode, allowing users to be aware of whether +they are depending on these features. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + +It may require a newer kernel than some current users have. It does +not depend on anything else in the cluster. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + +Probably not; kube-proxy will still be using the same +Service/EndpointSlice-monitoring code, it will just be doing different +things locally with the results. + +###### Will enabling / using this feature result in introducing new API types? + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +It is not expected to... + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +The same way that kube-proxy currently does; updates stop being +processed until the apiserver is available again. + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +- Initial proposal: 2023-02-01 + +## Drawbacks + +Adding a new officially-supported kube-proxy implementation implies +more work for SIG Network (especially if we are not able to deprecate +either of the existing backends soon). + +Replacing the default kube-proxy implementation will affect many +users. + +However, doing nothing would result in a situation where, eventually, +many users would be unable to use the default proxy implementation. + +## Alternatives + +### Continue to improve the `iptables` mode + +We have made many improvements to the `iptables` mode, and could make +more. In particular, we could make the `iptables` mode use IP sets +like the `ipvs` mode does. + +However, even if we could solve literally all of the performance +problems with the `iptables` mode, there is still the looming +deprecation issue. + +(See also "[The iptables kernel subsystem has unfixable performance +problems](#the-iptables-kernel-subsystem-has-unfixable-performance-problems)".) + +### Fix up the `ipvs` mode + +Rather than implementing an entirely new `nftables` kube-proxy mode, +we could try to fix up the existing `ipvs` mode. + +However, the `ipvs` mode makes extensive use of the iptables API in +addition to the IPVS API. So while it solves the performance problems +with the `iptables` mode, it does not address the deprecation issue. +So we would at least have to rewrite it to be IPVS+nftables rather +than IPVS+iptables. + +(See also "[The ipvs mode of kube-proxy will not save +us](#the--mode-of-kube-proxy-will-not-save-us)".) + +### Use an existing nftables-based kube-proxy implementation + +Discussed in [Notes/Constraints/Caveats](#notesconstraintscaveats). + +### Create an eBPF-based proxy implementation + +Another possibility would be to try to replace the `iptables` and +`ipvs` modes with an eBPF-based proxy backend, instead of an an +nftables one. eBPF is very trendy, but it is also notoriously +difficult to work with. + +One problem with this approach is that the APIs to access conntrack +information from eBPF programs only exist in the very newest kernels. +In particular, the API for NATting a connection from eBPF was only +added in the recently-released 6.1 kernel. It will be a long time +before a majority of Kubernetes users have a kernel new enough that we +can depend on that API. + +Thus, an eBPF-based kube-proxy implementation would initially need a +number of workarounds for missing functionality, adding to its +complexity (and potentially forcing architectural choices that would +not otherwise be necessary, to support the workarounds). + +One interesting eBPF-based approach for service proxying is to use +eBPF to intercept the `connect()` call in pods, and rewrite the +destination IP before the packets are even sent. In this case, eBPF +conntrack support is not needed (though it would still be needed for +non-local service connections, such as connections via NodePorts). One +nice feature of this approach is that it integrates well with possible +future "multi-network Service" ideas, in which a pod might connect to +a service IP that resolves to an IP on a secondary network which is +only reachable by certain pods. In the case of a "normal" service +proxy that does destination IP rewriting in the host network +namespace, this would result in a packet that was undeliverable +(because the host network namespace has no route to the isolated +secondary pod network), but a service proxy that does `connect()`-time +rewriting would rewrite the connection before it ever left the pod +network namespace, allowing the connection to proceed. + +The multi-network effort is still in the very early stages, and it is +not clear that it will actually adopt a model of multi-network +Services that works this way. (It is also _possible_ to make such a +model work with a mostly-host-network-based proxy implementation; it's +just more complicated.) + diff --git a/keps/sig-network/3866-nftables-proxy/kep.yaml b/keps/sig-network/3866-nftables-proxy/kep.yaml new file mode 100644 index 000000000000..b5ab82e9ab4b --- /dev/null +++ b/keps/sig-network/3866-nftables-proxy/kep.yaml @@ -0,0 +1,39 @@ +title: Add an nftables-based kube-proxy backend +kep-number: 3866 +authors: + - "@danwinship" +owning-sig: sig-network +status: implementable +creation-date: 2023-02-01 +reviewers: + - "@thockin" + - "@dcbw" + - "@aojea" +approvers: + - "@thockin" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.29" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.29" + beta: "v1.31" + stable: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: NFTablesProxyMode + components: + - kube-proxy +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - ...