diff --git a/keps/sig-network/3866-nftables-proxy/README.md b/keps/sig-network/3866-nftables-proxy/README.md
new file mode 100644
index 000000000000..3c48975af7ea
--- /dev/null
+++ b/keps/sig-network/3866-nftables-proxy/README.md
@@ -0,0 +1,1342 @@
+# KEP-3866: An nftables-based kube-proxy backend
+
+
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+ - [The iptables kernel subsystem has unfixable performance problems](#the-iptables-kernel-subsystem-has-unfixable-performance-problems)
+ - [Upstream development has moved on from iptables to nftables](#upstream-development-has-moved-on-from-iptables-to-nftables)
+ - [The ipvs mode of kube-proxy will not save us](#the--mode-of-kube-proxy-will-not-save-us)
+ - [The nf_tables mode of /sbin/iptables will not save us](#the--mode-of--will-not-save-us)
+ - [The iptables mode of kube-proxy has grown crufty](#the--mode-of-kube-proxy-has-grown-crufty)
+ - [We will hopefully be able to trade 2 supported backends for 1](#we-will-hopefully-be-able-to-trade-2-supported-backends-for-1)
+ - [Writing a new kube-proxy mode may help with our "KPNG" goals](#writing-a-new-kube-proxy-mode-may-help-with-our-kpng-goals)
+ - [Goals](#goals)
+ - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+ - [Notes/Constraints/Caveats](#notesconstraintscaveats)
+ - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+ - [High level](#high-level)
+ - [Low level](#low-level)
+ - [Tables](#tables)
+ - [Communicating with the kernel nftables subsystem](#communicating-with-the-kernel-nftables-subsystem)
+ - [Versioning and compatibility](#versioning-and-compatibility)
+ - [NAT rules](#nat-rules)
+ - [General Service dispatch](#general-service-dispatch)
+ - [Masquerading](#masquerading)
+ - [Session affinity](#session-affinity)
+ - [Filter rules](#filter-rules)
+ - [Dropping or rejecting packets for services with no endpoints](#dropping-or-rejecting-packets-for-services-with-no-endpoints)
+ - [Dropping traffic rejected by LoadBalancerSourceRanges](#dropping-traffic-rejected-by-)
+ - [Forcing traffic on HealthCheckNodePorts to be accepted](#forcing-traffic-on--to-be-accepted)
+ - [Future improvements](#future-improvements)
+ - [Test Plan](#test-plan)
+ - [Prerequisite testing updates](#prerequisite-testing-updates)
+ - [Unit tests](#unit-tests)
+ - [Integration tests](#integration-tests)
+ - [e2e tests](#e2e-tests)
+ - [Graduation Criteria](#graduation-criteria)
+ - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+ - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+ - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+ - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+ - [Monitoring Requirements](#monitoring-requirements)
+ - [Dependencies](#dependencies)
+ - [Scalability](#scalability)
+ - [Troubleshooting](#troubleshooting)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+ - [Continue to improve the iptables mode](#continue-to-improve-the--mode)
+ - [Fix up the ipvs mode](#fix-up-the--mode)
+ - [Use an existing nftables-based kube-proxy implementation](#use-an-existing-nftables-based-kube-proxy-implementation)
+ - [Create an eBPF-based proxy implementation](#create-an-ebpf-based-proxy-implementation)
+
+
+## Release Signoff Checklist
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
+ - [ ] e2e Tests for all Beta API Operations (endpoints)
+ - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+ - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
+- [ ] (R) Graduation criteria is in place
+ - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+- [ ] (R) Production readiness review completed
+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
+[kubernetes/website]: https://git.k8s.io/website
+
+## Summary
+
+The default kube-proxy implementation on Linux is currently based on
+iptables. IPTables was the preferred packet filtering and processing
+system in the Linux kernel for many years (starting with the 2.4
+kernel in 2001). However, problems with iptables led to the
+development of a successor, nftables, first made available in the 3.13
+kernel in 2014, and growing increasingly featureful and usable as a
+replacement for iptables since then. Development on iptables has
+mostly stopped, with new features and performance improvements
+primarily going into nftables instead.
+
+This KEP proposes the creation of a new official/supported nftables
+backend for kube-proxy. While it is hoped that this backend will
+eventually replace both the `iptables` and `ipvs` backends and become
+the default kube-proxy mode on Linux, that replacement/deprecation
+would be handled in a separate future KEP.
+
+## Motivation
+
+There are currently two officially supported kube-proxy backends for
+Linux: `iptables` and `ipvs`. (The original `userspace` backend was
+deprecated several releases ago and removed from the tree in 1.25.)
+
+The `iptables` mode of kube-proxy is currently the default, and it is
+generally considered "good enough" for most use cases. Nonetheless,
+there are good arguments for replacing it with a new `nftables` mode.
+
+### The iptables kernel subsystem has unfixable performance problems
+
+Although much work has been done to improve the performance of the
+kube-proxy `iptables` backend, there are fundamental
+performance-related problems with the implementation of iptables in
+the kernel, both on the "control plane" side and on the "data plane"
+side:
+
+ - The control plane is problematic because the iptables API does not
+ support making incremental changes to the ruleset. If you want to
+ add a single iptables rule, the iptables binary must acquire a lock,
+ download the entire ruleset from the kernel, find the appropriate
+ place in the ruleset to add the new rule, add it, re-upload the
+ entire ruleset to the kernel, and release the lock. This becomes
+ slower and slower as the ruleset increases in size (ie, as the
+ number of Kubernetes Services grows). If you want to replace a large
+ number of rules (as kube-proxy does frequently), then simply the
+ time that it takes `/sbin/iptables-restore` to parse all of the
+ rules becomes substantial.
+
+ - The data plane is problematic because (for the most part), the
+ number of iptables rules used to implement a set of Kubernetes
+ Services is directly proportional to the number of Services. And
+ every packet going through the system then needs to pass through
+ all of these rules, slowing down the traffic.
+
+IPTables is the bottleneck in kube-proxy performance, and it always
+will be until we stop using it.
+
+### Upstream development has moved on from iptables to nftables
+
+In large part due to its unfixable problems, development on iptables
+in the kernel has slowed down and mostly stopped. New features are not
+being added to iptables, because nftables is supposed to do everything
+iptables does, but better.
+
+Although there is no plan to remove iptables from the upstream kernel,
+that does not guarantee that iptables will remain supported by
+_distributions_ forever. In particular, Red Hat has declared that
+[iptables is deprecated in RHEL 9] and is likely to be removed
+entirely in RHEL 10, a few years from now. Other distributions have
+made smaller steps in the same direction; for instance, [Debian
+removed `iptables` from the set of "required" packages] in Debian 11
+(Bullseye).
+
+The RHEL deprecation in particular impacts Kubernetes in two ways:
+
+ 1. Many Kubernetes users run RHEL or one of its downstreams, so in a
+ few years when RHEL 10 is released, they will be unable to use
+ kube-proxy in `iptables` mode (or, for that matter, in `ipvs` or
+ `userspace` mode, since those modes also make heavy use of the
+ iptables API).
+
+ 2. Several upstream iptables bugs and performance problems that
+ affect Kubernetes have been fixed by Red Hat developers over the
+ past several years. With Red Hat no longer making any effort to
+ maintain iptables, it is less likely that upstream iptables bugs
+ that affect Kubernetes in the future would be fixed promptly, if
+ at all.
+
+[iptables is deprecated in RHEL 9]: https://access.redhat.com/solutions/6739041
+[Debian removed `iptables` from the set of "required" packages]: https://salsa.debian.org/pkg-netfilter-team/pkg-iptables/-/commit/c59797aab9
+
+### The `ipvs` mode of kube-proxy will not save us
+
+Because of the problems with iptables, some developers added an `ipvs`
+mode to kube-proxy in 2017. It was generally hoped that this could
+eventually solve all of the problems with the `iptables` mode and
+become its replacement, but this never really happened. It's not
+entirely clear why... [kubeadm #817], "Track when we can enable the
+ipvs mode for the kube-proxy by default" is perhaps a good snapshot of
+the initial excitement followed by growing disillusionment with the
+`ipvs` mode:
+
+ - "a few issues ... re: the version of iptables/ipset shipped in the
+ kube-proxy container image"
+ - "clearly not ready for defaulting"
+ - "complications ... with IPVS kernel modules missing or disabled on
+ user nodes"
+ - "we are still lacking tests"
+ - "still does not completely align with what [we] support in
+ iptables mode"
+ - "iptables works and people are familiar with it"
+ - "[not sure that it was ever intended for IPVS to be the default]"
+
+Additionally, the kernel IPVS APIs alone do not provide enough
+functionality to fully implement Kubernetes services, and so the
+`ipvs` backend also makes heavy use of the iptables API. Thus, if we
+are worried about iptables deprecation, then in order to switch to
+using `ipvs` as the default mode, we would have to port the iptables
+parts of it to use nftables anyway. But at that point, there would be
+little excuse for using IPVS for the core load-balancing part,
+particularly given that IPVS, like iptables, is no longer an
+actively-developed technology.
+
+[kubeadm #817]: https://github.com/kubernetes/kubeadm/issues/817
+[not sure that it was ever intended for IPVS to be the default]: https://en.wikipedia.org/wiki/The_Fox_and_the_Grapes
+
+### The `nf_tables` mode of `/sbin/iptables` will not save us
+
+In 2018, with the 1.8.0 release of the iptables client binaries, a new
+mode was added to the binaries, to allow them to use the nftables API
+in the kernel rather than the legacy iptables API, while still
+preserving the "API" of the original iptables binaries. As of 2022,
+most Linux distributions now use this mode, so the legacy iptables
+kernel API is mostly dead.
+
+However, this new mode does not add any new _syntax_, and so it is not
+possible to use any of the new nftables features (like maps) that are
+not present in iptables.
+
+Furthermore, the compatibility constraints imposed by the user-facing
+API of the iptables binaries themselves prevent them from being able
+to take advantage of many of the performance improvements associated
+with nftables.
+
+### The `iptables` mode of kube-proxy has grown crufty
+
+Because `iptables` is the default kube-proxy mode, it is subject to
+strong backward-compatibility constraints which mean that certain
+"features" that are now considered to be bad ideas cannot be removed
+because they might break some existing users. A few examples:
+
+ - It allows NodePort services to be accessed on `localhost`, which
+ requires it to set a sysctl to a value that may introduce security
+ holes on the system. More generally, it defaults to having
+ NodePort services be accessible on _all_ node IPs, when most users
+ would probably prefer them to be more restricted.
+
+ - It implements the `LoadBalancerSourceRanges` feature for traffic
+ addressed directly to LoadBalancer IPs, but not for traffic
+ redirected to a NodePort by an external LoadBalancer.
+
+ - Some new functionality only works correctly if the administrator
+ passes certain command-line options to kube-proxy (eg,
+ `--cluster-cidr`), but we cannot make those options be mandatory,
+ since that would break old clusters that aren't passing them.
+
+A new kube-proxy, which existing users would have to explicitly opt
+into, could revisit these and other decisions.
+
+### We will hopefully be able to trade 2 supported backends for 1
+
+Right now SIG Network is supporting both the `iptables` and `ipvs`
+backends of kube-proxy, and does not feel like it can ditch `ipvs`
+because of performance issues with `iptables`. If we create a new
+backend which is as functional and non-buggy as `iptables` but as
+performant as `ipvs`, then we could (eventually) deprecate both of the
+existing backends and only have one backend to support in the future.
+
+### Writing a new kube-proxy mode may help with our "KPNG" goals
+
+The [KPNG] (Kube-Proxy Next Generation) working group has been working
+on the future of kube-proxy's underlying architecture. They have
+recently proposed a [kube-proxy library KEP]. Creating a new proxy
+mode which will be officially supported, but which does not (yet) have
+the same compatibility and non-bugginess requirements as the
+`iptables` and `ipvs` modes should help with that project, because we
+can target the new backend to the new library without worrying about
+breaking the old backends.
+
+[KPNG]: https://github.com/kubernetes-sigs/kpng
+[kube-proxy library KEP]: https://github.com/kubernetes/enhancements/pull/3649
+
+### Goals
+
+- Design and implement an `nftables` mode for kube-proxy.
+
+ - Drop support for localhost nodeports
+
+ - Ensure that all configuration which is _required_ for full
+ functionality (eg, `--cluster-cidr`) is actually required,
+ rather than just logging warnings about missing functionality.
+
+ - Consider other fixes to legacy `iptables` mode behavior.
+
+- Come up with at least a vague plan to eventually make `nftables` the
+ default backend.
+
+- Decide whether we can/should deprecate or even remove the `iptables`
+ and/or `ipvs` backends. (Perhaps they can be pushed out of tree, a
+ la `cri-dockerd`.)
+
+- Take advantage of kube-proxy-related work being done by the kpng
+ working group.
+
+### Non-Goals
+
+- Falling into the same traps as the `ipvs` backend, to the extent
+ that we can identify what those traps were.
+
+## Proposal
+
+### Notes/Constraints/Caveats
+
+At least three nftables-based kube-proxy implementations already
+exist, but none of them seems suitable either to adopt directly or to
+use as a starting point:
+
+- [kube-nftlb]: This is built on top of a separate nftables-based load
+ balancer project called [nftlb], which means that rather than
+ translating Kubernetes Services directly into nftables rules, it
+ translates them into nftlb load balancer objects, which then get
+ translated into nftables rules. Besides making the code more
+ confusing for users who aren't already familiar with nftlb, this
+ also means that in many cases, new Service features would need to
+ have features added to the nftlb core first before kube-nftld could
+ consume them. (Also, it has not been updated in two years.)
+
+- [nfproxy]: Its README notes that "nfproxy is not a 1:1 copy of
+ kube-proxy (iptables) in terms of features. nfproxy is not going to
+ cover all corner cases and special features addressed by
+ kube-proxy". (Also, it has not been updated in two years.)
+
+- [kpng's nft backend]: This was written as a proof of concept and is
+ mostly a straightforward translation of the iptables rules to
+ nftables, and doesn't make good use of nftables features that would
+ let it reduce the total number of rules. It also makes heavy use of
+ kpng's APIs, like "DiffStore", which there is not consensus about
+ adopting upstream.
+
+[kube-nftlb]: https://github.com/zevenet/kube-nftlb
+[nftlb]: https://github.com/zevenet/nftlb
+[nfproxy]: https://github.com/sbezverk/nfproxy
+[kpng's nft backend]: https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft
+
+### Risks and Mitigations
+
+The primary risk of the proposal is feature regressions, which will be
+addressed by testing, and by a slow, optional, rollout of the new proxy
+mode.
+
+The `nftables` mode should not pose any new security issues relative
+to the `iptables` mode.
+
+## Design Details
+
+### High level
+
+At a high level, the new mode should have the same architecture as the
+existing modes; it will use the service/endpoint-tracking code in
+`k8s.io/kubernetes/pkg/proxy` (or its eventual replacement from kpng)
+to watch for changes, and update rules in the kernel accordingly.
+
+### Low level
+
+Some details will be figured out as we implement it. We may start with
+an implementation that is architecturally closer to the `iptables`
+mode, and then rewrite it to take advantage of additional nftables
+features over time.
+
+#### Tables
+
+Unlike iptables, nftables does not have any reserved/default tables or
+chains (eg, `nat`, `PREROUTING`). Users are expected to create their
+own tables and chains for their own purposes. An nftables table can
+only contain rules for a single "family" (`ip` (v4), `ip6`, `inet`
+(both IPv4 and IPv6), `arp`, `bridge`, or `netdev`), but unlike in
+iptables, you can have both "filter"-type chains and "NAT"-type chains
+in the same table.
+
+So, we will create a single `kube_proxy` table in the `ip` family, and
+another in the `ip6` family. All of our chains, sets, maps, etc, will
+go into those tables. Other system components (eg, firewalld) should
+ignore our table, so we should not need to worry about watching for
+other people deleting our rules like we have to in the `iptables`
+backend.
+
+(In theory, instead of creating one table each in the `ip` and `ip6`
+families, we could create a single table in the `inet` family and put
+both IPv4 and IPv6 chains/rules there. However, this wouldn't really
+result in much simplification, because we would still need separate
+sets/maps to match IPv4 addresses and IPv6 addresses. (There is no
+data type that can store/match either an IPv4 address or an IPv6
+address.) Furthermore, because of how Kubernetes Services evolved in
+parallel with the existing kube-proxy implementation, we have ended up
+with a dual-stack Service semantics that is most easily implemented by
+handling IPv4 and IPv6 completely separately anyway.)
+
+#### Communicating with the kernel nftables subsystem
+
+At least initially, we will use the `nft` command-line tool to read
+and write rules, much like how we use command-line tools in the
+`iptables` and `ipvs` backends. However, the `nft` tool is mostly just
+a thin wrapper around `libnftables`, and it would be possible to use
+that directly instead in the future, given a cgo wrapper.
+
+When reading data from the kernel (`nft list ...`), `nft` outputs the
+data in a nested "object" form:
+
+```
+table ip kube_proxy {
+ comment "Kubernetes service proxying rules";
+
+ chain services {
+ ip daddr . ip protocol . th dport vmap @service_ips
+ }
+}
+```
+
+(This is the "native" nftables syntax, but the tools also support a
+JSON syntax that may be easier for us to work with...)
+
+When writing data to the kernel, `nft` accepts the data in either the
+same "object" form used by `nft list`, or in the form of a set of
+`nft` command lines without the leading "`nft`" (which are then
+executed atomically):
+
+```
+add table ip kube_proxy { comment "Kubernetes service proxying rules"; }
+add chain ip kube_proxy services
+add rule ip kube_proxy services ip daddr . ip protocol . th dport vmap @service_ips
+```
+
+The "object" form is more logical and easy to understand, but the
+"command" form is better for dynamic usage. In particular, it allows
+you to add and remove individual chains, rules, map/set elements, etc,
+without needing to also include the chains/rules/elements that you are
+not modifying.
+
+The examples below all show the "object" form of data, but it should
+be understood that these are examples of what would be seen in `nft
+list` output after kube-proxy creates the rules (with additional
+`#`-preceded comments added to help the KEP reader), not examples of
+the data we will actually be passing to `nft`.
+
+The examples below are also all IPv4-specific, for simplicity. When
+actually writing out rules for nft, we will need to switch between,
+e.g., "`ip daddr`" and "`ip6 daddr`" appropriately, to match an IPv4
+or IPv6 destination address. This will actually be fairly simple
+because the `nft` command lets you create "variables" (really
+constants) and substitute their values into the rules. Thus, we can
+just always have the rule-generating code write "`$IP daddr`", and
+then pass either "`-D IP=ip`" or "`-D IP=ip6`" to `nft` to fix it up.)
+
+(Also, most of the examples below have not actually been tested and
+may have syntax errors. Caveat lector.)
+
+#### Versioning and compatibility
+
+Since nftables is subject to much more development than iptables has
+been recently, we will need to pay more attention to kernel and tool
+versions.
+
+The `nft` command has a `--check` option which can be used to check if
+a command could be run successfully; it parses the input, and then
+(assuming success), uploads the data to the kernel and asks the kernel
+to check it (but not actually act on it) as well. Thus, with a few
+`nft --check` runs at startup we should be able to confirm what
+features are known to both the tooling and the kernel.
+
+It is not yet clear what the minimum kernel or `nft` command-line
+versions needed by the `nftables` backend will be. The newest feature
+used in the examples below was added in Linux 5.6, released in March
+2020 (though they could be rewritten to not need that feature).
+
+It is possible some users will not be able to upgrade from the
+`iptables` and `ipvs` backends to `nftables`. (Certainly the
+`nftables` backend will not support RHEL 7, which some people are
+still using Kubernetes with.)
+
+#### NAT rules
+
+##### General Service dispatch
+
+For ClusterIP and external IP services, we will use an nftables
+"verdict map" to store the logic about where to dispatch traffic,
+based on destination IP, protocol, and port. We will then need only a
+single actual rule to apply the verdict map to all inbound traffic.
+(Or it may end up making more sense to have separate verdict maps for
+ClusterIP, ExternalIP, and LoadBalancer IP?) Likewise, for NodePort
+traffic, we will use a verdict map matching only on destination
+protocol / port, with the rules set up to only check the `nodeports`
+map for packets addressed to a local IP.
+
+```
+map service_ips {
+ comment "ClusterIP, ExternalIP and LoadBalancer IP traffic";
+
+ # The "type" clause defines the map's datatype; the key type is to
+ # the left of the ":" and the value type to the right. The map key
+ # in this case is a concatenation (".") of three values; an IPv4
+ # address, a protocol (tcp/udp/sctp), and a port (aka
+ # "inet_service"). The map value is a "verdict", which is one of a
+ # limited set of nftables actions. In this case, the verdicts are
+ # all "goto" statements.
+
+ type ipv4_addr . inet_proto . inet_service : verdict;
+
+ elements {
+ 172.30.0.44 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK,
+ 192.168.99.33 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK,
+ ...
+ }
+}
+
+map service_nodeports {
+ comment "NodePort traffic";
+ type inet_proto . inet_service : verdict;
+
+ elements {
+ tcp . 3001 : goto svc_4SW47YFZTEDKD3PK,
+ ...
+ }
+}
+
+chain prerouting {
+ jump services
+ jump nodeports
+}
+
+chain services {
+ # Construct a key from the destination address, protocol, and port,
+ # then look that key up in the `service_ips` vmap and take the
+ # associated action if it is found.
+
+ ip daddr . ip protocol . th dport vmap @service_ips
+}
+
+chain nodeports
+ # Return if the destination IP is non-local, or if it's localhost.
+ fib daddr type != local return
+ ip daddr == 127.0.0.1 return
+
+ # If --nodeport-addresses was in use then the above would instead be
+ # something like:
+ # ip daddr != { 192.168.1.5, 192.168.3.10 } return
+
+ # dispatch on the service_nodeports vmap
+ ip protocol . th dport vmap @service_nodeports
+}
+
+# Example per-service chain
+chain svc_4SW47YFZTEDKD3PK {
+ # Send to random endpoint chain using an inline vmap
+ numgen random mod 2 vmap {
+ 0 : goto sep_UKSFD7AGPMPPLUHC,
+ 1 : goto sep_C6EBXVWJJZMIWKLZ
+ }
+}
+
+# Example per-endpoint chain
+chain sep_UKSFD7AGPMPPLUHC {
+ # masquerade hairpin traffic
+ ip saddr 10.180.0.4 jump mark_for_masquerade
+
+ # send to selected endpoint
+ dnat to 10.180.0.4:8000
+}
+```
+
+##### Masquerading
+
+The example rules above include
+
+```
+ ip saddr 10.180.0.4 jump mark_for_masquerade
+```
+
+to masquerade hairpin traffic, as in the `iptables` proxier. This
+assumes the existence of a `mark_for_masquerade` chain, not shown.
+
+nftables has the same constraints on DNAT and masquerading as iptables
+does; you can only DNAT from the "prerouting" stage and you can only
+masquerade from the "postrouting" stage. Thus, as with `iptables`, the
+`nftables` proxy will have to handle DNAT and masquerading at separate
+times. One possibility would be to simply copy the existing logic from
+the `iptables` proxy, using the packet mark to communicate from the
+prerouting chains to the postrouting ones.
+
+However, it should be possible to do this in nftables without using
+the mark or any other externally-visible state; we can just create an
+nftables `set`, and use that to communicate information between the
+chains. Something like:
+
+```
+# Set of 5-tuples of connections that need masquerading
+set need_masquerade {
+ type ipv4_addr . inet_service . ipv4_addr . inet_service . inet_proto;
+ flags timeout ; timeout 5s ;
+}
+
+chain mark_for_masquerade {
+ update @need_masquerade { ip saddr . th sport . ip daddr . th dport . ip protocol }
+}
+
+chain postrouting_do_masquerade {
+ # We use "ct original ip daddr" and "ct original proto-dst" here
+ # since the packet may have been DNATted by this point.
+
+ ip saddr . th sport . ct original ip daddr . ct original proto-dst . ip protocol @need_masquerade masquerade
+}
+```
+
+This is not yet tested, but some kernel nftables developers have
+confirmed that it ought to work.
+
+##### Session affinity
+
+Session affinity can be done in roughly the same way as in the
+`iptables` proxy, just using the more general nftables "set" framework
+rather than the affinity-specific version of sets provided by the
+iptables `recent` module. In fact, since nftables allows arbitrary set
+keys, we can optimize relative to `iptables`, and only have a single
+affinity set per service, rather than one per endpoint. (And we also
+have the flexibility to change the affinity key in the future if we
+want to, eg to key on source IP+port rather than just source IP.)
+
+```
+set affinity_4SW47YFZTEDKD3PK {
+ # Source IP . Destination IP . Destination Port
+ type ipv4_addr . ipv4_addr . inet_service;
+ flags timeout; timeout 3h;
+}
+
+chain svc_4SW47YFZTEDKD3PK {
+ # Check for existing session affinity against each endpoint
+ ip saddr . 10.180.0.4 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_UKSFD7AGPMPPLUHC
+ ip saddr . 10.180.0.5 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_C6EBXVWJJZMIWKLZ
+
+ # Send to random endpoint chain
+ numgen random mod 2 vmap {
+ 0 : goto sep_UKSFD7AGPMPPLUHC,
+ 1 : goto sep_C6EBXVWJJZMIWKLZ
+ }
+}
+
+chain sep_UKSFD7AGPMPPLUHC {
+ # Mark the source as having affinity for this endpoint
+ update @affinity_4SW47YFZTEDKD3PK { ip saddr . 10.180.0.4 . 80 }
+
+ ip saddr 10.180.0.4 jump mark_for_masquerade
+ dnat to 10.180.0.4:8000
+}
+
+# likewise for other endpoint(s)...
+```
+
+#### Filter rules
+
+The `iptables` mode uses the `filter` table for three kinds of rules:
+
+##### Dropping or rejecting packets for services with no endpoints
+
+As with service dispatch, this is easily handled with a verdict map:
+
+```
+map no_endpoint_services {
+ type ipv4_addr . inet_proto . inet_service : verdict
+ elements = {
+ 192.168.99.22 . tcp . 80 : drop,
+ 172.30.0.46 . tcp . 80 : goto reject_chain,
+ 1.2.3.4 . tcp . 80 : drop
+ }
+}
+
+chain filter {
+ ...
+ ip daddr . ip protocol . th dport vmap @no_endpoint_services
+ ...
+}
+
+# helper chain needed because "reject" is not a "verdict" and so can't
+# be used directly in a verdict map
+chain reject_chain {
+ reject
+}
+```
+
+##### Dropping traffic rejected by `LoadBalancerSourceRanges`
+
+The implementation of LoadBalancer source ranges will be similar to
+the ipset-based implementation in the `ipvs` kube proxy: we use one
+set to recognize "traffic that is subject to source ranges", and then
+another to recognize "traffic that is _accepted_ by its service's
+source ranges". Traffic which matches the first set but not the second
+gets dropped:
+
+```
+set firewall {
+ comment "destinations that are subject to LoadBalancerSourceRanges";
+ type ipv4_addr . inet_proto . inet_service
+}
+set firewall_allow {
+ comment "destination+sources that are allowed by LoadBalancerSourceRanges";
+ type ipv4_addr . inet_proto . inet_service . ipv4_addr
+}
+
+chain filter {
+ ...
+ ip daddr . ip protocol . th dport @firewall jump firewall_check
+ ...
+}
+
+chain firewall_check {
+ ip daddr . ip protocol . th dport . ip saddr @firewall_allow return
+ drop
+}
+```
+
+Where, eg, adding a Service with LoadBalancer IP `10.1.2.3`, port
+`80`, and source ranges `["192.168.0.3/32", "192.168.1.0/24"]` would
+result in:
+
+```
+add element ip kube_proxy firewall { 10.1.2.3 . tcp . 80 }
+add element ip kube_proxy firewall { 10.1.2.3 . tcp . 80 }
+add element ip kube_proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.0.3/32 }
+add element ip kube_proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.1.0/24 }
+```
+
+##### Forcing traffic on `HealthCheckNodePorts` to be accepted
+
+The `iptables` mode adds rules to ensure that traffic to NodePort
+services' health check ports is allowed through the firewall. eg:
+
+```
+-A KUBE-NODEPORTS -m comment --comment "ns2/svc2:p80 health check node port" -m tcp -p tcp --dport 30000 -j ACCEPT
+```
+
+(There are also rules to accept any traffic that has already been
+tagged by conntrack.)
+
+This cannot be done reliably in nftables; the `accept` and `drop`
+rules work differently than they do in iptables, and so if there is a
+firewall that would drop traffic to that port, then there is no
+guaranteed way to "sneak behind its back" like you can in iptables; we
+would need to actually properly configure _that firewall_ to accept
+the packets.
+
+However, these sorts of rules are somewhat legacy anyway; they work
+(in the `iptables` proxy) to bypass a _local_ firewall, but they would
+do nothing to bypass a firewall implemented at the cloud network
+layer, which is perhaps a more common configuration these days anyway.
+Administrators using non-local firewalls are already required to
+configure those firewalls correctly to allow Kubernetes traffic
+through, and it is reasonable for us to just extend that requirement
+to administrators using local firewalls as well.
+
+Thus, the `nftables` backend will not attempt to replicate these
+`iptables`-backend rules.
+
+#### Future improvements
+
+Further improvements are likely possible.
+
+For example, it would be nice to not need a separate "hairpin" check for
+every endpoint. There is no way to ask directly "does this packet have
+the same source and destination IP?", but the proof-of-concept [kpng
+nftables backend] does this instead:
+
+```
+set hairpin {
+ type ipv4_addr . ipv4_addr;
+ elements {
+ 10.180.0.4 . 10.180.0.4,
+ 10.180.0.5 . 10.180.0.5,
+ ...
+ }
+}
+
+chain ... {
+ ...
+ ip saddr . ip daddr @hairpin jump mark_for_masquerade
+}
+```
+
+More efficiently, if nftables eventually got the ability to call eBPF
+programs as part of rule processing (like iptables's `-m ebpf`) then
+we could write a trivial eBPF program to check "source IP equals
+destination IP" and then call that rather than needing the giant set
+of redundant IPs.
+
+If we do this, then we don't need the per-endpoint hairpin check
+rules. If we could also get rid of the per-endpoint affinity-updating
+rules, then we could get rid of the per-endpoint chains entirely,
+since `dnat to ...` is an allowed vmap verdict:
+
+```
+chain svc_4SW47YFZTEDKD3PK {
+ # FIXME handle affinity somehow
+
+ # Send to random endpoint
+ random mod 2 vmap {
+ 0 : dnat to 10.180.0.4:8000
+ 1 : dnat to 10.180.0.5:8000
+ }
+}
+```
+
+With the current set of nftables functionality, it does not seem
+possible to do this (in the case where affinity is in use), but future
+features may make it possible.
+
+It is not yet clear what the tradeoffs of such rewrites are, either in
+terms of runtime performance, or of admin/developer-comprehensibility
+of the ruleset.
+
+[kpng nftables backend]: https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft
+
+### Test Plan
+
+
+
+[X] I/we understand the owners of the involved components may require updates to
+existing tests to make this code solid enough prior to committing the changes necessary
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+
+
+##### Unit tests
+
+We will add unit tests for the `nftables` mode that are equivalent to
+the ones for the `iptables` mode. In particular, we will port over the
+tests that feed Services and EndpointSlices into the proxy engine,
+dump the generated ruleset, and then mock running packets through the
+ruleset to determine how they would behave.
+
+The `cmd/kube-proxy/app` tests mostly only test configuration parsing,
+and we will extend them to understand the new mode and its associated
+configuration options, but there will not be many changes made there.
+
+
+
+- ``: `` - ``
+
+##### Integration tests
+
+Kube-proxy does not have integration tests.
+
+##### e2e tests
+
+Most of the e2e testing of kube-proxy is backend-agnostic. Initially,
+we will need a separate e2e job to test the nftables mode (like we do
+with ipvs). Eventually, if nftables becomes the default, then this
+would be flipped around to having a legacy "iptables" job.
+
+The handful of e2e tests that specifically examine iptables rules will
+need to be updated to be able to work with either backend.
+
+
+
+- :
+
+### Graduation Criteria
+
+
+
+### Upgrade / Downgrade Strategy
+
+The new mode should not introduce any upgrade/downgrade problems,
+excepting that you can't downgrade or feature-disable a cluster using
+the new kube-proxy mode without switching it back to `iptables` or
+`ipvs` first.
+
+When rolling out or rolling back the feature, it should be safe to
+enable the feature gate and change the configuration at the same time,
+since nothing cares about the feature gate except for kube-proxy
+itself. Likewise, it is expected to be safe to roll out the feature in
+a live cluster, even though this will result in different proxy modes
+running on different nodes, because Kubernetes service proxying is
+defined in such a way that no node needs to be aware of the
+implementation details of the service proxy implementation on any
+other node.
+
+(However, see the notes below in [Feature Enablement and
+Rollback](#feature-enablement-and-rollback) about stale rule cleanup
+when switching modes.)
+
+### Version Skew Strategy
+
+The feature is isolated to kube-proxy and does not introduce any API
+changes, so the versions of other components do not matter.
+
+## Production Readiness Review Questionnaire
+
+
+
+### Feature Enablement and Rollback
+
+
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+The administrator must enable the feature gate to make the feature
+available, and then must run kube-proxy with the
+`--proxy-mode=nftables` flag.
+
+Kube-proxy does not delete its rules on exit (to avoid service
+interruptions when restarting/upgrading kube-proxy, or if it crashes).
+This means that when switching between proxy modes, it is necessary
+for the administrator to ensure that the rules created by the old
+proxy mode get deleted. (Failure to do so may result in stale service
+rules being left behind for an arbitrarily long time.) The simplest
+way to do this is to reboot each node when switching from one proxy
+mode to another, but it is also possible to run kube-proxy in "cleanup
+and exit" mode, eg:
+
+```
+kube-proxy --proxy-mode=iptables --cleanup
+```
+
+- [X] Feature gate (also fill in values in `kep.yaml`)
+ - Feature gate name: NFTablesKubeProxy
+ - Components depending on the feature gate:
+ - kube-proxy
+- [X] Other
+ - Describe the mechanism:
+ - See above
+ - Will enabling / disabling the feature require downtime of the control
+ plane?
+ - No
+ - Will enabling / disabling the feature require downtime or reprovisioning
+ of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
+ - See above
+
+###### Does enabling the feature change any default behavior?
+
+Enabling the feature gate does not change any behavior; it just makes
+the `--proxy-mode=nftables` option available.
+
+Switching from `--proxy-mode=iptables` or `--proxy-mode=ipvs` to
+`--proxy-mode=nftables` will likely change some behavior, depending
+on what we decide to do about certain un-loved kube-proxy features
+like localhost nodeports.
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+Yes, though the same caveat about rebooting or running `kube-proxy
+--cleanup` applies as in the "enabling" case.
+
+Of course, if the user is rolling back, that suggests that the
+`nftables` mode was not working correctly, in which case the
+`--cleanup` option may _also_ not work correctly, so rebooting the
+node is safer.
+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+It should just work.
+
+###### Are there any tests for feature enablement/disablement?
+
+
+
+### Rollout, Upgrade and Rollback Planning
+
+
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+
+
+###### What specific metrics should inform a rollback?
+
+
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+
+
+### Monitoring Requirements
+
+
+
+###### How can an operator determine if the feature is in use by workloads?
+
+The feature is used by the cluster as a whole, and the operator would
+know that it was in use from looking at the cluster configuration.
+
+###### How can someone using this feature know that it is working for their instance?
+
+- [X] Other (treat as last resort)
+ - Details: If Services still work then the feature is working
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+- [X] Metrics
+ - Metric names:
+ - ...
+ - Components exposing the metric:
+ - kube-proxy
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+
+
+### Dependencies
+
+
+
+###### Does this feature depend on any specific services running in the cluster?
+
+It may require a newer kernel than some current users have. It does
+not depend on anything else in the cluster.
+
+### Scalability
+
+
+
+###### Will enabling / using this feature result in any new API calls?
+
+Probably not; kube-proxy will still be using the same
+Service/EndpointSlice-monitoring code, it will just be doing different
+things locally with the results.
+
+###### Will enabling / using this feature result in introducing new API types?
+
+No
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+
+No
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+No
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+
+No
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+It is not expected to...
+
+### Troubleshooting
+
+
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+The same way that kube-proxy currently does; updates stop being
+processed until the apiserver is available again.
+
+###### What are other known failure modes?
+
+
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+## Implementation History
+
+- Initial proposal: 2023-02-01
+
+## Drawbacks
+
+Adding a new officially-supported kube-proxy implementation implies
+more work for SIG Network (especially if we are not able to deprecate
+either of the existing backends soon).
+
+Replacing the default kube-proxy implementation will affect many
+users.
+
+However, doing nothing would result in a situation where, eventually,
+many users would be unable to use the default proxy implementation.
+
+## Alternatives
+
+### Continue to improve the `iptables` mode
+
+We have made many improvements to the `iptables` mode, and could make
+more. In particular, we could make the `iptables` mode use IP sets
+like the `ipvs` mode does.
+
+However, even if we could solve literally all of the performance
+problems with the `iptables` mode, there is still the looming
+deprecation issue.
+
+(See also "[The iptables kernel subsystem has unfixable performance
+problems](#the-iptables-kernel-subsystem-has-unfixable-performance-problems)".)
+
+### Fix up the `ipvs` mode
+
+Rather than implementing an entirely new `nftables` kube-proxy mode,
+we could try to fix up the existing `ipvs` mode.
+
+However, the `ipvs` mode makes extensive use of the iptables API in
+addition to the IPVS API. So while it solves the performance problems
+with the `iptables` mode, it does not address the deprecation issue.
+So we would at least have to rewrite it to be IPVS+nftables rather
+than IPVS+iptables.
+
+(See also "[The ipvs mode of kube-proxy will not save
+us](#the--mode-of-kube-proxy-will-not-save-us)".)
+
+### Use an existing nftables-based kube-proxy implementation
+
+Discussed in [Notes/Constraints/Caveats](#notesconstraintscaveats).
+
+### Create an eBPF-based proxy implementation
+
+Another possibility would be to try to replace the `iptables` and
+`ipvs` modes with an eBPF-based proxy backend, instead of an an
+nftables one. eBPF is very trendy, but it is also notoriously
+difficult to work with.
+
+One problem with this approach is that the APIs to access conntrack
+information from eBPF programs only exist in the very newest kernels.
+In particular, the API for NATting a connection from eBPF was only
+added in the recently-released 6.1 kernel. It will be a long time
+before a majority of Kubernetes users have a kernel new enough that we
+can depend on that API.
+
+Thus, an eBPF-based kube-proxy implementation would initially need a
+number of workarounds for missing functionality, adding to its
+complexity (and potentially forcing architectural choices that would
+not otherwise be necessary, to support the workarounds).
+
+One interesting eBPF-based approach for service proxying is to use
+eBPF to intercept the `connect()` call in pods, and rewrite the
+destination IP before the packets are even sent. In this case, eBPF
+conntrack support is not needed (though it would still be needed for
+non-local service connections, such as connections via NodePorts). One
+nice feature of this approach is that it integrates well with possible
+future "multi-network Service" ideas, in which a pod might connect to
+a service IP that resolves to an IP on a secondary network which is
+only reachable by certain pods. In the case of a "normal" service
+proxy that does destination IP rewriting in the host network
+namespace, this would result in a packet that was undeliverable
+(because the host network namespace has no route to the isolated
+secondary pod network), but a service proxy that does `connect()`-time
+rewriting would rewrite the connection before it ever left the pod
+network namespace, allowing the connection to proceed.
+
+The multi-network effort is still in the very early stages, and it is
+not clear that it will actually adopt a model of multi-network
+Services that works this way. (It is also _possible_ to make such a
+model work with a mostly-host-network-based proxy implementation; it's
+just more complicated.)
+
diff --git a/keps/sig-network/3866-nftables-proxy/kep.yaml b/keps/sig-network/3866-nftables-proxy/kep.yaml
new file mode 100644
index 000000000000..0549e182a374
--- /dev/null
+++ b/keps/sig-network/3866-nftables-proxy/kep.yaml
@@ -0,0 +1,39 @@
+title: An nftables-based kube-proxy backend
+kep-number: 3866
+authors:
+ - "@danwinship"
+owning-sig: sig-network
+status: provisional
+creation-date: 2023-02-01
+reviewers:
+ - "@thockin"
+ - "@dcbw"
+ - "@aojea"
+approvers:
+ - "@thockin"
+
+# The target maturity stage in the current dev cycle for this KEP.
+stage: alpha
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v1.27"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+ alpha: "v1.28"
+ beta: "v1.30"
+ stable: "v1.32"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+ - name: NFTablesKubeProxy
+ components:
+ - kube-proxy
+disable-supported: true
+
+# The following PRR answers are required at beta release
+metrics:
+ - ...