From ffc09075aaef170cd05171be0da477c9e5181801 Mon Sep 17 00:00:00 2001 From: Dan Winship Date: Thu, 29 Aug 2024 10:02:35 -0400 Subject: [PATCH] Document nftables backends for portmap and ipmasq Also belatedly remove the recommendation of using `"externalSetMarkChain": "KUBE-MARK-MASQ"`. Signed-off-by: Dan Winship --- content/plugins/current/main/bridge.md | 1 + content/plugins/current/main/ptp.md | 1 + content/plugins/current/meta/portmap.md | 131 +++++++++++++++++++----- 3 files changed, 108 insertions(+), 25 deletions(-) diff --git a/content/plugins/current/main/bridge.md b/content/plugins/current/main/bridge.md index f4187ea..d4a1b14 100644 --- a/content/plugins/current/main/bridge.md +++ b/content/plugins/current/main/bridge.md @@ -83,6 +83,7 @@ If the bridge is missing, the plugin will create one on first use and, if gatewa * `isDefaultGateway` (boolean, optional): Sets isGateway to true and makes the assigned IP the default route. Defaults to false. * `forceAddress` (boolean, optional): Indicates if a new IP address should be set if the previous value has been changed. Defaults to false. * `ipMasq` (boolean, optional): set up IP Masquerade on the host for traffic originating from this network and destined outside of it. Defaults to false. +* `ipMasqBackend` (string, optional): IP masquerading implementation to use when `ipMasq` is true. Can be "iptables" or "nftables". Defaults to "iptables", unless only "nftables" is available. * `mtu` (integer, optional): explicitly set MTU to the specified value. Defaults to the value chosen by the kernel. * `hairpinMode` (boolean, optional): set hairpin mode for interfaces on the bridge. Defaults to false. * `ipam` (dictionary, required): IPAM configuration to be used for this network. For L2-only network, create empty dictionary. diff --git a/content/plugins/current/main/ptp.md b/content/plugins/current/main/ptp.md index 4805a13..9940ea9 100644 --- a/content/plugins/current/main/ptp.md +++ b/content/plugins/current/main/ptp.md @@ -34,6 +34,7 @@ The traffic of the container interface will be routed through the interface of t * `name` (string, required): the name of the network * `type` (string, required): "ptp" * `ipMasq` (boolean, optional): set up IP Masquerade on the host for traffic originating from ip of this network and destined outside of this network. Defaults to false. +* `ipMasqBackend` (string, optional): IP masquerading implementation to use when `ipMasq` is true. Can be "iptables" or "nftables". Defaults to "iptables", unless only "nftables" is available. * `mtu` (integer, optional): explicitly set MTU to the specified value. Defaults to value chosen by the kernel. * `ipam` (dictionary, required): IPAM configuration to be used for this network. * `dns` (dictionary, optional): DNS information to return as described in the [Result](https://github.com/containernetworking/cni/blob/master/SPEC.md#result). diff --git a/content/plugins/current/meta/portmap.md b/content/plugins/current/meta/portmap.md index 1e16646..b790322 100644 --- a/content/plugins/current/meta/portmap.md +++ b/content/plugins/current/meta/portmap.md @@ -16,11 +16,12 @@ the following configuration options: * `snat` - boolean, default true. If true or omitted, set up the SNAT chains * `masqAll` - boolean, default false. If false or omitted, the `snat` rule set up on loopback & hairpin traffic, else will `snat` all source traffic. -* `markMasqBit` - int, (0-31), default 13. The mark bit to use for masquerading (see section SNAT). Cannot be set when `externalSetMarkChain` is used. -* `externalSetMarkChain` - string, default nil. If you already have a Masquerade mark chain (e.g. Kubernetes), specify it here. This will use that instead of creating a separate chain. When this is set, `markMasqBit` must be unspecified. -* `conditionsV4`, `conditionsV6` - array of strings. A list of arbitrary `iptables` +* `markMasqBit` - int, (0-31), default 13. The mark bit to use for masquerading (see section SNAT). Cannot be set when `externalSetMarkChain` is used. (Only used by the "iptables" backend.) +* `externalSetMarkChain` - string, default nil. If you already have a Masquerade mark chain (e.g. Kubernetes), specify it here. This will use that instead of creating a separate chain. When this is set, `markMasqBit` must be unspecified. (Only used by the "iptables" backend.) +* `conditionsV4`, `conditionsV6` - array of strings. A list of arbitrary `iptables` or `nft` matches to add to the per-container rule. This may be useful if you wish to exclude specific IPs from port-mapping +* `backend` - string. The backend ("iptables" or "nftables") to use for rules. Defaults to "iptables", unless iptables is unavailable, or nftables-specific configuration is provided (e.g., in `conditionsV4`). The plugin expects to receive the actual list of port mappings via the `portMappings` [capability argument](https://github.com/containernetworking/cni/blob/master/CONVENTIONS.md) @@ -49,16 +50,23 @@ look like: { "type": "portmap", "capabilities": {"portMappings": true}, - "externalSetMarkChain": "KUBE-MARK-MASQ" } ] } ``` +(Note that `"externalSetMarkChain": "KUBE-MARK-MASQ"` is [not +recommended] with recent releases of Kubernetes, since that chain is +considered private to kube-proxy, and may change in the future (and +does not exist when using kube-proxy in "nftables" mode).) + +[not recommended]: https://kubernetes.io/blog/2022/09/07/iptables-chains-not-api/ + A configuration file with all options set: ```json { "type": "portmap", + "backend": "iptables", "capabilities": {"portMappings": true}, "snat": true, "markMasqBit": 13, @@ -68,9 +76,19 @@ A configuration file with all options set: } ``` +Or using the "nftables" backend: +```json +{ + "type": "portmap", + "backend": "backend", + "capabilities": {"portMappings": true}, + "snat": true, + "conditionsV4": ["ip", "daddr", "!=", "192.0.2.0/24"], + "conditionsV6": ["ip6", "daddr", "!=", "fc00::/7"] +} +``` - -## Rule structure +## Rule structure (iptables) The plugin sets up two sequences of chains and rules - one "primary" DNAT sequence to rewrite the destination, and one additional SNAT sequence that will masquerade traffic as needed. @@ -101,42 +119,105 @@ rules look like this: - `-p tcp -s 127.0.0.1 --dport 8043 -j CNI-HOSTPORT-SETMARK` - `-p tcp --dport 8043 -j DNAT --to-destination 172.16.30.2:443` -New connections to the host will have to traverse every rule, so large numbers -of port forwards may have a performance impact. This won't affect established -connections, just the first packet. - ### SNAT (Masquerade) Some packets also need to have the source address rewritten: * connections from localhost * Hairpin traffic back to the container. -* Plugins which traffic not go though default net namespace e.g., ipvlan,macvlan,etc. (need `masqAll` option) +* Plugins whose traffic does not go through the default net namespace e.g., ipvlan,macvlan,etc. (need `masqAll` option) In the DNAT chain, a bit is set on the mark for packets that need snat. This chain performs that masquerading. By default, bit 13 is set, but this is configurable. If you are using other tools that also use the iptables mark, you should make sure this doesn't conflict. -Some container runtimes, most notably Kubernetes, already have a set of rules -for masquerading when a specific mark bit is set. If so enabled, the plugin -will use that chain instead. - `POSTROUTING`: - `-j CNI-HOSTPORT-MASQ` `CNI-HOSTPORT-MASQ`: - `--mark 0x2000 -j MASQUERADE` -Because MASQUERADE happens in POSTROUTING, it means that packets with source ip -127.0.0.1 need to first pass a routing boundary before being masqueraded. By -default, that is not allowed in Linux. So, the plugin needs to enable the sysctl -`net.ipv4.conf.IFNAME.route_localnet`, where IFNAME is the name of the host-side -interface that routes traffic to the container. +## Rule structure (nftables) +The organization is slightly simpler than in the iptables case. All +rules are created in the `cni_hostport` table (of the `ip` or `ip6` +family, as appropriate). + +### DNAT +The DNAT rule rewrites the destination port and address of new connections. +DNAT rules are added to the `hostports` or `hostip_hostports` chains +of the `cni_hostport` table, depending on whether the mapping is for +all host IPs or only for a single host IP. + +So, if a single container exists on IP 172.16.30.2/24 with ports 8080 +and 8043 on the host forwarded to ports 80 and 443 in the container, +the rules look like this: -There is no equivalent to `route_localnet` for ipv6, so connections to ::1 -will not be portmapped for ipv6. If you need port forwarding from localhost, -your container must have an ipv4 address. +``` +table ip cni_hostport { + comment "CNI portmap plugin"; } + + chain input { + type nat hook input priority dstnat; + jump hostports + } + chain output { + type nat hook output priority dstnat; + fib daddr type local jump hostports + } + + chain hostports { + ip protocol tcp th dport 8080 dnat ip addr . port to 172.16.30.2 . 80 + ip protocol tcp th dport 8043 dnat ip addr . port to 172.16.30.2 . 443 + } +} +New connections to the host will have to traverse every rule, so large numbers +of port forwards may have a performance impact. This won't affect established +connections, just the first packet. + +### SNAT (Masquerade) +Some packets also need to have the source address rewritten: +* connections from localhost +* Hairpin traffic back to the container. +* Plugins whose traffic does not go through the default net namespace e.g., ipvlan,macvlan,etc. (need `masqAll` option) + +Unlike the iptables backend, the nftables backend figures out the +packets that need to be masqueraded without using the packet mark or +an external chain. Continuing the above example: + +table ip cni_hostport { + comment "CNI portmap plugin"; } + + chain masquerading { + type nat hook postrouting priority srcnat; + # Hairpin traffic + ip saddr 127.16.30.2 ip daddr 172.16.30.2 masquerade + # Localhost hostports + ip saddr 127.0.0.1 ip daddr 10.0.0.2 masquerade + } +} ## Known issues -- ipsets could improve efficiency -- forwarding from localhost does not work with ipv6. + +### Efficiency + +Each new connection to the host will have to traverse every rule in +the chain, so large numbers of port forwards may have a performance +impact. (This won't affect established connections, just the first +packet.) + +In theory, it should be possible to use nftables sets (or ipsets with +iptables) to address this problem, but for complicated technical +reasons, this doesn't quite work. + +### Localhost hostports + +Because MASQUERADE happens in POSTROUTING, packets with source ip +127.0.0.1 need to first pass a routing boundary before being +masqueraded. By default, that is not allowed in Linux. So, the plugin +needs to enable the sysctl `net.ipv4.conf.IFNAME.route_localnet`, +where IFNAME is the name of the host-side interface that routes +traffic to the container. + +There is no equivalent to `route_localnet` for ipv6, so connections to +::1 will not be portmapped for ipv6. If you need port forwarding from +localhost, your container must have an ipv4 address.