- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Kubelet currently creates some IPTables chains at startup. In the past it actually used some of them, but with the removal of dockershim it no longer does. Additionally, for historical reasons, it creates some IPTables chains that are duplicates of chains also created by kube-proxy, and some that are only used by kube-proxy, but which kube-proxy requires kubelet to create on its behalf. (The initial comment in kubernetes #82125 has more details on the history of how we got to where we are now, or at least to where we were before dockershim was removed.)
We should clean this up so that kubelet no longer unnecessarily creates IPTables chains, and that kube-proxy creates all of the IPTables chains it needs.
-
Remove most IPTables chain management from kubelet.
-
Remove the
--iptables-drop-bit
and--iptables-masquerade-bit
arguments from kubelet. -
Kubelet will continue to create at least one IPTables chain as a hint to
iptables-wrapper
and other components that need to know whether the system is usinglegacy
ornft
iptables. -
For security-backward-compatibility, kubelet will continue (for now) to create a rule to block "martian packets" that would otherwise be allowed by
route_localnet
(even though kubelet itself does not setroute_localnet
or expect it to be set). (See discussion in Martian Packet Blocking below.)
-
-
Update kube-proxy to no longer rely on chains created by kubelet:
-
Rewrite packet-dropping rules to not depend on a
KUBE-MARK-DROP
chain created by kubelet. -
Kube-proxy should create its own copy of kubelet's "martian packet" fix, when running in a mode that needs that fix.
-
-
Ensure that the proxy implementations in
kpng
get updated similarly. -
Document for the future that Kubernetes's IPTables chains (other than the IPTables mode "hint" chain) are meant for its internal use only, do not constitute API, and should not be relied on by external components.
-
Ensure that users and third-party software that make assumptions about kubelet's and kube-proxy's IPTables rules have time to get updated for the changes in rule creation.
- Changing
KUBE-MARK-MASQ
andKUBE-MARK-DROP
to use the connection mark or connection label rather than the packet label (as discussed in kubernetes #78948).
Kubelet currently creates five IPTables chains; two to help with masquerading packets, two to help with dropping packets, and one for purely-internal purposes:
-
KUBE-MARK-MASQ
andKUBE-POSTROUTING
KUBE-MARK-MASQ
marks packets as needing to be masqueraded (by setting a configurable bit of the "packet mark").KUBE-POSTROUTING
checks the packet mark and calls-j MASQUERADE
on the packets that were previously marked for masquerading.These chains were formerly used for HostPort handling in dockershim, but are no longer used by kubelet. Kube-proxy (in iptables or ipvs mode) creates identical copies of both of these chains, which it uses for service handling.
-
KUBE-MARK-DROP
andKUBE-FIREWALL
KUBE-MARK-DROP
marks packets as needing to be dropped by setting a different configurable bit on the packet mark.KUBE-FIREWALL
checks the packet mark and calls-j DROP
on the packets that were previously marked for dropping.These chains have always been created by kubelet, but were only ever used by kube-proxy.
(
KUBE-FIREWALL
also contains a rule blocking certain "martian packets"; see below.) -
KUBE-KUBELET-CANARY
, which is used by theutiliptables.Monitor
functionality to notice when the iptables rules have been flushed and kubelet needs to recreate its rules.
The reason that the MARK
chains exist is because most of
kube-proxy's service-processing logic occurs in subchains of the
OUTPUT
and PREROUTING
chains of the nat
table (because those are
the only places you can call the DNAT
target from, to redirect
packets from a service IP to a pod IP), but neither masquerading nor
dropping packets can occur at that point in IPTables processing:
dropping packets can only occur from chains in the filter
table,
while masquerading can only occur from the POSTROUTING
chain of the
nat
table (because the kernel can't know what IP to masquerade the
packet to until after it has decided what interface it will send it
out on, which necessarily can't happen until after it knows if it is
going to DNAT it). To work around this, when Kubernetes wants to
masquerade or drop a packet, it just marks it as needing to be
masqueraded or dropped later, and then one of the later chains handles
the actual masquerading/dropping.
This approach was not necessary; kube-proxy could have just
duplicated its matching logic between multiple IPTables chains, eg,
having a rule saying "a packet sent to 192.168.0.11:80
should be
redirected to 10.0.20.18:8080
" in one chain and a rule saying "a
packet whose original pre-NAT destination was 192.168.0.11:80
should
be masqueraded" in another.
But using KUBE-MARK-MASQ
allows kube-proxy to keep its logic more
efficient, compact, and well-organized than it would be otherwise; it
can just create a pair of adjacent rules saying "a packet sent to
192.168.0.11:80
should be redirected to 10.0.20.18:8080
, and
should also be masqueraded".
In theory, KUBE-MARK-DROP
is useful for the same reason as
KUBE-MARK-MASQ
, although in practice, it turns out to be
unnecessary. Kube-proxy currently drops packets in two cases:
-
When using the
LoadBalancerSourceRanges
feature on a service, any packet to that service that comes from outside the valid source IP ranges is dropped. -
When using
Local
external traffic policy, if a connection to a service arrives on a node that has no local endpoints for that service, it is dropped.
In the LoadBalancerSourceRanges
case, it would not be difficult to
avoid using KUBE-MARK-DROP
. The current rule logic creates a
per-service KUBE-FW-
chain that checks against each allowed source
range and forwards traffic from those ranges to the KUBE-SVC-
or
KUBE-XLB-
chain for the service. At the end of the chain, it calls
KUBE-MARK-DROP
on any unmatched packets.
To do this without KUBE-MARK-DROP
, we simply remove the
KUBE-MARK-DROP
call at the end of the KUBE-FW-
chain, and add a
new rule in the filter
table, matching on the load balancer IP and
port, and calling -j DROP
. Any traffic that matched one of the
allowed source ranges in the KUBE-FW-
chain would have been DNAT'ed
to a pod IP by this point, so any packet that still has the original
load balancer IP as its destination when it reaches the filter
table
must be a packet that was not accepted, so we can drop it.
In the external traffic policy case, things are even simpler; we can
just move the existing traffic-dropping rule from the nat
table to
the filter
table and call DROP
directly rather than
KUBE-MARK-DROP
, and we will get exactly the same effect. (Indeed, we
already do it that way when calling REJECT
, so this will actually
make things more consistent.)
Some users or third-party software may expect that certain IPTables chains always exist on nodes in a Kubernetes cluster.
For example, the CNI portmap plugin provides an
"externalSetMarkChain"
option that is explicitly intended to be used
with "KUBE-MARK-MASQ"
, to make the plugin use Kubernetes's iptables
rules instead of creating its own. Although in most cases kube-proxy
will also create KUBE-MARK-MASQ
, kube-proxy may not start up until
after some other components, and some users may be running a network
plugin that has its own proxy implementation rather than using
kube-proxy. Thus, users may end up in a situation where some software
is trying to use a KUBE-MARK-MASQ
chain that does not exist.
Most of these external components have simply copied kube-proxy's use
of KUBE-MARK-MASQ
without understanding why it is used that way in
kube-proxy. But because they are generally doing much less IPTables
processing than kube-proxy does, they could fairly easily be rewritten
to not use the packet mark, and just have slightly-redundant
PREROUTING
and POSTROUTING
rules instead.
Another problem is the iptables-wrapper script for detecting whether the system is using iptables-legacy or iptables-nft. Currently it assumes that kubelet will always have created some iptables chains before the wrapper script runs, and so it can decide which iptables backend to use based on which one kubelet used. If kubelet stopped creating chains entirely, then iptables-wrapper might find itself in a situation where there were no chains in either set of tables.
To help this script (and other similar components), we should have kubelet continue to always create at least one IPTables chain, with a well-known name.
In iptables
mode, kube-proxy sets the
net.ipv4.conf.all.route_localnet
sysctl, so that it is possible to
connect to NodePort services via 127.0.0.1. (This is somewhat
controversial, and doesn't work under IPv6, but that's a story for
another KEP.) This creates a security hole (kubernetes #90259) and
so we add an iptables rule to block the insecure case while allowing
the "useful" case (kubernetes #91569). In keeping with historical
confusion around iptables rule ownership, this rule was added to
kubelet even though the behavior it is fixing is in kube-proxy.
Since kube-proxy is the one that is creating this problem, it ought to be the one creating the fix for it as well, and we should make kube-proxy create this filtering rule itself.
However, it is possible that users of other proxy implementations (or
hostport implementations) may be setting route_localnet
based on
kube-proxy's example and may depend on the security provided by the
existing kubelet rule. Thus, we should probably have kubelet continue
to create this rule as well, at least for a while. (The rule is
idempotent, so it doesn't even necessarily require keeping kubelet's
definition and kube-proxy's in sync.) We can consider removing it
again at some point in the future.
We cannot remove KUBE-MARK-DROP
from kubelet until we know that
kubelet cannot possibly be running against a version of kube-proxy
that requires it.
Thus, the process will be:
Kubelet will begin creating a new KUBE-IPTABLES-HINT
chain in the
mangle
table, to be used as a hint to external components about
which iptables API the system is using. (We use the mangle
chain,
not nat
or filter
, because given how the iptables API works, even
just checking for the existence of a chain is slow if the table it is
in is very large.)
(This happened in 1.24: kubernetes #109059.)
To help ensure that external components have time to remove any
dependency on KUBE-MARK-DROP
well before this feature goes Beta, we
will document the upcoming changes in a blog post and the next set of
release notes. (And perhaps other places?)
We will also document the new KUBE-IPTABLES-HINT
chain and its
intended use, as well as the best practices for detecting the system
iptables mode in previous releases.
Kube-proxy will be updated to not use KUBE-MARK-DROP
, as described
above. (This change is unconditional; it is not feature-gated,
because it is more of a cleanup/bugfix than a new feature.) We should
also ensure that kpng gets updated.
Kubelet's behavior will not change by default, but if you enable the
IPTablesOwnershipCleanup
feature gate, then:
-
It will stop creating
KUBE-MARK-DROP
,KUBE-MARK-MASQ
,KUBE-POSTROUTING
, andKUBE-KUBELET-CANARY
. (KUBE-FIREWALL
will remain, but will only be used for the "martian packet" rule, not for processing the drop mark.) -
It will warn that the
--iptables-masquerade-bit
and--iptables-drop-bit
flags are deprecated and have no effect.
(Importantly, those kubelet flags will not claim to be deprecated
when the feature gate is disabled, because in that case it is
important that if the user was previously overriding
--iptables-masquerade-bit
, that they keep overriding it in both
kubelet and kube-proxy for as long as they are both redundantly
creating the chains.)
The behavior is the same as alpha, except that the feature gate is enabled by default
As long as we wait 2 releases between Alpha and Beta, then it is
guaranteed that when the user upgrades to Beta, the currently-running
kube-proxy will be one that does not require the KUBE-MARK-DROP
chain to exist, so the upgrade will work correctly even if nodes end
up with an old kube-proxy and a new kubelet at some point.
The feature gate is now locked in the enabled state.
Most of the IPTables-handling code in kubelet can be removed (along with the warnings in kube-proxy about keeping that code in sync with the kubelet code).
Kubelet will now unconditionally warn that the
--iptables-masquerade-bit
and --iptables-drop-bit
flags are
deprecated and have no effect, and that they will be going away soon.
The deprecated kubelet flags can be removed.
We may eventually remove the "martian packet" blocking rule from Kubelet, but there is no specific plan for this at this time.
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
We discovered a while back that our existing e2e tests do not properly
test the cases that are expected to result in dropped packets. (The
tests still pass even when we don't drop the packets: kubernetes
#85572). However, attempting to fix this resulted in the discovery
that there is not any easy way to test these rules. In the
LoadBalancerSourceRanges
case, the drop rule will never get hit on
GCP (unless there is a bug in the GCP CloudProvider or the cloud load
balancers). (The drop rule can get hit in a bare-metal environment
when using a third-party load balancer like MetalLB, but we have no
way of testing this in Kubernetes CI). In the traffic policy case, the
drop rule is needed during periods when kube-proxy and the cloud load
balancers are out of sync, but there is no good way to reliably
trigger this race condition for e2e testing purposes.
However, we can manually test the new rules (eg, by killing kube-proxy before updating a service to ensure that kube-proxy and the cloud load balancer will remain out of sync), and then once we are satisfied that the rules do what we expect them to do, we can use the unit tests to ensure that we continue to generate the same (or functionally equivalent) rules in the future.
The unit tests in pkg/proxy/iptables/proxier_test.go
ensure that we
are generating the iptables rules that we expect to, and the new
tests already added in 1.25 allow us to assert specifically that
particular packets would behave in particular ways.
Thus, for example, although we can't reproduce the race conditions
mentioned above in an e2e environment, we can at least confirm that if
a packet arrived on a node which it shouldn't have because of this
race condition, that the iptables rules we generate would route it to
a DROP
rule, rather than delivering or rejecting it.
pkg/proxy/iptables
:06-21
-65.1%
There are no existing integration tests of the proxy code and no plans to add any.
As discussed above, it is not possible to test this functionality via e2e tests in our CI environment.
-
Tests and code are implemented as described above
-
Documentation of the upcoming changes in appropriate places.
-
Two releases have passed since Alpha
-
Running with the feature gate enabled causes no problems with any core kubernetes components.
-
The SIG is not aware of any problems with third-party components that would justify delaying Beta. (For example, Beta might be delayed if a commonly-used third-party component breaks badly when the feature gate is enabled, but the SIG might choose not to delay Beta for a bug involving a component which is not widely used, or which can be worked around by upgrading to a newer version of that component, or by changing its configuration.)
-
At least one release has passed since Beta
-
The SIG is not aware of any problems with third-party components that would justify delaying GA. (For example, if the feature breaks a third-party component which is no longer maintained and not likely to ever be fixed, then there is no point in delaying GA because of it.)
Other than version skew (below), there are no upgrade / downgrade issues; kube-proxy recreates all of the rules it uses from scratch on startup, so for the purposes of this KEP there is no real difference between starting a fresh kube-proxy and upgrading an existing one.
As long as we wait two releases between Alpha and Beta, then all allowed skewed versions of kubelet and kube-proxy will be compatible with each other.
- Feature gate (also fill in valuesin
kep.yaml
)- Feature gate name: IPTablesOwnershipCleanup
- Components depending on the feature gate:
- kubelet
When the feature gate is enabled, kubelet will no longer create the IPTables chains/rules that it used to. This may cause problems with third-party components in Alpha but these problems are expected to be ironed out before moving to Beta.
Yes
Nothing unexpected
No... there is no real difference between enabling the feature in an existing cluster vs creating a cluster where it was always enabled.
The most likely cause of a rollout failure would be a third-party component that depended on one of the no-longer-existing IPTables chains; most likely this would be a CNI plugin (either the default network plugin or a chained plugin) or some other networking-related component (NetworkPolicy implementation, service mesh, etc).
It is impossible to predict exactly how this third-party component would fail in this case, but it would likely impact already running workloads.
If the default network plugin (or plugin chain) depends on the missing
iptables chains, it is possible that all CNI_ADD
calls would fail
and it would become impossible to start new pods, in which case
kubelet's started_pods_errors_total
would start to climb. However,
"impossible to start new pods" would likely be noticed quickly without
metrics anyway...
For the most part, since failures would likely be in third-party components, it would be the metrics of those third-party components that would be relevant to diagnosing the problem. Since the problem is likely to manifest in the form of iptables calls failing because they reference non-existent chains, a metric for "number of iptables errors" or "time since last successful iptables update" might be useful in diagnosing problems related to this feature. (However, it is also quite possible that the third-party components in question would have no relevant metrics, and errors would be exposed only via log messages.)
Yes.
When considering only core kubernetes components, and only alpha-or-later releases, upgrade/downgrade and enablement/disablement create no additional complications beyond clean installs; kube-proxy simply doesn't care about the additional rules that kubelet may or may not be creating any more.
Upgrades from pre-alpha to beta-or-later or downgrades from beta-or-later to pre-alpha are not supported, and for this reason we waited 2 releases after alpha to go to beta.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
This KEP will eventually remove two kubelet command-line arguments, but not until after the feature is GA.
The feature is not "used by workloads"; when enabled, it is always in effect and affects the cluster as a whole.
-
Other (treat as last resort)
- Details: The feature is not supposed to have any externally-visible effect. If anything is not working, it is likely to be a third-party component, so it is impossible to say what a failure might look like.
N/A / Unchanged from previous releases. The expected end result of this enhancement is that no externally-measurable behavior changes.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Other (treat as last resort)
- Details: N/A / Unchanged from previous releases.
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
This does not change kubelet or kube-proxy's dependencies.
No
No
No
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
This KEP does not change the way that either kubelet or kube-proxy reacts to such a scenario.
As above, the only expected failure mode is that a third-party component expects kubelet to have created the chains that it no longer does, in which case the third-party component will react in some way we cannot predict.
N/A
- Initial proposal: 2022-01-23
- Updated: 2022-03-27, 2022-04-29
- Merged as
implementable
: 2022-06-10 - Updated: 2022-07-26 (feature gate rename)
- Alpha release (1.25): 2022-08-23
The primary drawback is the risk of third-party breakage. But the current state, where kubelet creates IPTables chains that it does not use, and kube-proxy depends on someone else creating IPTables chains that it does use, is clearly wrong.
The description of the Alpha stage above suggests that we will change
the code in kube-proxy without feature-gating it, because the change
to kube-proxy (dropping packets directly rather than depending on
KUBE-MARK-DROP
) is more of a refactoring/bugfix than it is a
"feature". However, it would be possible to feature-gate this instead.
This would extend the rollout by a few more releases, because we would not be able to move the kubelet feature gate to Beta until 2 releases after the kube-proxy feature gate went to Beta.
Rather than dropping KUBE-MARK-DROP
, we could just make kube-proxy
create it itself rather than depending on kubelet to create it. This
would potentially be slightly more third-party-component-friendly (if
there are third-party components using KUBE-MARK-DROP
, which it's
not clear that there are).
This is a more complicated approach though, because moving
KUBE-MARK-DROP
requires also moving the ability to override the
choice of mark bit, and if a user is overriding it in kubelet, they
must override it to the same value in kube-proxy or things will
break, so we need to warn users about this in advance before we can
start changing things, and we need separate feature gates moving out
of sync for kubelet and kube-proxy. The end result is that this would
take two release cycles longer than the No-KUBE-MARK-DROP
approach:
-
First release
-
Kubelet warns users who pass
--iptables-drop-bit
that they should also pass the same option to kube-proxy.If the
KubeletIPTablesCleanup
feature gate is enabled (which it is not by default), Kubelet does not create any iptables chains. -
Kube-proxy is updated to accept
--iptables-drop-bit
. If theKubeProxyIPTablesCleanup
feature gate is disabled (which it is by default), kube-proxy mostly ignores the new flag, but it does compare theKUBE-MARK-DROP
rule that kubelet created against the rule that it would have created, and warns the user if they don't match. (That is, it warns the user if they are overriding--iptables-drop-bit
in kubelet but not in kube-proxy.)When the feature gate is enabled, kube-proxy creates
KUBE-MARK-DROP
, etc, itself (using exactly the same rules kubelet would, such that it is possible to run withKubeProxyIPTablesCleanup
enabled butKubeletIPTablesCleanup
disabled).
-
-
Two releases later...
-
KubeProxyIPTablesCleanup
moves to Beta. (KubeletIPTablesCleanup
remains Alpha.) Since kubelet and kube-proxy have been warning about--iptables-drop-bit
for 2 releases now, everyone upgrading to this version should already have seen the warnings and updated their kube-proxy config if they need to.By default (with no manual feature gate overrides), kubelet and kube-proxy are now both creating identical
KUBE-MARK-DROP
rules.
-
-
Two releases after that...
-
KubeProxyIPTablesCleanup
moves to GA -
KubeletIPTablesCleanup
moves to Beta. Since kube-proxy has been creating its ownKUBE-MARK-DROP
chain for 2 releases now, everyone upgrading to this version should already have a kube-proxy that createsKUBE-MARK-DROP
, so there is no chance of there temporarily being noKUBE-MARK-DROP
due to version skew during the upgrade. -
When running with
KubeletIPTablesCleanup
enabled, kubelet warns that--iptables-masquerade-bit
and--iptables-drop-bit
are deprecated.
-
-
One release after that...
-
KubeletIPTablesCleanup
moves to GA -
Kubelet unconditionally warns about the deprecated flags
-
-
Two releases after that
- We remove the deprecated flags
As discussed in kubernetes #82125, the original plan had been to move the maintenance of both "mark" chains into kubelet, rather than into kube-proxy.
This would still get rid of the code duplication, and it would also
let us avoid problems with external components, by declaring that now
they can always assume the existence of KUBE-MARK-MASQ
, rather
than that they should not.
But kube-proxy already runs into problems with KUBE-MARK-DROP
, where
if it finds that the chain has been deleted (eg, by a system firewall
restart), it is unable to recreate it properly and thus operates in a
degraded state until kubelet fixes the chain. This problem would be
much worse with KUBE-MARK-MASQ
, which kube-proxy uses several orders
of magnitude more often than it uses KUBE-MARK-DROP
.
(There was also other discussion in #82125 about reasons why kubelet is not the right place to be doing low-level networking setup.)