race condition for pod starting order results in not working clusterIPs in case nf_tables must be used #1037

ffuerste · 2020-08-13T15:56:44Z

What happened:

Current major versions of Linux distributions don't support iptables-legacy anymore. Instead nf_tables is used (e.g. RHEL8 or debian Buster).
The availability of nf_tables only leads to a race condition for starting kube-proxy and node-local-dns in the correct order after a node is started.

Currently node-local-dns supports only iptables-legacy. For the support of nf_tables an open/blocked issue exists.
kube-proxy supports both iptables modes (see here) and determines during it's starting phase what it should use (see here).
This means, if the node-local-dns pod on the node is starting first and creates it's iptables-legacy rules, kube-proxy finds these legacy-rules and starts using the legacy mode too. Unfortunately, kube-proxy uses some chains which the kubelet creates when it starts. E.g. the chain "KUBE_MARK-DROP". Because the OS is offering nf_talbles only, the kubelet creates the particular chains with nf_tables and not iptables-legacy.
If now kube-proxy is starting in iptables-legacy mode, it tries to write to the kubelet chains and fails, becaue it cannot find the nf_tables chains:

kubectl -n kube-system logs $(kubectl -n kube-system get pods -o wide | grep proxy | grep cp-0 | awk '{print $1}') -f
W0813 15:17:48.190861       1 server_others.go:559] Unknown proxy mode "", assuming iptables proxy
I0813 15:17:48.203720       1 node.go:136] Successfully retrieved node IP: 172.16.10.5
I0813 15:17:48.203751       1 server_others.go:186] Using iptables Proxier.
I0813 15:17:48.205552       1 server.go:583] Version: v1.18.6
I0813 15:17:48.206007       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0813 15:17:48.206845       1 config.go:315] Starting service config controller
I0813 15:17:48.206863       1 shared_informer.go:223] Waiting for caches to sync for service config
I0813 15:17:48.206886       1 config.go:133] Starting endpoints config controller
I0813 15:17:48.206898       1 shared_informer.go:223] Waiting for caches to sync for endpoints config
I0813 15:17:48.307068       1 shared_informer.go:230] Caches are synced for endpoints config 
I0813 15:17:48.307209       1 shared_informer.go:230] Caches are synced for service config 
E0813 15:17:48.347064       1 proxier.go:1555] Failed to execute iptables-restore: exit status 2 (iptables-restore v1.8.3 (legacy): Couldn't load target `KUBE-MARK-DROP':No such file or directory

Error occurred at line: 84
Try `iptables-restore -h' or 'iptables-restore --help' for more information.
)
I0813 15:17:48.347141       1 proxier.go:825] Sync failed; retrying in 30s

For reference, see

Because of this kube-proxy is entering an endless loop, trying to write to the chain KUBE-MARK-DROP and never creates the iptables rules for clusterIPs.

On the other side, if kube-proxy is starting before node-local-dns is creating it's iptables-legacy rules, the kube-proxy creates all rules using nf_tables. Hence, in this case the chain KUBE-MARK-DROP exists and everything is working as expected.

What is the expected behavior:
Working clusterIPs after each node start.

How to reproduce the issue:
Rebooting nodes for testing the race condition.

For a running node delete the running kube-proxy pod and flush the nf_tables rules on the host

#remove/flush all rules & delete chains
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT
iptables -P FORWARD ACCEPT

Anything else we need to know?
The root cause is also hitting calico, see here.

I tested to use kube-proxy in ipvs mode. Unfortunately, I hit here another blocking issue for Azure (most likely for other public cloud providers as well), regarding services of type LoadBalancer. Hence, using kube-proxy in ipvs mode is not an option.

Because the root cause is the not patched node-local-dns pod I think a good option could be to disable it's deployment for now. Maybe introducing node-local-dns as a feature in kubeone which can be deactivated (e.g. like PodSecurityPolicies)?

Information about the environment:
KubeOne version (1.0.0-beta.2):
Operating system: RHEL8
Provider you're deploying cluster on: Azure
Operating system you're deploying on: CentOS8

The text was updated successfully, but these errors were encountered:

kron4eg · 2020-08-14T13:38:28Z

Sorry, I wasn't able to reproduce this problem no matter how many times I've rebooted VMs. What kubernetes version do you use?

ffuerste added the kind/bug Categorizes issue or PR as related to a bug. label Aug 13, 2020

kdomanski assigned kron4eg Aug 13, 2020

kron4eg removed their assignment Aug 14, 2020

kron4eg mentioned this issue Aug 18, 2020

Insert small delay before starting node-local-dns #1058

Merged

kubermatic-bot closed this as completed in #1058 Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition for pod starting order results in not working clusterIPs in case nf_tables must be used #1037

race condition for pod starting order results in not working clusterIPs in case nf_tables must be used #1037

ffuerste commented Aug 13, 2020

kron4eg commented Aug 14, 2020

race condition for pod starting order results in not working clusterIPs in case nf_tables must be used #1037

race condition for pod starting order results in not working clusterIPs in case nf_tables must be used #1037

Comments

ffuerste commented Aug 13, 2020

kron4eg commented Aug 14, 2020