Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten #278

jaysiyani · 2020-11-19T14:20:09Z

Description

Our flatcar image was auto-updated from 2512.4.0 to 2605.5.0, this somehow broke the ability for the Node to talk to pods running on it.

Impact

Pods on worker nodes are not able to communicate with Master Nodes API server pods.

Environment and steps to reproduce

Set-up:

Kubernetes Client Version: version.Info Major:"1", Minor:"19", GitVersion:"v1.19.1"
Kubernetes Server Version: version.Info Major:"1", Minor:"16", GitVersion:"v1.16.13"
Running on AWS instances using Flatcar 2605.5.0 (also tested with 2605.7.0)
Cilium v1.7.5 (also tested with Cilium v1.8.5)
AWS VPC CNI (v1.6.3)

Task: Reach a pod running on the Node
Action(s):
a. Upgrading from Flatcar 2512.4.0 to 2605.5.0
Error: The node cannot reach the pod running on it.

Node (ip-10-64-52-104.eu-west-1.compute.internal ) to POD (10.64.36.243) on Master-newFC (ip-10-64-32-253.eu-west-1.compute.internal)

tracepath 10.64.36.243
 1?: [LOCALHOST]                                         pmtu 9001
 1:  ip-10-64-32-253.eu-west-1.compute.internal            0.503ms 
 1:  ip-10-64-32-253.eu-west-1.compute.internal            0.464ms 
 2:  no reply
 3:  no reply
 4:  no reply
 5:  no reply
 6:  no reply
...
30:  no reply
     Too many hops: pmtu 9001
     Resume: pmtu 9001

Expected behavior

Node (ip-10-64-52-104.eu-west-1.compute.internal ) to POD (10.64.33.129) on Master-oldFC (ip-10-64-34-191.eu-west-1.compute.internal)

tracepath 10.64.33.129
 1?: [LOCALHOST]                                         pmtu 9001
 1:  ip-10-64-34-191.eu-west-1.compute.internal            0.538ms 
 1:  ip-10-64-34-191.eu-west-1.compute.internal            0.460ms 
 2:  ip-10-64-33-129.eu-west-1.compute.internal            0.475ms reached
     Resume: pmtu 9001 hops 2 back 2

Additional information
Cilium-monitor output when trying to run tracepath on a node with a pod running on it

level=info msg="Initializing dissection cache..." subsys=monitor
-> endpoint 1077 flow 0xd4db6b68 identity 1->66927 state new ifindex 0 orig-ip 10.64.32.253: 10.64.32.253:36282 -> 10.64.39.43:44444 udp
-> stack flow 0xa466c6d3 identity 66927->1 state related ifindex 0 orig-ip 0.0.0.0: 10.64.39.43 -> 10.64.32.253 DestinationUnreachable(Port)

TCP dump on Node trying to reach a pod running on it.

15:18:00.676152 IP ip-10-64-32-253.eu-west-1.compute.internal.58914 > ip-10-64-52-104.eu-west-1.compute.internal.4240: Flags [.], ack 548860955, win 491, options [nop,nop,TS val 3987550058 ecr 3030925508], length 0
15:18:00.676520 IP ip-10-64-52-104.eu-west-1.compute.internal.4240 > ip-10-64-32-253.eu-west-1.compute.internal.58914: Flags [.], ack 1, win 489, options [nop,nop,TS val 3030955756 ecr 3987534941], length 0
15:18:00.919448 IP ip-10-64-52-104.eu-west-1.compute.internal.4240 > ip-10-64-32-253.eu-west-1.compute.internal.58914: Flags [.], ack 1, win 489, options [nop,nop,TS val 3030955999 ecr 3987534941], length 0
15:18:00.919497 IP ip-10-64-32-253.eu-west-1.compute.internal.58914 > ip-10-64-52-104.eu-west-1.compute.internal.4240: Flags [.], ack 1, win 491, options [nop,nop,TS val 3987550301 ecr 3030955756], length 0
15:18:01.465589 IP ip-10-64-52-104.eu-west-1.compute.internal.34294 > ip-10-64-36-243.eu-west-1.compute.internal.44448: UDP, length 8973
15:18:01.465630 IP ip-10-64-52-104.eu-west-1.compute.internal.34294 > ip-10-64-36-243.eu-west-1.compute.internal.44448: UDP, length 8973
15:18:01.465647 IP ip-10-64-36-243.eu-west-1.compute.internal > ip-10-64-52-104.eu-west-1.compute.internal: ICMP ip-10-64-36-243.eu-west-1.compute.internal udp port 44448 unreachable, length 556

The text was updated successfully, but these errors were encountered:

invidian · 2020-11-19T14:42:00Z

Hm, duplicate of #181?

jaysiyani · 2020-11-19T15:19:52Z

Yeh looks similar, but the quick fix doesn't seem to work echo 'net.ipv4.conf.lxc*.rp_filter = 0' | sudo tee -a /etc/sysctl.d/90-override.conf && sudo systemctl restart systemd-sysctl

invidian · 2020-11-19T15:21:39Z

@jaysiyani perhaps this must be applied to all interfaces, also inside containers? Or try rebooting the machine.

jaysiyani · 2020-11-23T12:06:05Z

Interestingly after reboot this works, but as the pod get's rescheduled (on the same node) with a different IP it stops working again.

invidian · 2020-11-23T12:09:50Z

@jaysiyani blind guess, but maybe you also hit #279? Maybe you can try a workaround from #279 (comment) ?

pothos · 2020-11-24T05:44:01Z

I don't know the details fo Cilium networking but that's what is needed here I guess. Is there any difference in the routing table ip route get POD-IP-ADRESS before and after it stops working? It should give the interface on the host networking namespace where the pod is reachable, and I would expect this to be changed to the right interface (but even this may not be the case with Cilium in general). The next things to look out for is where this packet is dropped. Does Cilium still use iptables (I expect it to be using bpf on the tc layer)?

jaysiyani · 2020-11-26T13:34:15Z

@jaysiyani blind guess, but maybe you also hit #279? Maybe you can try a workaround from #279 (comment) ?

Our file permissions on /opt seem fine.

jaysiyani · 2020-11-26T13:36:07Z

Interestingly after reboot this works, but as the pod get's rescheduled (on the same node) with a different IP it stops working again.

Very strange. As now i can't seem to get it to work at all even after reboot

jaysiyani · 2020-11-26T14:06:50Z

also, looks like /usr/lib/sysctl.d/50-default.conf gets overwritten by /usr/lib/sysctl.d/baselayout.conf

sysctl --system
* Applying /usr/lib/sysctl.d/50-coredump.conf ...
kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
* Applying /usr/lib/sysctl.d/50-default.conf ...
kernel.sysrq = 16
kernel.core_uses_pid = 1
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.default.promote_secondaries = 1
net.core.default_qdisc = fq_codel
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
fs.protected_regular = 1
fs.protected_fifos = 1
* Applying /usr/lib/sysctl.d/50-pid-max.conf ...
kernel.pid_max = 4194304
* Applying /etc/sysctl.d/90-override.conf ...
* Applying /usr/lib/sysctl.d/baselayout.conf ...
net.ipv4.ip_forward = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.rp_filter = 1
kernel.kptr_restrict = 1
fs.protected_regular = 0
fs.protected_fifos = 0
* Applying /etc/sysctl.d/netfilter.conf ...
net.netfilter.nf_conntrack_tcp_be_liberal = 1

jaysiyani · 2020-11-26T16:19:21Z

@jaysiyani perhaps this must be applied to all interfaces, also inside containers? Or try rebooting the machine.

how can i apply this inside containers?

invidian · 2020-11-26T16:21:55Z

how can i apply this inside containers?

If you set it on the host before containers starts, they will automatically pick it up. This is why I suggested a reboot.

dvulpe · 2020-11-29T16:39:40Z

We're seeing a similar issue with AWS CNI and Flatcar 2605.8.0. Network connectivity appears to work with 2512.5.0 but fails with newer versions of Flatcar.

I have created a repository with Terraform driven by Terratest in here which can be used to further troubleshoot and/or verify any fixes.

gregsymons · 2020-12-09T21:03:52Z

I can also confirm that the 2605 series has broken the AWS CNI. Pods with host networking (e.g. kube-proxy or the CNI daemonset itself) are able to communicate outside the node fine, but pods without host networking fail to connect to anything as if the packets are being dropped, whether the destination is in-cluster, out-of-cluster, or even on the same node. This includes calls to the kubernetes.default service that exposes the control plane in the cluster.

margamanterola · 2020-12-10T15:09:43Z

Hi! First of all, thanks Dan for the reproduction case. I've used it and was able to verify that indeed this breaks when switching from 2512 to 2605. BTW, the repro case uses AWS CNI, not Cillium, so it already confirms what Greg commented.

I spent quite a few hours trying to figure out what exactly the problem is, but I wasn't able to find the root cause. In the repro case, when using 2605 the coredns pod is unable to send or receive packets. They get dropped, but I couldn't find what's causing the drop.

I tried comparing the sysctl values across both versions and overriding some of those that were different, but that had no effect. The generated firewall rules have a few differences, including a comment that says that the Cluster IP is not reachable, but it's unclear whether the differences are cause or effect of the problem.

margamanterola · 2020-12-11T18:04:08Z

I've spent another day poking at this and I haven't yet found the root cause, but at least I've reduced the list of suspects. I tried installing different Flatcar versions (from the alpha channel) to figure out when this broke. I was very suspicious that the culprit was the switch to the 5.4 kernel, but it turned out that was not the case.

To be able to poke at the problem, I deployed a debugging pod, by running kubectl run -it net --image=praqma/network-multitool -- sh. I then tried things like pinging between the pod and the node running that pod, reaching the internet, etc.

What I tested:

2466.0.0 (Last alpha release with 4.19)
- Ping to the node works, ping 8.8.8.8 works, dig www.google.com works, coredns works
2492.0.0 (First alpha release with 5.4)
- Kernel 5.4.35, systemd 241
- Ping to the node works, ping 8.8.8.8 works, dig @8.8.8.8 www.google.com works but dig www.google.com fails because coredns in k8s fails with CrashLoopBackOff: standard_init_linux.go:211: exec user process caused "operation not permitted". Due to Containers with no-new-privileges don't start on Flatcar alpha 2492.0.0 #110 (selinux regression).
- Setting docker.service to selinux_enabled=false makes it work, coredns is healthy, everything works.
2513.0.0 (Next alpha, fixes the selinux issue, upgrades systemd)
- Kernel 5.4.41, systemd 243
- Ping to the node fails, ping 8.8.8.8 fails, dig www.google.com fails, coredns fails with timeouts: Get https://10.100.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.100.0.1:443: i/o timeout

So, by disabling selinux for the 2492 version, I have a working setup with kernel 5.4 and systemd 241. The 2513 version introduced a bunch of changes, current main suspect is systemd 243. In particular, there were a lot of changes around systemd-networkd. The veth network interfaces are unmanaged by systemd in both cases, so the difference should be in the eth0 interface or the network stack in general.

In my tests I found that if I pinged from the host to the pod and I captured the ping with tcpdump, I could see the request going to the pod and the reply going back to the node, but the ping command never saw that reply. This is the tcpdump capture (ping just says 100% packets lost):

core@ip-192-168-0-212 ~ $ sudo tcpdump -i any -vv -n icmp
17:14:47.131391 IP (tos 0x0, ttl 64, id 4684, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.212 > 192.168.1.90: ICMP echo request, id 13872, seq 1, length 64
17:14:47.131412 IP (tos 0x0, ttl 64, id 9318, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.1.90 > 192.168.0.212: ICMP echo reply, id 13872, seq 1, length 64

One interesting difference is in the PREROUTING mangle table. In a working host, it will look like this:

$ sudo iptables -t mangle -L -v  
Chain PREROUTING (policy ACCEPT 123K packets, 106M bytes)
 pkts bytes target     prot opt in     out     source               destination         
64900   98M CONNMARK   all  --  eth0   any     anywhere             anywhere             /* AWS, primary ENI */ ADDRTYPE match dst-type LOCAL limit-in CONNMARK [unsupported revision]
17103 1200K CONNMARK   all  --  eni+   any     anywhere             anywhere             /* AWS, primary ENI */ CONNMARK [unsupported revision]
    0     0 CONNMARK   all  --  vlan+  any     anywhere             anywhere             /* AWS, primary ENI */ CONNMARK [unsupported revision]

In a broken host, it will look like this:

$ sudo iptables -t mangle -L -v 
Chain PREROUTING (policy ACCEPT 150K packets, 97M bytes)
 pkts bytes target     prot opt in     out     source               destination         
 105K   94M CONNMARK   all  --  eth0   any     anywhere             anywhere             /* AWS, primary ENI */ ADDRTYPE match dst-type LOCAL limit-in CONNMARK [unsupported revision]
    0     0 CONNMARK   all  --  eni+   any     anywhere             anywhere             /* AWS, primary ENI */ CONNMARK [unsupported revision]
    0     0 CONNMARK   all  --  vlan+  any     anywhere             anywhere             /* AWS, primary ENI */ CONNMARK [unsupported revision]

Notice how the eni line is at 0 in the broken host. When I tried tracing with iptables, I noticed that in the working case, I saw the traced of the packet coming in, hitting this PREROUTING table as its first step, while in the broken case, I saw nothing...

So apparently the packet gets lost after getting captured by tcpdump but before hitting iptables... Whatever happens in between?

dvulpe · 2020-12-29T12:10:08Z

I've had some more time to look into this issue closer, I think I may have found something.

In 2512.5.0 with systemd 241 the /usr/lib/systemd/network/99-default.link file contains:

...
[Link]
NamePolicy=keep kernel database onboard slot path
MACAddressPolicy=persistent

In 2513.0.0 with systemd 243 the /usr/lib/systemd/network/99-default.link file contains:

...
[Match]
OriginalName=*

[Link]
NamePolicy=keep kernel database onboard slot path
MACAddressPolicy=persistent

MACAddressPolicy=persistent appears to instruct systemd to generate a stable mac address for virtual devices and appears to cause kernel to drop packets.

A workaround is described here and appears to solve the underlying issue - tested successfully with the current stable Flatcar 2605.10.0.

pothos · 2020-12-31T17:05:19Z

@dvulpe Thanks for finding this out! Maybe a start would be to use MACAddressPolicy=none for PCI interfaces on AWS. If other setups are also affected, it seems more feasible to have the issue fixed in general…

margamanterola · 2021-01-04T15:13:18Z

I've tested this and verified that indeed the MACAdressPolicy is the likely culprit. To a broken node, I added:

core@ip-192-168-4-227 ~ $ cat /etc/systemd/network/50-veth.link 
[Match]
Driver=veth

[Link]
MACAddressPolicy=none

And after rebooting, new virtual interfaces work correctly. We had dealt with a similar issue for flannel here: flatcar-archive/coreos-overlay#282. In my tests, I also tried matching by name (eni*) and it also solved the problem.

As with in the linked flannel issue, the problem is visible when using ip monitor all and catching the interface getting created:

[LINK]4: eni4bd1086a8e2@if3: <BROADCAST,MULTICAST> mtu 9001 qdisc noop state DOWN group default 
    link/ether fe:f3:b1:0e:64:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[LINK]4: eni4bd1086a8e2@if3: <BROADCAST,MULTICAST> mtu 9001 qdisc noop state DOWN group default 
    link/ether 0e:33:4a:64:18:29 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[LINK]4: eni4bd1086a8e2@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default 
    link/ether 0e:33:4a:64:18:29 brd ff:ff:ff:ff:ff:ff link-netnsid 0

We have a few files that instruct systemd not to manage links for various CNIs, we do this by matching by interface names in the .network files, so we need to keep listing more and more with each different CNI. And also, at least in the case of flannel it was not enough, to avoid the mac address being managed we had to add the explicit .link file with the MacAddressPolicy=none setting.

These are the network files that currently state links should be unmanaged:

core@ip-192-168-4-227 /usr/lib64/systemd/network $ grep -ir -b1 unmanaged .
./ipsec-vti.network-69-[Link]
./ipsec-vti.network:76:Unmanaged=yes
--
./yy-azure-sriov-coreos.network-556-[Link]
./yy-azure-sriov-coreos.network:563:Unmanaged=yes
--
./yy-azure-sriov.network-557-[Link]
./yy-azure-sriov.network:564:Unmanaged=yes
--
./calico.network-79-[Link]
./calico.network:86:Unmanaged=yes
--
./50-flannel.network-23-[Link]
./50-flannel.network:30:Unmanaged=yes
--
./cni.network-74-[Link]
./cni.network:81:Unmanaged=yes

I think we should add a file like the one I showed above (matching on the veth driver) to our default setup. Alternatively, if we think that's too broad, we could also match by name.

When a veth device is created, the CNI in charge of bringing the device up will set a MAC address, if `MACAddressPolicy=permanent` is set, systemd will change it to a different one, causing dropped packets due to mismatches. With this change, the address set when the device is created will remain untouched by systemd. See flatcar/Flatcar#278 for more information.

Needed for flatcar/Flatcar#278

margamanterola · 2021-01-07T12:33:25Z

I've applied the fix to all flatcar branches. I've verified that the test case provided by @dvulpe passes with this fix applied.

This fix will be included in the next set of flatcar releases.

margamanterola · 2021-01-13T15:29:01Z

The fix got released yesterday in all channels (2605.11.0, 2705.1.1, 2748.0.0). Please test your setups and let us know if this solves the issue or not.

jaysiyani · 2021-01-13T15:51:10Z

Thanks @marga-kinvolk, I deployed the latest image today and the test i conducted before are now passing!

dvulpe · 2021-01-13T19:45:24Z

Thanks @marga-kinvolk - I've verified 2605.11.0 and it worked great!

margamanterola · 2021-01-14T11:44:49Z

Awesome, thanks everyone. I'm now closing this bug (and will also retitle it a bit to make the issue clearer). If you encounter further issues with running Flatcar with EKS, please file new bugs.

Thanks!!

Needed for flatcar/Flatcar#278

This change adds a default network configuration .link file that `systemd-udev` will use when configuring new interfaces. It contains the default list of policies that are used when naming interfaces, as well as the policy by which the MAC address should be set. Bottlerocket packages its own version of this file rather than the default from systemd for a few reasons. 1) Bottlerocket does not create/use a udev hwdb (we disable the option in systemd compile flags), so we remove this option from the NamePolicy list, 2) CNI plugins can be confused when MAC addresses change for virtual interfaces, so Bottlerocket sets the default MACAddress Policy to "none" which directs systemd not to attempt to manage the MAC. Hardware usually has a MAC, and veth devices used by CNI generally get a MAC generated by the plugin. Additional information about the MAC address issue: systemd/systemd#3374 (comment) flatcar/Flatcar#278 flatcar/init#33

ghost mentioned this issue Nov 19, 2020

Incorrect /opt mount permission on 2605.8.0 on AWS #279

Closed

t-lo assigned margamanterola Dec 1, 2020

pothos mentioned this issue Dec 8, 2020

Upgrade from Stable 2605.8.0 to 2605.9.0 breaks Kubernetes installed with RKE #290

Open

margamanterola mentioned this issue Jan 4, 2021

networkd: avoid managing MAC addresses for veth devices flatcar/init#33

Merged

margamanterola pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Jan 4, 2021

Track latest init commit

fc017cf

Needed for flatcar/Flatcar#278

margamanterola mentioned this issue Jan 4, 2021

Track latest init commit flatcar-archive/coreos-overlay#763

Merged

margamanterola pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Jan 5, 2021

Track latest init commit

446fa87

Needed for flatcar/Flatcar#278

margamanterola pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Jan 6, 2021

Track latest init commit

7cf3ca5

Needed for flatcar/Flatcar#278

margamanterola pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Jan 6, 2021

Track latest init commit

a21b56a

Needed for flatcar/Flatcar#278

margamanterola closed this as completed Jan 14, 2021

margamanterola changed the title ~~Upgrading from 2512.4.0 to 2605.5.0 or higher breaks networking inside the Kubernetes Node~~ Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten Jan 14, 2021

pothos pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Feb 1, 2021

Track latest init commit

74df45b

Needed for flatcar/Flatcar#278

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten #278

Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten #278

jaysiyani commented Nov 19, 2020

invidian commented Nov 19, 2020

jaysiyani commented Nov 19, 2020

invidian commented Nov 19, 2020

jaysiyani commented Nov 23, 2020

invidian commented Nov 23, 2020 •

edited

Loading

pothos commented Nov 24, 2020 •

edited

Loading

jaysiyani commented Nov 26, 2020

jaysiyani commented Nov 26, 2020

jaysiyani commented Nov 26, 2020

jaysiyani commented Nov 26, 2020

invidian commented Nov 26, 2020

dvulpe commented Nov 29, 2020

gregsymons commented Dec 9, 2020

margamanterola commented Dec 10, 2020

margamanterola commented Dec 11, 2020

dvulpe commented Dec 29, 2020

pothos commented Dec 31, 2020

margamanterola commented Jan 4, 2021 •

edited

Loading

margamanterola commented Jan 7, 2021

margamanterola commented Jan 13, 2021

jaysiyani commented Jan 13, 2021

dvulpe commented Jan 13, 2021

margamanterola commented Jan 14, 2021

Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten #278

Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten #278

Comments

jaysiyani commented Nov 19, 2020

invidian commented Nov 19, 2020

jaysiyani commented Nov 19, 2020

invidian commented Nov 19, 2020

jaysiyani commented Nov 23, 2020

invidian commented Nov 23, 2020 • edited Loading

pothos commented Nov 24, 2020 • edited Loading

jaysiyani commented Nov 26, 2020

jaysiyani commented Nov 26, 2020

jaysiyani commented Nov 26, 2020

jaysiyani commented Nov 26, 2020

invidian commented Nov 26, 2020

dvulpe commented Nov 29, 2020

gregsymons commented Dec 9, 2020

margamanterola commented Dec 10, 2020

margamanterola commented Dec 11, 2020

dvulpe commented Dec 29, 2020

pothos commented Dec 31, 2020

margamanterola commented Jan 4, 2021 • edited Loading

margamanterola commented Jan 7, 2021

margamanterola commented Jan 13, 2021

jaysiyani commented Jan 13, 2021

dvulpe commented Jan 13, 2021

margamanterola commented Jan 14, 2021

invidian commented Nov 23, 2020 •

edited

Loading

pothos commented Nov 24, 2020 •

edited

Loading

margamanterola commented Jan 4, 2021 •

edited

Loading