-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten #278
Comments
Hm, duplicate of #181? |
Yeh looks similar, but the quick fix doesn't seem to work |
@jaysiyani perhaps this must be applied to all interfaces, also inside containers? Or try rebooting the machine. |
Interestingly after reboot this works, but as the pod get's rescheduled (on the same node) with a different IP it stops working again. |
@jaysiyani blind guess, but maybe you also hit #279? Maybe you can try a workaround from #279 (comment) ? |
I don't know the details fo Cilium networking but that's what is needed here I guess. Is there any difference in the routing table |
Our file permissions on /opt seem fine. |
Very strange. As now i can't seem to get it to work at all even after reboot |
also, looks like
|
how can i apply this inside containers? |
If you set it on the host before containers starts, they will automatically pick it up. This is why I suggested a reboot. |
We're seeing a similar issue with AWS CNI and I have created a repository with Terraform driven by Terratest in here which can be used to further troubleshoot and/or verify any fixes. |
I can also confirm that the 2605 series has broken the AWS CNI. Pods with host networking (e.g. kube-proxy or the CNI daemonset itself) are able to communicate outside the node fine, but pods without host networking fail to connect to anything as if the packets are being dropped, whether the destination is in-cluster, out-of-cluster, or even on the same node. This includes calls to the |
Hi! First of all, thanks Dan for the reproduction case. I've used it and was able to verify that indeed this breaks when switching from 2512 to 2605. BTW, the repro case uses AWS CNI, not Cillium, so it already confirms what Greg commented. I spent quite a few hours trying to figure out what exactly the problem is, but I wasn't able to find the root cause. In the repro case, when using 2605 the coredns pod is unable to send or receive packets. They get dropped, but I couldn't find what's causing the drop. I tried comparing the sysctl values across both versions and overriding some of those that were different, but that had no effect. The generated firewall rules have a few differences, including a comment that says that the Cluster IP is not reachable, but it's unclear whether the differences are cause or effect of the problem. |
I've spent another day poking at this and I haven't yet found the root cause, but at least I've reduced the list of suspects. I tried installing different Flatcar versions (from the alpha channel) to figure out when this broke. I was very suspicious that the culprit was the switch to the 5.4 kernel, but it turned out that was not the case. To be able to poke at the problem, I deployed a debugging pod, by running What I tested:
So, by disabling selinux for the 2492 version, I have a working setup with kernel 5.4 and systemd 241. The 2513 version introduced a bunch of changes, current main suspect is systemd 243. In particular, there were a lot of changes around systemd-networkd. The veth network interfaces are unmanaged by systemd in both cases, so the difference should be in the eth0 interface or the network stack in general. In my tests I found that if I pinged from the host to the pod and I captured the ping with tcpdump, I could see the request going to the pod and the reply going back to the node, but the ping command never saw that reply. This is the tcpdump capture (ping just says 100% packets lost):
One interesting difference is in the PREROUTING mangle table. In a working host, it will look like this:
In a broken host, it will look like this:
Notice how the eni line is at 0 in the broken host. When I tried tracing with iptables, I noticed that in the working case, I saw the traced of the packet coming in, hitting this PREROUTING table as its first step, while in the broken case, I saw nothing... So apparently the packet gets lost after getting captured by tcpdump but before hitting iptables... Whatever happens in between? |
I've had some more time to look into this issue closer, I think I may have found something. In
In
A workaround is described here and appears to solve the underlying issue - tested successfully with the current stable Flatcar |
@dvulpe Thanks for finding this out! Maybe a start would be to use |
I've tested this and verified that indeed the
And after rebooting, new virtual interfaces work correctly. We had dealt with a similar issue for flannel here: flatcar-archive/coreos-overlay#282. In my tests, I also tried matching by name ( As with in the linked flannel issue, the problem is visible when using
We have a few files that instruct systemd not to manage links for various CNIs, we do this by matching by interface names in the These are the network files that currently state links should be unmanaged:
I think we should add a file like the one I showed above (matching on the |
When a veth device is created, the CNI in charge of bringing the device up will set a MAC address, if `MACAddressPolicy=permanent` is set, systemd will change it to a different one, causing dropped packets due to mismatches. With this change, the address set when the device is created will remain untouched by systemd. See flatcar/Flatcar#278 for more information.
Needed for flatcar/Flatcar#278
Needed for flatcar/Flatcar#278
Needed for flatcar/Flatcar#278
Needed for flatcar/Flatcar#278
I've applied the fix to all flatcar branches. I've verified that the test case provided by @dvulpe passes with this fix applied. This fix will be included in the next set of flatcar releases. |
The fix got released yesterday in all channels (2605.11.0, 2705.1.1, 2748.0.0). Please test your setups and let us know if this solves the issue or not. |
Thanks @marga-kinvolk, I deployed the latest image today and the test i conducted before are now passing! |
Thanks @marga-kinvolk - I've verified 2605.11.0 and it worked great! |
Awesome, thanks everyone. I'm now closing this bug (and will also retitle it a bit to make the issue clearer). If you encounter further issues with running Flatcar with EKS, please file new bugs. Thanks!! |
Needed for flatcar/Flatcar#278
This change adds a default network configuration .link file that `systemd-udev` will use when configuring new interfaces. It contains the default list of policies that are used when naming interfaces, as well as the policy by which the MAC address should be set. Bottlerocket packages its own version of this file rather than the default from systemd for a few reasons. 1) Bottlerocket does not create/use a udev hwdb (we disable the option in systemd compile flags), so we remove this option from the NamePolicy list, 2) CNI plugins can be confused when MAC addresses change for virtual interfaces, so Bottlerocket sets the default MACAddress Policy to "none" which directs systemd not to attempt to manage the MAC. Hardware usually has a MAC, and veth devices used by CNI generally get a MAC generated by the plugin. Additional information about the MAC address issue: systemd/systemd#3374 (comment) flatcar/Flatcar#278 flatcar/init#33
Description
Our flatcar image was auto-updated from 2512.4.0 to 2605.5.0, this somehow broke the ability for the Node to talk to pods running on it.
Impact
Pods on worker nodes are not able to communicate with Master Nodes API server pods.
Environment and steps to reproduce
Task: Reach a pod running on the Node
Action(s):
a. Upgrading from Flatcar 2512.4.0 to 2605.5.0
Error: The node cannot reach the pod running on it.
Expected behavior
Additional information
Cilium-monitor output when trying to run
tracepath
on a node with a pod running on itTCP dump on Node trying to reach a pod running on it.
The text was updated successfully, but these errors were encountered: