Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completely forego the usage of macvlans, as they wreak havoc in ARP/IPv6 ND when more than one macvlan is created form a bridge interface #2403

Closed
delandtj opened this issue Aug 21, 2024 · 17 comments
Assignees
Labels
priority_major type_feature New feature or request
Milestone

Comments

@delandtj
Copy link
Contributor

Creating a macvlan from a bridge interface seems to conflict in arp/nd for the bridge, as as well the bridge(duh) as macvlans have bridging code and somewhere along the path things don't fit any more.
Creating macvlans from physical interfaces doesn't seem to create problems, but from bridge interfaces (when there is more than one and both interfaces need to communicate) it does weird things, where neighbors get lost, asn as such routes get lost, ans a such, connections get lost.

some comments needed here :

  • if a macvlan or macvtap is created on a bridge for a workload that is still alive, we will have to see if we want to leave it like that or recreate it. Or just tell the user to recreate his workload.
  • zdbs are using the old container code, so I guess we can do that 'quite' easily?
  • all other workloads are taps ?
@delandtj delandtj added this to 3.15.x Aug 21, 2024
@delandtj delandtj changed the title Completely forego the usage of macvlans, as they wreak havoc in ARP/IPv6 ND when mure than one macvlan is created form a bridge interface Completely forego the usage of macvlans, as they wreak havoc in ARP/IPv6 ND when more than one macvlan is created form a bridge interface Aug 21, 2024
@scottyeager
Copy link

There might be some info about the sources of conflict here (just a note for reference).

So far I understand this affects the Mycelium interface for Zdbs. Are VMs also affected? If I read correctly here it seems so:

- A macvlan interface created from that bridge and moved inside the network resource namespace (will always be called `br-my`)

  • we will have to see if we want to leave it like that or recreate it. Or just tell the user to recreate his workload.

I think it's fine to leave existing VMs as is and let the users recreate if they want. If the user is experience a problem due to this, they probably have found another communication route and are happy enough with their VM as is.

For Zdb, ideally the endpoints could retain all the same IP addresses and nothing changes for the users. If that's not possible, oh well. There's not much usage yet anyway.

@LeeSmet
Copy link
Contributor

LeeSmet commented Aug 22, 2024

VMs always use regular TAP interfaces (NOT macvtaps), and therefore these should be fine. The problem exists primarily for zdbs because (for mycelium), a packet traverses 2 macvlans in bridge mode which are themselves enslaved by a bridge. It appears to be specifically the second macvlan which causes issues when acting as router, since "regular" public ipv6 has the same setup (macvlan from br-pub to zdb ns and macvlan from br-pub to public ns), however here the packet "exits" on the bridge since the bridge enslaved a physical interface which is the path for the router (so for regular IPv6 the public namespace is not involved).

@ashraffouda
Copy link
Collaborator

this is already handled for zos4 we need to do it in zos3

@delandtj
Copy link
Contributor Author

Any status on zos3 for this ?
Keep in mind that when that is done (in zos4 and zos3) we can remove the dumdum devices, as that problem is gone too then

@mik-tf
Copy link

mik-tf commented Oct 7, 2024

Hi everyone,

I had a talk with @despiegk and we can't wait until 3.16 to fix this macvlan bug, as it is currently a blocking bug, i.e. it blocks Mycelium utilization. Thus it is now set as a critical priority.

We can't wait for a forklift upgrade. We just need to do a patch.

@ashraffouda @Omarabdul3ziz can you guys check this?
@xmonader tagging you here so you can coordinate the steps if needed.

Thanks everyone.

@ashraffouda
Copy link
Collaborator

changing to macvlans will not fix the problem until every single node is rebooted, because nodes will not optout macvlans until rebooted

@LeeSmet
Copy link
Contributor

LeeSmet commented Oct 8, 2024

So basically all macvlan devices attached to a bridge (which should be all at this point iirc) need to be replaced by a veth pair with one end plugged into the bridge. While it is true that the network won't adapt automatically just because new code is uploaded until the node is rebooted, this can be solved by writing a bit of migration code which looks for all macvlans and switches them for veth pairs. Or the migration code can look for a single known macvlan device, if that is found it knows other macvlans need to be switched out, and finally the known one should be chagned to a veth pair. This will allow the devices to be changed with minimal downtime for workloads and without the need to reboot the node. Finally, the dummy interfaces we use to force the bridges to be up can be cleaned up as well after this.

@mik-tf
Copy link

mik-tf commented Oct 8, 2024

As I understand, this means that @LeeSmet has a working solution for this.

@ashraffouda @Omarabdul3ziz
Can you implement the changes needed? If something is missing, what would be needed to fix this ASAP? Thanks.

@xmonader xmonader added type_feature New feature or request and removed priority_critical labels Oct 9, 2024
@mik-tf
Copy link

mik-tf commented Oct 9, 2024

@xmonader, I added back priority_major as @delandtj set it initially, if we don't want to set it as critical.

Please let's discuss and act as needed to move this forward. I'll be glad to help.

@xmonader xmonader added this to 3.15.x Oct 9, 2024
@xmonader xmonader moved this to In Progress in 3.15.x Oct 9, 2024
@xmonader xmonader removed this from the 3.13 milestone Oct 9, 2024
@xmonader xmonader added this to the 3.12 milestone Oct 9, 2024
@ashraffouda
Copy link
Collaborator

we started by changing the zdb part today, but still not finished yet. has some issues we still debugging it

@Omarabdul3ziz
Copy link
Contributor

Work is done on the zdb workload it now uses veth pair instead of the macvlan for all its links (ygg/myc/pub)
image
image

will continue working on the other macvlans used for the network resource and the wiring to ndmz namespace

@Omarabdul3ziz Omarabdul3ziz moved this from In Progress to Pending Review in 3.15.x Oct 14, 2024
@ashraffouda ashraffouda moved this from Pending Review to In Verification in 3.15.x Oct 14, 2024
@ashraffouda ashraffouda moved this from In Verification to Pending Review in 3.15.x Oct 14, 2024
@Omarabdul3ziz Omarabdul3ziz moved this from Pending Review to In Verification in 3.15.x Oct 15, 2024
@scottyeager
Copy link

Since the patch to implement this was applied to mainnet, I don't have any more issues connecting to Zdbs over Mycelium. Looks good.

@xmonader
Copy link
Collaborator

Since the patch to implement this was applied to mainnet, I don't have any more issues connecting to Zdbs over Mycelium. Looks good.

Right now the macvlan removal was only applied on devnet, so that may mean there's something else could have been going on

@scottyeager
Copy link

Ah, my mistake. I should refine my statement to be that, while the gratuitous packet loss did not appear in my recent tests, there are still big latency spikes on testnet.

However, my testing on devnet today shows that there is still large amounts of packet loss happening when connecting to Zdbs over Mycelium.

Here are some examples from ping tests. These were performed from a VM running on node 128 on devnet. The IPs are the Mycelium IPs of Zdbs running on the indicated node ids, as returned by Zos after creating Zdb deployments. Check the gaps in sequence number.

Devnet node 12

64 bytes from 5b2:39e5:4286:5af0:8301:6cb5:43f0:5b77: icmp_seq=128 ttl=58 time=64.6 ms
64 bytes from 5b2:39e5:4286:5af0:8301:6cb5:43f0:5b77: icmp_seq=153 ttl=59 time=48.1 ms

Devnet node 28

64 bytes from 53c:6017:e9c1:9aee:e514:bea6:1a64:d38: icmp_seq=253 ttl=59 time=44.6 ms
64 bytes from 53c:6017:e9c1:9aee:e514:bea6:1a64:d38: icmp_seq=310 ttl=59 time=132 ms

Devnet node 31

64 bytes from 4b6:c798:a:34e1:6a16:7067:d299:11b: icmp_seq=186 ttl=58 time=59.1 ms
64 bytes from 4b6:c798:a:34e1:6a16:7067:d299:11b: icmp_seq=269 ttl=58 time=46.3 ms

Looks like we need some further investigation to be sure of the root cause and why the proposed fix did not address it.

@LeeSmet
Copy link
Contributor

LeeSmet commented Oct 25, 2024

Node 12:

image

Node 28:

image

Node 31:

image

Looks like we need some further investigation to be sure of the root cause and why the proposed fix did not address it.

I concur, since it seems that all listed nodes still exhibit the old behavior of using macvlans

@xmonader
Copy link
Collaborator

xmonader commented Oct 28, 2024

@scottyeager can you try after that reboot? the migration code is still in progress to automigrate without the need for a reboot.

@scottyeager
Copy link

Looking much better now. The three nodes that were rebooted have 0% packet loss now. Node 14 which was not rebooted is still showing high packet loss (lower right):

image

@github-project-automation github-project-automation bot moved this from In Verification to Done in 3.15.x Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority_major type_feature New feature or request
Projects
Status: Done
Development

No branches or pull requests

8 participants