-
Notifications
You must be signed in to change notification settings - Fork 673
problems with handling of FrameTooBigError for broadcast packets #419
Comments
Is swallowing FrameTooBigError for broadcast/multicast really the right solution? We don't have to follow slavishly what the Linux kernel does. What's wrong with injecting icmp "fragmentation needed" packets for broadcast/multicast? #3 is somewhat related. |
I'm not suggesting following what the Linux kernel does just because Linux does it, but because it sets a precedent that it seems wise to follow. Due to the lack of support in Linux, applications cannot be relying on ICMP frag-needed responses to broadcast/multicast, and given its problematic nature for weave, it seems best to give up on it. Of course, fragmentation of IP broadcast/multicast packets is a marginal case, as broadcast packets are never routed, and multicast packets are only routed if you go to the trouble of setting up multicast routing daemon, which few people ever do. But I've done some experiments to elucidate the Linux behaviour. I wrote a trivial UDP sender program which sets IP_MTU_DISCOVER to DO or DONT as required (to avoid being affected by /proc/sys/net/ipv4/ip_no_pmtu_disc). It also sets IP_MULTICAST_TTL to 10. To observe the results, I simply used tcpdump. For broadcast:
Broadcast packets are never forwarded by routers, so there are no more cases to consider for them. And because routers do not forward them, they will not fragment them, and PMTU discovery does not apply - applications cannot be doing PMTU discovery for broadcast packets. So attempting to send an ICMP frag-needed in this case serves no purpose. Without a significant change in approach, weave cannot replicate the behaviour applications expect for DF broadcast packets (i.e. EMSGSIZE or best-effort delivery). For multicast packets, the interaction with the MTU on the originating machine is the same as for broadcast packets: Without DF, the packet is fragmented; with DF, the send fails with EMSGSIZE. Next, I set up smcroute in order to test multicast routing. The host linux kernel on my laptop is acting as a router between the internal virtual network to which VMs are attached, and my home network. The sending program is run within a VM, and results are observed from another machine on my home network. smcroute runs on the host, configured to enable multicast routing of 225.3.2.1 from the virtual network to the real network. I have manually set the MTU of the interface to the home network to 900, while the virtual bridge remains at 1500, so that sufficiently large multicast packets must be fragmented to be forwarded by the host kernel. I confirmed that multicast routing was working for small packets. The results for large multicast packets are:
I conclude that applications cannot be doing PTMU discovery for IP multicast by means of frag-needed responses. |
IIRC, for non-broadcast/multicast destinations, the kernel will do the following a) when IP_MTU_DISCOVER_WANT is set then the kernel fragments the outbound packet according to the cached PMTU, and sets DF Which of a) and b) do not happen for multicast/broadcast destinations? |
I was in the process of writing a response to this based on the behaviour of Linux and BSD, when I found a line in RFC1122 that I previously overlooked: (section 3.2.2, top of page 38):
That means that we should not send "frag needed" in response to broadcast/multicast packets, even if we were to implement in the proposal in #3. As things stand, where we exchange the source and destination IP addresses when generating a "frag needed" ICMP packet, we are contravening another MUST in RFC1122 (section 3.2.1.3, middle of page 30):
And note later in the same section:
Both the Linux and BSD stacks implement this requirement. This answers your question (b): when the weave router generates an "frag needed" ICMP packet in response to a broadcast/multicast, it gets silently discarded. And if we were to follow the proposal in #3, and generate these packets, even though 3.2.2 says we mustn't, and so IP stacks won't be expecting them? I haven't found a rule in RFC1122 to say that they must be discarded by the recipient. The BSD IP code does discard them silently, early on in the ICMP receive path. It's less obvious what the Linux IP code does with them. It may well ignore them. If it doesn't, it probably counts as a bug. To answer (a): Linux has no relevant special handling when sending multicast or broadcast packets. With IP_MTU_DISCOVER_WANT set, there will be no path MTU known for a broadcast destination, and as a consequence, fragmentation is not performed at the source and DF is always set. And so, if the packet is too large, it is silently dropped. |
Good find re RFC1122. That settles it then. Apologies for the noise. Carry on as you were :) |
While scrutinizing handleUDPPacketFunc, I've noticed some issues when an attempt to relay a broadcast frame fails with FrameTooBigError (https://github.com/zettio/weave/blob/c74ffe6369b13b509d3941979eabfcef1531f5be/router/router.go#L349).
First, the ICMP "fragmentation needed" packet is generated simply by swapping the source and destination IP and ethernet addresses from the original frame. For a broadcast frame, this means we produce a packet with a broadcast/multicast source IP and broadcast source MAC. I expect that this contravenes some specification at both layers. Worse, it leads to a bug in weave: when the ICMP packet arrives at its destination peer, the broadcast source MAC will be inserted into the MAC cache, breaking the assumption that the MAC cache never contains broadcast MACs. So further broadcasts on that peer will be treated by weave as a unicast to the peer that encountered the FrameTooBigError. (Admittedly, this will be fiddly to reproduce in practice: It will need three peers connected as A--B--C, where the B--C link is narrower than A--B.)
Second, because FrameTooBigError is returned as an error from RelayBroadcast, it means that relaying of a broadcast may be cut short. It would be better if it attempted to relay the broadcast on all links, even if some of them hit PMTU limits.
I haven't been able to find an RFC or similarly authoritative source that says how DF and unicast/multicast are supposed to interact. But the linux kernel never produces an ICMP "fragmentation needed" packets in response to broadcast/multicast packets (https://github.com/torvalds/linux/blob/master/net/ipv4/icmp.c#L577). I propose that weave should behave similarly, swallowing the FrameTooBigError for broadcasts so that it attempts to relay on all links.
The text was updated successfully, but these errors were encountered: