problems with handling of FrameTooBigError for broadcast packets #419

dpw · 2015-02-25T13:21:31Z

While scrutinizing handleUDPPacketFunc, I've noticed some issues when an attempt to relay a broadcast frame fails with FrameTooBigError (https://github.com/zettio/weave/blob/c74ffe6369b13b509d3941979eabfcef1531f5be/router/router.go#L349).

First, the ICMP "fragmentation needed" packet is generated simply by swapping the source and destination IP and ethernet addresses from the original frame. For a broadcast frame, this means we produce a packet with a broadcast/multicast source IP and broadcast source MAC. I expect that this contravenes some specification at both layers. Worse, it leads to a bug in weave: when the ICMP packet arrives at its destination peer, the broadcast source MAC will be inserted into the MAC cache, breaking the assumption that the MAC cache never contains broadcast MACs. So further broadcasts on that peer will be treated by weave as a unicast to the peer that encountered the FrameTooBigError. (Admittedly, this will be fiddly to reproduce in practice: It will need three peers connected as A--B--C, where the B--C link is narrower than A--B.)

Second, because FrameTooBigError is returned as an error from RelayBroadcast, it means that relaying of a broadcast may be cut short. It would be better if it attempted to relay the broadcast on all links, even if some of them hit PMTU limits.

I haven't been able to find an RFC or similarly authoritative source that says how DF and unicast/multicast are supposed to interact. But the linux kernel never produces an ICMP "fragmentation needed" packets in response to broadcast/multicast packets (https://github.com/torvalds/linux/blob/master/net/ipv4/icmp.c#L577). I propose that weave should behave similarly, swallowing the FrameTooBigError for broadcasts so that it attempts to relay on all links.

rade · 2015-02-27T10:28:43Z

Is swallowing FrameTooBigError for broadcast/multicast really the right solution? We don't have to follow slavishly what the Linux kernel does. What's wrong with injecting icmp "fragmentation needed" packets for broadcast/multicast?

#3 is somewhat related.

dpw · 2015-02-27T21:21:26Z

I'm not suggesting following what the Linux kernel does just because Linux does it, but because it sets a precedent that it seems wise to follow. Due to the lack of support in Linux, applications cannot be relying on ICMP frag-needed responses to broadcast/multicast, and given its problematic nature for weave, it seems best to give up on it.

Of course, fragmentation of IP broadcast/multicast packets is a marginal case, as broadcast packets are never routed, and multicast packets are only routed if you go to the trouble of setting up multicast routing daemon, which few people ever do. But I've done some experiments to elucidate the Linux behaviour. I wrote a trivial UDP sender program which sets IP_MTU_DISCOVER to DO or DONT as required (to avoid being affected by /proc/sys/net/ipv4/ip_no_pmtu_disc). It also sets IP_MULTICAST_TTL to 10. To observe the results, I simply used tcpdump.

For broadcast:

If you attempt to send a UDP packet to 255.255.255.255, without DF, and it is larger than the MTU of the outgoing interface, it gets fragmented.
If you attempt to send a UDP packet to 255.255.255.255, with DF, and it is larger than the MTU of the outgoing interface, the send fails with EMSGSIZE.

Broadcast packets are never forwarded by routers, so there are no more cases to consider for them. And because routers do not forward them, they will not fragment them, and PMTU discovery does not apply - applications cannot be doing PMTU discovery for broadcast packets. So attempting to send an ICMP frag-needed in this case serves no purpose. Without a significant change in approach, weave cannot replicate the behaviour applications expect for DF broadcast packets (i.e. EMSGSIZE or best-effort delivery).

For multicast packets, the interaction with the MTU on the originating machine is the same as for broadcast packets: Without DF, the packet is fragmented; with DF, the send fails with EMSGSIZE.

Next, I set up smcroute in order to test multicast routing. The host linux kernel on my laptop is acting as a router between the internal virtual network to which VMs are attached, and my home network. The sending program is run within a VM, and results are observed from another machine on my home network. smcroute runs on the host, configured to enable multicast routing of 225.3.2.1 from the virtual network to the real network. I have manually set the MTU of the interface to the home network to 900, while the virtual bridge remains at 1500, so that sufficiently large multicast packets must be fragmented to be forwarded by the host kernel. I confirmed that multicast routing was working for small packets.

The results for large multicast packets are:

Without DF, the packet is fragmented during forwarding.
With DF, the packet is dropped at the router, and no ICMP frag-needed packet is sent back to the sending machine. The FragFails field in /proc/net/snmp increments with each unrouted packet.

I conclude that applications cannot be doing PTMU discovery for IP multicast by means of frag-needed responses.

rade · 2015-02-28T15:31:43Z

IIRC, for non-broadcast/multicast destinations, the kernel will do the following

a) when IP_MTU_DISCOVER_WANT is set then the kernel fragments the outbound packet according to the cached PMTU, and sets DF
b) the kernel pays attention to incoming "frag needed" packets, updating the cached PMTU for the destination that appears as the source of the "frag needed" packet

Which of a) and b) do not happen for multicast/broadcast destinations?

dpw · 2015-03-04T17:35:25Z

Which of a) and b) do not happen for multicast/broadcast destinations?

I was in the process of writing a response to this based on the behaviour of Linux and BSD, when I found a line in RFC1122 that I previously overlooked: (section 3.2.2, top of page 38):

An ICMP error message MUST NOT be sent as the result of receiving:
[...]

a datagram destined to an IP broadcast or IP multicast address, or

That means that we should not send "frag needed" in response to broadcast/multicast packets, even if we were to implement in the proposal in #3.

As things stand, where we exchange the source and destination IP addresses when generating a "frag needed" ICMP packet, we are contravening another MUST in RFC1122 (section 3.2.1.3, middle of page 30):

When a host sends any datagram, the IP source address MUST be one of its own IP addresses (but not a broadcast or multicast address).

And note later in the same section:

A host MUST silently discard an incoming datagram containing an IP source address that is invalid by the rules of this section.

Both the Linux and BSD stacks implement this requirement. This answers your question (b): when the weave router generates an "frag needed" ICMP packet in response to a broadcast/multicast, it gets silently discarded.

And if we were to follow the proposal in #3, and generate these packets, even though 3.2.2 says we mustn't, and so IP stacks won't be expecting them? I haven't found a rule in RFC1122 to say that they must be discarded by the recipient. The BSD IP code does discard them silently, early on in the ICMP receive path. It's less obvious what the Linux IP code does with them. It may well ignore them. If it doesn't, it probably counts as a bug.

To answer (a): Linux has no relevant special handling when sending multicast or broadcast packets. With IP_MTU_DISCOVER_WANT set, there will be no path MTU known for a broadcast destination, and as a consequence, fragmentation is not performed at the source and DF is always set. And so, if the packet is too large, it is silently dropped.

rade · 2015-03-04T17:57:15Z

Good find re RFC1122. That settles it then. Apologies for the noise. Carry on as you were :)

rade added the bug label Feb 25, 2015

dpw added the in progress label Feb 26, 2015

dpw self-assigned this Feb 26, 2015

dpw mentioned this issue Feb 27, 2015

don't lie about source of injected icmp 3.4 #3

Open

dpw mentioned this issue Mar 4, 2015

419 broadcast frametoobigerror #433

Merged

rade closed this as completed in 0f7f49d Mar 6, 2015

rade removed the in progress label Mar 6, 2015

dpw removed their assignment Apr 8, 2015

rade modified the milestone: 0.10.0 Apr 18, 2015

rade mentioned this issue Oct 5, 2015

large multicast packets get dropped #1507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems with handling of FrameTooBigError for broadcast packets #419

problems with handling of FrameTooBigError for broadcast packets #419

dpw commented Feb 25, 2015

rade commented Feb 27, 2015

dpw commented Feb 27, 2015

rade commented Feb 28, 2015

dpw commented Mar 4, 2015

rade commented Mar 4, 2015

problems with handling of FrameTooBigError for broadcast packets #419

problems with handling of FrameTooBigError for broadcast packets #419

Comments

dpw commented Feb 25, 2015

rade commented Feb 27, 2015

dpw commented Feb 27, 2015

rade commented Feb 28, 2015

dpw commented Mar 4, 2015

rade commented Mar 4, 2015