Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"eth0: hw csum failure" logging #2723

Closed
wants to merge 3 commits into from

Conversation

6by9
Copy link
Contributor

@6by9 6by9 commented Oct 17, 2018

For dicussion. I'll see if the person on #2713 who has been reproducing this reliably can get some more logging for us before suggesting we merge this generally.
This will have a small performance hit as it s/w checksums every packet.

6by9 added 3 commits October 17, 2018 14:04
Simple logging in the driver should the checksum offload value
not match a software check.
This will have a small performance impact, but is necessary to
try and identify why we are getting "eth0: hw csum failure"
messages.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
Simple logging in the driver should the checksum offload value
not match a software check.
This will have a small performance impact, but is necessary to
try and identify why we are getting "eth0: hw csum failure"
messages.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

@MilhouseVH : @popcornmix has said that you have been seeing these "hw csum failure" messages and are in a position to create kernel images fairly easily.
Is there any chance you could pick up these commits and see whether they dump out some useful information for us? Don't mind if it is against 4.18 or 4.19 - whatever you think is more likely to be able to provoke the issue.

There's a slight reluctance to merge this to the main tree as it will have a small performance hit on every single received packet, but if we can't gain any data from other sources then we may have to. We still can't reproduce :-(

@MilhouseVH
Copy link

@6by9 yep, I'll include them in my 4.18.y test builds from tonight, and also perform my own testing - I've found that with 4.19-rc8 (my LibreELEC test build #1016x, without any reverts) the following is sufficient to trigger a kernel backtrace LibreELEC:

echo hello | curl -F 'f:1=<-' http://ix.io

which then produces the following backtraces (all of them - log is clear until the above command is executed):

http://ix.io/1poD

A 4.19-rc8 kernel with the two reverts 5ca4ac2 and f2c248f does not produce a backtrace.

@MilhouseVH
Copy link

ix.io is also a trigger for last night's 4.18.14 without any reverts.

@MilhouseVH
Copy link

This is an ix.io-triggered backtrace when using the latest rpi-4.18.y branch and this PR:

http://ix.io/1poM

I'm not seeing any additional logging (which may suggest the problem is elsewhere?)

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

Thanks.
I'm building 4.19-rc8 to see if I can get anything from that. The curl ... ix.io isn't throwing any errors for me with either LAN chip for me on 4.14.76.

Your callstacks do include a load of nf (netfilter) functions. Have you got any odd iptables entries?
My default Raspbian build only has accept all

pi@raspberrypi:~ $ sudo iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

I'm not seeing any additional logging (which may suggest the problem is elsewhere?)

That is great news - many thanks.
I'll add a dump to the same point that logs the "eth0: hw csum failure" message and hope nothing has been chopped off that is necessary.

@MilhouseVH
Copy link

Yes in LibreELEC we have a few additional rules:

rpi22:~ # iptables -S
-P INPUT DROP
-P FORWARD DROP
-P OUTPUT ACCEPT
-N DOCKER-USER
-N private-subnets
-A INPUT -i lo -j ACCEPT
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i tether -p udp -m conntrack --ctstate NEW -j ACCEPT
-A INPUT -j private-subnets
-A FORWARD -i tether -m conntrack --ctstate NEW,RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o tether -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -j private-subnets
-A private-subnets -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A private-subnets -s 10.0.0.0/8 -i eth+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 172.16.0.0/12 -i eth+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 192.168.0.0/16 -i eth+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 10.0.0.0/8 -i en+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 172.16.0.0/12 -i en+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 192.168.0.0/16 -i en+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 10.0.0.0/8 -i wl+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 172.16.0.0/12 -i wl+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 192.168.0.0/16 -i wl+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 10.0.0.0/8 -i tether -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 172.16.0.0/12 -i tether -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 192.168.0.0/16 -i tether -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 10.0.0.0/8 -i docker+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 172.16.0.0/12 -i docker+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -s 192.168.0.0/16 -i docker+ -m conntrack --ctstate NEW -j ACCEPT
-A private-subnets -j REJECT --reject-with icmp-port-unreachable

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

Hmm, that may be the clue.
I'll sort the packet dump patch first, and then play with iptables.

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

Sorry, messing with branches having switched to testing 4.19.

https://github.com/6by9/linux/tree/rpi-4.19.y-net top commit should log the content of the packet that caused the hw csum failure.

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

No, I can't get 4.19-rc8 to throw the error either, even with all your iptables rules loaded. I hope you have more success.

@MilhouseVH
Copy link

Current rpi-4.19.y (c2f4561) plus rpi-4.19.y...6by9:rpi-4.19.y-net, triggered with echo hello | curl -F 'f:1=<-' http://ix.io: http://ix.io/1ppj

@MilhouseVH
Copy link

ix.io will also to trigger a hw csum failure on RPi1 with LibreELEC 4.19-rc8.

This is the kernel log from an RPi1 with rpi-4.19.y plus rpi-4.19.y-net, same "hello" ix.io trigger:

http://ix.io/1ppy

My network configuration is nothing fancy, aside probably from the LibreELEC iptables rules. It's RPi1/RPi3+ using genuine RPF white PSUs, both wired with CAT5

  1. into a Netgear GS608 v3 (8-port unmanaged switch)

s1

  1. which is uplinked (CAT5 -> CAT6 -> CAT5) to a DGND4000 router/gateway

s2

  1. that is using BT ADSL broadband for internet access.

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

A big thank you for that.
No time to do a full analysis now, but that seems to have all the IP packet there. 192.168.0.18 appears to be one end of the link, and 42 ac 0b 49 the other (66.172.11.73 of ix.io I suspect). Protocol 06 is TCP. We'll see what else is in there when I have a few more minutes.

@MilhouseVH
Copy link

Yes, these are my LAN IP addresses:

DGND4000: 192.168.0.1
LibreELEC RPi1: 192.168.0.8
LibreELEC RPi3+: 192.168.0.18

There's another RPi3+ (192.168.0.4, wired into the DGND4000) running Raspbian (Stretch, 4.14.49-v7+ #1120) and dnsmasq (DNS + DHCP) for the entire LAN, just in case that is relevant.

And yes, 66.172.11.73 appears to be ix.io:

neil@nm-linux:~/projects$ host ix.io
ix.io has address 66.172.11.73
ix.io mail is handled by 10 ix.io.

@pelwell
Copy link
Contributor

pelwell commented Oct 17, 2018

The traffic is a sequence of single byte TCP packets from the web server to the client. The bytes make up part of a URL: tp://ix.. Apart from the data byte and sequence numbers, the packets are the same. Decoding the first gives:

Internet Protocol Version 4
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x02 (DSCP: CS0, ECN: ECT(0))
Total Length: 53
Identification: 0xaa6e (43630)
Flags: 0x02 (Don't Fragment)
Fragment offset: 0
Time to live: 47
Protocol: TCP (6)
Header checksum: 0x92ad
Header checksum status: Unverified
Source: 66.172.11.73
Destination: 192.168.0.8
Source GeoIP: Unknown
Destination GeoIP: Unknown
Transmission Control Protocol
Source Port: 80
Destination Port: 36648
Stream index: 0
TCP Segment Len: 1
Sequence number: 0xf68f7040
Next sequence number: 2
Acknowledgment number: 0x9d629ad5
1000 .... = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
Window size value: 243
Calculated window size: 243
Window size scaling factor: -1 (unknown)
Checksum: 0x96e3
Checksum Status: Unverified
Urgent pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
TCP Option - No-Operation (NOP)
TCP Option - No-Operation (NOP)
TCP Option - Timestamps: TSval 667403842, TSecr 1616240471
SEQ/ACK analysis
Bytes in flight: 1
Bytes sent since last PSH flag: 1
TCP payload (1 byte)
TCP segment data (1 byte)

I think it's odd (because it's so inefficient) to be pushing single bytes at this point unless the receive window is closed.

@pelwell
Copy link
Contributor

pelwell commented Oct 17, 2018

I get a similar pattern of packet sizes running curl on my Mac. The first packet contains 157 bytes of payload - the full HTTP header, two blank lines and an 'h'. The second is just a 't', as is the third. The fourth is 'p', etc.

The total length of the first packet is 223 bytes - nowhere near the MTU.

@MilhouseVH
Copy link

MilhouseVH commented Oct 17, 2018

4.14.62 #0821 doesn't have the hw csum failure when testing with ix.io

4.18.4 #0822 does have the hw csum failure when testing with ix.io (http://ix.io/1pqn)

So ix.io wasn't/isn't a problem with 4.14.y (I stopped building 4.14.y after 4.14.62).

@pelwell
Copy link
Contributor

pelwell commented Oct 17, 2018

I'm convinced the problem will be at our end, but it may be triggered by a peculiarity in the source. Perhaps the small packets are because the content is being generated a byte at a time by a server script, and the first gets the header prepended.

@MilhouseVH
Copy link

By the way, even though I'm getting these hw csum failures with ix.io, I AM able to upload stuff successfully and there's no obvious problem at the command line. The uploaded content also appears to be uncorrupted, so the kernel messages are "noise", at least in this case.

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

By the way, even though I'm getting these hw csum failures with ix.io, I AM able to upload stuff successfully and there's no obvious problem at the command line. The uploaded content also appears to be uncorrupted, so the kernel messages are "noise", at least in this case.

Yup, from https://github.com/raspberrypi/linux/blob/rpi-4.14.y/net/core/datagram.c#L761 I don't see it aborting/discarding the packet after logging the error, although I suspect the caller will be looking at the result.
I suspect that it is that the stack has recomputed the csum through software because the hw has reported a csum error, and decided to ignore the hw offloaded result, but confirming that would take a fair amount of digging.

@6by9
Copy link
Contributor Author

6by9 commented Oct 17, 2018

Thanks Phil for doing the analysis - I had to go out for the evening.

I'm convinced the problem will be at our end, but it may be triggered by a peculiarity in the source. Perhaps the small packets are because the content is being generated a byte at a time by a server script, and the first gets the header prepended.

It is only showing up after the change "pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends".
The logic in that all holds together, but the fact that 6by9@4a098c7#diff-40fa73c9d0083043c6576dd2b40511e4 doesn't log an error when we check the hw offload csum in the driver. That implies something else is handling the csums wrong, and I'm not sure how that can be us.

Plan for the morning will be to throw a WARN into pskb_trim_rcsum_slow to see where it gets called from, as that would be the point that the CHECKSUM_COMPLETE was previously getting cleared.
It may be telling that the packets are all an odd number of bytes in length. The checksum is the 1's complement sum of the 16bit words in the packet, with a 0 padding byte added at the end if you have an odd number of bytes. I'd expect that to go wrong far more often if it were that trivial though.

I'll also grab Milhouse's nightly build and see if that allows us to reproduce it on the corporate network (it may be networking infrastructure related).

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

I get a similar pattern of packet sizes running curl on my Mac. The first packet contains 157 bytes of payload - the full HTTP header, two blank lines and an 'h'. The second is just a 't', as is the third. The fourth is 'p', etc.

The total length of the first packet is 223 bytes - nowhere near the MTU.

Testing here in the office with my normal setup (isolated network nat'ed to the corporate network via Ubuntu on VirtualBox) I get a 220 byte packet (166 of TCP data) with the full HTTP header and http://ix.. Then a 62 byte packet (8 bytes of TCP) with io/1IXN<cr>.
My Ubuntu laptop on the guest Wifi I get the header and h, then t, t, p, t, :, \, i, x, ., io/1IXN<cr>.
Switch to the Pi direct to the corporate network and I see the same as the laptop.

I suspect VirtualBox's routing may be being smarter than it should :-(
Still no csum errors though.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

72534f6 pushed to my rpi-4.19.y-net branch to dump the packet and parameters on any call to pskb_trim_rcsum_slow (the function that was modified by upstream).

I'm seeing it called on some multicast packets, but not when trying curl. The packets tend to be being trimmed by 1 or 6 bytes, and all instances for me are zero padded so the checksum doesn't change.
Wireshark is indeed showing that excess bytes have been captured on the wire which are not part of the IP packet, which is why ip_rcv_core is trimming them from the frame being passed upwards.

@MilhouseVH Any chance you can try it?

There's obviously some subtle thing I'm missing at the moment. I may head home early and try it on my home broadband to see if the change in networking setup helps.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

Looking in more detail at the trimmed multicasts, trimming by 6 bytes is happening on IGMPv3 membership reports/join messages from another Pi. They have been sent as 60 bytes on the wire, but only 40bytes of IP + 14 bytes of ethernet header. The remaining 6 bytes are observed to be 0.
Looking at the same packet on a PC and I see the same padding.
That makes sense as ethernet has a minimum frame length of 64 bytes (inc 4 byte CRC).

The other multicast messages being trimmed by 1 are typically SSDP messages. Both Windows and the Pi agree that they are 214 bytes on the wire: 14 bytes of ethernet header, and 199 bytes of IP, with one byte left over (Wireshark incorrectly tags it as VSS-monitoring ethernet trailer).
If it's reported by both Windows and the Pi, then I'd say that there is a reasonable chance that the source device sent it. If the source device sent it.
Then again if the hardware has computed the checksum over the whole frame, then subtracting the sum of whatever is in there is also valid.

I'm still much confused over this :-/

@MilhouseVH
Copy link

@MilhouseVH Any chance you can try it?

http://ix.io/1psZ

ix.io test performed at [ 61.071106].

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

Perfect. Thank you.

I'm suspecting there's a bug in csum_sub.
csum was a248 now aa00a0ce. That looks like the result has been previously compressed down to the bottom 16 bits, which may make the wrapping in the subtraction wonky. (It's meant to be a 16bit sum, but is computed in 32 bit chunks for efficiency. The aa00a0ce needs to be passed through csum_fold to reduce it).

Disecting the buffer first to confirm that the hw offload csum is valid for the full buffer length.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

Head and brick walls. Endianness screws things over too.
Taking the raw packet

[   61.586930] pskb_trim_rcsum_slow called, csum was 704b now a8006e03, skb->len 58, delta 5
[   61.587730] raw data: 00000000: 45 02 00 35 63 dd 40 00 2f 06 d9 34 42 ac 0b 49  E..5c.@./..4B..I
[   61.587748] raw data: 00000010: c0 a8 00 12 00 50 df ac 16 a0 90 56 40 6e 07 7f  .....P.....V@n..
[   61.587759] raw data: 00000020: 80 18 00 f3 d9 a9 00 00 01 01 08 0a 28 c2 ec e1  ............(...
[   61.587770] raw data: 00000030: dc 1b 9d c7 2f
                                                  d5 7e fc da 75                    ..../.~..u

I make the checksum of the complete packet 0x4b70. Endian-swap and I'll match the 704b that was logged.

Remove the last 5 bytes and I compute the checksum of that packet to be 0xf128. It works if you do longhand, or compute the checksum of 00 d5 7e fc da 75 (0x5a47), invert (0xa5b8) and add it to 0x4b70.

0xa8006e03 folds down into 0x1604, so I don't see that computation being correct when I compute it to be 0xf128 (or is it 0x28f1).

I'll try to parcel this packet up into a form that I can replay at a Pi. This looks to be all about underflows/overflows/carrys of the checksum maths, so the network conditions (IP addresses, port numbers, anything) will be affecting it.
I have no idea where the trailer has come from, but it's there and seems to be the thing that is screwing up the checksums.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

EUREKA!!! eth0: hw csum failure using the recreation of that packet.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

To save those needing to recreate it, packet capture of 1 packet attached.
Pi needs MAC address b8 27 eb 2c 7a 91, and ip address 192.168.0.18.

edited.zip

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

Not endian-swapped in the logging - 0x704b is incorrectly byte swapped from the driver for some reason.

@pelwell
Copy link
Contributor

pelwell commented Oct 18, 2018

		skb->csum = ntohs((u16)(rx_cmd_b >> RX_CMD_B_CSUM_SHIFT));

Why would any hardware designer in their right mind make the content of a register host order dependent? In fact, how could they?

@pelwell
Copy link
Contributor

pelwell commented Oct 18, 2018

Scratch that - they are packet trailers, so in memory.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

skb->csum = ntohs((u16)(rx_cmd_b >> RX_CMD_B_CSUM_SHIFT));

is used on every other packet fine, just not this one. skb->csum is on the host system and needs to be handled there correctly as a 16 bit value, hence needing ntohs.
Someone somewhere appears to be swapping it. I have the patch in this PR that software validates the hw checksum in the lan78xx driver, so I'm expecting that it will be OK.

I'm adding logging in the driver to print skb->csum for every packet now, but unfortunately touched checksum.h so it's a bigger rebuild than otherwise.

Further note that I'm on a 3B+ in this case.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

Life is very confusing as to which way around is correct to deal with this data! The checksum computed by the driver as 704b is the correct value that the kernel is wanting. It is therefore wanting to subtract 58000247 / 5a47 to give 28f1.

I think I have concluded that it is pskb_trim_rcsum_slow / skb_checksum going wrong should the bit to be clipped start on an odd boundary. Amending it to read

int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
{
	if (skb->ip_summed == CHECKSUM_COMPLETE) {
		int delta = skb->len - len;
		__wsum old_csum = skb->csum;
		__wsum frag_csum = skb_checksum(skb, len, delta, 0);

		if (len & 1) {
			u32 sum = (__force u32)frag_csum;
			sum = (sum & 0xffff) + (sum >> 16);
			sum = (sum & 0xffff) + (sum >> 16);
			frag_csum = ((sum & 0xFF) <<8) | ((sum&0xFF00) >> 8);

		}
		skb->csum = csum_sub(skb->csum, frag_csum);
		if (net_ratelimit()) {
			WARN(1, "%s called, csum was %x now %x (subtracted %x), skb->len %d, delta %d",
			     __func__, old_csum, skb->csum,
			     skb_checksum(skb, len, delta, 0), skb->len, delta);
			print_hex_dump(KERN_ERR, "raw data: ",
				       DUMP_PREFIX_OFFSET, 16, 1, skb->data,
				       skb->len, true);
		}
	}
	return __pskb_trim(skb, len);
}

and I appear to get no csum failures.

skb_checksum should be doing this itself as it tries to compensate for the starting offset, at least in the lib version of checksum.c
I fully expect that arm has an optimised version of that call, but I haven't gone digging as yet.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

OK, daft thought. Seeing as the suspicion is that this is generic, do we have a > 4.14.71, or 4.18 / 4.19 x86/x86_64 kernel build that we can throw this packet at (after suitably modifying the MAC address)?

I know I don't have one immediately to hand, but do the Milhouse x86_64 builds fit the bill?
If it fails on x86/x86_64, then we're well in the clear and we can just throw it upstream.

@MilhouseVH
Copy link

My current LibreELEC x86_64 build is 4.18.14: http://forum.kodi.tv/showthread.php?tid=298462&pid=2783648#pid2783648

There is a LibreELEC 4.19-rc8 x86_64 build here: https://forum.kodi.tv/showthread.php?tid=298462&pid=2783419#pid2783419

I'm not sure how to "replay" this data, but if you can write LibreELEC 8.90.006 to a USB memory stick (4GB+) then you boot into run mode, and from there update the memory stick using either of the two "Milhouse" builds (copy the tar file into the "Update" share, then reboot from the memory stick - it will automatically use the last selected boot mode, ie. run when booting from memory stick)

I've tested ix.io on a Skylake NUC (4.18.14/4.19-rc8, ip 192.168.0.20, Intel WiFi), but no panic. Same with a Revo 3700 (4.18.14/4.19-rc8, ip 192.168.0.12, wired r8169 ethernet) - no panic.

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

Thanks Milhouse. I've now got to think what spare x86 kit I have available. Some old Acer Revo's 3610's (Atom and NVidia ION) are probably about all the sacrificial stuff.

There is also the fact that there are network driver options to force alignment of packets within the IP stack, so that may be masking stuff on other platforms.

If you fancied trying it, then the easiest approach I've found is to add an ethernet interface with a back-to-back connection to the device being tested so switches don't filter out messages on you. There's no DHCP etc, so the source device doesn't start sending any extraneous junk. Add the relevant static IP address to the DUT.
Replay with sudo bittwist -i <interface> <file> (bittwist is in the Debian repos).
The pcap file format is actually very simple to hack. In this case we only wnat to replace the MAC address (b8:27:eb:2c:7a:91) with that of the DUT.

Please feel under no obligation to try this. Now we can reproduce it I should be able to make progress. Having said that, I'm out of the office tomorrow, so it may have to wait for next week.

@MilhouseVH
Copy link

Yeah sorry, I'm not likely to be able to run those tests on my x86_64 kit but I can of course test any commits on the RPi1/RPi3+ as and when you think you may have something worth testing. Part of the problem with me cobbling something together on my x86_64 kit is that the lack of a panic is likely to be due to my own ineptitude as different network hardware etc.

LibreELEC 8 and 9 should work with your Revo 3610 - it's an ION1 GPU which should be supported. Make sure ssh is enabled (username: root, password: libreelec) in the Kodi first run wizard. LibreELEC is not a general purpose distribution so a utility like bitttwist isn't available but sometimes copying a statically linked binary from Debian to LibreELEC has been known to work.

Booting LibreELEC from the USB memory stick in run mode is not destructive - LibreELEC will use available space on the USB memory stick for storage, and also save settings (ie. network config etc.) to the memory stick which simplifies repeated testing. There's also live mode which boots entirely from RAM and is again not destructive.

Just don't run the default installer mode through to completion, as that will format your internal storage (after several confirmation prompts...)

@6by9
Copy link
Contributor Author

6by9 commented Oct 18, 2018

No problem, it was just one of those daft thoughts as I was about to leave the office. I'm very appreciative of the efforts you've put in to help us on this one.

I've got a couple of ideas over things to check on the Pi kernel, and I'll take my Revo into the office on Monday to have a play.

@catschulze
Copy link
Contributor

catschulze commented Oct 20, 2018

I think I have reproduced how the incrorrect checksum in the example above was calculated by the kernel code:

[ 61.586930] pskb_trim_rcsum_slow called, csum was 704b now a8006e03, skb->len 58, delta 5
[ 61.587730] raw data: 00000000: 45 02 00 35 63 dd 40 00 2f 06 d9 34 42 ac 0b 49 E..5c.@./..4B..I
[ 61.587748] raw data: 00000010: c0 a8 00 12 00 50 df ac 16 a0 90 56 40 6e 07 7f .....P.....V@n..
[ 61.587759] raw data: 00000020: 80 18 00 f3 d9 a9 00 00 01 01 08 0a 28 c2 ec e1 ............(...
[ 61.587770] raw data: 00000030: dc 1b 9d c7 2f
d5 7e fc da 75 ..../.~..

  • Calculate the checksum of the full packet: 4502+0035+63DD+4000+2F06+D934+42AC+0B49+C0A8+0012+0050+DFAC+16A0+9056+406E+077F+8018+00F3+D9A9+0000+0101+080A+28C2+ECE1+DC1B+9DC7+2FD5+7EFC+DA75=A4B66, compress into 16 bit -> A+4B66=4B70. Matches what @6by9 computed above.
  • Calculate the checksum of the truncated packet: 4502+0035+63DD+4000+2F06+D934+42AC+0B49+C0A8+0012+0050+DFAC+16A0+9056+406E+077F+8018+00F3+D9A9+0000+0101+080A+28C2+ECE1+DC1B+9DC7+2F00=8F120, compress into 16 bit -> 8+F120=F128. Matches what @6by9 computed above.
  • Calculate the checksum of the part to be removed at the end: D57E+FCDA+7500=24758, compress into 16 bit -> 2+4758=475A. (Note that this is the naive and incorrect calculation that will explain the wrong result the kernel produces -- as @6by9 commented, the correct calculation should be 00D5+7EFC+DA75=15A46 aka 5A47 instead, which happens to be the byte-swapped value.)
  • Try to calculate the checksum of the truncated packet by subtracting the checksum of the part to the removed from the checksum of the full packet: 4B70-475A, rewrite as sum using complement -> 4B70+B8A5=10415, compress into 16 bit -> 0416. This is the compressed value of the incorrect kernel checksum value A8006E03 (after endianness conversion), as already seen in @6by9 's comment above. (A800+6E03=11603, 1+1603=1604)

So, what has happened? The root cause of our problems is that the IP checksum is computed over 16 bit data, and we ended up calculating the first part (4B70) starting from an even address, but the second part (475A) starting from an odd address.

When looking at the kernel code, I saw a comment that warned that incremental checksum calculations are only allowed if all fragments, except the last, have even length. Now I understand why -- in the example above, the first fragment has odd length, causing the second fragment to start at an odd address, messing up high and low bytes in the checksum calculation.

Edit: @6by9 already suspected something similar above ("skb_checksum should be doing this itself as it tries to compensate for the starting offset"). However, I am not sure whether it is actually skb_checksum's job to "compensate for the starting offset" in the sense we need it in this case here: skb_checksum seems to assume that the range of bytes it is checksumming starts at an even offset with respect to the network packet, but not necessarily with respect to memory addresses, and provides special treatment for the latter. The correction of skb_checksum's result for the fragment to be removed in @6by9 's update to pskb_trim_rcsum_slow() takes care of the former, and seems to be required to get both aspects covered correctly, in addition to the current code in the kernel.

Why is this not hitting a larger number of people? I can only assume that in many cases the fragment to be removed consists of zeros only (as in the dumps provided by @MilhouseVH
that were processed correctly and did not cause hw csum errors to be generated).

@6by9
Copy link
Contributor Author

6by9 commented Oct 21, 2018

@catschulze I agree with your analysis. What skb_checksum gets up to seems to be the crux of it.

The comments do say that all fragments except the last have to be even length, but the code appears at first glance to dispute that. The C reference version does check the start offset for odd/even, as does what I believe to be the ARM optimised version. The check does seem to be pretty noddy of checking the address of the buffer start for odd/even, when I'm not clear on the requirement for the buffer to be aligned. There are patches for 9514 which do fix the alignment.
Where I was getting lost was in __skb_checksum which iterates over the pages/fragments of the buffer.

I was going to mock up an SKB with buffer 00 d5 7e fc da 75 and checking that checksumming that with the 00 and with an offset of 1 gave the same results. Mocking up an SKB seems to be more involved than one expects. Totally overwriting a buffer may be quicker and easier.

Anyway, I'll be emailing net-dev tomorrow with what we've found.

@catschulze
Copy link
Contributor

catschulze commented Oct 21, 2018

@6by9 Looking at the prototype of do_csum()

static unsigned int do_csum(const unsigned char *buff, int len)

you can see that this function just receives the buffer location and length, but not the relative offset from the start of the packet. When working with (non-final) fragments of odd length, you would need both to correctly compute the checksum update:

  • You need to know whether the buffer starts at an odd address to check whether you can read the entire packet in 16 bit (or 32 bit) chunks (for increased efficiency or even CPU access alignment restrictions on certain architectures), or whether the first byte (or bytes) needs to be treated differently. For buffer content AA BB CC, this means calculating 00AA+BBCC or AABB+CCxx, and later byte-swapping the checksum in case we started with a single byte (to get the result as if we had calculated AABB+CCxx in both cases). This is what do_csum() does, see below. (And this is also why the function needs to be aware of the host's endianness, to use 00AA or AA00 as first part of the checksum. Note that the rest of the checksum always runs with the native endianness and does not convert to network byte order.)

odd = 1 & (unsigned long) buff;
if (odd) {
[...]
if (odd)
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);

  • When dealing with (non-final) fragments of odd length (what Linux seems to have avoided so far, before the problematic "performance patch" we are now dealing with was introduced), you will also need to know whether the fragment started at an even or odd offset from the beginning of the entire packet. This is required because an additional byte swap in the fragment checksum is required for fragments starting at an odd offset (see the example in my last comment above). The current code in Linux' network stack does not consider that yet, but your (@6by9) patch does:

if (len & 1) {
[...]
frag_csum = ((sum & 0xFF) <<8) | ((sum&0xFF00) >> 8);

Note that len is not memory address related (as the check involving buff in the first code fragment above was), but instead contains the required information of the offset of the fragment's start from the start of the entire packet.

(BTW, alignment of the packet or fragments is irrelevant to our discussion here; do_csum() will deal correctly with arbitrary alignment of the fragment's data buffer, and your code does not have to care at all, because it works with logical packet related offsets only. The only influence alignment has is efficiency related, because 32 bit alignment of fragments makes it easier for do_csum() to run with native 32 bit operations.)

IMO you should ask on net-dev to get a version of your patch (maybe excluding the debugging output) merged. Also provide a capture of the problematic packet, with non-zero padding and odd lengths of both initial fragment and padding -- if I am not mistaken, this effect should also be observable on other architectures as well (x86_64).

@MilhouseVH
Copy link

I can confirm that the hw csum failure problem is solved - at least for ix.io - with both 4.18.y and 4.19.y by reverting e5c9741 and 2a2fbe1, then applying the fix from Dimitris: https://www.spinics.net/lists/netdev/msg530020.html

@catschulze
Copy link
Contributor

@MilhouseVH Dimitris' patch from net-dev looks like an equivalent, but cleaner solution to what @6by9 had found above. It changes the call to csum_sub() into a call to csum_block_sub(), which gets len as additional argument and is therefore able to detect that an additional byte swap of the value returned from skb_checksum() is required in certain cases.

@6by9 It seems that this is now on its way to the stable kernel series, see LKML: https://lkml.org/lkml/2018/10/21/80. That means we might just have to wait for 4.14.79 to be released to get the full fix (until then, the revert 2a2fbe1 in our 4.14 tree should already avoid these error messages, with somewhat reduced efficiency due to discarding the hardware checksum). (As indicated by @MilhouseVH, 2a2fbe1 (and e5c9741?) should be reverted once we have Dimitris' patch.)

@6by9
Copy link
Contributor Author

6by9 commented Oct 22, 2018

@MilhouseVH Huge thanks for all your testing. That commit from Dimitris is spot on with where we were heading (it helps to know the calls that are available!)
e7bbc87 "net: udp: fix handling of CHECKSUM_COMPLETE packets" shouldn't really have been merged but looked like it might have been related. It ought to stay reverted until it is merged via the normal paths.

@catschulze I'd been looking at the do_csum code and seeing the use of odd as indirectly picking up the offset, but as you say it is only the optimisation for working on larger/more efficient operations and then deliberately swaps the butes at the end if odd.
Yes I think it will have broken all architectures and really depended on what level of checksum offload the LAN adapter supports and whether it got presented with padded data. Nice to know it had been tested so thoroughly.
2a2fbe1 only dropped us back to the same behaviour as we've had for the previous N years, so it's not as if it has a significant performance hit.

I'm going to close this PR and update the other issues relating to this. Thanks again for the assistance.

@6by9 6by9 closed this Oct 22, 2018
@6by9
Copy link
Contributor Author

6by9 commented Oct 22, 2018

@MilhouseVH Huge thanks for all your testing. That commit from Dimitris is spot on with where we were heading (it helps to know the calls that are available!)
e7bbc87 "net: udp: fix handling of CHECKSUM_COMPLETE packets" shouldn't really have been merged but looked like it might have been related. It ought to stay reverted until it is merged via the normal paths.

@catschulze I'd been looking at the do_csum code and seeing the use of odd as indirectly picking up the offset, but as you say it is only the optimisation for working on larger/more efficient operations and then deliberately swaps the butes at the end if odd.
Yes I think it will have broken all architectures and really depended on what level of checksum offload the LAN adapter supports and whether it got presented with padded data. Nice to know it had been tested so thoroughly.
2a2fbe1 only dropped us back to the same behaviour as we've had for the previous N years, so it's not as if it has a significant performance hit.

I'm going to close this PR and update the other issues relating to this. Thanks again for the assistance.

@MilhouseVH
Copy link

Thanks @6by9 and @catschulze for all your time digging into this!

@6by9 6by9 deleted the rpi-4.18.y-net branch July 2, 2019 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants