-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel 4.18.4 eth0: hw csum failure - Pi3b+ only #2659
Comments
How repeatable is the error? Do you see it every boot? Always after playing a video? and let us know if problem still occurs? |
errors start repeatable several minutes after boot. |
Can you test just |
|
I assume this is on a 3B+. |
I'm using rpi3+, no vlan configured. |
Module normally only gets loaded when you do configure VLAN tagging. |
In LibreELEC 9 with rpi-4.18.y, the config from LibreELEC/LibreELEC.tv#2913 is being used. A quick grep of the config for VLAN shows the following:
The same VLAN options are being used for rpi-4.14.y which doesn't appear to have any problems. |
Ok. If it already did so in previous versions, then that's probably not the problem (Have not seen any checksum problems in 4.17 or 4.18 myself, even though running network intensive software) |
I guess it's something we (@polojoe) can rule out - I added
No, me neither - my network seems to be frustratingly boring and problem free. :( |
Blacklist doesn't work for me |
I can confirm this issue happens on RPi 2 as well on kernel version 4.14.71. "Fixable" with
but that isnt a real solution. EDIT: I didn't bisect it, but the issue isn't present in 4.14.18. Going to be using it for now, it got broken somewhere between 4.14.18 and 4.14.71. |
I'm getting this without VLANs in use. I am using NAT and iptables, but not VLANs. RPi 3, |
@therealbstern and does the |
It did. |
@6by9 Any thoughts? Still seems to be a checksum offload issue in there. I vaguely remember some traffic on the linux netdev mailing list for this driver on checksums.... |
I'm going to be very strict here or things are going to get very confused. There are a set of patches for SMSC95xx (ie all Pis EXCEPT Pi 3B+ and not this issue) having been proposed on net-dev https://www.spinics.net/lists/netdev/msg526520.html, with https://www.spinics.net/lists/netdev/msg526518.html being related to a checksum generation failure (ie transmit) should the csum end up being in the last 4 bytes of the packet. A respin has been requested on the patches so they aren't in a shape that we can merge them at the moment. |
I noticed this problem on my Pi 3b + and I can confirm that in my case the issue comes from outside. |
@gmzed Please confirm which kernel version you are running. I think I'd currently describe them as troublesome rather than necessarily bad. If something is producing genuinely bad packets then it would cause issues on almost all systems. To all seeing this message: Do you get the same callstack as in polojoe's original kernel log, showing that it was triggered via
netdev_rx_csum_fault which logs the message is called from several places in the IP stack, so we need to isolate the exact path being taken. |
@6by9 my kernel: 4.14.74-v7+ |
Thank you.
So straight TCP v4. Looking at polojoe's, the nf in |
Also started to see this issue lately, but I'm using the 3 model B, which uses smsc95xx. This is with the latest 4.18 kernel tree (rpi patches + stable).
|
Looking at my local network, the service generating this issue was iceccd (running on ubuntu 16.04). After disabling iceccd on the ubuntu machine I'm not not getting the hw csum errors anymore. Logs from tcpdump:
Unclear if this is really an issue at the sender or if an issue at the receiver side. |
PLEASE DO NOT COMMENT ON THIS THREAD IF YOU ARE NOT USING A Pi 3B+. lan78xx and smsc95xx are two distinct drivers. As both are made by Microchip they may have common failures, however diagnosis MUST remain independent as it may well be totally independent issues. #2712 raised for SMSC9514 related issues. |
@gmzed If you have such a reliable way to reporduce this, could you run Wireshark or tcpdump on the Pi to capture all incoming traffic (nothing sensitive please!) whilst you trigger the issue?
|
@6by9 Here you are. It seems to me that the "TCP keep alive" packets are the cause of the issue in this case. |
@gmzed Thank you - I'll have a look. |
Not much progress here I'm afraid, and my network knowledge is sufficiently rusty to mean I need to go away and check a load of things. The transactions look bizarre, and I'm not sure I agree with Wireshark's diagnosis that they are Keep-alive ACKs as those would normally be 0 length whilst these are length 1. Keep-alives are typically 10's of minutes to hours apart. Just looking at the port 49761 <> 888 transactions,
As I say, for a keep-alive I would have expected length 0. Flip side is that I'd also expect an exponential backoff on TCP retries. @gmzed Would you be prepared to drop back to a 4.14 kernel to see if you can trigger it there under the same situation? There are differences in the lan78xx driver between our 4.14 and 4.18 branches, but nothing fundamental around offloading. If it isn't reproducable there then it must be something odd in the core. |
Forgive the intrusion, but that sounds like the kind of traffic pattern you would get if the receiver's window was full. Senders are meant to periodically probe the window by trying to send one byte, and receivers ACK the previous position if they have no space. The proof should be in the window size parameter of the ACK packets. |
Always happy to have extra sets of eyes. Pasting the tcpdump interpretation of that one session.
The outgoing cksums will be incorrect due to checksum tx offload being active. |
From #2713, it appears that issues on 4.14 have occurred at 4.14.71. Could those seeing issues on 4.14 try reverting to 4.14.70 ( For 4.18 it isn't so easy, but having looked at the commits between 4.14.70 - 71 I'd be suspecting 88078d9, so if someone could try 4.18 with that commit reverted then it'd be a useful test. |
@6by9
backtrace from dmesg:
|
I also tried on my second pi (pi 2) and I can confirm that the problem also occurs on the 4.14.74-v7+ and after the change to 4.14.70-v7+ the problem does not occur. |
Edit: I'm on 4.14.71-v7+ hash1145 - not sure if I should file a separate bug or not since it looks from the above discussion as though the root cause could be the same commit in 4.14.71. I'm seeing this too - on a Pi 3B+ that I've just rebuilt running stock Raspbian 2018-10-09. With samba, mediaplayer (a Java-based OpenHome / uPnP music player outputting to an attached iQaudio DAC) and Kodi (for playing videos to an HDMI-connected TV). The only change I've made to the network config from stock Raspbian is to disable ipv6 in cmdline.txt, disable onboard bluetooth and onboard wifi via dtoverlay in config.txt, and disable EEE via dtparam in config.txt. I was seeing this before the rebuild in 4.14.71+ but could not be sure until I rebuilt that it wasn't my corrupt SDcard that was causing it. Sadly, it seems it is indeed this bug. Here's a example apparently getting triggered on receipt of a TCP packet:
I've installed ethtool and have disabled hardware checksum offload for the rx path in the meantime, although for some reason those errors only occurred so far from boot up to 48 minutes after boot. I've not investigated the traffic that is triggering this. Assuming the cause is [88078d9], is that on the rx path only? Wondering if I should disable checksum offload for tx as well. |
This reverts commit 88078d9. Various people have been reporting seeing "eth0: hw csum failure" and callstacks dumped in the kernel log on 4.18, and since 4.14.71, on both SMSC9514 and LAN7800 adapters. This commit appears to be the reason, but potentially due to an issue further down the stack. Revert whilst investigating the trigger. raspberrypi#2713 raspberrypi#2659 raspberrypi#2712 Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
This reverts commit 6bf32cd. Various people have been reporting seeing "eth0: hw csum failure" and callstacks dumped in the kernel log on 4.18, and since 4.14.71, on both SMSC9514 and LAN7800 adapters. This commit appears to be the reason, but potentially due to an issue further down the stack. Revert whilst investigating the trigger. raspberrypi#2713 raspberrypi#2659 raspberrypi#2712 Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
The suspected upstream commit is in the receive path only, supposedly an optimisation for fragmented IP packets and checksum offload. Being quite so embedded in the core of the networking stack fully understanding the usage can take a while. We are looking at reverting it in 4.14 to deal with the majority of users. 4.18 may retain the patch for now because it isn't the main kernel version and we still haven't managed to reproduce this. If those people hitting the issue can provide descriptions of their systems and what they are doing at the time (simpler the better), then it would help. |
This reverts commit 6bf32cd. Various people have been reporting seeing "eth0: hw csum failure" and callstacks dumped in the kernel log on 4.18, and since 4.14.71, on both SMSC9514 and LAN7800 adapters. This commit appears to be the reason, but potentially due to an issue further down the stack. Revert whilst investigating the trigger. #2713 #2659 #2712 Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
This reverts commit 6bf32cd. Various people have been reporting seeing "eth0: hw csum failure" and callstacks dumped in the kernel log on 4.18, and since 4.14.71, on both SMSC9514 and LAN7800 adapters. This commit appears to be the reason, but potentially due to an issue further down the stack. Revert whilst investigating the trigger. raspberrypi/linux#2713 raspberrypi/linux#2659 raspberrypi/linux#2712 Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org> Signed-off-by: ahmedradaideh <ahmed.radaideh@gmail.com>
Latest rpi-update kernel reverts the offending commit. Please test and confirm if issue is fixed. |
To clarify, the rpi-4.18.y tree includes a patch from upstream which was thought to address the problem. The original commit is still present. rpi-4.14.y - now available via rpi-update - includes the reversion. |
Now confirmed that the upstream patch does NOT fix the issue, therefore 4.18 is still going to exhibit issues. |
The root cause of this has been found upstream and a fix should trickle down in the next few days. When they hit we'll revert the reverts. |
kernel: Revert Revert net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends See: raspberrypi/linux#2659 kernel: config: Add CONFIG_USBIP_VUDC See: #353 kernel: mmc/bcm2835-sdhost: Recover from MMC_SEND_EXT_CSD See: raspberrypi/linux#2728 kernel: overlays: pi3-disable-bt: Clear out bt_pins node kernel: Revert rtc: pcf8523: properly handle oscillator stop bit See: #1065 bootcode: Extend TEST_UNIT_READY timeout to 20 seconds, some hard drives take a really long time See: #898 firmware: video_render: Treat an empty buffer with ENDOFFRAME set as a flush firmware: dispmanx: Add option to ignore all layers lower than the current layer
kernel: Revert Revert net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends See: raspberrypi/linux#2659 kernel: config: Add CONFIG_USBIP_VUDC See: raspberrypi/firmware#353 kernel: mmc/bcm2835-sdhost: Recover from MMC_SEND_EXT_CSD See: raspberrypi/linux#2728 kernel: overlays: pi3-disable-bt: Clear out bt_pins node kernel: Revert rtc: pcf8523: properly handle oscillator stop bit See: raspberrypi/firmware#1065 bootcode: Extend TEST_UNIT_READY timeout to 20 seconds, some hard drives take a really long time See: raspberrypi/firmware#898 firmware: video_render: Treat an empty buffer with ENDOFFRAME set as a flush firmware: dispmanx: Add option to ignore all layers lower than the current layer
Latest rpi-update kernel removes the revert and makes use of the upstream fix. |
I can confirm that after rpi-update the problem is fixed on my Pi 3B+. Thanks for your effort! |
4.14.71-v7+ Upgraded to 4.14.79-v7+ resolved |
upgraded to 4.14.79-v7+ on my 3b+ and the kernel panic is gone, thanks!
|
Closing this issue as questions answered/issue resolved. |
I get following errors
eth0: hw csum failure
since upgrade to 4.18 from 4.14
kernel log http://ix.io/1l8g
rpi is connected directly to my isp router
The text was updated successfully, but these errors were encountered: