Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel 4.18.4 eth0: hw csum failure - Pi3b+ only #2659

Closed
polojoe opened this issue Aug 24, 2018 · 44 comments
Closed

kernel 4.18.4 eth0: hw csum failure - Pi3b+ only #2659

polojoe opened this issue Aug 24, 2018 · 44 comments

Comments

@polojoe
Copy link

polojoe commented Aug 24, 2018

I get following errors
eth0: hw csum failure
since upgrade to 4.18 from 4.14

kernel log http://ix.io/1l8g

rpi is connected directly to my isp router

@popcornmix
Copy link
Collaborator

How repeatable is the error? Do you see it every boot? Always after playing a video?
Can you try running:
ethtool -K eth0 rx off ; ethtool -K eth0 tx off

and let us know if problem still occurs?
Ping @6by9

@polojoe
Copy link
Author

polojoe commented Aug 24, 2018

errors start repeatable several minutes after boot.
ethtool -K eth0 rx off ; ethtool -K eth0 tx off
solves errors in kernel log

@popcornmix
Copy link
Collaborator

Can you test just ethtool -K eth0 rx off and just ethtool -K eth0 tx off to see which is fixing the error?

@polojoe
Copy link
Author

polojoe commented Aug 24, 2018

ethtool -K eth0 rx off
does the trick

@6by9
Copy link
Contributor

6by9 commented Aug 24, 2018

I assume this is on a 3B+.
Are you running VLANs? That was the only time we were previously seeing these errors, and there was a patch that swapped to s/w checksum offload in those conditions.

@polojoe
Copy link
Author

polojoe commented Aug 25, 2018

I'm using rpi3+, no vlan configured.
Direct lan connection to router with integrated 4 port 100 Mbit switch.

@maxnet
Copy link
Contributor

maxnet commented Aug 25, 2018

I'm using rpi3+, no vlan configured.

[    4.827651] 8021q: 802.1Q VLAN Support v1.8
[    4.827687] 8021q: adding VLAN 0 to HW filter on device eth0

Module normally only gets loaded when you do configure VLAN tagging.
Or do you have a custom kernel configuration in which it is statically built into kernel?

@MilhouseVH
Copy link

Or do you have a custom kernel configuration in which it is statically built into kernel?

In LibreELEC 9 with rpi-4.18.y, the config from LibreELEC/LibreELEC.tv#2913 is being used.

A quick grep of the config for VLAN shows the following:

neil@nm-linux:~/projects/pullrequest_repos/LibreELEC.tv/projects/RPi/devices/RPi2$ git grep VLAN
linux/linux.arm.conf:# CONFIG_BRIDGE_VLAN_FILTERING is not set
linux/linux.arm.conf:CONFIG_VLAN_8021Q=m
linux/linux.arm.conf:# CONFIG_VLAN_8021Q_GVRP is not set
linux/linux.arm.conf:# CONFIG_VLAN_8021Q_MVRP is not set
linux/linux.arm.conf:CONFIG_MACVLAN=m
linux/linux.arm.conf:# CONFIG_IPVLAN is not set

The same VLAN options are being used for rpi-4.14.y which doesn't appear to have any problems.

@maxnet
Copy link
Contributor

maxnet commented Aug 26, 2018

The same VLAN options are being used for rpi-4.14.y which doesn't appear to have any problems.

Ok.
Probably connman doing something that pulls in the module then.
Module is not loaded on distributions with other network managers like Raspbian's dhcpcd.

If it already did so in previous versions, then that's probably not the problem
But perhaps you can still try adding module_blacklist=8021q to cmdline.txt, to see if it's any different without it?

(Have not seen any checksum problems in 4.17 or 4.18 myself, even though running network intensive software)

@MilhouseVH
Copy link

But perhaps you can still try adding module_blacklist=8021q to cmdline.txt module_blacklist=8021q

I guess it's something we (@polojoe) can rule out - I added module_blacklist=8021q to the LibreELEC kernel command line and can confirm the VLAN module is no longer loaded. LibreELEC (on an RPi3+, wired ethernet) appears to be behaving normally without it.

(Have not seen any checksum problems in 4.17 or 4.18 myself, even though running network intensive software)

No, me neither - my network seems to be frustratingly boring and problem free. :(

@polojoe
Copy link
Author

polojoe commented Aug 26, 2018

Blacklist doesn't work for me
Commandline http://ix.io/1lfr
Kernel log http://ix.io/1lfq

@OrN
Copy link

OrN commented Sep 26, 2018

I can confirm this issue happens on RPi 2 as well on kernel version 4.14.71. "Fixable" with

ethtool -K eth0 rx off

but that isnt a real solution.

EDIT: I didn't bisect it, but the issue isn't present in 4.14.18. Going to be using it for now, it got broken somewhere between 4.14.18 and 4.14.71.

@therealbstern
Copy link

therealbstern commented Oct 2, 2018

I'm getting this without VLANs in use. I am using NAT and iptables, but not VLANs.

RPi 3, Linux beret 4.14.72-v7+ #1146 SMP Wed Sep 26 16:58:28 BST 2018 armv7l GNU/Linux

@JamesH65
Copy link
Contributor

JamesH65 commented Oct 2, 2018

@therealbstern and does the ethtool -K eth0 rx off make the issue go away?

@therealbstern
Copy link

It did.

@JamesH65
Copy link
Contributor

@6by9 Any thoughts? Still seems to be a checksum offload issue in there. I vaguely remember some traffic on the linux netdev mailing list for this driver on checksums....

@6by9
Copy link
Contributor

6by9 commented Oct 10, 2018

I'm going to be very strict here or things are going to get very confused.
The original issue is reported against a 3B+ and the 4.18 kernel, and is therefore using the lan78xx driver. Reports of it also affecting 4.14 on any other version of Pi (using the smsc95xx driver) are therefore a different issue as it is a different kernel version and different chip. Please raise a new issue with all the details and logs.

There are a set of patches for SMSC95xx (ie all Pis EXCEPT Pi 3B+ and not this issue) having been proposed on net-dev https://www.spinics.net/lists/netdev/msg526520.html, with https://www.spinics.net/lists/netdev/msg526518.html being related to a checksum generation failure (ie transmit) should the csum end up being in the last 4 bytes of the packet.
Is it plausible that you have a Pi (not 3B+) on the network sending corrupt packets and so the receiving device may legitimately be flagging a checksum failure? There are very few packets that fall into this corner case, and it wouldn't be cleared by disabling the offload (depending on exactly how a csum failure is reported when having gone via the software route).

A respin has been requested on the patches so they aren't in a shape that we can merge them at the moment.

@JamesH65 JamesH65 changed the title kernel 4.18.4 eth0: hw csum failure kernel 4.18.4 eth0: hw csum failure - Pi3b+ only Oct 10, 2018
@gmzed
Copy link

gmzed commented Oct 10, 2018

I noticed this problem on my Pi 3b + and I can confirm that in my case the issue comes from outside.
"eth0: hw csum failure" only appears when I connect to my Pi remotely via openvpn terminated on my UBNT EdgerouterX.
When I connect via vpn directly terminated on Raspberry, the problem does not occur ... So it looks like my EdgerouterX produces bad packets when I link through openvpn (?).

@6by9
Copy link
Contributor

6by9 commented Oct 10, 2018

@gmzed Please confirm which kernel version you are running.

I think I'd currently describe them as troublesome rather than necessarily bad. If something is producing genuinely bad packets then it would cause issues on almost all systems.
It is fair to say there is an issue in some part of the checksum offload, but at the moment we haven't identified what, the logged messages don't contain enough data to deduce it, and the majority of networks (including ours) aren't triggering it. I suspect we need to drop some more logging into the kernel when it does occur to try and identify the situation more accurately.

To all seeing this message: Do you get the same callstack as in polojoe's original kernel log, showing that it was triggered via

netdev_rx_csum_fault
__skb_checksum_complete
nf_ip_checksum
nf_checksum
tcp_error
nf_conntrack_in
ipv4_conntrack_in
nf_hook_slow
ip_rcv
__netif_receive_skb_core
__netif_receive_skb
process_backlog
net_rx_action

netdev_rx_csum_fault which logs the message is called from several places in the IP stack, so we need to isolate the exact path being taken.

@gmzed
Copy link

gmzed commented Oct 10, 2018

@6by9 my kernel: 4.14.74-v7+
Oct 10 13:06:51 pi3 kernel: [ 864.539387] eth0: hw csum failure Oct 10 13:06:51 pi3 kernel: [ 864.539398] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G C 4.14.74-v7+ #1149 Oct 10 13:06:51 pi3 kernel: [ 864.539404] Hardware name: BCM2835 Oct 10 13:06:51 pi3 kernel: [ 864.539421] [<8010ffd4>] (unwind_backtrace) from [<8010c240>] (show_stack+0x20/0x24) Oct 10 13:06:51 pi3 kernel: [ 864.539438] [<8010c240>] (show_stack) from [<80788604>] (dump_stack+0xd4/0x118) Oct 10 13:06:51 pi3 kernel: [ 864.539455] [<80788604>] (dump_stack) from [<8068e870>] (netdev_rx_csum_fault+0x44/0x48) Oct 10 13:06:51 pi3 kernel: [ 864.539474] [<8068e870>] (netdev_rx_csum_fault) from [<80681144>] (__skb_checksum_complete+0xbc/0xc0) Oct 10 13:06:51 pi3 kernel: [ 864.539495] [<80681144>] (__skb_checksum_complete) from [<80701224>] (tcp_v4_rcv+0x604/0xe5c) Oct 10 13:06:51 pi3 kernel: [ 864.539513] [<80701224>] (tcp_v4_rcv) from [<806d8054>] (ip_local_deliver_finish+0xe4/0x330) Oct 10 13:06:51 pi3 kernel: [ 864.539530] [<806d8054>] (ip_local_deliver_finish) from [<806d890c>] (ip_local_deliver+0x54/0xdc) Oct 10 13:06:51 pi3 kernel: [ 864.539546] [<806d890c>] (ip_local_deliver) from [<806d851c>] (ip_rcv_finish+0x27c/0x4e0) Oct 10 13:06:51 pi3 kernel: [ 864.539561] [<806d851c>] (ip_rcv_finish) from [<806d8cb0>] (ip_rcv+0x31c/0x514) Oct 10 13:06:51 pi3 kernel: [ 864.539578] [<806d8cb0>] (ip_rcv) from [<8068be4c>] (__netif_receive_skb_core+0x340/0xc84) Oct 10 13:06:51 pi3 kernel: [ 864.539596] [<8068be4c>] (__netif_receive_skb_core) from [<8068e9e4>] (__netif_receive_skb+0x20/0x7c) Oct 10 13:06:51 pi3 kernel: [ 864.539614] [<8068e9e4>] (__netif_receive_skb) from [<8068ead8>] (process_backlog+0x98/0x148) Oct 10 13:06:51 pi3 kernel: [ 864.539629] [<8068ead8>] (process_backlog) from [<80692df0>] (net_rx_action+0x2e8/0x45c) Oct 10 13:06:51 pi3 kernel: [ 864.539645] [<80692df0>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8) Oct 10 13:06:51 pi3 kernel: [ 864.539660] [<80101694>] (__do_softirq) from [<80123870>] (irq_exit+0x108/0x164) Oct 10 13:06:51 pi3 kernel: [ 864.539676] [<80123870>] (irq_exit) from [<801759a8>] (__handle_domain_irq+0x70/0xc4) Oct 10 13:06:51 pi3 kernel: [ 864.539692] [<801759a8>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac) Oct 10 13:06:51 pi3 kernel: [ 864.539706] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<807a41bc>] (__irq_svc+0x5c/0x7c) Oct 10 13:06:51 pi3 kernel: [ 864.539713] Exception stack(0x80c01ef0 to 0x80c01f38) Oct 10 13:06:51 pi3 kernel: [ 864.539723] 1ee0: 00000000 03ee7ee8 36436000 00000000 Oct 10 13:06:51 pi3 kernel: [ 864.539735] 1f00: 80c00000 80c03dcc 80c03d68 80c885b2 00000001 80b60a30 b77ffa00 80c01f4c Oct 10 13:06:51 pi3 kernel: [ 864.539745] 1f20: 80c04174 80c01f40 80108a4c 80108a50 60000013 ffffffff Oct 10 13:06:51 pi3 kernel: [ 864.539762] [<807a41bc>] (__irq_svc) from [<80108a50>] (arch_cpu_idle+0x34/0x4c) Oct 10 13:06:51 pi3 kernel: [ 864.539780] [<80108a50>] (arch_cpu_idle) from [<807a393c>] (default_idle_call+0x34/0x48) Oct 10 13:06:51 pi3 kernel: [ 864.539797] [<807a393c>] (default_idle_call) from [<801614b8>] (do_idle+0xd8/0x150) Oct 10 13:06:51 pi3 kernel: [ 864.539811] [<801614b8>] (do_idle) from [<801617cc>] (cpu_startup_entry+0x28/0x2c) Oct 10 13:06:51 pi3 kernel: [ 864.539826] [<801617cc>] (cpu_startup_entry) from [<8079d664>] (rest_init+0xbc/0xc0) Oct 10 13:06:51 pi3 kernel: [ 864.539842] [<8079d664>] (rest_init) from [<80b00df8>] (start_kernel+0x3d4/0x3e0)

@6by9
Copy link
Contributor

6by9 commented Oct 10, 2018

Thank you.
To clean that callstack up:

netdev_rx_csum_fault
__skb_checksum_complete
tcp_v4_rcv
ip_local_deliver_finish
ip_local_deliver
ip_rcv_finish
ip_rcv
__netif_receive_skb_core
__netif_receive_skb
process_backlog
net_rx_action

So straight TCP v4.

Looking at polojoe's, the nf in nf_ip_checksum is netfilter, so I suspect iptables stuff is running on his system (slightly unexpected if it is just a LibreElec box).

@ricardosalveti
Copy link

Also started to see this issue lately, but I'm using the 3 model B, which uses smsc95xx.

This is with the latest 4.18 kernel tree (rpi patches + stable).

[   10.597513] eth0: hw csum failure
[   10.604290] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C        4.18.13 #1
[   10.611202] Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT)
[   10.618165] Call trace:
[   10.625057]  dump_backtrace+0x0/0x188
[   10.631958]  show_stack+0x24/0x30
[   10.638828]  dump_stack+0x80/0xa4
[   10.645602]  netdev_rx_csum_fault+0x40/0x50
[   10.652378]  __skb_checksum_complete+0xd0/0xd8
[   10.659126]  __udp4_lib_rcv+0x130/0xba0
[   10.665802]  udp_rcv+0x30/0x40
[   10.672388]  ip_local_deliver_finish+0x108/0x278
[   10.678972]  ip_local_deliver+0xf8/0x108
[   10.685525]  ip_rcv_finish+0x204/0x460
[   10.692043]  ip_rcv+0x34c/0x410
[   10.698457]  __netif_receive_skb_core+0x4a4/0xb70
[   10.704927]  __netif_receive_skb+0x3c/0x88
[   10.711325]  process_backlog+0xac/0x170
[   10.717719]  net_rx_action+0x124/0x3a8
[   10.724149]  __do_softirq+0x168/0x3a8
[   10.730500]  irq_exit+0xb8/0xd8
[   10.736739]  __handle_domain_irq+0x9c/0x108
[   10.743038]  bcm2836_arm_irqchip_handle_irq+0x68/0xc8
[   10.749403]  el1_irq+0xb0/0x128
[   10.755726]  arch_cpu_idle+0x38/0x1a8
[   10.762089]  do_idle+0x240/0x258
[   10.768406]  cpu_startup_entry+0x2c/0x30
[   10.774654]  rest_init+0xd0/0xdc
[   10.780775]  start_kernel+0x488/0x4b0

@ricardosalveti
Copy link

ricardosalveti commented Oct 10, 2018

Looking at my local network, the service generating this issue was iceccd (running on ubuntu 16.04). After disabling iceccd on the ubuntu machine I'm not not getting the hw csum errors anymore.

Logs from tcpdump:

21:12:03.771275 IP (tos 0x0, ttl 64, id 57986, offset 0, flags [DF], proto UDP (17), length 29)
    192.168.1.44.58909 > 192.168.1.255.8765: UDP, length 1
E.....@.@......,......"=.       S. ............../..

Unclear if this is really an issue at the sender or if an issue at the receiver side.

@6by9
Copy link
Contributor

6by9 commented Oct 11, 2018

PLEASE DO NOT COMMENT ON THIS THREAD IF YOU ARE NOT USING A Pi 3B+.

lan78xx and smsc95xx are two distinct drivers. As both are made by Microchip they may have common failures, however diagnosis MUST remain independent as it may well be totally independent issues.

#2712 raised for SMSC9514 related issues.

@6by9
Copy link
Contributor

6by9 commented Oct 11, 2018

@gmzed If you have such a reliable way to reporduce this, could you run Wireshark or tcpdump on the Pi to capture all incoming traffic (nothing sensitive please!) whilst you trigger the issue?
I've tried several approaches now and failed to reproduce anything.

sudo tcpdump -i eth0 -n -x -w capture.dmp should both capture the data to capture.dmp, and decode it. If you can post capture.dmp to dropbox, Google Drive, or other file sharing site then I can take a look at it.

@gmzed
Copy link

gmzed commented Oct 12, 2018

@6by9 Here you are. It seems to me that the "TCP keep alive" packets are the cause of the issue in this case.
capture4.zip

@6by9
Copy link
Contributor

6by9 commented Oct 12, 2018

@gmzed Thank you - I'll have a look.

@6by9
Copy link
Contributor

6by9 commented Oct 12, 2018

Not much progress here I'm afraid, and my network knowledge is sufficiently rusty to mean I need to go away and check a load of things.

The transactions look bizarre, and I'm not sure I agree with Wireshark's diagnosis that they are Keep-alive ACKs as those would normally be 0 length whilst these are length 1. Keep-alives are typically 10's of minutes to hours apart.
I'm more tending towards them being retries due to the other end not receiving the ACK.

Just looking at the port 49761 <> 888 transactions,

  • Frame 60 is len 1 being sent, seq num 377196760
  • Frame 61 is the ACK saying the next packet would be seq num 377196761.
  • Frame 122 is 10 seconds later, len 1 byte at seq num 377196760, so the same as frame 60.
  • Frame 123 is the ACK again.
  • Frame 170/171 is 10 seconds later again. Same thing.
  • Frame 229/230 is 10 seconds later again. Same thing.

As I say, for a keep-alive I would have expected length 0. Flip side is that I'd also expect an exponential backoff on TCP retries.

@gmzed Would you be prepared to drop back to a 4.14 kernel to see if you can trigger it there under the same situation? There are differences in the lan78xx driver between our 4.14 and 4.18 branches, but nothing fundamental around offloading. If it isn't reproducable there then it must be something odd in the core.

@pelwell
Copy link
Contributor

pelwell commented Oct 12, 2018

Forgive the intrusion, but that sounds like the kind of traffic pattern you would get if the receiver's window was full. Senders are meant to periodically probe the window by trying to send one byte, and receivers ACK the previous position if they have no space. The proof should be in the window size parameter of the ACK packets.

@6by9
Copy link
Contributor

6by9 commented Oct 12, 2018

Always happy to have extra sets of eyes.
Window is 253 on inbound frames, 274 on outbound.

Pasting the tcpdump interpretation of that one session.

$ tcpdump -# -r capture4.dmp 'tcp port 49761' -x -v
reading from file capture4.dmp, link-type EN10MB (Ethernet)
    1  10:30:46.777659 IP (tos 0x0, ttl 127, id 10385, offset 0, flags [DF], proto TCP (6), length 41)
    172.17.1.2.49761 > 192.168.1.10.8888: Flags [.], cksum 0x46c2 (correct), seq 377196760:377196761, ack 660292998, win 253, length 1
	0x0000:  4500 0029 2891 4000 7f06 6478 ac11 0102
	0x0010:  c0a8 010a c261 22b8 167b 90d8 275b 4586
	0x0020:  5010 00fd 46c2 0000 0000 9dd5 1ac1
    2  10:30:46.778417 IP (tos 0x0, ttl 64, id 159, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.10.8888 > 172.17.1.2.49761: Flags [.], cksum 0x6eec (incorrect -> 0xc1ed), ack 1, win 274, options [nop,nop,sack 1 {0:1}], length 0
	0x0000:  4500 0034 009f 4000 4006 cb5f c0a8 010a
	0x0010:  ac11 0102 22b8 c261 275b 4586 167b 90d9
	0x0020:  8010 0112 6eec 0000 0101 050a 167b 90d8
	0x0030:  167b 90d9
    3  10:30:56.790582 IP (tos 0x0, ttl 127, id 10391, offset 0, flags [DF], proto TCP (6), length 41)
    172.17.1.2.49761 > 192.168.1.10.8888: Flags [.], cksum 0x46c2 (correct), seq 0:1, ack 1, win 253, length 1
	0x0000:  4500 0029 2897 4000 7f06 6472 ac11 0102
	0x0010:  c0a8 010a c261 22b8 167b 90d8 275b 4586
	0x0020:  5010 00fd 46c2 0000 0000 773c 281f
    4  10:30:56.791295 IP (tos 0x0, ttl 64, id 160, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.10.8888 > 172.17.1.2.49761: Flags [.], cksum 0x6eec (incorrect -> 0xc1ed), ack 1, win 274, options [nop,nop,sack 1 {0:1}], length 0
	0x0000:  4500 0034 00a0 4000 4006 cb5e c0a8 010a
	0x0010:  ac11 0102 22b8 c261 275b 4586 167b 90d9
	0x0020:  8010 0112 6eec 0000 0101 050a 167b 90d8
	0x0030:  167b 90d9
    5  10:31:06.796615 IP (tos 0x0, ttl 127, id 10397, offset 0, flags [DF], proto TCP (6), length 41)
    172.17.1.2.49761 > 192.168.1.10.8888: Flags [.], cksum 0x46c2 (correct), seq 0:1, ack 1, win 253, length 1
	0x0000:  4500 0029 289d 4000 7f06 646c ac11 0102
	0x0010:  c0a8 010a c261 22b8 167b 90d8 275b 4586
	0x0020:  5010 00fd 46c2 0000 0000 0800 0ea6
    6  10:31:06.797302 IP (tos 0x0, ttl 64, id 161, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.10.8888 > 172.17.1.2.49761: Flags [.], cksum 0x6eec (incorrect -> 0xc1ed), ack 1, win 274, options [nop,nop,sack 1 {0:1}], length 0
	0x0000:  4500 0034 00a1 4000 4006 cb5d c0a8 010a
	0x0010:  ac11 0102 22b8 c261 275b 4586 167b 90d9
	0x0020:  8010 0112 6eec 0000 0101 050a 167b 90d8
	0x0030:  167b 90d9
    7  10:31:16.802849 IP (tos 0x0, ttl 127, id 10403, offset 0, flags [DF], proto TCP (6), length 41)
    172.17.1.2.49761 > 192.168.1.10.8888: Flags [.], cksum 0x46c2 (correct), seq 0:1, ack 1, win 253, length 1
	0x0000:  4500 0029 28a3 4000 7f06 6466 ac11 0102
	0x0010:  c0a8 010a c261 22b8 167b 90d8 275b 4586
	0x0020:  5010 00fd 46c2 0000 0000 5ba1 de02
    8  10:31:16.803307 IP (tos 0x0, ttl 64, id 162, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.10.8888 > 172.17.1.2.49761: Flags [.], cksum 0x6eec (incorrect -> 0xc1ed), ack 1, win 274, options [nop,nop,sack 1 {0:1}], length 0
	0x0000:  4500 0034 00a2 4000 4006 cb5c c0a8 010a
	0x0010:  ac11 0102 22b8 c261 275b 4586 167b 90d9
	0x0020:  8010 0112 6eec 0000 0101 050a 167b 90d8
	0x0030:  167b 90d9

The outgoing cksums will be incorrect due to checksum tx offload being active.
Wireshark claims all buffers except the first two are TCP Keep-alive frames.

@6by9
Copy link
Contributor

6by9 commented Oct 13, 2018

From #2713, it appears that issues on 4.14 have occurred at 4.14.71.

Could those seeing issues on 4.14 try reverting to 4.14.70 (sudo rpi-update e880e627d11af2b17be3efb6d105e3c28cd63867) and see if they still get the issues. Moving forward the one step to 4.14.71 (sudo rpi-update c919d632ddc2a88bcb87b7d0cddd61446d1a36bf) is expected to show up the problems.

For 4.18 it isn't so easy, but having looked at the commits between 4.14.70 - 71 I'd be suspecting 88078d9, so if someone could try 4.18 with that commit reverted then it'd be a useful test.

@iMouath
Copy link

iMouath commented Oct 14, 2018

Could those seeing issues on 4.14 try reverting to 4.14.70 (sudo rpi-update e880e627d11af2b17be3efb6d105e3c28cd63867) and see if they still get the issues. Moving forward the one step to 4.14.71 (sudo rpi-update c919d632ddc2a88bcb87b7d0cddd61446d1a36bf) is expected to show up the problems.

@6by9
Can confirm error have stopped showing when I reverted to e880e627d11af2b17be3efb6d105e3c28cd63867
I was on 4.14.74-v7+ #1149 then after reverting 4.14.70-v7+ #1144
Note: I'm using bonding driver for eth0 and wlan0 under systemd-networkd (thought that might be helpful) Raspberry Pi 3 B

c919d632ddc2a88bcb87b7d0cddd61446d1a36bf Linux piHole 4.14.71-v7+ #1145 does indeed bring the error back

backtrace from dmesg:

[Oct14 03:28] bond0: hw csum failure
[  +0.000032] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.71-v7+ #1145
[  +0.000014] Hardware name: BCM2835
[  +0.000042] [<8010ffd4>] (unwind_backtrace) from [<8010c240>] (show_stack+0x20/0x24)
[  +0.000019] [<8010c240>] (show_stack) from [<80787f24>] (dump_stack+0xd4/0x118)
[  +0.000021] [<80787f24>] (dump_stack) from [<8068e20c>] (netdev_rx_csum_fault+0x44/0x48)
[  +0.000028] [<8068e20c>] (netdev_rx_csum_fault) from [<80680ae0>] (__skb_checksum_complete+0xbc/0xc0)
[  +0.000020] [<80680ae0>] (__skb_checksum_complete) from [<80732b04>] (nf_ip_checksum+0xd4/0x130)
[  +0.000116] [<80732b04>] (nf_ip_checksum) from [<7f3c2a4c>] (udp_error+0x138/0x1c8 [nf_conntrack])
[  +0.000195] [<7f3c2a4c>] (udp_error [nf_conntrack]) from [<7f3bb994>] (nf_conntrack_in+0xec/0x560 [nf_conntrack])
[  +0.000112] [<7f3bb994>] (nf_conntrack_in [nf_conntrack]) from [<7f4052dc>] (ipv4_conntrack_in+0x28/0x2c [nf_conntrack_ipv4])
[  +0.000031] [<7f4052dc>] (ipv4_conntrack_in [nf_conntrack_ipv4]) from [<806cfe38>] (nf_hook_slow+0x4c/0xd0)
[  +0.000020] [<806cfe38>] (nf_hook_slow) from [<806d8750>] (ip_rcv+0x460/0x514)
[  +0.000019] [<806d8750>] (ip_rcv) from [<8068b7e8>] (__netif_receive_skb_core+0x340/0xc84)
[  +0.000019] [<8068b7e8>] (__netif_receive_skb_core) from [<8068e380>] (__netif_receive_skb+0x20/0x7c)
[  +0.000020] [<8068e380>] (__netif_receive_skb) from [<8068e474>] (process_backlog+0x98/0x148)
[  +0.000016] [<8068e474>] (process_backlog) from [<8069278c>] (net_rx_action+0x2e8/0x45c)
[  +0.000017] [<8069278c>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8)
[  +0.000017] [<80101694>] (__do_softirq) from [<80123870>] (irq_exit+0x108/0x164)
[  +0.000022] [<80123870>] (irq_exit) from [<80175984>] (__handle_domain_irq+0x70/0xc4)
[  +0.000021] [<80175984>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac)
[  +0.000015] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<807a3abc>] (__irq_svc+0x5c/0x7c)
[  +0.000007] Exception stack(0x80c01ef0 to 0x80c01f38)
[  +0.000010] 1ee0:                                     00000000 01971e24 36a17000 00000000
[  +0.000012] 1f00: 80c00000 80c03dcc 80c03d68 80c885b2 00000001 80b60a30 b75ffa00 80c01f4c
[  +0.000011] 1f20: 80c04174 80c01f40 80108a4c 80108a50 60000013 ffffffff
[  +0.000017] [<807a3abc>] (__irq_svc) from [<80108a50>] (arch_cpu_idle+0x34/0x4c)
[  +0.000022] [<80108a50>] (arch_cpu_idle) from [<807a323c>] (default_idle_call+0x34/0x48)
[  +0.000017] [<807a323c>] (default_idle_call) from [<80161494>] (do_idle+0xd8/0x150)
[  +0.000015] [<80161494>] (do_idle) from [<801617a8>] (cpu_startup_entry+0x28/0x2c)
[  +0.000015] [<801617a8>] (cpu_startup_entry) from [<8079cf64>] (rest_init+0xbc/0xc0)
[  +0.000017] [<8079cf64>] (rest_init) from [<80b00df8>] (start_kernel+0x3d4/0x3e0)

@gmzed
Copy link

gmzed commented Oct 14, 2018

I also tried on my second pi (pi 2) and I can confirm that the problem also occurs on the 4.14.74-v7+ and after the change to 4.14.70-v7+ the problem does not occur.

@ghost
Copy link

ghost commented Oct 14, 2018

Edit: I'm on 4.14.71-v7+ hash1145 - not sure if I should file a separate bug or not since it looks from the above discussion as though the root cause could be the same commit in 4.14.71.

I'm seeing this too - on a Pi 3B+ that I've just rebuilt running stock Raspbian 2018-10-09. With samba, mediaplayer (a Java-based OpenHome / uPnP music player outputting to an attached iQaudio DAC) and Kodi (for playing videos to an HDMI-connected TV). The only change I've made to the network config from stock Raspbian is to disable ipv6 in cmdline.txt, disable onboard bluetooth and onboard wifi via dtoverlay in config.txt, and disable EEE via dtparam in config.txt. I was seeing this before the rebuild in 4.14.71+ but could not be sure until I rebuilt that it wasn't my corrupt SDcard that was causing it. Sadly, it seems it is indeed this bug. Here's a example apparently getting triggered on receipt of a TCP packet:

[ 1179.732318] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.71-v7+ #1145
[ 1179.732337] Hardware name: BCM2835
[ 1179.732383] [<8010ffd4>] (unwind_backtrace) from [<8010c240>] (show_stack+0x20/0x24)
[ 1179.732401] [<8010c240>] (show_stack) from [<80787f24>] (dump_stack+0xd4/0x118)
[ 1179.732423] [<80787f24>] (dump_stack) from [<8068e20c>] (netdev_rx_csum_fault+0x44/0x48)
[ 1179.732447] [<8068e20c>] (netdev_rx_csum_fault) from [<80680ae0>] (__skb_checksum_complete+0xbc/0xc0)
[ 1179.732472] [<80680ae0>] (__skb_checksum_complete) from [<80700b8c>] (tcp_v4_rcv+0x604/0xe5c)
[ 1179.732506] [<80700b8c>] (tcp_v4_rcv) from [<806d79b0>] (ip_local_deliver_finish+0xe4/0x330)
[ 1179.732528] [<806d79b0>] (ip_local_deliver_finish) from [<806d8268>] (ip_local_deliver+0x54/0xdc)
[ 1179.732551] [<806d8268>] (ip_local_deliver) from [<806d7e78>] (ip_rcv_finish+0x27c/0x4e0)
[ 1179.732572] [<806d7e78>] (ip_rcv_finish) from [<806d860c>] (ip_rcv+0x31c/0x514)
[ 1179.732590] [<806d860c>] (ip_rcv) from [<8068b784>] (__netif_receive_skb_core+0x2dc/0xc84)
[ 1179.732608] [<8068b784>] (__netif_receive_skb_core) from [<8068e380>] (__netif_receive_skb+0x20/0x7c)
[ 1179.732625] [<8068e380>] (__netif_receive_skb) from [<8068e474>] (process_backlog+0x98/0x148)
[ 1179.732649] [<8068e474>] (process_backlog) from [<8069278c>] (net_rx_action+0x2e8/0x45c)
[ 1179.732669] [<8069278c>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8)
[ 1179.732692] [<80101694>] (__do_softirq) from [<80123870>] (irq_exit+0x108/0x164)
[ 1179.732711] [<80123870>] (irq_exit) from [<80175984>] (__handle_domain_irq+0x70/0xc4)
[ 1179.732728] [<80175984>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac)
[ 1179.732746] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<807a3abc>] (__irq_svc+0x5c/0x7c)
[ 1179.732760] Exception stack(0x80c01ef0 to 0x80c01f38)
[ 1179.732774] 1ee0:                                     00000000 048b86c8 36437000 00000000
[ 1179.732791] 1f00: 80c00000 80c03dcc 80c03d68 80c885b2 00000001 80b60a30 b77ffa00 80c01f4c
[ 1179.732801] 1f20: 80c04174 80c01f40 80108a4c 80108a50 60000013 ffffffff
[ 1179.732819] [<807a3abc>] (__irq_svc) from [<80108a50>] (arch_cpu_idle+0x34/0x4c)
[ 1179.732838] [<80108a50>] (arch_cpu_idle) from [<807a323c>] (default_idle_call+0x34/0x48)
[ 1179.732862] [<807a323c>] (default_idle_call) from [<80161494>] (do_idle+0xd8/0x150)
[ 1179.732881] [<80161494>] (do_idle) from [<801617a8>] (cpu_startup_entry+0x28/0x2c)
[ 1179.732901] [<801617a8>] (cpu_startup_entry) from [<8079cf64>] (rest_init+0xbc/0xc0)
[ 1179.732924] [<8079cf64>] (rest_init) from [<80b00df8>] (start_kernel+0x3d4/0x3e0)

I've installed ethtool and have disabled hardware checksum offload for the rx path in the meantime, although for some reason those errors only occurred so far from boot up to 48 minutes after boot. I've not investigated the traffic that is triggering this.

Assuming the cause is [88078d9], is that on the rx path only? Wondering if I should disable checksum offload for tx as well.

6by9 added a commit to 6by9/linux that referenced this issue Oct 15, 2018
This reverts commit 88078d9.

Various people have been reporting seeing "eth0: hw csum failure"
and callstacks dumped in the kernel log on 4.18, and since 4.14.71,
on both SMSC9514 and LAN7800 adapters.
This commit appears to be the reason, but potentially due to an
issue further down the stack. Revert whilst investigating the
trigger.

raspberrypi#2713
raspberrypi#2659
raspberrypi#2712

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
6by9 added a commit to 6by9/linux that referenced this issue Oct 15, 2018
This reverts commit 6bf32cd.

Various people have been reporting seeing "eth0: hw csum failure"
and callstacks dumped in the kernel log on 4.18, and since 4.14.71,
on both SMSC9514 and LAN7800 adapters.
This commit appears to be the reason, but potentially due to an
issue further down the stack. Revert whilst investigating the
trigger.

raspberrypi#2713
raspberrypi#2659
raspberrypi#2712

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
@6by9
Copy link
Contributor

6by9 commented Oct 15, 2018

The suspected upstream commit is in the receive path only, supposedly an optimisation for fragmented IP packets and checksum offload. Being quite so embedded in the core of the networking stack fully understanding the usage can take a while.

We are looking at reverting it in 4.14 to deal with the majority of users.

4.18 may retain the patch for now because it isn't the main kernel version and we still haven't managed to reproduce this. If those people hitting the issue can provide descriptions of their systems and what they are doing at the time (simpler the better), then it would help.
The commit text would point in the direction of when receiving fragmented IP packets, but packet captures that have been provided so far don't appear to be fragmented.

pelwell pushed a commit that referenced this issue Oct 15, 2018
This reverts commit 6bf32cd.

Various people have been reporting seeing "eth0: hw csum failure"
and callstacks dumped in the kernel log on 4.18, and since 4.14.71,
on both SMSC9514 and LAN7800 adapters.
This commit appears to be the reason, but potentially due to an
issue further down the stack. Revert whilst investigating the
trigger.

#2713
#2659
#2712

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
ahmedradaideh pushed a commit to ahmedradaideh/Pi-Kernel that referenced this issue Oct 15, 2018
This reverts commit 6bf32cd.

Various people have been reporting seeing "eth0: hw csum failure"
and callstacks dumped in the kernel log on 4.18, and since 4.14.71,
on both SMSC9514 and LAN7800 adapters.
This commit appears to be the reason, but potentially due to an
issue further down the stack. Revert whilst investigating the
trigger.

raspberrypi/linux#2713
raspberrypi/linux#2659
raspberrypi/linux#2712

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
Signed-off-by: ahmedradaideh <ahmed.radaideh@gmail.com>
@popcornmix
Copy link
Collaborator

Latest rpi-update kernel reverts the offending commit. Please test and confirm if issue is fixed.

@pelwell
Copy link
Contributor

pelwell commented Oct 15, 2018

To clarify, the rpi-4.18.y tree includes a patch from upstream which was thought to address the problem. The original commit is still present. rpi-4.14.y - now available via rpi-update - includes the reversion.

@6by9
Copy link
Contributor

6by9 commented Oct 15, 2018

Now confirmed that the upstream patch does NOT fix the issue, therefore 4.18 is still going to exhibit issues.
4.14 via rpi-update should be clean due to the reversion.

@6by9
Copy link
Contributor

6by9 commented Oct 22, 2018

The root cause of this has been found upstream and a fix should trickle down in the next few days. When they hit we'll revert the reverts.

popcornmix added a commit to raspberrypi/firmware that referenced this issue Nov 5, 2018
kernel: Revert Revert net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
See: raspberrypi/linux#2659

kernel: config: Add CONFIG_USBIP_VUDC
See: #353

kernel: mmc/bcm2835-sdhost: Recover from MMC_SEND_EXT_CSD
See: raspberrypi/linux#2728

kernel: overlays: pi3-disable-bt: Clear out bt_pins node

kernel: Revert rtc: pcf8523: properly handle oscillator stop bit
See: #1065

bootcode: Extend TEST_UNIT_READY timeout to 20 seconds, some hard drives take a really long time
See: #898

firmware: video_render: Treat an empty buffer with ENDOFFRAME set as a flush

firmware: dispmanx: Add option to ignore all layers lower than the current layer
popcornmix added a commit to Hexxeh/rpi-firmware that referenced this issue Nov 5, 2018
kernel: Revert Revert net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
See: raspberrypi/linux#2659

kernel: config: Add CONFIG_USBIP_VUDC
See: raspberrypi/firmware#353

kernel: mmc/bcm2835-sdhost: Recover from MMC_SEND_EXT_CSD
See: raspberrypi/linux#2728

kernel: overlays: pi3-disable-bt: Clear out bt_pins node

kernel: Revert rtc: pcf8523: properly handle oscillator stop bit
See: raspberrypi/firmware#1065

bootcode: Extend TEST_UNIT_READY timeout to 20 seconds, some hard drives take a really long time
See: raspberrypi/firmware#898

firmware: video_render: Treat an empty buffer with ENDOFFRAME set as a flush

firmware: dispmanx: Add option to ignore all layers lower than the current layer
@popcornmix
Copy link
Collaborator

Latest rpi-update kernel removes the revert and makes use of the upstream fix.
Would be helpful if affected users can report that is still okay.

@treuter
Copy link

treuter commented Nov 8, 2018

I can confirm that after rpi-update the problem is fixed on my Pi 3B+.

Thanks for your effort!

@spattinson
Copy link

4.14.71-v7+ Upgraded to 4.14.79-v7+ resolved
syslog grew ~400MB in only a few hours after install
Thanks!

@ElTopo
Copy link

ElTopo commented Nov 8, 2018

upgraded to 4.14.79-v7+ on my 3b+ and the kernel panic is gone, thanks!

uname -a
Linux lxlrpi 4.14.79-v7+ #1159 SMP Sun Nov 4 17:50:20 GMT 2018 armv7l GNU/Linux

@JamesH65
Copy link
Contributor

Closing this issue as questions answered/issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests