Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcp25xxfd reports errors in the kernel log #6

Open
enmasse opened this issue May 16, 2019 · 69 comments
Open

mcp25xxfd reports errors in the kernel log #6

enmasse opened this issue May 16, 2019 · 69 comments

Comments

@enmasse
Copy link

enmasse commented May 16, 2019

Hi,

I'm evaluating the PICAN FD Duo board from Skpang. To test it out, I've connected can0 to can1 and running cangen -g 0 -m can0. After a short period of time, the command will exit with a message write: No buffer space available.

Kernel:
root@signalserver-canfd:~# uname -a Linux signalserver-canfd 4.14.98ms7-v7+ #75 SMP Sat Feb 16 22:22:33 UTC 2019 armv7l GNU/Linux

Looking in the kernel log I can see the following messages:

[  323.947188] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[  323.963571] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready
[  343.344348] can: controller area network core (rev 20170425 abi 9)
[  343.344444] NET: Registered protocol family 29
[  343.352870] can: raw protocol (rev 20170425)
[  343.361683] NOHZ: local_softirq_pending 08
[  343.363120] NOHZ: local_softirq_pending 08
[  343.366264] NOHZ: local_softirq_pending 08
[  343.367370] NOHZ: local_softirq_pending 08
[  343.368599] NOHZ: local_softirq_pending 08
[  343.369537] NOHZ: local_softirq_pending 08
[  343.370786] NOHZ: local_softirq_pending 08
[  343.374552] NOHZ: local_softirq_pending 08
[  343.375805] NOHZ: local_softirq_pending 08
[  343.376777] NOHZ: local_softirq_pending 08
[  402.700957] mcp25xxfd spi0.1 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[  402.701055] mcp25xxfd spi0.1 can0: tefif: fifo 1 not pending - tef data: id: 0b066825 flags: 00000217, ts: 04947926 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.702035] mcp25xxfd spi0.1 can0: tefif: fifo 2 not pending - tef data: id: 03c34c04 flags: 00000435, ts: 04947d64 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.703176] mcp25xxfd spi0.1 can0: tefif: fifo 3 not pending - tef data: id: 00000479 flags: 0000068b, ts: 04948196 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.703214] mcp25xxfd spi0.1 can0: tefif: fifo 4 not pending - tef data: id: 14d04342 flags: 0000089c, ts: 049485bf - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.704475] mcp25xxfd spi0.1 can0: tefif: fifo 5 not pending - tef data: id: 00000253 flags: 00000ac0, ts: 049489d3 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.704512] mcp25xxfd spi0.1 can0: tefif: fifo 6 not pending - tef data: id: 177f51db flags: 00000c9e, ts: 04948e1a - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.704546] mcp25xxfd spi0.1 can0: tefif: fifo 7 not pending - tef data: id: 041ee9fb flags: 00000e15, ts: 04949245 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced

I've tried different values for -g, but it just takes slightly longer to fail. Also I've tried to half spimaxfrequency following the suggestion in the kernel log.

@msperl
Copy link
Owner

msperl commented May 19, 2019

this may be due to some bugs - please try the latest version I have just published

@msperl msperl closed this as completed May 19, 2019
@enmasse
Copy link
Author

enmasse commented May 20, 2019

No, this is still happening:

[  156.474627] can: controller area network core (rev 20170425 abi 9)
[  156.474726] NET: Registered protocol family 29
[  156.484418] can: raw protocol (rev 20170425)
[  156.488317] NOHZ: local_softirq_pending 08
[  156.488422] NOHZ: local_softirq_pending 08
[  156.491721] NOHZ: local_softirq_pending 08
[  156.491800] NOHZ: local_softirq_pending 08
[  156.491874] NOHZ: local_softirq_pending 08
[  156.491919] NOHZ: local_softirq_pending 08
[  156.492017] NOHZ: local_softirq_pending 08
[  156.492044] NOHZ: local_softirq_pending 08
[  156.492169] NOHZ: local_softirq_pending 08
[  156.492294] NOHZ: local_softirq_pending 08
[  169.584554] mcp25xxfd spi0.1 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[  169.584676] mcp25xxfd spi0.1 can0: tefif: fifo 6 not pending - tef data: id: 081f6bd7 flags: 00000c18, ts: 04485841 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  169.585603] mcp25xxfd spi0.1 can0: tefif: fifo 1 not pending - tef data: id: 10d8d200 flags: 000002dd, ts: 04485c89 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  248.568010] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  248.568151] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  248.568219] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  248.570112] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  248.570152] mcp25xxfd spi0.1 can0: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[1]+  Killed                  cangen -g 1 -m can0
root@signalserver-canfd:~# uname -a
Linux signalserver-canfd 4.19.40ms7-v7+ #81 SMP Sat May 11 17:33:45 UTC 2019 armv7l GNU/Linux

@enmasse
Copy link
Author

enmasse commented May 20, 2019

I'd be happy to help out debugging or providing more information to help you solve this issue.

@msperl
Copy link
Owner

msperl commented May 20, 2019

I would need a bit more details!

Things like:
spi clock speed configured
Description of test setup
Description of test - how are you testing?
Description of can bus parameters (baud,...)
Output of head /sys/kernel/debug/mcp25xxfd*/spi*
Output of ip -details link show can0
Output of uname -a
Version of code - is it really 8.2?

Other things that may be of interest - some that i can not think of right now

@msperl
Copy link
Owner

msperl commented May 20, 2019

Looking at it some more I came to realize:

You seem to be using:

  • a Raspberry Pi 3*
  • using spi 0.1 - so i guess that that you use spi-bcm2835 with (at least) 2 chip selects

It leaves open some other questions...

@msperl msperl reopened this May 20, 2019
@enmasse
Copy link
Author

enmasse commented May 22, 2019

Configuration file:
config.txt

File I use to set up interfaces:
ifup-canfd.txt

Output from /sys/kernel/debug/mcp25xxfd*/spi* as you requested:

root@signalserver-canfd:~# find /sys/kernel/debug/mcp25xxfd-spi0.0/ -name "spi_*" -exec echo "{}:" \; -exec cat "{}" \;
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_crc_read_split:
0
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_crc_read:
1
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_use_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_normal_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.0/spi_setup_speed_hz:
10000000
root@signalserver-canfd:~# find /sys/kernel/debug/mcp25xxfd-spi0.1/ -name "spi_*" -exec echo "{}:" \; -exec cat "{}" \;
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_crc_read_split:
0
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_crc_read:
1
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_use_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_normal_speed_hz:
20000000
/sys/kernel/debug/mcp25xxfd-spi0.1/spi_setup_speed_hz:
10000000

Output from ip -details as you requested:

root@signalserver-canfd:~# ip -details link show type can
3: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
    link/can  promiscuity 0 
    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 
	  bitrate 500000 sample-point 0.800 
	  tq 25 prop-seg 31 phase-seg1 32 phase-seg2 16 sjw 1
	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
	  dbitrate 2000000 dsample-point 0.800 
	  dtq 25 dprop-seg 7 dphase-seg1 8 dphase-seg2 4 dsjw 1
	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
	  clock 40000000numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
4: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
    link/can  promiscuity 0 
    can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 
	  bitrate 500000 sample-point 0.800 
	  tq 25 prop-seg 31 phase-seg1 32 phase-seg2 16 sjw 1
	  mcp25xxfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
	  dbitrate 2000000 dsample-point 0.800 
	  dtq 25 dprop-seg 7 dphase-seg1 8 dphase-seg2 4 dsjw 1
	  mcp25xxfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
	  clock 40000000numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 

Output from uname -a:

root@signalserver-canfd:~# uname -a 
Linux signalserver-canfd 4.19.40ms7-v7+ #81 SMP Sat May 11 17:33:45 UTC 2019 armv7l GNU/Linux

Not sure how to show you that I really run your software. I pulled the kernel directly from your release page. :)

Image of test setup:
20190522_130421

Reproducing the error:
cangen -g 1 -m can0 & candump can1

Tell me if you need anything else.

@enmasse
Copy link
Author

enmasse commented May 22, 2019

From config.txt:

dtparam=i2c_arm=on
dtparam=i2c_vc=on
dtparam=spi=on
dtoverlay=mcp2517fd-can0,oscillator=40000000,interrupt=25
dtoverlay=mcp2517fd-can1,oscillator=40000000,interrupt=05

That can1 is tied to interrupt 05 seems a little bit odd, but it's according to instructions from SkPang. On the previous version of the PICAN DUO shield the interrupts for the can interfaces were in sequence.

@enmasse
Copy link
Author

enmasse commented Jun 4, 2019

Please let me know if you're able to reproduce the error or if I could assist in isolating the bug further.

@irrwisch1
Copy link

I was having the same issue and as a workaround I reverted those two commits that optimize TEF reading
d86b484#diff-8a495b05736235e49b1a602651e1a187
65c507a#diff-8a495b05736235e49b1a602651e1a187

The problems (hopefully) seems to be gone now without those two commits, however I didn't dig into it much further.

@enmasse
Copy link
Author

enmasse commented Jul 2, 2019

Thanks @jlanger, I will try to do the same when I figure out how to build the kernel.

@enmasse
Copy link
Author

enmasse commented Jul 3, 2019

I recompiled the kernel from commit cf733ff just before the TEF optimizations as just reverting the commits resulted in some compile errors that I didn't care to investigate.
It looks promising. I'm able to run cangen -g 1 -m can0 & candump can1 in my setup without the error reoccurring.

...after a while I get this message though:
write: No buffer space available

But it could be just a matter of increasing txqueuelen.

Edit: When running the code from this commit, with increased txqueuelen I get a kernel panic. I can post it on request, but because it's not the latest there might be a fix in a later commit.

@irrwisch1
Copy link

I also did some more testing and it specifically seems to be this commit
65c507a#diff-8a495b05736235e49b1a602651e1a187
That's making trouble. The first TEF optimization commit seems to work fine.

@enmasse: Yes if you revert that commit you need to re-add some variable definitions in mcp25xxfd_can_debugfs.c to make it build again.

@enmasse
Copy link
Author

enmasse commented Jul 4, 2019

@jlanger, it depends on what you mean when you say first or second commit, they're dated in the opposite order of when they're commited. :)

I reverted only commit 65c507a but then I still get the error.

@enmasse
Copy link
Author

enmasse commented Jul 8, 2019

Even with the two TEF commits reverted I was able to reproduce the kernel panic.

@omdathetkan
Copy link

Could this be related to the issue described in the MCP2517FD Errata?

TX MAB underflow due to long delays between SPI bytes
In case of a TX MAB underflow, the device willnotify the application by setting SERRIF and MODIF and by transitioning to Restricted Opera-tion or Listen Only mode (depending on CiCON.SERR2LOM

@johandc
Copy link

johandc commented Nov 19, 2019

We tested with a board using the MCP2518FD where the errata of the MCP2517 have been fixed. This did not solve the issue with the "Something went wrong - we got a TEF..." error.

@xiangtang-1992
Copy link

I also have same issue on mcp2517fd and rpi3. @msperl do you have any update to solve this issue.

@xiangtang-1992
Copy link

Even with the two TEF commits reverted I was able to reproduce the kernel panic.

do you have already solve this issue? can frame stop and need reboot

@xiangtang-1992
Copy link

Hi,

I'm evaluating the PICAN FD Duo board from Skpang. To test it out, I've connected can0 to can1 and running cangen -g 0 -m can0. After a short period of time, the command will exit with a message write: No buffer space available.

Kernel:
root@signalserver-canfd:~# uname -a Linux signalserver-canfd 4.14.98ms7-v7+ #75 SMP Sat Feb 16 22:22:33 UTC 2019 armv7l GNU/Linux

Looking in the kernel log I can see the following messages:

[  323.947188] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[  323.963571] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready
[  343.344348] can: controller area network core (rev 20170425 abi 9)
[  343.344444] NET: Registered protocol family 29
[  343.352870] can: raw protocol (rev 20170425)
[  343.361683] NOHZ: local_softirq_pending 08
[  343.363120] NOHZ: local_softirq_pending 08
[  343.366264] NOHZ: local_softirq_pending 08
[  343.367370] NOHZ: local_softirq_pending 08
[  343.368599] NOHZ: local_softirq_pending 08
[  343.369537] NOHZ: local_softirq_pending 08
[  343.370786] NOHZ: local_softirq_pending 08
[  343.374552] NOHZ: local_softirq_pending 08
[  343.375805] NOHZ: local_softirq_pending 08
[  343.376777] NOHZ: local_softirq_pending 08
[  402.700957] mcp25xxfd spi0.1 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[  402.701055] mcp25xxfd spi0.1 can0: tefif: fifo 1 not pending - tef data: id: 0b066825 flags: 00000217, ts: 04947926 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.702035] mcp25xxfd spi0.1 can0: tefif: fifo 2 not pending - tef data: id: 03c34c04 flags: 00000435, ts: 04947d64 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.703176] mcp25xxfd spi0.1 can0: tefif: fifo 3 not pending - tef data: id: 00000479 flags: 0000068b, ts: 04948196 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.703214] mcp25xxfd spi0.1 can0: tefif: fifo 4 not pending - tef data: id: 14d04342 flags: 0000089c, ts: 049485bf - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.704475] mcp25xxfd spi0.1 can0: tefif: fifo 5 not pending - tef data: id: 00000253 flags: 00000ac0, ts: 049489d3 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.704512] mcp25xxfd spi0.1 can0: tefif: fifo 6 not pending - tef data: id: 177f51db flags: 00000c9e, ts: 04948e1a - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[  402.704546] mcp25xxfd spi0.1 can0: tefif: fifo 7 not pending - tef data: id: 041ee9fb flags: 00000e15, ts: 04949245 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced

I've tried different values for -g, but it just takes slightly longer to fail. Also I've tried to half spimaxfrequency following the suggestion in the kernel log.

I also have same issue with you. do you have already solve this issue?

@Skypodo
Copy link

Skypodo commented Dec 17, 2019

We tested with a board using the MCP2518FD where the errata of the MCP2517 have been fixed. This did not solve the issue with the "Something went wrong - we got a TEF..." error.

@johandc Do you have to restart the interface or your system in this case?

@johandc
Copy link

johandc commented Dec 17, 2019

@Skypodo yes. If dn and if up again.

@Skypodo
Copy link

Skypodo commented Dec 17, 2019

@johandc Did it improve than in any way? Did you notice any differences in behaviour between MCP2517FD and MCP2518FD?
I noticed that reducing the SPI frequency (i.e. 2.5 MHz) or increasing the message interval (~1k messages per second) increases the error rate.

@johandc
Copy link

johandc commented Dec 17, 2019

@Skypodo

I don't know, it feels very random to me when the error hits, so its subjective. But i felt like there was very little difference between the two chips.

There are two issues, as far as i can see.

  1. That the error occurs and
  2. That the driver does not try to recover automatically.

If the driver would receover itself, i could accept to loose a can frame once in a while. But having it locking up, is the worst possible scenario.

@Skypodo
Copy link

Skypodo commented Jan 6, 2020

@johandc That has been also my experience for now. Once the driver is stuck, there is no way it recovers and you have to turn the interface off and on again.
Over the holidays I've done some long-term tests to see how it might behave in our particular use case. SPI was running at 10 MHz and both of the CAN nodes sent random frames each 10/10.07 ms. This was running for 16 days without major errors. I had some bus errors occasionally but they were very rare and the driver managed to keep going.
For my use case this is sufficient for now.

@DanielOgorchock
Copy link

I'm not sure if it's related to this particular issue, but I ran into a problem with the driver not recovering from a BUS_OFF state. The mcp2518fd cancels all pending transmissions in this state, but the driver thinks they're still pending. This leads to a lockup since as far as the driver's concerned there's no free TX queues.

The driver needs to cancel its pending transmissions when a BUS_OFF error is detected (and report them as failed to userspace). The problem is that sometimes there's weird race conditions between detecting the BUS_OFF state change and queuing a new frame for TX asynchronously, so it's really difficult to figure out which frames are still queued or not for TX. An occasional check for lockup (if no frame TX complete interrupts have occurred in X seconds) can somewhat work around that issue.

@heikoschmidttki
Copy link

As we are facing the same problems I've investigated further in the issue with not pending FIFOs and the problem is located in the optimized interrupt handler mcp25xxfd_can_tx_handle_int_tefif_optimized().

If you remove the handling and just use mcp25xxfd_can_tx_handle_int_tefif_conservative() in mcp25xxfd_can_tx_handle_int_tefif() the classical (and more SPI bandwith consuming) handling is used and the driver should work stable.

The problem with the optimized version seems to be a timing issue. Now and then - with our application even after hours of running w/o problems - more than one TEF is read in advance, meaning not only the signalled CiTEFUA (that's not even used in the driver I think) is read, but also consecutive TEFs are readout. This works fine most of the times, but anytime after the read a TEFINT is triggered that leads to a readout of not ready TEFs. We believe that this is due to some readout while not all TEFs have been written by the chip completely, so the TEFNEIF flag is set after the readout itself has already taken place.

Once in this situation, every further send of CAN(-FD) messages lead to a readout of a wrong TEF, a missing interrupt and - in the end - no idle TX-FIFOs, which in turn make the driver stop sending. ifdown and ifup lead to re-initialization, so the interface is working again.

I've created a support case with microchip to clarify if a readout other than at the CiTEFUA address for the TEF is safe according to concurrent updates, as for the moment we cannot prove this (mcp25xxfd datasheet and reference manual is not clear at this point). I'll keep this updated if I get any news.

@johandc
Copy link

johandc commented Jan 15, 2020

That is excellent news. At least there is a workaround for now. Thank very much for looking into this.

@jbakuwel
Copy link

@t-kopp I've applied that patch as well. So far only single bit ECC errors:

kernel: [16664.150326] mcp25xxfd spi0.0: MCP2517 successfully initialized.
kernel: [16664.164735] mcp25xxfd spi1.0: MCP2517 successfully initialized.
kernel: [16767.939714] mcp25xxfd spi1.0 can1: CAN Bus error experienced
kernel: [16767.940215] mcp25xxfd spi1.0 can1: CAN Bus error experienced
kernel: [16768.035113] mcp25xxfd spi1.0 can1: CAN Bus error experienced
kernel: [16865.444674] mcp25xxfd spi1.0: ECC single bit error at 4ac
kernel: [17046.915653] mcp25xxfd spi1.0: ECC single bit error at 46c
kernel: [17046.915885] mcp25xxfd spi1.0: ECC single bit error at 46c
kernel: [17149.678267] mcp25xxfd spi1.0: ECC single bit error at 48c
kernel: [17329.608207] mcp25xxfd spi1.0: ECC single bit error at 46c
kernel: [19289.850859] mcp25xxfd spi1.0: ECC single bit error at 48c
kernel: [20154.126151] mcp25xxfd spi1.0: ECC single bit error at 498

I'll let the test run overnight and will report back.

@Skypodo
Copy link

Skypodo commented Jan 31, 2020

@jbakuwel Do you have real CAN nodes connected to the RP4 or some PC and a CAN interface?
I don't think that I've ever encountered ECC errors during all of my tests. The only issue I have with the current driver + 3 patches are Bus errors and Bus offs. I can see on my oscillioscope that in Bus off state, the bus is still busy but my receiving host PC does no longer receive messages.
Do you mind telling us things like SPI speed and CAN bus configuration?
The tests I've done so far using 20 MHz SPI and ip link set can0 up type can fd on bitrate 250000 dbitrate 1000000 prop-seg 47 dprop-seg 0 phase-seg1 16 dphase-seg1 15 phase-seg2 16 dphase-seg2 4 sjw 16 dsjw 4 sample-point 0.8 dsample-point 0.8. My mcp generates messages with cangen -g 1.07 -i -x -m can0 and the host (PCAN-USB Pro FD connected to a PC) is listening.

@jbakuwel
Copy link

jbakuwel commented Jan 31, 2020

@t-kopp Here are the preceding single bit ECC errors and the double bit ECC error that seized the driver; this time after approx 3 1/2 hrs.

[16865.444674] mcp25xxfd spi1.0: ECC single bit error at 4ac
[17046.915653] mcp25xxfd spi1.0: ECC single bit error at 46c
[17046.915885] mcp25xxfd spi1.0: ECC single bit error at 46c
[17149.678267] mcp25xxfd spi1.0: ECC single bit error at 48c
[17329.608207] mcp25xxfd spi1.0: ECC single bit error at 46c
[19289.850859] mcp25xxfd spi1.0: ECC single bit error at 48c
[20154.126151] mcp25xxfd spi1.0: ECC single bit error at 498
[23310.035113] mcp25xxfd spi1.0: ECC single bit error at 46c
[23310.035388] mcp25xxfd spi1.0: ECC single bit error at 46c
[23331.935150] mcp25xxfd spi1.0: ECC double bit error at 468
[23331.935386] mcp25xxfd spi1.0: ECC single bit error at 468
[25758.842745] mcp25xxfd spi1.0: ECC single bit error at 4bc
[26124.689480] mcp25xxfd spi1.0: ECC single bit error at 4b8
[29335.705842] mcp25xxfd spi1.0: ECC double bit error at 4b8

@Skypodo Yes, I have real CAN nodes connnected. See here for more details. My first test: can0 <--wire--> can1 using cangen -g 0 can0 and candump can1 ran fine for an extended period of time.

@jbakuwel
Copy link

@t-kopp More debug info. I found a way with the live CAN system to reproduce the issue quicker (not having to wait for an hour or longer. Contrary to what I've said before, single ECC errors also have an effect on the live CAN system (it flags an error condition) even though everything seems to continue to work until the double ECC error hits.

[62829.089841] mcp25xxfd spi0.0: MCP2517 successfully initialized.
[62829.105868] mcp25xxfd spi1.0: MCP2517 successfully initialized.
[62885.417987] mcp25xxfd spi1.0 can1: CAN Bus error experienced
[62885.418975] mcp25xxfd spi1.0 can1: CAN Bus error experienced
[62885.597993] mcp25xxfd spi1.0 can1: CAN Bus error experienced
[62923.048377] mcp25xxfd spi1.0: ECC single bit error at 4a8
[62978.371973] mcp25xxfd spi1.0: ECC single bit error at 498
[63076.625240] mcp25xxfd spi1.0: unidentified system interrupt - intf = 391a1119
[63076.625356] mcp25xxfd spi1.0: ECC double bit error at 498

And as always the driver seized up after this. We're a bit stuck with our project until we fix this issue, happy to help in any way to further debug this, especially now I have a quick way to reproduce the problem.

@t-kopp
Copy link

t-kopp commented Feb 3, 2020

@jbakuwel I have not encountered any ECC errors with or without external CANnodes. What is your way to reproduce the issue quickly?
Can you check what the addresses that are flagged are being used for? Using the debugfs you can find the fifo addresses.
Can you check if enabling the CRC for the SPI helps? (Via the module param)

@marckleinebudde
Copy link

On the mcp2517fd I can generate CRC errors reliable. To reproduce I setup a bus with two CAN nodes, the receiving mcp2517fd is in ifdown state, on the other node I've queued 32 frames with cangen.
Once I bring the mcp2517fd ifup, it receives the messages and reliable generates CRC errors.

@jbakuwel
Copy link

jbakuwel commented Feb 3, 2020

@t-kopp Please see here for the details about my test setup. I'll see if I can get the requested information for you tomorrow (it's getting late here in NZ).

@johandc
Copy link

johandc commented Feb 3, 2020

@t-kopp @jbakuwel
Are there any indications that this ECC problem is related to the original issue with the TEF's? If not, then i think a separate issue should be created to focus on the ECC problem. Also, remember that the MCP2517 has an errata regarding some race condition on the SPI interface, and have since been replaced with the MCP2518. So maybe we could test the TEF's patch on the 2518 and see if it is stable, and get it merged.

@marckleinebudde
Copy link

marckleinebudde commented Feb 3, 2020

I'm currently rewriting the driver. Feel free to have a look (as in testing) at it: https://github.com/marckleinebudde/linux/tree/mcp25xxfd-rpi

It's currently based on the latest rpi-4.19 kernel and my plans are to get it merged into mainline.

@msperl
Copy link
Owner

msperl commented Feb 3, 2020

Sorry for having bin quiet for so long - busy with other projects and vacation.

As for ECC: My guess is that you are trying to access a RAM address that has not been written to prior to the first read after enabling ECC.

But that seems strange, as I have cleared all ram in mcp25xxfd_ecc_enable.

Maybe we hit another bug with concurrent write of TEF data and the speculative read the same address via SPI that results in this ECC issue (SDRAM access being blocked) - but what is strange is that I have never seen this issue during my testing, But then I have kept count(TEF) = count(TX) and avoiding possible rollover issues...

@jbakuwel
Copy link

jbakuwel commented Feb 3, 2020

@johandc I can't say for sure that it is related but I have a feeling it could be. This:

kernel: [56043.592373] mcp25xxfd spi1.0: ECC single bit error at 800
kernel: [56618.710533] mcp25xxfd spi1.0: unidentified system interrupt - intf = 391a1109
kernel: [56618.710640] mcp25xxfd spi1.0: ECC double bit error at 800

suggests a bug that is triggered by a (perhaps) rare timing constraint.

@marckleinebudde I'll have a look at your driver hopefully tonight (NZ time). I have to squeeze this in as I'm having a busy week.

@msperl I rely on the driver and/or it's defaults to access the RAM and enable (or disable ECC). My Python script talks to the python-can library which in turn interfaces with the can devices created by the driver (ie. my main work is a few layers separated from the driver).

I do admit I do not know every detail about the way all this fits together but am puzzled a bit by the fact that my live CAN system is able to pick up on single bit ECC errors. Are these not auto corrected? As mentioned before, I have found a reliable way to reproduce this (and the other problems) with a test that resembles replacing the wires connected two CAN nodes with a RPi with Seeed Studio CAN hat running a Python script that simply forwards frames received on can0 to can1 and vice versa.

The unidentified system interrupt does not always show up but could well be a tell tale leading to what appears to be ECC errors. The Python script continues to forward frames after single bit ECC errors but the driver stops transmitting after the first double bit ECC error from which there is no recovery. It could well be a data corruption that occurs in rare conditions on a busy CAN bus which shows itself as ECC errors.

@marckleinebudde
Copy link

@msperl I observed the ECC error with my driver and it clears the RAM entirely on ifup. Only the mcp2517fd shows this problem (the mcp2518fd doesn't). And the ECC shows only up when the chip receives a huge burst of frames during ifup.

@marckleinebudde
Copy link

@jbakuwel Yes, single bit ECC errors are detected and corrected and if enabled the corresponding interrupt is propagated to the linux driver.

@marckleinebudde
Copy link

marckleinebudde commented Feb 3, 2020

@msperl I observed the ECC error with my driver and it clears the RAM entirely on ifup. Only the mcp2517fd shows this problem (the mcp2518fd doesn't). And the ECC shows only up when the chip receives a huge burst of frames during ifup.

This commit seems to fix the ECC in startup problem:
marckleinebudde/linux@08d2a5d

@jbakuwel
Copy link

jbakuwel commented Feb 4, 2020

@marckleinebudde and @msperl in my test scenario my live CAN system is initially switched off. Then I bring the interfaces up, start the Python script and then start the live CAN system. After a while (sometimes 1 hour sometimes 3 hours) ECC errors start to show up. I've discovered that increasing the traffic on the CAN bus a bit (via the live CAN system) reliably creates the ECC errors pretty much straight away.

@jbakuwel
Copy link

jbakuwel commented Feb 4, 2020

@marckleinebudde If single bit ECC errors are corrected, why would the live CAN system be able to detect those (as it does)? The driver logs a warning in syslog but (I assume) the frame was not dropped but a) corrected and b) passed on to the userland software which then sent it onwards onto the other CAN bus (in my test).

@marckleinebudde
Copy link

@jbakuwel The ECC errors you see in your system (ECC single bit error at 800 and ECC double bit error at 800) are not errors on the wire, but inside the on-chip-RAM of the MCP2517FD.

@jbakuwel
Copy link

jbakuwel commented Feb 4, 2020

@t-kopp It will take me a bit longer to get the requested debug info as the SD card crapped out on which I've been doing my experiments. I thought buying Samsung EVO's would be a safe bet....

@jbakuwel
Copy link

jbakuwel commented Feb 4, 2020

@marckleinebudde Am I mistaken thinking that "internal" errors inside the on-chip-RAM should not be visible to the CAN nodes (at least not the single bit ones that can be corrected)? Yet my live CAN system notices something is wrong by flagging "a fault" (sorry it's not more verbose than that).

@marckleinebudde
Copy link

@jbakuwel

Am I mistaken thinking that "internal" errors inside the on-chip-RAM should not be visible to the CAN nodes

Depends. If your ECC problems are created in the same way as described below, RX'ed CAN frames are broken. They will be passed broken into the user space and if you send them via the same or another CAN interface to the next CAN node, this node will receive CAN frames with wrong contents. From the CAN bus point of view this transmission is perfectly OK.

So far the only ECC errors I have seen were in my driver and in the RAM associated with the RX FIFO. (As noted in #6 (comment), I've changed the driver so that this doesn't happen any more.) My current theory is that the RX-Process in the Chip that moves the data into the RAM and the SPI read commands that read data from RAM interfere, resulting in data being written to the wrong address without updating the ECC. The ECC in chip is designed to correct single bit flips and detect double bit flips. When you overwrite data (without updating the ECC) there's probably more than 2 bit flips. What the chip does with this is unknown.

In my test case I send 32 CAN frames with incrementing data bytes and I see more than half of them totally broken. I can show you an example later today.

(at least not the single bit ones that can be corrected)?

Yes. But if overwriting happens as speculated above the, there are probably more than 2 bit flips....

@jbakuwel
Copy link

jbakuwel commented Feb 4, 2020

@marckleinebudde Thanks for that. I agree with your theory. It is a bit odd though that, in my tests, I first see single bit ECC errors, and later double bit ECC errors. I've not seen a double ECC error first. The driver does not recover after the first double bit ECC error; maybe it does not recover due to more than 2 flipped bits. Unfortunately my new fancy Samsung EVO plus SD card suddenly died, so I'll have to rebuild the system in an otherwise very busy week. May take me a few days...

@jbakuwel
Copy link

@t-kopp Here are the test results with kernel 4.19.102-v7l+, current Seeed Studio driver (which includes this patch) as well as this patch applied. Note the mode switches are new; didn't see them in previous tests.

The test started at Feb 13 09:01:57 ran successfully (ie. the live CAN system not detecting faults) until Feb 13 11:59:27 due to kernel: [11077.498185] mcp25xxfd spi1.0: ECC double bit error at 470.

Feb 13 10:30:17 test kernel: [ 5726.727061] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 10:30:17 test kernel: [ 5726.727680] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 3 to 6
...snap...
Feb 13 10:52:48 test kernel: [ 7077.957237] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 10:52:48 test kernel: [ 7077.957391] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 3 to 6
...snap...
Feb 13 10:58:10 test kernel: [ 7399.667045] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
...snap...
Feb 13 11:14:12 test kernel: [ 8362.349562] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
...snap...
Feb 13 11:20:05 test kernel: [ 8715.137548] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
...snap...
Feb 13 11:36:02 test kernel: [ 9672.350554] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
...snap...
Feb 13 11:55:06 test kernel: [10815.755773] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
...snap...
Feb 13 11:59:27 test kernel: [11077.498185] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.498353] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 11:59:27 test kernel: [11077.498430] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 13 11:59:27 test kernel: [11077.498450] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.498586] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1108
Feb 13 11:59:27 test kernel: [11077.498637] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.498730] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 6
Feb 13 11:59:27 test kernel: [11077.498816] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1108
Feb 13 11:59:27 test kernel: [11077.498863] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.498955] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 6
Feb 13 11:59:27 test kernel: [11077.499003] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1108
Feb 13 11:59:27 test kernel: [11077.499050] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.499142] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 6
Feb 13 11:59:27 test kernel: [11077.499190] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1108
Feb 13 11:59:27 test kernel: [11077.499237] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.499343] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.499587] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 13 11:59:27 test kernel: [11077.499738] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.499894] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 13 11:59:27 test kernel: [11077.500012] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 13 11:59:27 test kernel: [11077.500031] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.500173] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 11:59:27 test kernel: [11077.500250] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 13 11:59:27 test kernel: [11077.500268] mcp25xxfd spi1.0: ECC double bit error at 470
Feb 13 11:59:27 test kernel: [11077.500411] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 11:59:27 test kernel: [11077.500487] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 13 11:59:27 test kernel: [11077.500639] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 11:59:27 test kernel: [11077.500716] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 13 11:59:27 test kernel: [11077.500871] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 11:59:27 test kernel: [11077.500948] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 13 11:59:27 test kernel: [11077.501099] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 13 11:59:27 test kernel: [11077.501170] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 13 11:59:27 test kernel: [11077.501519] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1108
Feb 13 11:59:27 test kernel: [11077.501618] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1108
Feb 13 11:59:28 test /testd: 20200213-115928: v->b: CanError('Transmit buffer full')

@t-kopp
Copy link

t-kopp commented Feb 14, 2020

@jbakuwel mhm, am I missing something or are you saying you applied the two patches two the driver from Marc? Both patches were for the driver from Martin so they should not apply here.
In any case I'll try to get your test-scripts running on my setup to reproduce the problems.

@jbakuwel
Copy link

jbakuwel commented Feb 14, 2020

@t-kopp I did not apply the patches to Marc's driver. Obviously :-). I haven't tried but I think the patch would have failed. I haven't applied the first patch as that was already done; I simply cloned a fresh copy from the Seeed Studio driver (I had to rebuilt the system).

I also tested @marckleinebudde 's kernel and driver (first) but kept the 102 kernel running; Marc mentioned some beneficial (my interpretation) backported kernel SPI patches; see here for the details.

Maybe we need one place where we discuss these (likely?) related issues? We are currently discussing it in @msperl 's repo, as well as in Seed Studio's.

@jbakuwel
Copy link

jbakuwel commented Feb 23, 2020

@t-kopp, @msperl Have you been able to make progress on this issue? Do you think there's a chance we can get this resolved in a reasonable timeframe for the MCP2517FD chipset? It's a real show stopper for us, and if it's not at all certain we can resolve this, we'll have to look for other hardware. If there's anything else I can do to help, please let me know.

@t-kopp
Copy link

t-kopp commented Feb 24, 2020

@jbakuwel I haven't been able to reproduce it yet, no. And I agree with your comment from before that it makes sense to discuss in one place. Maybe you can open a new issue and we discuss it there?
As far as I can see the ECC errors are not related to either bug they are currently being discussed in. (Both are about the TEF issue)

Were you able to find out the answers to the questions from above? Which FIFOs are causing the errors (TX/RX?) and does enabling the SPI CRC help?
About your setup - in the Seed studio entry you don't seem to hit double bit ecc errors anymore (at least the latest logs in the bug report over there show RX fifo and lots of single bit errors but no double errors)
Do the mode switches happen with both drivers for you?

@jbakuwel
Copy link

@t-kopp Which module parameter enables SPI CRC? I've tried a few (use_spi_crc=1 for example) but so far no luck. Can you please let me know how to find the FIFO's that are causing the errors? DebugFS is here but it's not clear to me where I need to look for answer your questions.

I'm still seeing double bit ECC errors (see above), for example:

Feb 13 11:59:27 test kernel: [11077.500268] mcp25xxfd spi1.0: ECC double bit error at 470

I don't think the mode switches are happening with Marc's driver. I'm about to test the latest branch and will let you know.

Would you prefer to have the issue (to discuss these possibly related issues) in this repo or here?

@jbakuwel
Copy link

@t-kopp would you be able to help with this ?

Unfortunately, none of the documentation tells you what to do in case of an ECC error. :(

heikoschmidttki added a commit to heikoschmidttki/linux-rpi that referenced this issue Dec 7, 2020
… reads non signalled TEFs which is undefined behavior according to Microchip

In function mcp25xxfd_can_tx_handle_int_tefif() the call to
mcp25xxfd_can_tx_handle_int_tefif_optimized() is removed as it acesses
transmit FIFOs that are not signalled by the TEFIF. This leads to
undefined behavior and wrong reads, leading to the driver's internal TEF
state tot get out of sync and transmit further frames is blocked (TEFIFs
not freed).

The problem is discussed in Github issue
msperl#6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests