-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcp25xxfd reports errors in the kernel log #6
Comments
this may be due to some bugs - please try the latest version I have just published |
No, this is still happening:
|
I'd be happy to help out debugging or providing more information to help you solve this issue. |
I would need a bit more details! Things like: Other things that may be of interest - some that i can not think of right now |
Looking at it some more I came to realize: You seem to be using:
It leaves open some other questions... |
Configuration file: File I use to set up interfaces: Output from /sys/kernel/debug/mcp25xxfd*/spi* as you requested:
Output from ip -details as you requested:
Output from uname -a:
Not sure how to show you that I really run your software. I pulled the kernel directly from your release page. :) Reproducing the error: Tell me if you need anything else. |
From config.txt:
That can1 is tied to interrupt 05 seems a little bit odd, but it's according to instructions from SkPang. On the previous version of the PICAN DUO shield the interrupts for the can interfaces were in sequence. |
Please let me know if you're able to reproduce the error or if I could assist in isolating the bug further. |
I was having the same issue and as a workaround I reverted those two commits that optimize TEF reading The problems (hopefully) seems to be gone now without those two commits, however I didn't dig into it much further. |
Thanks @jlanger, I will try to do the same when I figure out how to build the kernel. |
I recompiled the kernel from commit cf733ff just before the TEF optimizations as just reverting the commits resulted in some compile errors that I didn't care to investigate. ...after a while I get this message though: But it could be just a matter of increasing Edit: When running the code from this commit, with increased txqueuelen I get a kernel panic. I can post it on request, but because it's not the latest there might be a fix in a later commit. |
I also did some more testing and it specifically seems to be this commit @enmasse: Yes if you revert that commit you need to re-add some variable definitions in mcp25xxfd_can_debugfs.c to make it build again. |
Even with the two TEF commits reverted I was able to reproduce the kernel panic. |
Could this be related to the issue described in the MCP2517FD Errata?
|
We tested with a board using the MCP2518FD where the errata of the MCP2517 have been fixed. This did not solve the issue with the "Something went wrong - we got a TEF..." error. |
I also have same issue on mcp2517fd and rpi3. @msperl do you have any update to solve this issue. |
do you have already solve this issue? can frame stop and need reboot |
I also have same issue with you. do you have already solve this issue? |
@johandc Do you have to restart the interface or your system in this case? |
@Skypodo yes. If dn and if up again. |
@johandc Did it improve than in any way? Did you notice any differences in behaviour between MCP2517FD and MCP2518FD? |
I don't know, it feels very random to me when the error hits, so its subjective. But i felt like there was very little difference between the two chips. There are two issues, as far as i can see.
If the driver would receover itself, i could accept to loose a can frame once in a while. But having it locking up, is the worst possible scenario. |
@johandc That has been also my experience for now. Once the driver is stuck, there is no way it recovers and you have to turn the interface off and on again. |
I'm not sure if it's related to this particular issue, but I ran into a problem with the driver not recovering from a BUS_OFF state. The mcp2518fd cancels all pending transmissions in this state, but the driver thinks they're still pending. This leads to a lockup since as far as the driver's concerned there's no free TX queues. The driver needs to cancel its pending transmissions when a BUS_OFF error is detected (and report them as failed to userspace). The problem is that sometimes there's weird race conditions between detecting the BUS_OFF state change and queuing a new frame for TX asynchronously, so it's really difficult to figure out which frames are still queued or not for TX. An occasional check for lockup (if no frame TX complete interrupts have occurred in X seconds) can somewhat work around that issue. |
As we are facing the same problems I've investigated further in the issue with not pending FIFOs and the problem is located in the optimized interrupt handler If you remove the handling and just use The problem with the optimized version seems to be a timing issue. Now and then - with our application even after hours of running w/o problems - more than one TEF is read in advance, meaning not only the signalled CiTEFUA (that's not even used in the driver I think) is read, but also consecutive TEFs are readout. This works fine most of the times, but anytime after the read a TEFINT is triggered that leads to a readout of not ready TEFs. We believe that this is due to some readout while not all TEFs have been written by the chip completely, so the TEFNEIF flag is set after the readout itself has already taken place. Once in this situation, every further send of CAN(-FD) messages lead to a readout of a wrong TEF, a missing interrupt and - in the end - no idle TX-FIFOs, which in turn make the driver stop sending. ifdown and ifup lead to re-initialization, so the interface is working again. I've created a support case with microchip to clarify if a readout other than at the CiTEFUA address for the TEF is safe according to concurrent updates, as for the moment we cannot prove this (mcp25xxfd datasheet and reference manual is not clear at this point). I'll keep this updated if I get any news. |
That is excellent news. At least there is a workaround for now. Thank very much for looking into this. |
@t-kopp I've applied that patch as well. So far only single bit ECC errors: kernel: [16664.150326] mcp25xxfd spi0.0: MCP2517 successfully initialized. I'll let the test run overnight and will report back. |
@jbakuwel Do you have real CAN nodes connected to the RP4 or some PC and a CAN interface? |
@t-kopp Here are the preceding single bit ECC errors and the double bit ECC error that seized the driver; this time after approx 3 1/2 hrs. [16865.444674] mcp25xxfd spi1.0: ECC single bit error at 4ac @Skypodo Yes, I have real CAN nodes connnected. See here for more details. My first test: can0 <--wire--> can1 using cangen -g 0 can0 and candump can1 ran fine for an extended period of time. |
@t-kopp More debug info. I found a way with the live CAN system to reproduce the issue quicker (not having to wait for an hour or longer. Contrary to what I've said before, single ECC errors also have an effect on the live CAN system (it flags an error condition) even though everything seems to continue to work until the double ECC error hits. [62829.089841] mcp25xxfd spi0.0: MCP2517 successfully initialized. And as always the driver seized up after this. We're a bit stuck with our project until we fix this issue, happy to help in any way to further debug this, especially now I have a quick way to reproduce the problem. |
@jbakuwel I have not encountered any ECC errors with or without external CANnodes. What is your way to reproduce the issue quickly? |
On the mcp2517fd I can generate CRC errors reliable. To reproduce I setup a bus with two CAN nodes, the receiving mcp2517fd is in ifdown state, on the other node I've queued 32 frames with |
@t-kopp @jbakuwel |
I'm currently rewriting the driver. Feel free to have a look (as in testing) at it: https://github.com/marckleinebudde/linux/tree/mcp25xxfd-rpi It's currently based on the latest rpi-4.19 kernel and my plans are to get it merged into mainline. |
Sorry for having bin quiet for so long - busy with other projects and vacation. As for ECC: My guess is that you are trying to access a RAM address that has not been written to prior to the first read after enabling ECC. But that seems strange, as I have cleared all ram in mcp25xxfd_ecc_enable. Maybe we hit another bug with concurrent write of TEF data and the speculative read the same address via SPI that results in this ECC issue (SDRAM access being blocked) - but what is strange is that I have never seen this issue during my testing, But then I have kept count(TEF) = count(TX) and avoiding possible rollover issues... |
@johandc I can't say for sure that it is related but I have a feeling it could be. This: kernel: [56043.592373] mcp25xxfd spi1.0: ECC single bit error at 800 suggests a bug that is triggered by a (perhaps) rare timing constraint. @marckleinebudde I'll have a look at your driver hopefully tonight (NZ time). I have to squeeze this in as I'm having a busy week. @msperl I rely on the driver and/or it's defaults to access the RAM and enable (or disable ECC). My Python script talks to the python-can library which in turn interfaces with the can devices created by the driver (ie. my main work is a few layers separated from the driver). I do admit I do not know every detail about the way all this fits together but am puzzled a bit by the fact that my live CAN system is able to pick up on single bit ECC errors. Are these not auto corrected? As mentioned before, I have found a reliable way to reproduce this (and the other problems) with a test that resembles replacing the wires connected two CAN nodes with a RPi with Seeed Studio CAN hat running a Python script that simply forwards frames received on can0 to can1 and vice versa. The unidentified system interrupt does not always show up but could well be a tell tale leading to what appears to be ECC errors. The Python script continues to forward frames after single bit ECC errors but the driver stops transmitting after the first double bit ECC error from which there is no recovery. It could well be a data corruption that occurs in rare conditions on a busy CAN bus which shows itself as ECC errors. |
@msperl I observed the ECC error with my driver and it clears the RAM entirely on ifup. Only the mcp2517fd shows this problem (the mcp2518fd doesn't). And the ECC shows only up when the chip receives a huge burst of frames during ifup. |
@jbakuwel Yes, single bit ECC errors are detected and corrected and if enabled the corresponding interrupt is propagated to the linux driver. |
This commit seems to fix the ECC in startup problem: |
@marckleinebudde and @msperl in my test scenario my live CAN system is initially switched off. Then I bring the interfaces up, start the Python script and then start the live CAN system. After a while (sometimes 1 hour sometimes 3 hours) ECC errors start to show up. I've discovered that increasing the traffic on the CAN bus a bit (via the live CAN system) reliably creates the ECC errors pretty much straight away. |
@marckleinebudde If single bit ECC errors are corrected, why would the live CAN system be able to detect those (as it does)? The driver logs a warning in syslog but (I assume) the frame was not dropped but a) corrected and b) passed on to the userland software which then sent it onwards onto the other CAN bus (in my test). |
@jbakuwel The ECC errors you see in your system ( |
@t-kopp It will take me a bit longer to get the requested debug info as the SD card crapped out on which I've been doing my experiments. I thought buying Samsung EVO's would be a safe bet.... |
@marckleinebudde Am I mistaken thinking that "internal" errors inside the on-chip-RAM should not be visible to the CAN nodes (at least not the single bit ones that can be corrected)? Yet my live CAN system notices something is wrong by flagging "a fault" (sorry it's not more verbose than that). |
Depends. If your ECC problems are created in the same way as described below, RX'ed CAN frames are broken. They will be passed broken into the user space and if you send them via the same or another CAN interface to the next CAN node, this node will receive CAN frames with wrong contents. From the CAN bus point of view this transmission is perfectly OK. So far the only ECC errors I have seen were in my driver and in the RAM associated with the RX FIFO. (As noted in #6 (comment), I've changed the driver so that this doesn't happen any more.) My current theory is that the RX-Process in the Chip that moves the data into the RAM and the SPI read commands that read data from RAM interfere, resulting in data being written to the wrong address without updating the ECC. The ECC in chip is designed to correct single bit flips and detect double bit flips. When you overwrite data (without updating the ECC) there's probably more than 2 bit flips. What the chip does with this is unknown. In my test case I send 32 CAN frames with incrementing data bytes and I see more than half of them totally broken. I can show you an example later today.
Yes. But if overwriting happens as speculated above the, there are probably more than 2 bit flips.... |
@marckleinebudde Thanks for that. I agree with your theory. It is a bit odd though that, in my tests, I first see single bit ECC errors, and later double bit ECC errors. I've not seen a double ECC error first. The driver does not recover after the first double bit ECC error; maybe it does not recover due to more than 2 flipped bits. Unfortunately my new fancy Samsung EVO plus SD card suddenly died, so I'll have to rebuild the system in an otherwise very busy week. May take me a few days... |
@t-kopp Here are the test results with kernel 4.19.102-v7l+, current Seeed Studio driver (which includes this patch) as well as this patch applied. Note the mode switches are new; didn't see them in previous tests. The test started at Feb 13 09:01:57 ran successfully (ie. the live CAN system not detecting faults) until Feb 13 11:59:27 due to kernel: [11077.498185] mcp25xxfd spi1.0: ECC double bit error at 470. Feb 13 10:30:17 test kernel: [ 5726.727061] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 6 to 3 |
@jbakuwel mhm, am I missing something or are you saying you applied the two patches two the driver from Marc? Both patches were for the driver from Martin so they should not apply here. |
@t-kopp I did not apply the patches to Marc's driver. Obviously :-). I haven't tried but I think the patch would have failed. I haven't applied the first patch as that was already done; I simply cloned a fresh copy from the Seeed Studio driver (I had to rebuilt the system). I also tested @marckleinebudde 's kernel and driver (first) but kept the 102 kernel running; Marc mentioned some beneficial (my interpretation) backported kernel SPI patches; see here for the details. Maybe we need one place where we discuss these (likely?) related issues? We are currently discussing it in @msperl 's repo, as well as in Seed Studio's. |
@t-kopp, @msperl Have you been able to make progress on this issue? Do you think there's a chance we can get this resolved in a reasonable timeframe for the MCP2517FD chipset? It's a real show stopper for us, and if it's not at all certain we can resolve this, we'll have to look for other hardware. If there's anything else I can do to help, please let me know. |
@jbakuwel I haven't been able to reproduce it yet, no. And I agree with your comment from before that it makes sense to discuss in one place. Maybe you can open a new issue and we discuss it there? Were you able to find out the answers to the questions from above? Which FIFOs are causing the errors (TX/RX?) and does enabling the SPI CRC help? |
@t-kopp Which module parameter enables SPI CRC? I've tried a few (use_spi_crc=1 for example) but so far no luck. Can you please let me know how to find the FIFO's that are causing the errors? DebugFS is here but it's not clear to me where I need to look for answer your questions. I'm still seeing double bit ECC errors (see above), for example: Feb 13 11:59:27 test kernel: [11077.500268] mcp25xxfd spi1.0: ECC double bit error at 470 I don't think the mode switches are happening with Marc's driver. I'm about to test the latest branch and will let you know. Would you prefer to have the issue (to discuss these possibly related issues) in this repo or here? |
… reads non signalled TEFs which is undefined behavior according to Microchip In function mcp25xxfd_can_tx_handle_int_tefif() the call to mcp25xxfd_can_tx_handle_int_tefif_optimized() is removed as it acesses transmit FIFOs that are not signalled by the TEFIF. This leads to undefined behavior and wrong reads, leading to the driver's internal TEF state tot get out of sync and transmit further frames is blocked (TEFIFs not freed). The problem is discussed in Github issue msperl#6.
Hi,
I'm evaluating the PICAN FD Duo board from Skpang. To test it out, I've connected can0 to can1 and running
cangen -g 0 -m can0
. After a short period of time, the command will exit with a messagewrite: No buffer space available
.Kernel:
root@signalserver-canfd:~# uname -a Linux signalserver-canfd 4.14.98ms7-v7+ #75 SMP Sat Feb 16 22:22:33 UTC 2019 armv7l GNU/Linux
Looking in the kernel log I can see the following messages:
I've tried different values for
-g
, but it just takes slightly longer to fail. Also I've tried to halfspimaxfrequency
following the suggestion in the kernel log.The text was updated successfully, but these errors were encountered: