-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arasan SD host fifo overrun causes corruption #415
Comments
Interesting. Is this "Arasan module" something that's baked into the GPU hardware, or is it something that gets loaded from start.elf (giving the possibility of a firmware-update fix) ? |
It's the Arasan SD/SDIO host controller IP block as described in the BCM2835 peripheral datasheet. The ARM uses this SD host controller in Linux. |
Why do the corruption problems get worse when overclocking? How is the Arasan controller affected by this? Could the defect really lie on the Broadcom side instead? |
When overclocking the core clock you are changing the ratio between the core and emmc clocks, these are the two clocks used to clock data into and out of the fifo... |
Why does |
No, it's just a normal part of starting up the SD card. |
We're testing a change that may fix some cases of sdcard corruption. You will need the latest firmware. e.g.
To enable, you want in config.txt:
that runs the sdcard clock off the same pll as the gpu core which reduces problems in the clock domain crossing. If you are not overclocking core_freq that is simple. E.g.
If you are overclocking core_freq, you should disable dynamic overclock for now (which may set your warranty bit). e.g.
Then try whatever normally causes sdcard corruption, and report back if anything has changed. edit: this is enabled by default with latest firmware. There is no need to apply any of these settings now. |
hopefully this will help with Kingston Class 10 SD cards. Old ones seemed to be OK but new ones seem very problematic and plagued with timeout issues. Perhaps a standard initramfs with fsck on mount fail is a good idea for mainline kernels? |
Would be very interested in whether this helps... I found on my desk it fails to write to the SD card no matter which one, I tried three different manufacturers! The sudo BRANCH=next rpi-update should fix anything to do with this clock crossing issue ... Be very interested if people are still seeing corruption with this update as well... |
Latest master tree has been upated to support this, so just rpi-update will allow you to test this. If you have suffered from corruption, then please update and try the config options suggested. As always with testing, backing up the sdcard first is advised. |
my Raspi is freezing arm_freq=1000 then my setting are: core_freq=250 force_turbo=1 emmc_pll_core=1 is it right? |
No, you want init_emmc_clock to match core_freq. So:
(in your case force_turbo is not essential as you are not overclocking core_freq). |
Early days, but this looks promising for me on a model A and SD card that suffered corruption from overclocking in the past. I did notice that the model A is giving lousy i/o performance on the sdcard, but further testing has shown that is unrelated to this fix. |
See: raspberrypi/linux#415 (comment) kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM See: raspberrypi/linux#430 See: raspberrypi/linux#358 firmware: audio_render: check for output space more frequently to avoid underrun See: http://forum.stmlabs.com/showthread.php?tid=10118
See: raspberrypi/linux#415 (comment) kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM See: raspberrypi/linux#430 See: raspberrypi/linux#358 firmware: audio_render: check for output space more frequently to avoid underrun See: http://forum.stmlabs.com/showthread.php?tid=10118
See: raspberrypi/linux#415 (comment) kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM See: raspberrypi/linux#430 See: raspberrypi/linux#358 firmware: audio_render: check for output space more frequently to avoid underrun See: http://forum.stmlabs.com/showthread.php?tid=10118
The latest firmware update enables emmc_pll_core by default. Please test and report. |
Default doesn't seem to work for me; using vcgencmd measure_clock emmc gives me frequency(47)=250000000 with config.txt as follows: but I get frequency(47)=500000000 when I add the below to config.txt this is on: |
@RobfromJoppa (*) You may find you spot the emmc_freq a little lower as the PLL is changing during a core->turbo or core->idle transition. |
ok, thanks. Wasn't sure as core shows frequency(1)=500000000 regardless. |
@RobfromJoppa good to hear. |
We can confirm that this fix works well. We have some RPis (especially the ones with Hynix memory) that were reliable corrupting the SD card when over voltage was required. This does not happen anymore with recent firmware. |
I am experiencing some bizzarre issues with the SD card on my model B. It is a AUSDH32GCL10-RA1, and I am constantly experiencing FS corruption (this card and the Pi are brand new. This is my first endeavor with RPi, and it isn't proving to be enjoyable). I am going to try the above mentioned fix and hopefully it will work. So tired fo EXT4 FS Errors and Input/Output errors when using apt on raspbian. |
What is the output of uname -a? What does the error line look like in dmesg (is it an error -110)? |
hi, reformatted the card (gparted,fat32) running freshly downloaded noobs (network install) and installed raspbian. cannot start any commands like uname, because got no prompt (hangs while booting on mouting local file systems). tried setting from user popcornmix (appended to default BOOT/config.txt), but cannot verify that settings are applied i'm new to RPi, and googling around, but found no other solution |
Frank-W that's an interesting error, so you can't even boot the very second time (you can clearly boot into NOOBS the first time but either the second time is failing or the NOOBS image is failing) Have you tried using an image to install Raspbian to see if the same problem occurs (this will identify whether it is a problem with booting or noobs. Is this a brand new SD card? Are you UK based? Thanks Gordon |
the noobs-screen (press shift) ist still present. I can also enter it. by the serial this is the card (for detailed investigation): sudo dd bs=4M if=2014-01-07-wheezy-raspbian.img of=/dev/sdb now i have a nicer partitiontable (sdb1 and sdb2)...with noobs i got sdb3,sdb5 and sdb6..trying to boot...yes, flashed sdcard with image boots without any errors. Maybe the strange Partitiontable created by noobs was the fault I'm from germany... |
Thanks for the updated info. Would you mind trying NOOBS again (on the same card) but this time using the full NOOBS download rather than using the network-install option? |
as you suggested, i tried with full noobs (not lite=network) and got the same error when installing raspbian on first reboot. mmc0: Timeout waiting for hardware interrupt - cmd12. same card, that works if flashing the img directly after i installed raspbian i got the following partition table (gparted): /dev/sdb1 fat32 (recovery) 1.34GiB fdisk -l /dev/sdb Gerät boot. Anfang Ende Blöcke Id System maybe its the wrong partiton type of extended (dos-compatible)? ID 0x85 instead of normal 0x5...or its a adressing-problem (rootfs is last partition) |
@ghollingworth has this issue been resolved? If yes, then please close this issue. |
@ghollingworth ping... |
We no longer use the Arasan block for sdcard so I don't believe this issue is present. |
There is currently no reference count being held on the PHY driver, which makes it possible to remove the PHY driver module while the PHY state machine is running and polling the PHY. This could cause crashes similar to this one to show up: [ 43.361162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000140 [ 43.361162] IP: phy_state_machine+0x32/0x490 [ 43.361162] PGD 59dc067 [ 43.361162] PUD 0 [ 43.361162] [ 43.361162] Oops: 0000 [#1] SMP [ 43.361162] Modules linked in: dsa_loop [last unloaded: broadcom] [ 43.361162] CPU: 0 PID: 1299 Comm: kworker/0:3 Not tainted 4.10.0-rc5+ #415 [ 43.361162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 [ 43.361162] Workqueue: events_power_efficient phy_state_machine [ 43.361162] task: ffff880006782b80 task.stack: ffffc90000184000 [ 43.361162] RIP: 0010:phy_state_machine+0x32/0x490 [ 43.361162] RSP: 0018:ffffc90000187e18 EFLAGS: 00000246 [ 43.361162] RAX: 0000000000000000 RBX: ffff8800059e53c0 RCX: ffff880006a15c60 [ 43.361162] RDX: ffff880006782b80 RSI: 0000000000000000 RDI: ffff8800059e5428 [ 43.361162] RBP: ffffc90000187e48 R08: ffff880006a15c40 R09: 0000000000000000 [ 43.361162] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800059e5428 [ 43.361162] R13: ffff8800059e5000 R14: 0000000000000000 R15: ffff880006a15c40 [ 43.361162] FS: 0000000000000000(0000) GS:ffff880006a00000(0000) knlGS:0000000000000000 [ 43.361162] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.361162] CR2: 0000000000000140 CR3: 0000000005979000 CR4: 00000000000006f0 [ 43.361162] Call Trace: [ 43.361162] process_one_work+0x1b4/0x3e0 [ 43.361162] worker_thread+0x43/0x4d0 [ 43.361162] ? __schedule+0x17f/0x4e0 [ 43.361162] kthread+0xf7/0x130 [ 43.361162] ? process_one_work+0x3e0/0x3e0 [ 43.361162] ? kthread_create_on_node+0x40/0x40 [ 43.361162] ret_from_fork+0x29/0x40 [ 43.361162] Code: 56 41 55 41 54 4c 8d 67 68 53 4c 8d af 40 fc ff ff 48 89 fb 4c 89 e7 48 83 ec 08 e8 c9 9d 27 00 48 8b 83 60 ff ff ff 44 8b 73 98 <48> 8b 90 40 01 00 00 44 89 f0 48 85 d2 74 08 4c 89 ef ff d2 8b Keep references on the PHY driver module right before we are going to utilize it in phy_attach_direct(), and conversely when we don't use it anymore in phy_detach(). Signed-off-by: Mao Wenan <maowenan@huawei.com> [florian: rebase, rework commit message] Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
See: raspberrypi/linux#415 (comment) kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM See: raspberrypi/linux#430 See: raspberrypi/linux#358 firmware: audio_render: check for output space more frequently to avoid underrun See: http://forum.stmlabs.com/showthread.php?tid=10118
[ Upstream commit cafe8df ] There is currently no reference count being held on the PHY driver, which makes it possible to remove the PHY driver module while the PHY state machine is running and polling the PHY. This could cause crashes similar to this one to show up: [ 43.361162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000140 [ 43.361162] IP: phy_state_machine+0x32/0x490 [ 43.361162] PGD 59dc067 [ 43.361162] PUD 0 [ 43.361162] [ 43.361162] Oops: 0000 [#1] SMP [ 43.361162] Modules linked in: dsa_loop [last unloaded: broadcom] [ 43.361162] CPU: 0 PID: 1299 Comm: kworker/0:3 Not tainted 4.10.0-rc5+ #415 [ 43.361162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 [ 43.361162] Workqueue: events_power_efficient phy_state_machine [ 43.361162] task: ffff880006782b80 task.stack: ffffc90000184000 [ 43.361162] RIP: 0010:phy_state_machine+0x32/0x490 [ 43.361162] RSP: 0018:ffffc90000187e18 EFLAGS: 00000246 [ 43.361162] RAX: 0000000000000000 RBX: ffff8800059e53c0 RCX: ffff880006a15c60 [ 43.361162] RDX: ffff880006782b80 RSI: 0000000000000000 RDI: ffff8800059e5428 [ 43.361162] RBP: ffffc90000187e48 R08: ffff880006a15c40 R09: 0000000000000000 [ 43.361162] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800059e5428 [ 43.361162] R13: ffff8800059e5000 R14: 0000000000000000 R15: ffff880006a15c40 [ 43.361162] FS: 0000000000000000(0000) GS:ffff880006a00000(0000) knlGS:0000000000000000 [ 43.361162] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.361162] CR2: 0000000000000140 CR3: 0000000005979000 CR4: 00000000000006f0 [ 43.361162] Call Trace: [ 43.361162] process_one_work+0x1b4/0x3e0 [ 43.361162] worker_thread+0x43/0x4d0 [ 43.361162] ? __schedule+0x17f/0x4e0 [ 43.361162] kthread+0xf7/0x130 [ 43.361162] ? process_one_work+0x3e0/0x3e0 [ 43.361162] ? kthread_create_on_node+0x40/0x40 [ 43.361162] ret_from_fork+0x29/0x40 [ 43.361162] Code: 56 41 55 41 54 4c 8d 67 68 53 4c 8d af 40 fc ff ff 48 89 fb 4c 89 e7 48 83 ec 08 e8 c9 9d 27 00 48 8b 83 60 ff ff ff 44 8b 73 98 <48> 8b 90 40 01 00 00 44 89 f0 48 85 d2 74 08 4c 89 ef ff d2 8b Keep references on the PHY driver module right before we are going to utilize it in phy_attach_direct(), and conversely when we don't use it anymore in phy_detach(). Signed-off-by: Mao Wenan <maowenan@huawei.com> [florian: rebase, rework commit message] Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Description
With specific (currently unknown) timing situations it has been found that writing to the SD card can result in failure and filesystem corruption due to the fifo full flag being incorrectly sampled between two different clock domains.
Information
With specific SD cards (I've got one from popcornmix) it seems there is a specific error that always happens when trying to write long continuous sequences of data. The error was diagnosed using my SD card protocol analyser (that I wrote using some very cool and very special hardware!) and shows that the data written to the SD card was invalid and that whole sectors of data are missing from the output...
This was done using a piece of test code that directly opened a partition on the SD card and wrote 1MiB of pseudo random sequence (PRS). The protocol analyser then checked the write data to make sure the least significant bit was correct (currently I've only made it sample one data bit), there was an error found in the data written to the card from the Arasan module.
I then checked the data actually written to the SDCard by plugging into my linux box and writing it out to a file and finally did a binary comparison between the data read from the card and the pseudo random sequence we were meant to write. There were clear errors in the write data stream such that we missed whole sectors of data.
MEMORY -> DMA -> FIFO -> ARASAN MODULE -> SDCard
The above shows the flow of data from the memory to the SDCard. I know that the PRS was correct in memory and I get an interrupt from the DMA module when it has finished transmitting the block of data so I know that it was written successfully into the FIFO. But the data is not being written out to the SDCard and we're missing blocks as small as a single sector and up to 1024 bytes at a time (the FIFO is 1024 bytes x 2).
So the only possible issue here is that the DREQ signal output from the ARASAN module to the DMA controller is wrong, this is the signal that tells the DMA whether there is any space for it to write data... But there is no window size feedback so it is possible in the classical network module for us to overflow that fifo
Workaround
To confirm the issue I first switched to PIO mode, this can be done fairly easily by making shdci-bcm2708.c always return 0 from dmaable() function, this will then not use DMA at all.
Unfortunately there is a further bug that effects PIO mode, this is due to the fact that the STATUS register takes a number of SD clock cycles to clock the fifo status through and therefore when we fill the fifo in sdhci_transfer_pio we cannot believe the value of SDHCI_SPACE_AVAILABLE. It will take a number of cycles to come through to that register (and for some reason it may not ever reset...) so we also need to add an unconditional break to this function
This is only a first stab workaround because it means we end up making the SD card throughput poor! We'd prefer not to do this if possible!
Commit
Not yet committed. Need a solution that works for writing and reading without losing too much performance!
The text was updated successfully, but these errors were encountered: