Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arasan SD host fifo overrun causes corruption #415

Closed
ghollingworth opened this issue Nov 2, 2013 · 29 comments
Closed

Arasan SD host fifo overrun causes corruption #415

ghollingworth opened this issue Nov 2, 2013 · 29 comments

Comments

@ghollingworth
Copy link

Description

With specific (currently unknown) timing situations it has been found that writing to the SD card can result in failure and filesystem corruption due to the fifo full flag being incorrectly sampled between two different clock domains.

Information

With specific SD cards (I've got one from popcornmix) it seems there is a specific error that always happens when trying to write long continuous sequences of data. The error was diagnosed using my SD card protocol analyser (that I wrote using some very cool and very special hardware!) and shows that the data written to the SD card was invalid and that whole sectors of data are missing from the output...

This was done using a piece of test code that directly opened a partition on the SD card and wrote 1MiB of pseudo random sequence (PRS). The protocol analyser then checked the write data to make sure the least significant bit was correct (currently I've only made it sample one data bit), there was an error found in the data written to the card from the Arasan module.

I then checked the data actually written to the SDCard by plugging into my linux box and writing it out to a file and finally did a binary comparison between the data read from the card and the pseudo random sequence we were meant to write. There were clear errors in the write data stream such that we missed whole sectors of data.

MEMORY -> DMA -> FIFO -> ARASAN MODULE -> SDCard

The above shows the flow of data from the memory to the SDCard. I know that the PRS was correct in memory and I get an interrupt from the DMA module when it has finished transmitting the block of data so I know that it was written successfully into the FIFO. But the data is not being written out to the SDCard and we're missing blocks as small as a single sector and up to 1024 bytes at a time (the FIFO is 1024 bytes x 2).

So the only possible issue here is that the DREQ signal output from the ARASAN module to the DMA controller is wrong, this is the signal that tells the DMA whether there is any space for it to write data... But there is no window size feedback so it is possible in the classical network module for us to overflow that fifo

Workaround

To confirm the issue I first switched to PIO mode, this can be done fairly easily by making shdci-bcm2708.c always return 0 from dmaable() function, this will then not use DMA at all.

Unfortunately there is a further bug that effects PIO mode, this is due to the fact that the STATUS register takes a number of SD clock cycles to clock the fifo status through and therefore when we fill the fifo in sdhci_transfer_pio we cannot believe the value of SDHCI_SPACE_AVAILABLE. It will take a number of cycles to come through to that register (and for some reason it may not ever reset...) so we also need to add an unconditional break to this function

This is only a first stab workaround because it means we end up making the SD card throughput poor! We'd prefer not to do this if possible!

Commit

Not yet committed. Need a solution that works for writing and reading without losing too much performance!

@lurch
Copy link
Contributor

lurch commented Nov 4, 2013

Interesting. Is this "Arasan module" something that's baked into the GPU hardware, or is it something that gets loaded from start.elf (giving the possibility of a firmware-update fix) ?

@P33M
Copy link
Contributor

P33M commented Nov 4, 2013

It's the Arasan SD/SDIO host controller IP block as described in the BCM2835 peripheral datasheet. The ARM uses this SD host controller in Linux.

@zerxy
Copy link

zerxy commented Nov 4, 2013

Why do the corruption problems get worse when overclocking? How is the Arasan controller affected by this? Could the defect really lie on the Broadcom side instead?

@ghollingworth
Copy link
Author

When overclocking the core clock you are changing the ratio between the core and emmc clocks, these are the two clocks used to clock data into and out of the fifo...

@zerxy
Copy link

zerxy commented Nov 4, 2013

Why does mmc0: read SD Status register (SSR) after x attempts where in my experience x appears to be anything up to 10 happen during the majority of boots? Is this somehow related to the corruption issue?

@ghollingworth
Copy link
Author

No, it's just a normal part of starting up the SD card.

@popcornmix
Copy link
Collaborator

We're testing a change that may fix some cases of sdcard corruption. You will need the latest firmware. e.g.

sudo rpi-update

To enable, you want in config.txt:

emmc_pll_core=1

that runs the sdcard clock off the same pll as the gpu core which reduces problems in the clock domain crossing.
You probably also want to set the sdcard clock to the same as the gpu core clock.

If you are not overclocking core_freq that is simple. E.g.

init_emmc_clock=250000000

If you are overclocking core_freq, you should disable dynamic overclock for now (which may set your warranty bit). e.g.

force_turbo=1
core_freq=500
init_emmc_clock=500000000

Then try whatever normally causes sdcard corruption, and report back if anything has changed.

edit: this is enabled by default with latest firmware. There is no need to apply any of these settings now.

@samnazarko
Copy link
Contributor

hopefully this will help with Kingston Class 10 SD cards. Old ones seemed to be OK but new ones seem very problematic and plagued with timeout issues.

Perhaps a standard initramfs with fsck on mount fail is a good idea for mainline kernels?

@ghollingworth
Copy link
Author

Would be very interested in whether this helps... I found on my desk it fails to write to the SD card no matter which one, I tried three different manufacturers!

The sudo BRANCH=next rpi-update should fix anything to do with this clock crossing issue ... Be very interested if people are still seeing corruption with this update as well...

@popcornmix
Copy link
Collaborator

Latest master tree has been upated to support this, so just rpi-update will allow you to test this.

If you have suffered from corruption, then please update and try the config options suggested.
Even if you don't experience corruption we'd still like as many testers of these options as possible, as we'd like to enable this as a default, but need confirmation that the config options are safe.

As always with testing, backing up the sdcard first is advised.

@xnosek00
Copy link

my Raspi is freezing arm_freq=1000
I have to use: arm_freq=950

then my setting are:
arm_freq=950

core_freq=250
sdram_freq=450
over_voltage=6

force_turbo=1
init_emmc_clock=450000000

emmc_pll_core=1

is it right?

@popcornmix
Copy link
Collaborator

No, you want init_emmc_clock to match core_freq. So:

arm_freq=950
core_freq=250
sdram_freq=450
over_voltage=6
init_emmc_clock=250000000
emmc_pll_core=1

(in your case force_turbo is not essential as you are not overclocking core_freq).

@RobfromJoppa
Copy link

Early days, but this looks promising for me on a model A and SD card that suffered corruption from overclocking in the past.

I did notice that the model A is giving lousy i/o performance on the sdcard, but further testing has shown that is unrelated to this fix.

popcornmix pushed a commit to raspberrypi/firmware that referenced this issue Nov 11, 2013
See: raspberrypi/linux#415 (comment)

kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM
See: raspberrypi/linux#430
See: raspberrypi/linux#358

firmware: audio_render: check for output space more frequently to avoid underrun
See: http://forum.stmlabs.com/showthread.php?tid=10118
popcornmix pushed a commit to raspberrypi/firmware that referenced this issue Nov 11, 2013
See: raspberrypi/linux#415 (comment)

kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM
See: raspberrypi/linux#430
See: raspberrypi/linux#358

firmware: audio_render: check for output space more frequently to avoid underrun
See: http://forum.stmlabs.com/showthread.php?tid=10118
popcornmix pushed a commit to Hexxeh/rpi-firmware that referenced this issue Nov 11, 2013
See: raspberrypi/linux#415 (comment)

kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM
See: raspberrypi/linux#430
See: raspberrypi/linux#358

firmware: audio_render: check for output space more frequently to avoid underrun
See: http://forum.stmlabs.com/showthread.php?tid=10118
@popcornmix
Copy link
Collaborator

The latest firmware update enables emmc_pll_core by default.
It should handle turbo mode automatically, so the additional config option are no longer required.

Please test and report.

@RobfromJoppa
Copy link

Default doesn't seem to work for me; using vcgencmd measure_clock emmc gives me frequency(47)=250000000 with config.txt as follows:
pi@pi-a2 ~ $ cat /boot/config.txt |grep -v '#'
arm_freq=1000
gpu_mem=64
core_freq=500
sdram_freq=500
over_voltage=6
force_turbo=1

but I get frequency(47)=500000000 when I add the below to config.txt
emmc_pll_core=1
init_emmc_clock=500000000

this is on:
pi@pi-a2 ~ $ uname -a && vcgencmd version
Linux pi-a2 3.10.18+ #594 PREEMPT Wed Nov 13 17:59:34 GMT 2013 armv6l GNU/Linux
Nov 12 2013 23:43:03
Copyright (c) 2012 Broadcom
version 4aee5454c7955e7bc0bbb152ca4c0e26e75376e1 (clean) (release)

@popcornmix
Copy link
Collaborator

@RobfromJoppa
That's working as expected. The emmc clock is set to the idle core frequency(*).
We've found running the core faster than emmc but from the PLL is safe.

(*) You may find you spot the emmc_freq a little lower as the PLL is changing during a core->turbo or core->idle transition.

@RobfromJoppa
Copy link

ok, thanks. Wasn't sure as core shows frequency(1)=500000000 regardless.
This fix does seem to have eliminated sd card corruption for me, and is running fine on another device with the rootfs on USB for the last few days.

@popcornmix
Copy link
Collaborator

@RobfromJoppa good to hear.

@julianscheel
Copy link
Contributor

We can confirm that this fix works well. We have some RPis (especially the ones with Hynix memory) that were reliable corrupting the SD card when over voltage was required. This does not happen anymore with recent firmware.

@thebigredgeek
Copy link

I am experiencing some bizzarre issues with the SD card on my model B. It is a AUSDH32GCL10-RA1, and I am constantly experiencing FS corruption (this card and the Pi are brand new. This is my first endeavor with RPi, and it isn't proving to be enjoyable). I am going to try the above mentioned fix and hopefully it will work. So tired fo EXT4 FS Errors and Input/Output errors when using apt on raspbian.

@ghollingworth
Copy link
Author

What is the output of uname -a?

What does the error line look like in dmesg (is it an error -110)?

@frank-w
Copy link
Contributor

frank-w commented Feb 11, 2014

hi,
i got the same error using a samsung 16GB SD-Card
getting a error -110

reformatted the card (gparted,fat32) running freshly downloaded noobs (network install) and installed raspbian. cannot start any commands like uname, because got no prompt (hangs while booting on mouting local file systems).
i got previously error that debugfs cannt be created...but that seems not to bad :)

tried setting from user popcornmix (appended to default BOOT/config.txt), but cannot verify that settings are applied

i'm new to RPi, and googling around, but found no other solution

@ghollingworth
Copy link
Author

Frank-W that's an interesting error, so you can't even boot the very second time (you can clearly boot into NOOBS the first time but either the second time is failing or the NOOBS image is failing)

Have you tried using an image to install Raspbian to see if the same problem occurs (this will identify whether it is a problem with booting or noobs. Is this a brand new SD card?

Are you UK based?

Thanks

Gordon

@frank-w
Copy link
Contributor

frank-w commented Feb 12, 2014

the noobs-screen (press shift) ist still present. I can also enter it.
i got the pi from friend with the sdcard and power-adapter. Openelec was working well before. But i cannot install anything so i want to install raspbian
Where did i get the rasbian-image and how to install? ok, found on http://www.raspberrypi.org/downloads ;)
downloaded and executed
echo "9d0afbf932ec22e3c29d793693f58b0406bcab86"&&sha1sum 2014-01-07-wheezy-raspbian.zip
both checksums are ok

by the serial this is the card (for detailed investigation):
http://www.samsung.com/latin_en/consumer/monitor-peripherals-printer/memory-storage/external-drives/MB-SPAGA/US
i run "badblocks -w -s -b 4096 /dev/sdb" on this card and got no error (2 passes 0xaa and 0x55)

sudo dd bs=4M if=2014-01-07-wheezy-raspbian.img of=/dev/sdb

now i have a nicer partitiontable (sdb1 and sdb2)...with noobs i got sdb3,sdb5 and sdb6..trying to boot...yes, flashed sdcard with image boots without any errors. Maybe the strange Partitiontable created by noobs was the fault
Thanks for your support

I'm from germany...

@lurch
Copy link
Contributor

lurch commented Feb 15, 2014

Thanks for the updated info. Would you mind trying NOOBS again (on the same card) but this time using the full NOOBS download rather than using the network-install option?

@frank-w
Copy link
Contributor

frank-w commented Feb 21, 2014

as you suggested, i tried with full noobs (not lite=network) and got the same error when installing raspbian on first reboot.

mmc0: Timeout waiting for hardware interrupt - cmd12.
error -110 sending stop command, original cmd response 0x900, card status 0xf00

same card, that works if flashing the img directly

after i installed raspbian i got the following partition table (gparted):

/dev/sdb1 fat32 (recovery) 1.34GiB
freespace 3,85MiB
/dev/sdb2 extended 13.68GiB
freespace 4.00 MiB
/dev/sdb5 fat32 (boot) 60.00MiB
freespace 4.00 MiB
/dev/sdb6 ext4 (root) 13.62GiB
/dev/sdb3 ext4 (settings) 32MiB

fdisk -l /dev/sdb
Disk /dev/sdb: 16.2 GB, 16172187648 bytes
4 Köpfe, 16 Sektoren/Spur, 493536 Zylinder, zusammen 31586304 Sektoren
Einheiten = Sektoren von 1 × 512 = 512 Bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Festplattenidentifikation: 0x000981cb

Gerät boot. Anfang Ende Blöcke Id System
/dev/sdb1 2048 2818359 1408156 e W95 FAT16 (LBA)
/dev/sdb2 2826240 31520767 14347264 85 Linux erweitert
/dev/sdb3 31520768 31586303 32768 83 Linux
/dev/sdb5 2834432 2957311 61440 c W95 FAT32 (LBA)
/dev/sdb6 2965504 31520767 14277632 83 Linux

maybe its the wrong partiton type of extended (dos-compatible)? ID 0x85 instead of normal 0x5...or its a adressing-problem (rootfs is last partition)

@Ruffio
Copy link

Ruffio commented Aug 10, 2016

@ghollingworth has this issue been resolved? If yes, then please close this issue.

@Ruffio
Copy link

Ruffio commented Aug 29, 2016

@ghollingworth ping...

@popcornmix
Copy link
Collaborator

We no longer use the Arasan block for sdcard so I don't believe this issue is present.

popcornmix pushed a commit that referenced this issue Feb 13, 2017
There is currently no reference count being held on the PHY driver,
which makes it possible to remove the PHY driver module while the PHY
state machine is running and polling the PHY. This could cause crashes
similar to this one to show up:

[   43.361162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000140
[   43.361162] IP: phy_state_machine+0x32/0x490
[   43.361162] PGD 59dc067
[   43.361162] PUD 0
[   43.361162]
[   43.361162] Oops: 0000 [#1] SMP
[   43.361162] Modules linked in: dsa_loop [last unloaded: broadcom]
[   43.361162] CPU: 0 PID: 1299 Comm: kworker/0:3 Not tainted 4.10.0-rc5+ #415
[   43.361162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[   43.361162] Workqueue: events_power_efficient phy_state_machine
[   43.361162] task: ffff880006782b80 task.stack: ffffc90000184000
[   43.361162] RIP: 0010:phy_state_machine+0x32/0x490
[   43.361162] RSP: 0018:ffffc90000187e18 EFLAGS: 00000246
[   43.361162] RAX: 0000000000000000 RBX: ffff8800059e53c0 RCX:
ffff880006a15c60
[   43.361162] RDX: ffff880006782b80 RSI: 0000000000000000 RDI:
ffff8800059e5428
[   43.361162] RBP: ffffc90000187e48 R08: ffff880006a15c40 R09:
0000000000000000
[   43.361162] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff8800059e5428
[   43.361162] R13: ffff8800059e5000 R14: 0000000000000000 R15:
ffff880006a15c40
[   43.361162] FS:  0000000000000000(0000) GS:ffff880006a00000(0000)
knlGS:0000000000000000
[   43.361162] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.361162] CR2: 0000000000000140 CR3: 0000000005979000 CR4:
00000000000006f0
[   43.361162] Call Trace:
[   43.361162]  process_one_work+0x1b4/0x3e0
[   43.361162]  worker_thread+0x43/0x4d0
[   43.361162]  ? __schedule+0x17f/0x4e0
[   43.361162]  kthread+0xf7/0x130
[   43.361162]  ? process_one_work+0x3e0/0x3e0
[   43.361162]  ? kthread_create_on_node+0x40/0x40
[   43.361162]  ret_from_fork+0x29/0x40
[   43.361162] Code: 56 41 55 41 54 4c 8d 67 68 53 4c 8d af 40 fc ff ff
48 89 fb 4c 89 e7 48 83 ec 08 e8 c9 9d 27 00 48 8b 83 60 ff ff ff 44 8b
73 98 <48> 8b 90 40 01 00 00 44 89 f0 48 85 d2 74 08 4c 89 ef ff d2 8b

Keep references on the PHY driver module right before we are going to
utilize it in phy_attach_direct(), and conversely when we don't use it
anymore in phy_detach().

Signed-off-by: Mao Wenan <maowenan@huawei.com>
[florian: rebase, rework commit message]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neuschaefer pushed a commit to neuschaefer/raspi-binary-firmware that referenced this issue Feb 27, 2017
See: raspberrypi/linux#415 (comment)

kernel: enable MROUTE options and CONFIG_CRYPTO_AES_ARM
See: raspberrypi/linux#430
See: raspberrypi/linux#358

firmware: audio_render: check for output space more frequently to avoid underrun
See: http://forum.stmlabs.com/showthread.php?tid=10118
popcornmix pushed a commit that referenced this issue Jun 17, 2017
[ Upstream commit cafe8df ]

There is currently no reference count being held on the PHY driver,
which makes it possible to remove the PHY driver module while the PHY
state machine is running and polling the PHY. This could cause crashes
similar to this one to show up:

[   43.361162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000140
[   43.361162] IP: phy_state_machine+0x32/0x490
[   43.361162] PGD 59dc067
[   43.361162] PUD 0
[   43.361162]
[   43.361162] Oops: 0000 [#1] SMP
[   43.361162] Modules linked in: dsa_loop [last unloaded: broadcom]
[   43.361162] CPU: 0 PID: 1299 Comm: kworker/0:3 Not tainted 4.10.0-rc5+ #415
[   43.361162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[   43.361162] Workqueue: events_power_efficient phy_state_machine
[   43.361162] task: ffff880006782b80 task.stack: ffffc90000184000
[   43.361162] RIP: 0010:phy_state_machine+0x32/0x490
[   43.361162] RSP: 0018:ffffc90000187e18 EFLAGS: 00000246
[   43.361162] RAX: 0000000000000000 RBX: ffff8800059e53c0 RCX:
ffff880006a15c60
[   43.361162] RDX: ffff880006782b80 RSI: 0000000000000000 RDI:
ffff8800059e5428
[   43.361162] RBP: ffffc90000187e48 R08: ffff880006a15c40 R09:
0000000000000000
[   43.361162] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff8800059e5428
[   43.361162] R13: ffff8800059e5000 R14: 0000000000000000 R15:
ffff880006a15c40
[   43.361162] FS:  0000000000000000(0000) GS:ffff880006a00000(0000)
knlGS:0000000000000000
[   43.361162] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   43.361162] CR2: 0000000000000140 CR3: 0000000005979000 CR4:
00000000000006f0
[   43.361162] Call Trace:
[   43.361162]  process_one_work+0x1b4/0x3e0
[   43.361162]  worker_thread+0x43/0x4d0
[   43.361162]  ? __schedule+0x17f/0x4e0
[   43.361162]  kthread+0xf7/0x130
[   43.361162]  ? process_one_work+0x3e0/0x3e0
[   43.361162]  ? kthread_create_on_node+0x40/0x40
[   43.361162]  ret_from_fork+0x29/0x40
[   43.361162] Code: 56 41 55 41 54 4c 8d 67 68 53 4c 8d af 40 fc ff ff
48 89 fb 4c 89 e7 48 83 ec 08 e8 c9 9d 27 00 48 8b 83 60 ff ff ff 44 8b
73 98 <48> 8b 90 40 01 00 00 44 89 f0 48 85 d2 74 08 4c 89 ef ff d2 8b

Keep references on the PHY driver module right before we are going to
utilize it in phy_attach_direct(), and conversely when we don't use it
anymore in phy_detach().

Signed-off-by: Mao Wenan <maowenan@huawei.com>
[florian: rebase, rework commit message]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests