dwc_otg: driver does not handle data toggle errors on SPLIT transactions #241

P33M · 2013-03-02T20:29:17Z

Hello (again)

A culmination of a number of threads/random crashes reported by users all have a constant undercurrent:
#195

http://www.raspberrypi.org/phpBB3/viewtopic.php?f=28&t=16280
http://www.raspberrypi.org/phpBB3/viewtopic.php?f=44&t=8010&hilit=serial+toggle

There are two main themes here -

a broken USB device (FT232BM/BL / FT8U232 / various other wierd and wonderful ones) consistently causes the Pi to lock up if the device is accessed
sporadic crashes are reported after days or weeks of use with otherwise functional devices

The constant factor with both of these is that the devices are full-speed plugged into a high-speed hub, be it the model B ports or a downstream hub. The crash is caused by a data toggle error condition existing either due to the broken USB device sending one with the wrong PID, or due to genuine variations in the aether causing a device to drop a packet.

I have a FT232BL device and the Pi locks up as a result of a usb_control_msg which is sent on opening the serial port in any garden variety terminal emulator. It works perfectly well if the speed is either forced to Full via command line parameter or via being plugged into a model A directly.

If a data toggle error occurs when the device is behind a transaction translator, it appears that the HC gets halted with the datatglerr interrupt bit set. The driver currently does not handle this at all well and goes into an infinite loop. The relevant interrupt handler is not called from handle_hc_chhltd_intr_dma.

The data toggle error interrupt handler as it currently exists is basically a stub, and does nothing but disable the interrupt. For a split transaction, I think the correct recovery is to restart the split and flip the expected data toggle bit in the HC's register.

ghollingworth · 2013-03-02T20:59:22Z

Might be best to leave this to me since I'm changing split transactions so
significantly

Gordon

On Saturday, 2 March 2013, P33M wrote:

Hello (again)

A culmination of a number of threads/random crashes reported by users all
have a constant undercurrent:

#195 #195
http://www.raspberrypi.org/phpBB3/viewtopic.php?f=28&t=16280

http://www.raspberrypi.org/phpBB3/viewtopic.php?f=44&t=8010&hilit=serial+toggle

There are two main themes here -

a broken USB device (FT232BM/BL / FT8U232 / various other wierd and
wonderful ones) consistently causes the Pi to lock up if the device is
accessed

sporadic crashes are reported after days or weeks of use with otherwise
functional devices

The constant factor with both of these is that the devices are full-speed
plugged into a high-speed hub, be it the model B ports or a downstream hub.
The crash is caused by a data toggle error condition existing either due to
the broken USB device sending one with the wrong PID, or due to genuine
variations in the aether causing a device to drop a packet.

I have a FT232BL device and the Pi locks up as a result of a
usb_control_msg which is sent on opening the serial port in any garden
variety terminal emulator. It works perfectly well if the speed is either
forced to Full via command line parameter or via being plugged into a model
A directly.

If a data toggle error occurs when the device is behind a transaction
translator, it appears that the HC gets halted with the datatglerr
interrupt bit set. The driver currently does not handle this at all well
and goes into an infinite loop. The relevant interrupt handler is not
called from handle_hc_chhltd_intr_dma.

The data toggle error interrupt handler as it currently exists is
basically a stub, and does nothing but disable the interrupt. For a split
transaction, I think the correct recovery is to restart the split and flip
the expected data toggle bit in the HC's register.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/241
.

P33M · 2013-03-03T15:02:15Z

Well I have done a basic fix - add the requisite interrupt handling in handle_chhltd_intr_dma and perform steps equivalent to a xacterr channel halt - it will serve as a retry mechanism for up to 3 transfer attempts, which is enough for both control and bulk transfers to sort themselves out.

Note that the data toggle "error" is also used to reset the QTD error count for non-split IN transfers as per previous investigation into #217 - patch leaves this functionality unaltered.

With this patch, my FT232BL now works! This device evidently powers on with its data PID counters in the wrong state - on the first access I get a toggle error for both control and bulk IN endpoints but subsequently all other transfers complete fine.

ghollingworth · 2013-03-06T07:19:13Z

Excellent,

I've recently got most of my FIQ code scheduling the split transactions,
which depending upon how long the tail of random interactions between
differrent ports / hubs is means we should be close to getting a proper
solution...

Gordon

On 3 March 2013 15:02, P33M notifications@github.com wrote:

Well I have done a basic fix - add the requisite interrupt handling in
handle_chhltd_intr_dma and perform steps equivalent to a xacterr channel
halt - it will serve as a retry mechanism for up to 3 transfer attempts,
which is enough for both control and bulk transfers to sort themselves out.

Note that the data toggle "error" is also used to reset the QTD error
count for non-split IN transfers as per previous investigation into #217 https://github.com/raspberrypi/linux/issues/217\- patch leaves this functionality unaltered.

With this patch, my FT232BL now works! This device evidently powers on
with its data PID counters in the wrong state - on the first access I get a
toggle error for both control and bulk IN endpoints but subsequently all
other transfers complete fine.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/241#issuecomment-14348104
.

P33M · 2013-03-07T12:01:45Z

Closing issue: positive noises on the forum about the patch.

Corrects the following checkpatch gripes: WARNING: quoted string split across lines #95: FILE: drivers/mfd/ab3100-core.c:95: + "write error (write register) " + "%d bytes transferred (expected 2)\n", WARNING: quoted string split across lines #139: FILE: drivers/mfd/ab3100-core.c:139: + "write error (write test register) " + "%d bytes transferred (expected 2)\n", WARNING: quoted string split across lines #175: FILE: drivers/mfd/ab3100-core.c:175: + "write error (send register address) " + "%d bytes transferred (expected 1)\n", WARNING: quoted string split across lines #193: FILE: drivers/mfd/ab3100-core.c:193: + "write error (read register) " + "%d bytes transferred (expected 1)\n", WARNING: quoted string split across lines #241: FILE: drivers/mfd/ab3100-core.c:241: + "write error (send first register address) " + "%d bytes transferred (expected 1)\n", WARNING: quoted string split across lines #256: FILE: drivers/mfd/ab3100-core.c:256: + "write error (read register page) " + "%d bytes transferred (expected %d)\n", WARNING: quoted string split across lines #299: FILE: drivers/mfd/ab3100-core.c:299: + "write error (maskset send address) " + "%d bytes transferred (expected 1)\n", WARNING: quoted string split across lines #314: FILE: drivers/mfd/ab3100-core.c:314: + "write error (maskset read register) " + "%d bytes transferred (expected 1)\n", WARNING: quoted string split across lines #334: FILE: drivers/mfd/ab3100-core.c:334: + "write error (write register) " + "%d bytes transferred (expected 2)\n", WARNING: please, no spaces at the start of a line #374: FILE: drivers/mfd/ab3100-core.c:374: + return blocking_notifier_chain_unregister(&ab3100->event_subscribers,$ WARNING: Prefer seq_puts to seq_printf #458: FILE: drivers/mfd/ab3100-core.c:458: + seq_printf(s, "AB3100 registers:\n"); WARNING: quoted string split across lines #564: FILE: drivers/mfd/ab3100-core.c:564: + "debug write reg[0x%02x] with 0x%02x, " + "after readback: 0x%02x\n", WARNING: quoted string split across lines #723: FILE: drivers/mfd/ab3100-core.c:723: + "AB3100 P1E variant detected, " + "forcing chip to 32KHz\n"); WARNING: quoted string split across lines #882: FILE: drivers/mfd/ab3100-core.c:882: + "could not communicate with the AB3100 analog " + "baseband chip\n"); WARNING: quoted string split across lines #906: FILE: drivers/mfd/ab3100-core.c:906: + dev_err(&client->dev, "accepting it anyway. Please update " + "the driver.\n"); total: 0 errors, 15 warnings, 999 lines checked Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Lee Jones <lee.jones@linaro.org>

If the driver is unable to create a subset of IO queues for any reason, the read/write and polled queue sets will not match the actual allocated hardware contexts. This leaves gaps in the CPU affinity mappings and causes the following kernel panic after blk_mq_map_queue_type() returns a NULL hctx. BUG: unable to handle kernel NULL pointer dereference at 0000000000000198 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 64 PID: 1171 Comm: kworker/u259:1 Not tainted 4.20.0+ #241 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 Workqueue: nvme-wq nvme_scan_work [nvme_core] RIP: 0010:blk_mq_init_allocated_queue+0x2d9/0x440 RSP: 0018:ffffb1bf0abc3cd0 EFLAGS: 00010286 RAX: 000000000000001f RBX: ffff8ea744cf0718 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 000000000000007c RDI: ffffffff9109a820 RBP: ffff8ea7565f7008 R08: 000000000000001f R09: 000000000000003f R10: ffffb1bf0abc3c00 R11: 0000000000000000 R12: 000000000001d008 R13: ffff8ea7565f7008 R14: 000000000000003f R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8ea757200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000198 CR3: 0000000013058000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_init_queue+0x35/0x60 nvme_validate_ns+0xc6/0x7c0 [nvme_core] ? nvme_identify_ctrl.isra.56+0x7e/0xc0 [nvme_core] nvme_scan_work+0xc8/0x340 [nvme_core] ? __wake_up_common+0x6d/0x120 ? try_to_wake_up+0x55/0x410 process_one_work+0x1e9/0x3d0 worker_thread+0x2d/0x3d0 ? process_one_work+0x3d0/0x3d0 kthread+0x111/0x130 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nvme nvme_core serio_raw CR2: 0000000000000198 Fix by re-running the interrupt vector setup from scratch using a reduced count that may be successful until the created queues matches the irq affinity plus polling queue sets. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>

commit 0be698e upstream. We got issue as follows: EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478 ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000 ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae ================================================================== BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220 Read of size 4 at addr ffff88810beee6ae by task rep/1895 CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ raspberrypi#241 Call Trace: dump_stack+0xbe/0xf9 print_address_description.constprop.0+0x1e/0x220 kasan_report.cold+0x37/0x7f ext4_rename_dir_prepare+0x152/0x220 ext4_rename+0xf44/0x1ad0 ext4_rename2+0x11c/0x170 vfs_rename+0xa84/0x1440 do_renameat2+0x683/0x8f0 __x64_sys_renameat+0x53/0x60 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f45a6fc41c9 RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9 RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005 RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080 R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0 R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee flags: 0x200000000000000() raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint ext4_rename_dir_prepare: [2] parent_de->inode=3537895424 ext4_rename_dir_prepare: [3] dir=0xffff888124170140 ext4_rename_dir_prepare: [4] ino=2 ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872 Reason is first directory entry which 'rec_len' is 34478, then will get illegal parent entry. Now, we do not check directory entry after read directory block in 'ext4_get_first_dir_block'. To solve this issue, check directory entry in 'ext4_get_first_dir_block'. [ Trigger an ext4_error() instead of just warning if the directory is missing a '.' or '..' entry. Also make sure we return an error code if the file system is corrupted. -TYT ] Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 0be698e upstream. We got issue as follows: EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478 ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000 ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae ================================================================== BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220 Read of size 4 at addr ffff88810beee6ae by task rep/1895 CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ #241 Call Trace: dump_stack+0xbe/0xf9 print_address_description.constprop.0+0x1e/0x220 kasan_report.cold+0x37/0x7f ext4_rename_dir_prepare+0x152/0x220 ext4_rename+0xf44/0x1ad0 ext4_rename2+0x11c/0x170 vfs_rename+0xa84/0x1440 do_renameat2+0x683/0x8f0 __x64_sys_renameat+0x53/0x60 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f45a6fc41c9 RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9 RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005 RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080 R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0 R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee flags: 0x200000000000000() raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint ext4_rename_dir_prepare: [2] parent_de->inode=3537895424 ext4_rename_dir_prepare: [3] dir=0xffff888124170140 ext4_rename_dir_prepare: [4] ino=2 ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872 Reason is first directory entry which 'rec_len' is 34478, then will get illegal parent entry. Now, we do not check directory entry after read directory block in 'ext4_get_first_dir_block'. To solve this issue, check directory entry in 'ext4_get_first_dir_block'. [ Trigger an ext4_error() instead of just warning if the directory is missing a '.' or '..' entry. Also make sure we return an error code if the file system is corrupted. -TYT ] Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 0be698e upstream. We got issue as follows: EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478 ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000 ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae ================================================================== BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220 Read of size 4 at addr ffff88810beee6ae by task rep/1895 CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ raspberrypi#241 Call Trace: dump_stack+0xbe/0xf9 print_address_description.constprop.0+0x1e/0x220 kasan_report.cold+0x37/0x7f ext4_rename_dir_prepare+0x152/0x220 ext4_rename+0xf44/0x1ad0 ext4_rename2+0x11c/0x170 vfs_rename+0xa84/0x1440 do_renameat2+0x683/0x8f0 __x64_sys_renameat+0x53/0x60 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f45a6fc41c9 RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9 RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005 RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080 R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0 R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee flags: 0x200000000000000() raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint ext4_rename_dir_prepare: [2] parent_de->inode=3537895424 ext4_rename_dir_prepare: [3] dir=0xffff888124170140 ext4_rename_dir_prepare: [4] ino=2 ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872 Reason is first directory entry which 'rec_len' is 34478, then will get illegal parent entry. Now, we do not check directory entry after read directory block in 'ext4_get_first_dir_block'. To solve this issue, check directory entry in 'ext4_get_first_dir_block'. [ Trigger an ext4_error() instead of just warning if the directory is missing a '.' or '..' entry. Also make sure we return an error code if the file system is corrupted. -TYT ] Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

P33M mentioned this issue Mar 3, 2013

dwc_otg: add handling of SPLIT transaction data toggle errors #242

Merged

P33M closed this as completed Mar 7, 2013

m-kozlowski mentioned this issue May 4, 2015

Backport "dwc_otg: add handling of SPLIT transaction data toggle errors" radxa/linux-rockchip#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dwc_otg: driver does not handle data toggle errors on SPLIT transactions #241

dwc_otg: driver does not handle data toggle errors on SPLIT transactions #241

P33M commented Mar 2, 2013

ghollingworth commented Mar 2, 2013

P33M commented Mar 3, 2013

ghollingworth commented Mar 6, 2013

P33M commented Mar 7, 2013

dwc_otg: driver does not handle data toggle errors on SPLIT transactions #241

dwc_otg: driver does not handle data toggle errors on SPLIT transactions #241

Comments

P33M commented Mar 2, 2013

ghollingworth commented Mar 2, 2013

P33M commented Mar 3, 2013

ghollingworth commented Mar 6, 2013

P33M commented Mar 7, 2013