Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USB: File writing hang regression for big files #3452

Closed
Willy-JL opened this issue Feb 15, 2024 · 14 comments · Fixed by #3705
Closed

USB: File writing hang regression for big files #3452

Willy-JL opened this issue Feb 15, 2024 · 14 comments · Fixed by #3705
Assignees
Labels
Core+Services HAL, furi & core system services USB BadUSB + physical USB interface

Comments

@Willy-JL
Copy link
Contributor

Willy-JL commented Feb 15, 2024

Describe the bug.

Have 10+ user reports, including first hand, of qFlipper consistently hanging during the update process while writing resources.tar (although at random points throughout the write operation). I tested back through the commit history and narrowed it down to #3358. In particular, reverting the USB CDC parameter changes fixes the issues.

Reproduction

  1. Connect to qFlipper
  2. Start an update operation (eg. Install from file)
  3. At some random point through resources.tar, writes hang
  4. qFlipper reports timeout, update fails

Target

Flipper USB -> qFlipper on all operating systems

Logs

In Flipper serial debug logs, write operations flow through normally until they suddenly stop, for no apparent reason, and a few seconds later qFlipper reports a timeout error. No other information is present in the logs.

Anything else?

The PR in question mentions "switch from 2x unidirectional data endpoints to 1x bidirectional".
Seems like this is what is causing issues, although not clear why. Is there any more info on this change?

@skotopes skotopes added USB BadUSB + physical USB interface Core+Services HAL, furi & core system services labels Feb 16, 2024
@Willy-JL
Copy link
Contributor Author

Willy-JL commented Mar 4, 2024

I have also seen multiple reports of DFU CRC mismatches or of Invalid Manifests, when trying to install via USB when the current firmware has the aforementioned USB CDC changes. Currently trying to test whether they are directly linked, but seems suspicious that there's no reports of these symptoms when installing via USB on firmware that doesn't have the changes to USB CDC parameters...

I think it would be beneficial to double check them before they hit an RC...

@doomwastaken
Copy link
Member

Greetings. We were unable to reproduce it on our test systems, could you provide additional details:

OS version you are using
Firmware and target firmware you are trying to flash
What kind of pc/laptop you are using so we can investigate further.

We will continue testing on other systems meanwhile

@Willy-JL
Copy link
Contributor Author

Willy-JL commented Mar 5, 2024

So the DFU CRC mismatches I have not been able to replicate myself, but noticed that people reporting it are under similar conditions to the qFlipper write timeouts. The qFlipper timeouts however I can consistently replicate myself.

I am using Fedora Linux 37, but I have had it happen first hand on Windows too, and heard of it happening on Mac too.

Installed firmware that causes these issues is anything with those USB CDC parameters being changed. Now, I have been testing on latest OFW dev branch at the time of writing, and also Momentum dev branch after I re-insert the USB CDC changes. At the time of reporting this issue, I was testing on OFW dev branch, and on XFW dev branch too before reverting the USB CDC changes which fixed it. Also at the time of reporting, there were reports of it happening on RM too, and there too removing USB CDC changes fixed it (quote "FIX GLITCHY INSTALL" from his git history).

Target firmware is anything. The issue is random, so a firmware with a bigger resources.tar will have a higher chance of encountering the issue. While installing Momentum / XFW (around 6MB of resources.tar), it happens 4 out of 5 times, at a random point through the writing of resources.tar. I have gotten it to happen while installing OFW dev branch too, though it is a lower chance due to the dramatically smaller resources.tar.

Pc is anything. I have observed it first hand on my desktop pc running a ryzen 5600 and b550 mobo, both on front and back usb ports, on both Fedora and Windows. Also on my laptop, a huawei matebook with ryzen 3500u, on both arch linux and windows. And as i said, have heard from dozens of people on very diverse platforms, though by now it has gone quiet due to reverting those parameter changes downstream fixing the issue entirely.

To get some rough expectations of frequency of this issue, i tested on my main desktop machine on fedora. Installed firmware is OFW dev branch (1070064).

  • While trying to install OFW dev branch again ("Reinstall" button in qFlipper): 1/10 times roughly
  • While trying to install MFW/XFW dev branch (6MB resources) ("Install from file..." button): 5/10 times roughly

once again, due to the size of the resources.tar providing a larger window for the bug to manifest itself.

I mean no disrespect, but I strongly feel that this is a problem on the flipper side, rather than on the pc side. After the issue occurs, there is no way to re-establish USB functionality until a flipper reboot. You can connect to another pc, restart the pc, restart qflipper, but nothing will help. The USB interface of the flipper is simply stuck.

@Willy-JL
Copy link
Contributor Author

Willy-JL commented Mar 5, 2024

after compiling ofw dev branch locally with DEBUG=1 and attaching to debugger after the hang (in this case crash) reveals that it is having issues at applications/services/rpc/rpc_cli.c:75 furi_assert(fed_bytes == size_received);. i dont understand the exact function of this code, but what my first uneducated guesses are:

  • writing data is having some issues, so data received at other end is not same as flipper sent?
  • compiling with DEBUG=1 gets furi_assert() to actually work, so it crashes rather than continuing silently
  • on builds with DEBUG=0 (ofw dev branch with qflipper, or aforementioned MFW/XFW/RM) it continues silently but later on encounters a catastrophic failure due to the previous incorrect communications.

@doomwastaken
Copy link
Member

Thank you for such a complete feedback, I will try to replicate this issue myself. Will get back to you today

@Willy-JL
Copy link
Contributor Author

Willy-JL commented Mar 11, 2024

just wanted to say that this is still an issue. got also multiple reports from people installing momentum firmware with the webupdater that it stops while uploading resources.tar to flipper. this doesnt seem to be an issue with momentum, since they are on ofw, trying to install another firmware, and it is failing to upload the file, rather than installing it.

@Willy-JL Willy-JL changed the title USB: qFlipper update hang regression in lengthy file writes USB: File writing hang regression for big files Mar 11, 2024
@Willy-JL
Copy link
Contributor Author

please, is there ANY info on what that USB CDC change did? because so far, ive seen no info on it, no benefits from it (no faster speeds or anything), and just random file writes failing and hanging the USB interface indefinitely.

@doomwastaken
Copy link
Member

Was not able to ever reproduce it on any of the systems. My colleague managed to get it once on home windows pc, however it involved using a faulty SD card and fully wiping it prior to update attempt.

Are you using any usb hub by any chance?

@Willy-JL
Copy link
Contributor Author

There is no hub. Using otherwise the same system and setup, with the CDC changes included it happens quite often when writing large files over rpc, with the CDC changes reverted it never happens. There are no other variables, it's just "CDC parameter changes" or "no CDC parameter changes" in the currently installed firmware, that affect this issue.

Ive even heard now from people struggling to install Momentum because of this, after telling them to downgrade to OFW 0.97.1 to be sure that this bug isn't included, and they couldn't even downgrade! From OFW downgrading to lower OFW to fix the issue, even that wasn't wanting to work, they had to DFU.

@doomwastaken
Copy link
Member

Alright, thank you for details.
So far I've tested it on 5 separate machines, Win 10, Win 11, Ubuntu, M1 mac 12.5, Intel mac 11.6 and 14.4 m1 mac (same machine), oddly enough no matter the amount of attempts, I was not able to reproduce.

Could you be able to take stack trace on official firmware?

@Willy-JL
Copy link
Contributor Author

Willy-JL commented Mar 14, 2024

oddly enough, ive only gotten the crash once, which is what i posted. its usually just a hang. flipper continues to work, but the USB interface is dead. my assumption is that the crash i got there is one of the many possible side effects from silent corruption caused by those changes we are debating over, while the most common outcome is the USB interface locking up and qflipper timing out.

i will try to find the time to get more info, but with all due respect, we are talking about just 4 parameter changes, only 4 lines, that were changed with no additional context, it worked perfectly before, and it randomly breaks after, with no explanation or known improvement by changing those 4 numbers.

works:

#define CDC0_RXD_EP 0x01
#define CDC0_NTF_EP 0x83
#define CDC1_TXD_EP 0x85
#define CDC1_NTF_EP 0x86

breaks:

#define CDC0_RXD_EP 0x02
#define CDC0_NTF_EP 0x81
#define CDC1_TXD_EP 0x84
#define CDC1_NTF_EP 0x83

the linked PR #3358 that introduced this aimed to fix cat /dev/ttyACM0 causing crashes. even reverting those CDC parameter changes to the working ones, it is still fixed. the fix for the cat /dev/ttyACM0 crash was the other part of that PR, that ensured the input line of CLI didnt grow too big and run out of memory.

the CDC parameter changes were only denoted as

USB CDC: switch from 2x unidirectional data endpoints to 1x bidirectional

with no known improvement, bugfix, or other benefit related to the change. and yet, it causes issues to many users, ive seen atleast 20 thus far myself.

@feldlef
Copy link

feldlef commented Mar 20, 2024

sorry! Ill go stand in a corner....I forgot to write that i had tested that file. Got same issue.

But just to be sure I tried it again and now the 0.97.1 worked flawlessly..
After that the mntm worked as well!

Sorry for time-waste and thanks for all that you do!!!

@thisplantguy
Copy link

thisplantguy commented Mar 23, 2024

Apologies for my standalone ticket, I discovered this browsing another user's open ticket. I am experiencing the same issue, and needed to downgrade my OFW to 0.97.1 in order for the Momentum FW to be installed. Quite curious.

17 [default] Binding on background is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
17 [default] Binding on contentItem is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
18 [default] Binding on contentItem is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
18 [default] Binding on background is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
19 [default] Binding on background is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
19 [default] Binding on contentItem is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
20 [default] Binding on background is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
20 [default] Binding on contentItem is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
20 [default] Binding on indicator is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
21 [default] Binding on background is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
21 [default] Binding on background is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
22 [default] Binding on animation is not deferred as requested by the DeferredPropertyNames class info because one or more of its sub-objects contain an id.
182 [APP] qFlipper version 1.3.3 commit bfce851d 2023-11-10T01:40:31
182 [APP] OS info: macOS 14.4 14.4 23.4.0 Qt 6.3.1
198 [REG] Detected new device: VID_0x483:PID_0x5740
235 [DBG] Trying serial port at /dev/cu.Bluetooth-Incoming-Port
235 [DBG] Trying serial port at /dev/tty.Bluetooth-Incoming-Port
235 [DBG] Trying serial port flip_Omaiygo at /dev/cu.usbmodemflip_Omaiygo1
235 [DBG] Using serial port flip_Omaiygo at /dev/cu.usbmodemflip_Omaiygo1
239 [RPC] Starting RPC session...
307 [UPD] Fetched update information from https://update.flipperzero.one/qFlipper/directory.json
355 [UPD] Fetched update information from https://update.flipperzero.one/firmware/directory.json
358 [RPC] RPC session started successfully.
358 [RPC] (1) System Protobuf Version START
414 [DBG] Detected protobuf version: 0.22
414 [RPC] (1) System Protobuf Version SUCCESS
414 [RPC] (2) Property Get START
548 [RPC] (2) Property Get SUCCESS
548 [RPC] (3) Storage Info @/ext START
600 [RPC] (3) Storage Info @/ext SUCCESS
600 [RPC] (4) Storage Stat @/ext/Manifest START
653 [RPC] (4) Storage Stat @/ext/Manifest SUCCESS
654 [RPC] (5) System Get DateTime START
706 [DBG] Flipper time skew is -1373 milliseconds
707 [RPC] (5) System Get DateTime SUCCESS
707 [RPC] (6) System Set DateTime START
759 [RPC] (6) System Set DateTime SUCCESS
760 [RPC] Stopping RPC session...
771 [RPC] RPC session stopped successfully.
771 [REG] Registering the device
771 [DEV] Version: 0.99.1-rc commit: 628c055 radio: 1.17.3
771 [BKD] Current device changed to "Omaiygo"
872 [RPC] Starting RPC session...
992 [RPC] RPC session started successfully.
992 [RPC] (1) Gui Start ScreenStream START
1048 [RPC] (1) Gui Start ScreenStream SUCCESS
47019 [DEV] Full Update @Omaiygo START
47019 [RPC] (2) Gui Start VirtualDisplay START
47020 [UTL] Region Provisioning @Omaiygo START
47076 [RPC] (2) Gui Start VirtualDisplay SUCCESS
47165 [DBG] Detected region: "US"
47165 [DBG] Allowed bands: QList("F_US_CA_304", "F_US_CA_AU_NZ_433", "F_US_CA_AU_NZ_915")
47165 [RPC] (3) Storage Write @/int/.region_data START
47244 [RPC] (3) Storage Write @/int/.region_data SUCCESS
47244 [UTL] Region Provisioning @Omaiygo SUCCESS
47247 [UTL] Refresh Storage Info @Omaiygo START
47247 [RPC] (4) Storage Info @/ext START
47489 [RPC] (4) Storage Info @/ext SUCCESS
47490 [RPC] (5) Storage Stat @/ext/Manifest START
47543 [RPC] (5) Storage Stat @/ext/Manifest SUCCESS
47628 [UTL] Refresh Storage Info @Omaiygo SUCCESS
47629 [ZIP] Uncompressing file with size of 6494623 bytes...
47677 [ZIP] Uncompression finished : No Error
47683 [UTL] Create Path @/ext/update/f7-update-mntm-001 START
47684 [RPC] (6) Storage Stat @/ext/update START
47736 [RPC] (6) Storage Stat @/ext/update SUCCESS
47737 [RPC] (7) Storage MkDir @/ext/update START
48161 [RPC] (7) Storage MkDir @/ext/update SUCCESS
48161 [RPC] (8) Storage MkDir @/ext/update/f7-update-mntm-001 START
48578 [RPC] (8) Storage MkDir @/ext/update/f7-update-mntm-001 SUCCESS
48578 [UTL] Create Path @/ext/update/f7-update-mntm-001 SUCCESS
48579 [UTL] Upload 7 entries @Omaiygo START
48579 [RPC] (9) Storage Write @/ext/update/f7-update-mntm-001/firmware.dfu START
55270 [RPC] (9) Storage Write @/ext/update/f7-update-mntm-001/firmware.dfu SUCCESS
55270 [RPC] (10) Storage Write @/ext/update/f7-update-mntm-001/firstboot.bin START
55342 [RPC] (10) Storage Write @/ext/update/f7-update-mntm-001/firstboot.bin SUCCESS
55343 [RPC] (11) Storage Write @/ext/update/f7-update-mntm-001/radio.bin START
56279 [RPC] (11) Storage Write @/ext/update/f7-update-mntm-001/radio.bin SUCCESS
56280 [RPC] (12) Storage Write @/ext/update/f7-update-mntm-001/resources.tar START
136226 [UTL] Upload 7 entries @Omaiygo ERROR: Operation timeout (generic)
136226 [DEV] Full Update @Omaiygo ERROR: Operation timeout (generic)
136226 [BKD] Current operation finished with error: "Operation timeout (generic)"
136227 [RPC] (12) Storage Write @/ext/update/f7-update-mntm-001/resources.tar ERROR: Operation timeout (generic)
499540 [APP] qFlipper exited
499557 [default] "There are still "1" items in the process of being created at engine destruction."

@Willy-JL
Copy link
Contributor Author

indeed, the issue seems resolved with #3705. thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core+Services HAL, furi & core system services USB BadUSB + physical USB interface
Projects
None yet
5 participants