-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zephyr: mcuboot swap using move will brick device when power interrupted during move #1588
Comments
We do reset testing on the simulator, with interruption on every possible flash operation, but it's also possible something is being missed. It would be helpful if you could gist an |
Hi Utzig, Thank you for your reply.
Please see the attached hexes of the last sectors (from when the partitions were still equally sized) |
Actually, come to think of it. The image trailer is exactly the same except for the Another thought though: If the error occurs during the move, not the swap it would result in a state that isn't described in this section of the documentation, since The
This then causes the code to return and never reach |
I may have found an issue in the code although I'm not certain yet. in
But I suppose the status for swap using move should be written to the secondary image instead of the primary, else the swap status isn't correctly tracked, no? |
The primary slot is where status is tracked. Looking at your hex dumps, it seems the time the issue happened there was no swapping performed, only the move up of the sectors. This can be seen by the lack of information on lines starting with Your setup seems straight-forward enough, no encrypted image, no external flash, so should be "easy" for me to try reproducing here. Are you sending new images during the test, or you are always using the same ones, only flashed at the beginning of the test, and just starting a new upgrade over and over? Do you do resets in the middle of the upgrade or only after it is done? Did you try on multiple hardware boards or are you always using the same? |
We have a central linux device that updates all others. An update of all devices is done by updating some others, updating the nordic, a couple of seconds of small other functions and then a reboot of the central device, causing 2 power dips/reboots for the nrf52840. It then waits around 40s, but this varies a little and everything starts again. Sometimes it takes 25 goes to reproduce it, sometimes 75, but every time I can get it to fail within 100 tries. As far as I can read in the code, whenever
However, if the first sector (the headers) haven't been swapped yet, the header of the primary partition will never be in the correct location, therefor resulting in the error of not being able to read the header. Because of this error, it will never complete the partial swap. This is unfortunate, because I'd expect it to complete, but still it should actually set the swap type to As you can see in the logs, the swap type of the primary partition is set to |
The relevant error seems to be this last line.
It does take into consideration that if a previous move was done that header will be located into the second sector, but maybe your second sector doesn't have the data for some reason? Could you also provide a dump of the first two sectors of the primary slot after the error happens? |
Sure, these are the first two sectors: |
Hmmm, as expected it does have the image magic at 0x4000, which I assume is the second sector in the slot. So there must be another error. I did re-check the simulator for the swap move upgrade testing yesterday and it indeed does test a reset on every step, which includes the place where your issue is happening, so I'll resort to doing some "manual" testing. |
@ImaraSpeek I was able to reproduced the issue and already identified why it happens. It does not seem to be the easiest thing to fix, but I'll try to provide a patch for upstream during this weekend, and paste here a diff against sdk-mcuboot 1.9.99-ncs1 for your testing. |
Awesome, thank you! Will test it and get back to you
…On Sat, 4 Feb 2023, 12:56 Fabio Utzig, ***@***.***> wrote:
@ImaraSpeek <https://github.com/ImaraSpeek> I was able to reproduced the
issue and already identified why it happens. It does not seem to be the
easiest thing to fix, but I'll try to provide a patch for upstream during
this weekend, and paste here a diff against sdk-mcuboot 1.9.99-ncs1 for
your testing.
—
Reply to this email directly, view it on GitHub
<#1588 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABETO2OU6LEIMQOE326ISKLWVY7XXANCNFSM6AAAAAAULDE66Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey Fabio,
Could you still post the diff for me so I could try it?
Cheers
…On Sat, 4 Feb 2023, 17:50 Imara Speek, ***@***.***> wrote:
Awesome, thank you! Will test it and get back to you
On Sat, 4 Feb 2023, 12:56 Fabio Utzig, ***@***.***> wrote:
> @ImaraSpeek <https://github.com/ImaraSpeek> I was able to reproduced the
> issue and already identified why it happens. It does not seem to be the
> easiest thing to fix, but I'll try to provide a patch for upstream during
> this weekend, and paste here a diff against sdk-mcuboot 1.9.99-ncs1 for
> your testing.
>
> —
> Reply to this email directly, view it on GitHub
> <#1588 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABETO2OU6LEIMQOE326ISKLWVY7XXANCNFSM6AAAAAAULDE66Y>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Once it's done, I will! |
Fix a swap corruption which occurs on the swap move algorithm when a reset happens exactly at the point after the last move up, and its status update. On restart the image headers should be read at the 2nd sector of the primary slot, but due to lacking initialization it is read on the first sector, and then fails. This error was masked on the simulator because of the use of a global variable, which retained its value on a "reset simulation". Fixes mcu-tools#1588 Signed-off-by: Fabio Utzig <utzig@apache.org>
@ImaraSpeek I opened #1597 to fix this issue. You can either get the diff from there, or here https://gist.github.com/utzig/1fd28580f980d3e6bbae097b1d85fdb5, which I already checked against |
Fix a swap corruption which occurs on the swap move algorithm when a reset happens exactly at the point after the last move up, and its status update. On restart the image headers should be read at the 2nd sector of the primary slot, but due to lacking initialization it is read on the first sector, and then fails. This error was masked on the simulator because of the use of a global variable, which retained its value on a "reset simulation". Fixes mcu-tools#1588 Signed-off-by: Fabio Utzig <utzig@apache.org>
Better yet, try this one: https://gist.github.com/utzig/97276ce3447c09c884f7d779b05ea132. This patch should also work with encrypted images, and should be fine without encryption as well. |
Thank you so much! I have successfully run the test already 180 times, where normally it would fail < 80 times. I will complete the test for at least 500 times to be sure, but it looks good! Will this be released based on 1.9.99-ncs1 or 1.9.99-ncs2? Our production has already moved on to be based on 1.9.99-ncs2, but I don't think we'll be ready to make a bigger jump between versions. |
The usual process is to merge here, sync Zephyr's mcuboot fork, I guess wait for a Zephyr release, and then NCS is updated, or something like that. Maybe they can do it without a Zephyr release, I am not sure. Anyways it should take some time, but this code has not changed much since that NCS version you are using, so would be better to just manage the patch on your tree, and once it goes out it will end up being the same code. |
Will do, thanks again for the help. I ran 500 successful updates, so I'm seeing this as fixed :) |
Fix a swap corruption which occurs on the swap move algorithm when a reset happens exactly at the point after the last move up, and its status update. On restart the image headers should be read at the 2nd sector of the primary slot, but due to lacking initialization it is read on the first sector, and then fails. This error was masked on the simulator because of the use of a global variable, which retained its value on a "reset simulation". Fixes mcu-tools#1588 Signed-off-by: Fabio Utzig <utzig@apache.org>
Fix a swap corruption which occurs on the swap move algorithm when a reset happens exactly at the point after the last move up, and its status update. On restart the image headers should be read at the 2nd sector of the primary slot, but due to lacking initialization it is read on the first sector, and then fails. This error was masked on the simulator because of the use of a global variable, which retained its value on a "reset simulation". Fixes #1588 Signed-off-by: Fabio Utzig <utzig@apache.org>
Dear all,
I'm working with the nrfconnect sdk on the nrf52840 and we are experiencing issues on the nrfconnect sdk 2.0.0 which uses sdk-mcuboot 1.9.99-ncs1.
We are using a swap using move algorithm with a swap_type test to update our firmware on the devices. Our bootloader is never updated (at least not for now) and we have a primary and secondary partition with only a single image, so no mult-image swaps. If we start the swap and accidentally a reboot happens (which happened at our factory and is what i'm currently reproducing at home), it will sometimes occur (once every 50 times or so) that it unrecoverably bricks the device. If I take a look at the device's flash I see that the image header is moved 0x1000 bytes. From then on the primary partition image header cannot be read and it will never recover from that (see logs). Shouldn't the image trailer at least sign for a swap failed so it can recover using the second partition?
These are the logs from when it recovers from a power interruption:
I suppose I'd need to enable the MCUBOOT_BOOTSTRAP flag in order to recover from this, but I'd expect it to detect it to not have a header and thus look for the header in the second sector anyways. Otherwise this issue is unrecoverable.
Best and thanks in advance,
Imara
The text was updated successfully, but these errors were encountered: