Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zephyr: mcuboot swap using move will brick device when power interrupted during move #1588

Closed
ImaraSpeek opened this issue Jan 30, 2023 · 18 comments · Fixed by #1597
Closed

Comments

@ImaraSpeek
Copy link

ImaraSpeek commented Jan 30, 2023

Dear all,

I'm working with the nrfconnect sdk on the nrf52840 and we are experiencing issues on the nrfconnect sdk 2.0.0 which uses sdk-mcuboot 1.9.99-ncs1.

We are using a swap using move algorithm with a swap_type test to update our firmware on the devices. Our bootloader is never updated (at least not for now) and we have a primary and secondary partition with only a single image, so no mult-image swaps. If we start the swap and accidentally a reboot happens (which happened at our factory and is what i'm currently reproducing at home), it will sometimes occur (once every 50 times or so) that it unrecoverably bricks the device. If I take a look at the device's flash I see that the image header is moved 0x1000 bytes. From then on the primary partition image header cannot be read and it will never recover from that (see logs). Shouldn't the image trailer at least sign for a swap failed so it can recover using the second partition?

uart:~$ uart:~$ *** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x1, image_ok=0x1
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Swap type: test
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Starting swap using move algorithm.
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
W: Failed reading image headers; Image=0
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
E: Unable to find bootable image
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
W: Failed reading image headers; Image=0
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
E: Unable to find bootable image
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
W: Failed reading image headers; Image=0
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
E: Unable to find bootable image
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

These are the logs from when it recovers from a power interruption:

uart:~$ [00:00:45.716,552] <inf> ftp: Finish transfer 0
uart:~$ Requesting upgrade

uart:~$ uart:~$ *** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.


Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46


Firmware signature verified.

Firmware version 641


Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x1, image_ok=0x1
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Swap type: test
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Starting swap using move algorithm.
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.


Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46


Firmware signature verified.

Firmware version 641


Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
I: Starting swap using move algorithm.
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.


Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46


Firmware signature verified.

Firmware version 641


Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
I: Starting swap using move algorithm.
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Bootloader chainload address offset: 0x23000
I: Jumping to the first image slot

I suppose I'd need to enable the MCUBOOT_BOOTSTRAP flag in order to recover from this, but I'd expect it to detect it to not have a header and thus look for the header in the second sector anyways. Otherwise this issue is unrecoverable.

Best and thanks in advance,
Imara

@ImaraSpeek ImaraSpeek changed the title zephyr: mcuboot swap using move will brack when power interrupted during move zephyr: mcuboot swap using move will brick device when power interrupted during move Jan 30, 2023
@utzig
Copy link
Member

utzig commented Jan 31, 2023

  1. When reproducing this, are you doing those resets manually?
  2. Are both partitions on the internal flash, or are you also using external flash?
  3. Are the images encrypted?
  4. Approximately how big are the images and how big are the partitions?

We do reset testing on the simulator, with interruption on every possible flash operation, but it's also possible something is being missed. It would be helpful if you could gist an hexdump of the last sector of the primary slot after the error happens, because that's where the progress of the swap is written, so we can inspect what MCUboot was doing when it failed.

@ImaraSpeek
Copy link
Author

ImaraSpeek commented Jan 31, 2023

Hi Utzig,

Thank you for your reply.

  1. A reboot is issued whenever the entire device is updated. This causes the power to the nrf52840 to also briefly be cut, causing a reboot of the nrf52840.
  2. Both partitions are on internal flash
  3. The images are not encrypted
  4. The bin image is 0x5FB40, the mcuboot_primary is 0x6D000, the mcuboot_primary_app starts 0x200 later and is 0x6CE00. The mcuboot_secondary starts right after and is 0x6D000. I've already tried increasing the primary partition to 0x6E000 so that it'd be a sector larger than the secondary, but that didn't change anything yet. The hex I've added comes from the prior situation where both partitions are equal size (which should also be allowed according to the documentation).

Please see the attached hexes of the last sectors (from when the partitions were still equally sized)

failed_swap_hexes.zip

@ImaraSpeek
Copy link
Author

ImaraSpeek commented Jan 31, 2023

Actually, come to think of it. The image trailer is exactly the same except for the copy_done and image_ok flags (I think these are the values in the memory, but I'm not sure). This means that the faulty image trailer indicates that a swap has been successfully done, but the image hasn't been confirmed. Is is possible that this image trailer is left from a previous swap and isn't correctly cleaned up after a swap? In my tests, I only ever reproduce this issue after > 25 updates, so that'd mean that many successful swaps happened prior to the faulty case.

Another thought though: If the error occurs during the move, not the swap it would result in a state that isn't described in this section of the documentation, since copy_done isn't set yet. Does that mean that when the move (instead of the swap) is interrupted, it will never set the outcome to BOOT_SWAP_TYPE_REVERT and thus not recover?

The bool_status_is_reset will also think the status is reset when the swap is still busy moving the sectors us. Therefor it is required to read all headers, even though the header of the primary image has moved.

bool
boot_status_is_reset(const struct boot_status *bs)
{
    return (bs->op == BOOT_STATUS_OP_MOVE &&
            bs->idx == BOOT_STATUS_IDX_0 &&
            bs->state == BOOT_STATUS_STATE_0);
}
#ifdef MCUBOOT_SWAP_USING_MOVE
        /*
         * Must re-read image headers because the boot status might
         * have been updated in the previous function call.
         */
        rc = boot_read_image_headers(state, !boot_status_is_reset(bs), bs);
#ifdef MCUBOOT_BOOTSTRAP
        /* When bootstrapping it's OK to not have image magic in the primary slot */
        if (rc != 0 && (BOOT_CURR_IMG(state) != BOOT_PRIMARY_SLOT ||
                boot_check_header_erased(state, BOOT_PRIMARY_SLOT) != 0)) {
#else
        if (rc != 0) {
#endif

            /* Continue with next image if there is one. */
            BOOT_LOG_WRN("Failed reading image headers; Image=%u",
                    BOOT_CURR_IMG(state));
            BOOT_SWAP_TYPE(state) = BOOT_SWAP_TYPE_NONE;
            return;
        }
#endif

This then causes the code to return and never reach boot_complete_partial_swap as far as I can tell.

@ImaraSpeek
Copy link
Author

I may have found an issue in the code although I'm not certain yet.

in loader.c it says the following:

    /* NOTE: The first sector copied (that is the last sector on slot) contains
     *       the trailer. Since in the last step the primary slot is erased, the
     *       first two status writes go to the scratch which will be copied to
     *       the primary slot!
     */

#if MCUBOOT_SWAP_USING_SCRATCH
    if (bs->use_scratch) {
        /* Write to scratch. */
        area_id = FLASH_AREA_IMAGE_SCRATCH;
    } else {
#endif
        /* Write to the primary slot. */
        area_id = FLASH_AREA_IMAGE_PRIMARY(BOOT_CURR_IMG(state));
#if MCUBOOT_SWAP_USING_SCRATCH
    }
#endif

But I suppose the status for swap using move should be written to the secondary image instead of the primary, else the swap status isn't correctly tracked, no?

@utzig
Copy link
Member

utzig commented Feb 1, 2023

But I suppose the status for swap using move should be written to the secondary image instead of the primary, else the swap status isn't correctly tracked, no?

The primary slot is where status is tracked.

Looking at your hex dumps, it seems the time the issue happened there was no swapping performed, only the move up of the sectors. This can be seen by the lack of information on lines starting with 0xFBC1 upto 0xFEC1. My only suspicion so far is that it could be due to a flash write error resulting in an assert and an unexpected reset, but I can't imagine why it would not recover in this particular case.

Your setup seems straight-forward enough, no encrypted image, no external flash, so should be "easy" for me to try reproducing here. Are you sending new images during the test, or you are always using the same ones, only flashed at the beginning of the test, and just starting a new upgrade over and over? Do you do resets in the middle of the upgrade or only after it is done? Did you try on multiple hardware boards or are you always using the same?

@ImaraSpeek
Copy link
Author

ImaraSpeek commented Feb 2, 2023

We have a central linux device that updates all others. An update of all devices is done by updating some others, updating the nordic, a couple of seconds of small other functions and then a reboot of the central device, causing 2 power dips/reboots for the nrf52840. It then waits around 40s, but this varies a little and everything starts again. Sometimes it takes 25 goes to reproduce it, sometimes 75, but every time I can get it to fail within 100 tries.

As far as I can read in the code, whenever boot_status_is_reset(bs) returns false, a partial swap is detected and therefor, the headers must be re-read:

#ifdef MCUBOOT_SWAP_USING_MOVE
        /*
         * Must re-read image headers because the boot status might
         * have been updated in the previous function call.
         */
        rc = boot_read_image_headers(state, !boot_status_is_reset(bs), bs);

However, if the first sector (the headers) haven't been swapped yet, the header of the primary partition will never be in the correct location, therefor resulting in the error of not being able to read the header. Because of this error, it will never complete the partial swap. This is unfortunate, because I'd expect it to complete, but still it should actually set the swap type to NONE and then recover the second partition on a reboot, but this also doesn't happen.

As you can see in the logs, the swap type of the primary partition is set to NONE, but it seems this is never written to flash and doesn't survive reboots. This also seems like a problem.

@utzig
Copy link
Member

utzig commented Feb 2, 2023

I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
W: Failed reading image headers; Image=0

The relevant error seems to be this last line.

However, if the first sector (the headers) haven't been swapped yet, the header of the primary partition will never be in the correct location, therefor resulting in the error of not being able to read the header.

It does take into consideration that if a previous move was done that header will be located into the second sector, but maybe your second sector doesn't have the data for some reason? Could you also provide a dump of the first two sectors of the primary slot after the error happens?

@ImaraSpeek
Copy link
Author

Sure, these are the first two sectors:

reproduced_issue_2.0.0sdk_5.1.0fw_firsttwosector.zip

@utzig
Copy link
Member

utzig commented Feb 3, 2023

Sure, these are the first two sectors:

reproduced_issue_2.0.0sdk_5.1.0fw_firsttwosector.zip

Hmmm, as expected it does have the image magic at 0x4000, which I assume is the second sector in the slot. So there must be another error. I did re-check the simulator for the swap move upgrade testing yesterday and it indeed does test a reset on every step, which includes the place where your issue is happening, so I'll resort to doing some "manual" testing.

@utzig
Copy link
Member

utzig commented Feb 4, 2023

@ImaraSpeek I was able to reproduced the issue and already identified why it happens. It does not seem to be the easiest thing to fix, but I'll try to provide a patch for upstream during this weekend, and paste here a diff against sdk-mcuboot 1.9.99-ncs1 for your testing.

@ImaraSpeek
Copy link
Author

ImaraSpeek commented Feb 4, 2023 via email

@ImaraSpeek
Copy link
Author

ImaraSpeek commented Feb 6, 2023 via email

@utzig
Copy link
Member

utzig commented Feb 6, 2023

Could you still post the diff for me so I could try it?

Once it's done, I will!

utzig added a commit to utzig/mcuboot that referenced this issue Feb 7, 2023
Fix a swap corruption which occurs on the swap move algorithm when a
reset happens exactly at the point after the last move up, and its
status update. On restart the image headers should be read at the 2nd
sector of the primary slot, but due to lacking initialization it is
read on the first sector, and then fails. This error was masked on the
simulator because of the use of a global variable, which retained its
value on a "reset simulation".

Fixes mcu-tools#1588

Signed-off-by: Fabio Utzig <utzig@apache.org>
@utzig
Copy link
Member

utzig commented Feb 7, 2023

@ImaraSpeek I opened #1597 to fix this issue. You can either get the diff from there, or here https://gist.github.com/utzig/1fd28580f980d3e6bbae097b1d85fdb5, which I already checked against sdk-mcuboot-1.9.99-ncs1. Please report on your findings!

utzig added a commit to utzig/mcuboot that referenced this issue Feb 8, 2023
Fix a swap corruption which occurs on the swap move algorithm when a
reset happens exactly at the point after the last move up, and its
status update. On restart the image headers should be read at the 2nd
sector of the primary slot, but due to lacking initialization it is
read on the first sector, and then fails. This error was masked on the
simulator because of the use of a global variable, which retained its
value on a "reset simulation".

Fixes mcu-tools#1588

Signed-off-by: Fabio Utzig <utzig@apache.org>
@utzig
Copy link
Member

utzig commented Feb 8, 2023

Better yet, try this one: https://gist.github.com/utzig/97276ce3447c09c884f7d779b05ea132. This patch should also work with encrypted images, and should be fine without encryption as well.

@ImaraSpeek
Copy link
Author

Thank you so much! I have successfully run the test already 180 times, where normally it would fail < 80 times. I will complete the test for at least 500 times to be sure, but it looks good! Will this be released based on 1.9.99-ncs1 or 1.9.99-ncs2? Our production has already moved on to be based on 1.9.99-ncs2, but I don't think we'll be ready to make a bigger jump between versions.

@utzig
Copy link
Member

utzig commented Feb 8, 2023

Thank you so much! I have successfully run the test already 180 times, where normally it would fail < 80 times. I will complete the test for at least 500 times to be sure, but it looks good! Will this be released based on 1.9.99-ncs1 or 1.9.99-ncs2? Our production has already moved on to be based on 1.9.99-ncs2, but I don't think we'll be ready to make a bigger jump between versions.

The usual process is to merge here, sync Zephyr's mcuboot fork, I guess wait for a Zephyr release, and then NCS is updated, or something like that. Maybe they can do it without a Zephyr release, I am not sure. Anyways it should take some time, but this code has not changed much since that NCS version you are using, so would be better to just manage the patch on your tree, and once it goes out it will end up being the same code.

@ImaraSpeek
Copy link
Author

ImaraSpeek commented Feb 9, 2023

Will do, thanks again for the help. I ran 500 successful updates, so I'm seeing this as fixed :)

utzig added a commit to utzig/mcuboot that referenced this issue Feb 9, 2023
Fix a swap corruption which occurs on the swap move algorithm when a
reset happens exactly at the point after the last move up, and its
status update. On restart the image headers should be read at the 2nd
sector of the primary slot, but due to lacking initialization it is
read on the first sector, and then fails. This error was masked on the
simulator because of the use of a global variable, which retained its
value on a "reset simulation".

Fixes mcu-tools#1588

Signed-off-by: Fabio Utzig <utzig@apache.org>
utzig added a commit that referenced this issue Feb 9, 2023
Fix a swap corruption which occurs on the swap move algorithm when a
reset happens exactly at the point after the last move up, and its
status update. On restart the image headers should be read at the 2nd
sector of the primary slot, but due to lacking initialization it is
read on the first sector, and then fails. This error was masked on the
simulator because of the use of a global variable, which retained its
value on a "reset simulation".

Fixes #1588

Signed-off-by: Fabio Utzig <utzig@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants