Serviceability of Edge Devices is often limited or non-existent, which makes it challenging to troubleshoot device problems following a failed software or operating system upgrade.
To mitigate these problems, MicroShift uses greenboot,
the Generic Health Check Framework for systemd
on rpm-ostree
based systems.
If a failure is detected, the system is expected to boot into the last known
working configuration using rpm-ostree
rollback facilities.
This functionality benefits the users by reducing the risk of being locked out of an Edge Device when upgrades take place. Users should not experience significant interruption of service in case of a failed upgrade.
MicroShift includes the 40_microshift_running_check.sh
health check script to
validate that all the required MicroShift services are up and running. The health
check script is packaged in the separate mandatory microshift-greenboot
RPM,
which has an explicit dependency on the greenboot
RPM.
The health check script is installed into the /etc/greenboot/check/required.d
directory and it is not executed during the system boot in case the greenboot
package is not present.
The 40_microshift_pre_rollback.sh
pre-rollback script is installed into the
/etc/greenboot/red.d
directory, to be executed right before the system rollback
takes place. The script performs MicroShift pod and OVN cleanup to avoid potential
conflicts with the software rolled back to a previous version.
The existing MicroShift data and container images are not affected by this operation.
In addition, if the greenboot-default-health-check
RPM subpackage is installed,
it already includes health check scripts
verifying that DNS and ostree
services can be accessed.
Greenboot redirects all the script output to the system log, accessible via the following commands:
journalctl -u greenboot-healthcheck.service
for the health check procedurejournalctl -u redboot-task-runner.service
for the pre-rollback procedure
Exiting the health check script with a non-zero status will have the boot declared as failed. The following validations are performed by the script.
Validation | Pass | Fail |
---|---|---|
Check the script runs with 'root' permissions | Next | exit 0 |
Check microshift.service is enabled | Next | exit 0 |
Wait for microshift.service to be active (!failed) | Next | exit 1 |
Wait for Kubernetes API health endpoints to be OK | Next | exit 1 |
Wait for any Pod to start | Next | exit 1 |
For each core namespace, wait for images to be pulled | Next | exit 1 |
For each core namespace, wait for Pods to be ready | Next | exit 1 |
For each core namespace, check Pods not restarting | exit 0 | exit 1 |
The pre-rollback script runs the sudo microshift-cleanup-data --ovn
command
to prepare the system for a potential software downgrade.
If the system is not booted using the
ostree
file system, the health check and pre-rollback procedures still run, but no rollback would be possible in case of an upgrade failure.
The wait period in each health check validation starts from 5 minutes base time
and it is incremented by the base wait period after each boot in the verification
loop. It is possible to override the base time wait period setting with the
MICROSHIFT_WAIT_TIMEOUT_SEC
environment variable in the /etc/greenboot/greenboot.conf
configuration file alongside other Greenboot Configuration
settings.
Some 3rd party user workloads may become operational before the upgrade is declared valid and potentially create or update data on the device. If a rollback is performed subsequentially, there is a risk of data loss because the file system is reverted to its state before the upgrade. One of the ways to mitigate this problem is to have 3rd party workloads wait until a boot is declared successful.
$ sudo grub2-editenv - list | grep ^boot_success
boot_success=1
Note that the MicroShift health check script only performs validation of the core MicroShift services. Users should install their own workload validation scripts using
greenboot
facilities to ensure the successful operation after system upgrades.
The default configuration of the systemd
journal service stores the data in
the volatile /run/log/journal
directory, which does not persist after a system
boot. To monitor greenboot
activities across system boots, it is recommended to
enable the journal data persistency by creating the /var/log/journal
directory
and setting limits on the maximal journal data size.
Run the following commands to configure the journal data persistency and limits.
sudo mkdir -p /etc/systemd/journald.conf.d
cat <<EOF | sudo tee /etc/systemd/journald.conf.d/microshift.conf &>/dev/null
[Journal]
Storage=persistent
SystemMaxUse=1G
RuntimeMaxUse=1G
EOF