Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpm-ostree upgrade fails in edge-commit RHEL-9.6 #4593

Open
mcattamoredhat opened this issue Feb 3, 2025 · 31 comments
Open

rpm-ostree upgrade fails in edge-commit RHEL-9.6 #4593

mcattamoredhat opened this issue Feb 3, 2025 · 31 comments

Comments

@mcattamoredhat
Copy link
Contributor

Describe the bug
Our CI has detected RHEL-9.6 edge-commit fails after the changes introduced by PR #4569
rpm-ostree upgrade fails to upgrade the system.
After ostree image/commit upgrade is built, the edge system detects there's an upgrade available, but after rpm-ostree upgrade and reboot, the system rolls back to the previous deployment and the update is not applied.

Environment

  • OS version (/etc/os-release and /etc/redhat-release):
    source /etc/os-release
    NAME='Red Hat Enterprise Linux'
    VERSION='9.6 (Plow)'
    ID=rhel
    ID_LIKE=fedora
    VERSION_ID=9.6
    PLATFORM_ID=platform:el9
    PRETTY_NAME='Red Hat Enterprise Linux 9.6 Beta (Plow)'
    ANSI_COLOR='0;31'
    LOGO=fedora-logo-icon
    CPE_NAME=cpe:/o:redhat:enterprise_linux:9::baseos
    HOME_URL=https://www.redhat.com/
    DOCUMENTATION_URL=https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9
    BUG_REPORT_URL=https://issues.redhat.com/
    REDHAT_BUGZILLA_PRODUCT='Red Hat Enterprise Linux 9'
    REDHAT_BUGZILLA_PRODUCT_VERSION=9.6
    REDHAT_SUPPORT_PRODUCT='Red Hat Enterprise Linux'
    REDHAT_SUPPORT_PRODUCT_VERSION='9.6 Beta'
  • osbuild-composer version (rpm -qi osbuild-composer)
    $ rpm -qa | grep osbuild
    osbuild-composer-debugsource-130-1.20250129git008b43e.el9.x86_64
    osbuild-composer-debuginfo-130-1.20250129git008b43e.el9.x86_64
    python3-osbuild-137-1.el9.noarch
    osbuild-selinux-137-1.el9.noarch
    osbuild-137-1.el9.noarch
    osbuild-depsolve-dnf-137-1.el9.noarch
    osbuild-composer-core-130-1.20250129git008b43e.el9.x86_64
    osbuild-luks2-137-1.el9.noarch
    osbuild-lvm2-137-1.el9.noarch
    osbuild-ostree-137-1.el9.noarch
    osbuild-composer-worker-130-1.20250129git008b43e.el9.x86_64
    osbuild-composer-130-1.20250129git008b43e.el9.x86_64
    osbuild-composer-tests-130-1.20250129git008b43e.el9.x86_64
    osbuild-composer-core-debuginfo-130-1.20250129git008b43e.el9.x86_64
    osbuild-composer-tests-debuginfo-130-1.20250129git008b43e.el9.x86_64
    osbuild-composer-worker-debuginfo-130-1.20250129git008b43e.el9.x86_64

To Reproduce
Steps to reproduce the behavior:

  • Build edge-commit artifact in RHEL-9.6
  • Build ostree image/commit upgrade artifact
  • Apply the upgrade using rpm-ostree upgrade and reboot the system.

Expected behavior
The system is able to apply the upgrade commit.

Additional context
In this example the upgrade hash is:

$ curl http://192.168.100.1/repo/refs/heads/rhel/9/x86_64/edge
75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
$ sudo virsh console osbuild-composer-ostree-test-4b6e4700-ce4b-48d7-8c25-811f4876b923
Connected to domain 'osbuild-composer-ostree-test-4b6e4700-ce4b-48d7-8c25-811f4876b923'
Escape character is ^] (Ctrl + ])

vm login: admin
Password: 
Last login: Wed Jan 29 12:11:27 on ttyS0
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:34:37Z)
                   Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b

The edge-system detects there's an upgrade available, but rpm-ostree upgrade fails and the system rollbacked to 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b:

[admin@vm ~]$ sudo rpm-ostree upgrade
1 metadata, 0 content objects fetched; 401 B transferred in 0 seconds; 0 bytes content written
Staging deployment... done
Freed: 7.8 kB (pkgcache branches: 1)
Added:
  wget-1.21.1-8.el9_4.x86_64
Run "systemctl reboot" to start a reboot
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
                     Diff: 1 added

● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:34:37Z)
                   Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b

Then the system fails to upgrade, and rollback to ostree:1

Red Hat Enterprise Linux 9.6 Beta (Plow)
Kernel 5.14.0-547.el9.x86_64 on an x86_64

vm login: admin
Password: 
Last login: Wed Jan 29 12:20:35 on ttyS0
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:34:37Z)
                   Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b

It seems the system is failing to remount the file system:

[admin@vm ~]$ sudo journalctl --no-pager --boot=-1 -xe | grep FAIL
Jan 29 12:25:35 localhost systemd[1]: systemd-remount-fs.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost greenboot[736]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...
Jan 29 12:25:38 localhost greenboot[736]: Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Jan 29 12:25:38 localhost systemd[1]: greenboot-healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost greenboot[801]: Boot Status is RED - Health Check FAILURE!
Jan 29 12:25:38 localhost greenboot-status[822]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...
Jan 29 12:25:38 localhost greenboot-status[822]: Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Jan 29 12:25:38 localhost greenboot-status[822]: Boot Status is RED - Health Check FAILURE!
@miabbott
Copy link

miabbott commented Feb 3, 2025

I think this might be related to ostreedev/ostree#3193

@runcom you were looking at systemd-remount-fs.service failures recently; maybe you have some insight

@runcom
Copy link
Member

runcom commented Feb 3, 2025

I’ll check it out, maybe composefs? Although, what is greenboot exit code 4 too? @say-paul

@runcom
Copy link
Member

runcom commented Feb 3, 2025

This may be relevant ostreedev/ostree#3193 (comment)

@mcattamoredhat do you know what exactly changed in the new snapshot? rpm-ostree? Just ostree? Can you print versions and also provide the content of /etc/fstab and /proc/cmdline

@runcom
Copy link
Member

runcom commented Feb 4, 2025

it seems that changes in https://github.com/osbuild/osbuild-composer/pull/4569/files are tests only, so how did you reproduce this @mcattamoredhat ? 🤔 I'm trying with 9.6 nightlies repo enabled, building a commit and upgrade (using a raw image to install)

@runcom
Copy link
Member

runcom commented Feb 4, 2025

since ostree.sh uses anaconda, this may be relevant https://bugzilla.redhat.com/show_bug.cgi?id=2332319 if we understand it's systemd-remount-fs.service that it's causing this issue (still not sure and I say this because there's no bootc involved here...nor composefs enabled)

@runcom
Copy link
Member

runcom commented Feb 4, 2025

I think the remount service is a red herring tho - it seems it's greenboot that fails and triggers the rollback 🤔

@runcom
Copy link
Member

runcom commented Feb 4, 2025

This is what Mario has, the system is installed using Anaconda, but there's no bootc nor composefs (cc @cgwalters for the similar failiure) - we'll try w/o the / line in /etc/fstab -- also, it seems there's some sort of network failure to me in rpm-ostree/rpm-ostreed

[admin@vm ~]$ !2
cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/ostree/edge-commit-68c3fe04cf09b3082bbe68c4d771a4ec122ea9cba2c5c0ef850740a227691aaf/vmlinuz-5.14.0-547.el9.x86_64 net.ifnames=0 modprobe.blacklist=vc4 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=ttyS0,115200 root=UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 rw ostree=/ostree/boot.1/edge-commit/68c3fe04cf09b3082bbe68c4d771a4ec122ea9cba2c5c0ef850740a227691aaf/1
[admin@vm ~]$ !3
cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Tue Feb  4 11:08:58 2025
#
# Accessible filesystems, by reference, are maintained under '/dev/disk/'.
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
#
# After editing this file, run 'systemctl daemon-reload' to update systemd
# units generated from this file.
#
UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 /                       xfs     defaults        0 0
UUID=d9ac8899-d8a4-43cc-93cf-290ed9892683 /boot                   xfs     defaults        0 0

[admin@vm ~]$ sudo journalctl --boot=-1 --no-pager -eu systemd-remount-fs.service
Feb 04 11:22:49 localhost systemd-remount-fs[619]: mount: /: cannot remount /dev/vda2 read-write, is write-protected.
Feb 04 11:22:49 localhost systemd-remount-fs[617]: /usr/bin/mount for / exited with exit status 32.
Feb 04 11:22:49 localhost systemd[1]: systemd-remount-fs.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:22:49 localhost systemd[1]: systemd-remount-fs.service: Failed with result 'exit-code'.
Feb 04 11:22:49 localhost systemd[1]: Failed to start Remount Root and Kernel File Systems.
[admin@vm ~]$ 

[admin@vm ~]$ sudo journalctl --no-pager --boot=-1 -eu rpm-ostreed.service
Feb 04 11:22:52 localhost systemd[1]: Starting rpm-ostree System Management Daemon...
Feb 04 11:22:52 localhost rpm-ostree[772]: error: Error receiving data: Connection reset by peer
Feb 04 11:22:52 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:22:52 localhost systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.
Feb 04 11:22:52 localhost systemd[1]: Failed to start rpm-ostree System Management Daemon.

@runcom
Copy link
Member

runcom commented Feb 4, 2025

the watchdog check is a required one so that failing means we rollback too (update platforms checks instead is just wanted so shouldn't cause the rollback)

@mcattamoredhat
Copy link
Contributor Author

After commenting out the/line in /etc/fstab the system is still making rollback:

[admin@vm ~]$ cat /etc/fstab 

#
# /etc/fstab
# Created by anaconda on Tue Feb  4 11:08:58 2025
#
# Accessible filesystems, by reference, are maintained under '/dev/disk/'.
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
#
# After editing this file, run 'systemctl daemon-reload' to update systemd
# units generated from this file.
#
# UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 /                       xfs     defaults        0 0
UUID=d9ac8899-d8a4-43cc-93cf-290ed9892683 /boot                   xfs     defaults        0 0

Upgrade commit is 1b5ba91f75c7ab115882be3f83c3668f2c88b2f7850e92fc2355a361839f594d

[admin@vm ~]$ sudo rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-02-04T11:06:00Z)
                   Commit: f63986e3544dffa98c3b97de358b26de4499f7993096c49efbdd11041d347a13

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-02-04T11:17:59Z)
                   Commit: 1b5ba91f75c7ab115882be3f83c3668f2c88b2f7850e92fc2355a361839f594d

@cgwalters
Copy link
Contributor

Our CI has detected RHEL-9.6 edge-commit fails after the changes introduced by PR #4569

I am not an expert in this repo but that would seem to be a surprising cause.

Feb 04 11:22:49 localhost systemd-remount-fs[619]: mount: /: cannot remount /dev/vda2 read-write, is write-protected.

"is write-protected" here means we got EROFS from mount which usually means the physical block device is read-only. Is the CI system here only providing a read-only virtio device? I'd look for more logs related to that.

@runcom
Copy link
Member

runcom commented Feb 4, 2025

right, although I just think that the rollback isn't related - greenboot doesn't check for that so whatever happens it's greenboot
@say-paul

@runcom
Copy link
Member

runcom commented Feb 4, 2025

so it seems that dbus can't start for some reason which makes rpm-ostreed unfunctional:

Feb 04 11:19:56 localhost systemd[1]: Starting D-Bus System Message Bus...
Feb 04 11:19:56 localhost systemd[717]: dbus-broker.service: Failed to set up mount namespacing: /run/systemd/unit-root/dev: Read-only file system
Feb 04 11:19:56 localhost systemd[717]: dbus-broker.service: Failed at step NAMESPACE spawning /usr/bin/dbus-broker-launch: Read-only file system
Feb 04 11:19:56 localhost systemd[1]: dbus-broker.service: Main process exited, code=exited, status=226/NAMESPACE
Feb 04 11:19:56 localhost systemd[1]: dbus-broker.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Listening on D-Bus System Message Bus Socket.
Feb 04 11:19:58 localhost systemd[1]: Starting rpm-ostree System Management Daemon...
Feb 04 11:19:58 localhost systemd[1]: dbus-broker.service: Start request repeated too quickly.
Feb 04 11:19:58 localhost systemd[1]: dbus-broker.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Failed to start D-Bus System Message Bus.
Feb 04 11:19:58 localhost systemd[1]: dbus.socket: Failed with result 'service-start-limit-hit'.
Feb 04 11:19:58 localhost rpm-ostree[776]: error: Error receiving data: Connection reset by peer
Feb 04 11:19:58 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:19:58 localhost systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Failed to start rpm-ostree System Management Daemon.
Feb 04 11:19:58 localhost 02_watchdog.sh[775]: Job for rpm-ostreed.service failed because the control process exited with error code.
Feb 04 11:19:58 localhost 02_watchdog.sh[775]: See "systemctl status rpm-ostreed.service" and "journalctl -xeu rpm-ostreed.service" for details.
Feb 04 11:19:58 localhost 02_watchdog.sh[772]: parse error: Invalid numeric literal at line 1, column 3
Feb 04 11:19:58 localhost 02_watchdog.sh[771]: error: Loading sysroot: exit status: 1
Feb 04 11:19:58 localhost greenboot[738]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...

@cgwalters
Copy link
Contributor

That looks like a symptom of missing /tmp as a tmpfs...which is part of the reference base image: https://gitlab.com/fedora/bootc/base-images/-/blame/main/tier-0/basic-fixes.yaml?ref_type=heads#L4

@runcom
Copy link
Member

runcom commented Feb 4, 2025

That looks like a symptom of missing /tmp as a tmpfs...which is part of the reference base image: https://gitlab.com/fedora/bootc/base-images/-/blame/main/tier-0/basic-fixes.yaml?ref_type=heads#L4

uhm, but this is not bootc 😄 I'm seeing a bunch of tmpfs issues indeed

Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.X11-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.ICE-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.XIM-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.font-unix": Read-only file system

@runcom
Copy link
Member

runcom commented Feb 4, 2025

@thozza @achilleas-k do you know more here from the top of your heads? 👼

@runcom
Copy link
Member

runcom commented Feb 4, 2025

maybe slightly related as a change in ostree ostreedev/ostree#3366 ?

@runcom
Copy link
Member

runcom commented Feb 4, 2025

This seems to be the case for us now ostreedev/ostree#3366 (comment) @cgwalters

@runcom
Copy link
Member

runcom commented Feb 4, 2025

ostreedev/ostree#3353 (comment)

so ostree-2024.10 may be breaking for us as we upgrade from .9 to that w/o having a prepare-root.conf w/ composefs disabled.
Maybe upgrading straight to 2025.1 is gonna fix it?

@cgwalters
Copy link
Contributor

Ah yes, sorry. We withdrew 2024.10 from Fedora bodhi, but not C{9,10}S as there's no real "undo" button there. In any case 2025.1 is already queued to ship in 9.6 and beyond.

@cgwalters
Copy link
Contributor

Ugh, yeah 2025.1 is stuck in QE, will try to get that fixed

@runcom
Copy link
Member

runcom commented Feb 4, 2025

so we need the snapshots here to at least target 20250201 - that snapshot contains ostree-2025.1 cc @thozza

@achilleas-k
Copy link
Member

achilleas-k commented Feb 4, 2025

We have snapshots from 20250201, the PR is still open though: #4591

Quick look shows me that the rpm-ostree version there is 2025.4 (for RHEL 9.6).

@cgwalters
Copy link
Contributor

Quick look shows me that the rpm-ostree version there is 2025.4 (for RHEL 9.6).

It's ostree, not rpm-ostree at issue here

@achilleas-k
Copy link
Member

Right, my mistake. In that case it's 2025.1.

@mcattamoredhat
Copy link
Contributor Author

After the snapshot update to 20250201 edge-commit test in RHEL-9.6 is still failing https://artifacts.osci.redhat.com/testing-farm/8876a623-b410-499c-affd-727dbb89054f/work-edge-x86-commitqqviedpt/tmt/plans/edge-test/edge-x86-commit/execute/data/guest/default-0/tmt/tests/edge-test-1/output.txt

After reproducing this failure locally, it seems anaconda fails to install bootloader:

Installing boot loader
..
Performing post-installation setup tasks
================================================================================
================================================================================
Question

 The following error occurred while installing the boot loader. The system will
 not be bootable. Would you like to ignore this and continue with installation?
 
 failed to write boot loader configuration

An unknown error has occured, look at the /tmp/anaconda-tb* file(s) for more details


===============================================================================

ne 311, in start
    item.start()
  File "/usr/lib64/python3.9/site-packages/pyanaconda/installation_tasks.py", line 311, in start
    item.start()
  File "/usr/lib64/python3.9/site-packages/pyanaconda/installation.py", line 399, in run_installation
    queue.start()
  File "/usr/lib64/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/site-packages/pyanaconda/threading.py", line 275, in run
    threading.Thread.run(self)
pyanaconda.modules.common.errors.installation.BootloaderInstallationError: failed to write boot loader configuration

What do you want to do now?
1) Report Bug
2) Debug
3) Run shell
4) Quit
 

@cgwalters
Copy link
Contributor

Do you have bootupd in your tree? See https://pagure.io/workstation-ostree-config/pull-request/600# that explicitly excludes it. If the Edge-9.6 setup isn't ready for bootupd then at the current time the package needs to be excluded.

(All this pain will go away when we consolidate on a reference, tested base image defined as a container image going forward)

@runcom
Copy link
Member

runcom commented Feb 5, 2025

Right, we actually did that with fedora before we supported it afaict osbuild/images#918

PR for rhel osbuild/images#1195
PR here for integration #4597

@mcattamoredhat
Copy link
Contributor Author

Exluding bootupd with osbuild/images#1195 fixes bootloader issue.

Nevertheless, edge-commit in RHEL-9.6 is still failing, we will continue debugging.

@runcom
Copy link
Member

runcom commented Feb 6, 2025

Seems like we're now hitting a greenboot issue somehow - @say-paul is on it (but the bootupd actually fixes the anaconda failure)

@runcom
Copy link
Member

runcom commented Feb 6, 2025

@runcom
Copy link
Member

runcom commented Feb 7, 2025

the fix for greenboot has been merged - we'd keep this open until we're bumping the snapshot again to pick the new version (thanks everybody for the help!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants