-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LVMActivationController fails repeatedly with exit status 5 #9365
Comments
There is a good chance this is a dup of #9300, but the details are different, so I filed a separate issue. |
same, 1.8.0, drive used for rook-ceph, noticed the loss of osds as /dev/mapper nodes not configured
|
I can't reproduce this fully, not sure what is the Ceph version and OSD setup:
|
I am using rook-ceph v1.15.2, using the cluster and operator charts. Here's the storage configuration:
Here's my dv output:
|
@smira OK, I think I know the root cause. At some point I made a change to the node, either a firmware change or more likely the kernel to a new version or config, that impacted PCIe enumeration. This resulted in a different assignment of nvme device nodes to physical devices (e.g. nvme1 became nvme2 and vice-versa). lvm keeps "pvs_online" files when When Because of the device reordering that happened on the node, this check fails:
The -- I think the lvm2 design assumes that these "pvs_online" files do not survive reboot. Indeed, according to the FHS 3
Unfortunately, while I think the right fix for this issue, and potentially many other issues, is to fix Talos's rootfs to have Footnotes |
@jfroy I actually found the problem yesterday, and never got back to the issue. It's even worse a bit. There are two separate issues:
|
So the issue takes two Talos reboots to reproduce - e.g. if I install Rook/Ceph with encrypted drives (to trigger LVM, otherwise it creates bluestore directly on the device), everything is fine (as Ceph activates LVM itself on creation) On the second reboot, everything is activated correctly, as
|
See siderolabs/talos#9365 This allows to break dependency on `/var` availability, and also workaround issue with `/var/run` being persistent on Talos right now (which is going to be fixed as well). Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
See siderolabs/talos#9365 This allows to break dependency on `/var` availability, and also workaround issue with `/var/run` being persistent on Talos right now (which is going to be fixed as well). Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
For new installs, simply symlink to `/run` (which is `tmpfs`). For old installs, simulate by cleaning up the contents. Fixes siderolabs#9432 Related to siderolabs#9365 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
For new installs, simply symlink to `/run` (which is `tmpfs`). For old installs, simulate by cleaning up the contents. Fixes siderolabs#9432 Related to siderolabs#9365 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
See siderolabs/talos#9365 This allows to break dependency on `/var` availability, and also workaround issue with `/var/run` being persistent on Talos right now (which is going to be fixed as well). Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit ae205aa)
For new installs, simply symlink to `/run` (which is `tmpfs`). For old installs, simulate by cleaning up the contents. Fixes siderolabs#9432 Related to siderolabs#9365 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit f711907)
Bug Report
Description
On a single node cluster with Talos v1.8.0 and a Rook-Ceph cluster composed of 8 encrypted disk with one OSD per disk, Talos fails to activate the lvm volumes at boot. This process appears to keep retrying and the controller makes no progress.
In beta versions of v1.8.0, the lvm volumes also did not become active at boot, but there was also no attempt to activate them. The new code for that was introduced late in the 1.8 cycle (see #9300) and this is the first time I ran a build with the new controller.
Workaround
I can run a privileged Alpine pod and issue
vgchange -a y
to activate all the lvm volumes. It does not allow the controller to make progress, but Ceph OSDs do start and the Ceph cluster does become healthy.Logs
Environment
support.zip
The text was updated successfully, but these errors were encountered: