Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerance of pre-existing software-RAID #38

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ydirson
Copy link
Contributor

@ydirson ydirson commented Jan 26, 2023

Primary goal of this PR is avoid a pre-existing software RAID configuration to block usage of the disks that were used for that RAID. Those were at best not shown as available, or caused installation errors when selected (on disks with partitions that were part of a RAID), or even successful-looking installations that were not booting (when attempting to install on a RAID whose members were partitions).

The basic idea is to prevent assembling of pre-existing RAID volumes, and add an explicit option to trigger such assembling, allowing to upgrade/restore a system installed on RAID. It also avoids, while RAIDs are not assembled, giving the user the choice to upgrade/restore a system installed on RAID.

This filtering-out of restoration choice is extended to any backup partition referring to a non-accessible device, which as a side-effect will make it impossible to start a "restore" operation of a too-old version, which would be doomed to fail because of a dom0 device id change (as seen e.g. in https://xcp-ng.org/forum/topic/6573/upgrade-xenserver-7-to-xcp-ng-8-2-1).

Note PR is based on #20 to help the user understand which exact choices are really presented. The display can be limited to commits relevant to this PR.

Note that some cases are explicitly not addressed:

  • filtering out of disks whose partitions are in a RAID -- but then, this would only be needed when such a RAID is assembled, and we only offer to assemble pre-existing RAID when there are disks with a RAID superblock; so they would only appear when there are both a (supported) disk-level RAID, and an (unsupported) partition-level RAID; at this point the user is very likely already working on the supported RAID, so it is not sure there is really a problem left to address here
  • prevention of installation of a partition-level RAID -- those would only appear to be available in the same situation

…for user

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
Adding disk vendor and model would be useful, but can take too much
horizontal space, especially for good old 80x25 resolution, so we use
a shorter string with just the disk size for disambiguation.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
Reasons to have this here include:
* don't require changes to srpm when the set of installed files change
* make installer development faster by deploying directly to a filesystem

This installation process still depends by default on the sm package
being installed, but through SM_ROOT variable it can now be run on a
dev machine where this is not the case, it just needs an unpacked
version of the package.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
mdadm output is like "MD_DEVNAME=127 which results in getHumanDiskName
giving a name of "RAID: 127(sda,sdb)".  This fixes the problem and gets
us "RAID: md127(sda,sdb)" instead.

Also slightly improve readability with tuple unpacking.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
This prevents 65-md-incremental.rules from assembling every RAID it
identifies, we want to have control over that, so we can reuse a
former RAID member for other purposes.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
This causes the "Back" button in primary-disk selection to trigger a
rescan for products.  This provides support for allowing the user to
enable layers of block-device composition (e.g. "md" software RAID,
not part of this PR) and then scan those RAID volumes for products.

This could even be used for arbitrary manual block-device setup from shell
on alternate console.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
It has value for a user expecting an existing product to get notified that
none was found.  Moreover, this will allow a user to get to the "assemble
RAID" choice even when no product was detected.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
… RAID

This adds an option to enable *all* RAID devices found.  This is less
flexible than a more selective mechanism would be, but really simple and
covering all known use-cases.

It filters out from choices presented to the user both "update of" and
"restore to" entries refering to a disk that is part of a md RAID
(though in practice only the "update" case is likely to appear).  This
avoids the case where a user gets those entries before explicitly
assembling RAIDs, and thus avoids breaking a RAID by mistake.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
…er to assemble RAID

This commit is kept as separate until decided this is the way to go, but
would be better squashed.

Introduce installer feature flags, and hide the RAID-assembling
feature behind the `raid-assemble` flag.
…ound

This will avoid issues where users attempt to upgrade from a very old
version, and a dom0 kernel driver evolution changed the
/dev/disk/by-id/ identifier for the disk, and when the backup became
invalid after a hardware change.

Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
Signed-off-by: Yann Dirson <yann.dirson@vates.fr>
@@ -229,7 +229,7 @@ def get_installation_type(answers):
# RAID members, for filtering out from upgradable-products and
# backups, and to decide whether to propose to activate existing RAID.
raid_members = []
if "assemble-raid" not in answers:
if constants.HAS_RAID_ASSEMBLE and "assemble-raid" not in answers:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature flag to enable from presenting the "assemble RAID" option also disables the safety measure of filtering out upgrade/restore entries that the user should not select, as they indeed live on a RAID volume. It is I think a bad idea to allow the user to do this, but then hiding those options without letting the user assemble RAID volumes looks like a bad idea too. I would rather just let the behavior added by this PR be the default, and just not add this feature flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As XenServer doesn't support installing on software RAID at all, it is better to keep the feature flag to hide the "assemble RAID" option. The button would be too confusing for users in a XenServer context.

Copy link
Contributor Author

@ydirson ydirson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, there is a problem in the chaining of screens: we rely on LEFT_BACKWARDS from get_installation_type() to bring us back to the (new) scan_existing() step, which

  1. prevents getting back to EULA / hardware warnings screens
  2. in the case where the overwrite_warning() step would trigger (likely rare, but still), the rescan after RAID assembly would not even trigger

This is summarized in this sketch:
sam-tolerate-raid-current

Our proposal to address this is to bring all of existing-product scanning, overwrite warning, and RAID activation, inside the the get_installation_type() step, so we can have REPEAT_STEP do the rescan before re-presenting choices, and LEFT_BACKWARDS behave as expected.
We identified one (small) downside: getting back from the primary-disk selection screen will trigger a rescan before bringing us back to the installation-type selection.

This is summarized in this sketch:
sam-tolerate-raid-proposal

Since this change is a bit disruptive, we're looking for feedback before implementing it.

@stormi
Copy link
Contributor

stormi commented Jul 7, 2023

This feature is one we would like in XCP-ng 8.3. As usual, we favour a common design and implementation over XCP-ng-specific patches.

Could someone at XenServer review our above proposition (last comment) so that we may quickly decide how to move forward?

@MarkSymsCtx
Copy link
Contributor

MarkSymsCtx commented Jul 7, 2023

Could someone at XenServer review our above proposition (last comment) so that we may quickly decide how to move forward?

The biggest problem is likely to be that XenServer doesn't support software RAID at all and explicitly refuses to install on software RAID.

This is also quite a sizeable chunk of code, which would not be getting tested within a XenServer environment so the potential for XenServer making a change elsewhere which lead to a regressing failure in this code is sufficiently far form 0 to be of concern to both parties.

@stormi
Copy link
Contributor

stormi commented Jul 10, 2023

The biggest problem is likely to be that XenServer doesn't support software RAID at all and explicitly refuses to install on software RAID.

This is also quite a sizeable chunk of code, which would not be getting tested within a XenServer environment so the potential for XenServer making a change elsewhere which lead to a regressing failure in this code is sufficiently far form 0 to be of concern to both parties.

You raise a crucial question. Do only XenServer needs count, or can XCP-ng contribution also - in a measured way - introduce features which XenServer doesn't need?

We don't mean to overflow you with PRs you don't care about for XenServer. But in a healthy collaboration, we also need some of our PRs to be accepted, when they don't cause a big risk of regression or bitrot for XenServer.

XCP-ng has supported software RAID for dom0 almost since the beginning. We have experience with it and can maintain and test the related code. Indeed, there's a risk with it not being tested first at XenServer, but we have it already anyway because the soft RAID support in XCP-ng currently lives as out of branch patches, which is worse.

Also, the primary objective in this PR is not supporting software RAID: it's making the installer able to install on a disk even if this disk was previously used for software RAID. A situation which is unlikely when installing on a brand new server, but more likely when you re-use hardware, as often happens with XCP-ng users. We see it as an improvement XenServer benefits from, be it in a small way. The installer should not assemble software RAID by default as it does currently.

The only part which is specific to XCP-ng is the support for detecting existing installations and backups on software RAID, and the assemble RAID button, which won't show up unless you enable the feature.

I understand this pull request is not very interesting for XenServer, and our proposition in the comments even causes a workflow change which may affect existing test suites. I just hope it is acceptable. It falls under the "trying to merge our legacy patches into common code and maintain them there, so that we can focus on more important common matters". It's too late for XCP-ng to remove Soft RAID support for dom0. It is a popular feature. So either we can maintain it in the upstream code, or we have to keep maintaining our own branches, with potential conflicts and breakage each time upstream code changes.

@MarkSymsCtx
Copy link
Contributor

You raise a crucial question. Do only XenServer needs count, or can XCP-ng contribution also - in a measured way - introduce features which XenServer doesn't need?

I didn't mean to imply that either only XenServer's or XCP-ng's requirements count. But where we do have quite divergent requirements here there is a problem, for both parties, relating to ensuring that code not used by one party does not inadvertently get broken by unrelated changes. A robust set of unit tests would go a long way to ensure that regressions would not be introduced by future changes.

@stormi
Copy link
Contributor

stormi commented Jul 10, 2023

I definitely agree about testing the installer more thoroughly, and ideally as github actions on PRs. Do you have any plans already in this regard? Maybe as part of the python3 port?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants