-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installer pre-flight validations #346
Conversation
Add ability to run pre-flight validations in the installer.
## Open Questions [optional] | ||
|
||
- Will the validations be run automatically on `cluster create` or will it be | ||
an explicit action? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preflight checks should be in the default code path. As when it fails it will help communicate why it failed and help the customer fix the issue and run the installer again.
Also there should be an explicit way to skip preflight checks when required. Sometimes customers might want to skip it because of specific environment requirement. For example a customer trying OpenShift installation in a new platform which we do not support and the preflight check fails on the last preflight check. But customer is fine with installing it and then manually fixing the thing which made the preflight check fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that in general we'll want the validations to run by default as it is done today, everyone might not be happy running all validations if it makes the deployment significantly longer, and we may want to explicitly enable them instead. I'm thinking in particular about the extra checks to validate the cloud performance or the ability to pull container images.
Agreed on the ability to bypass checks. Depending on how we implement validations, either turn off validations, or make their failures non-fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pre-flight checks should enough value that it should be in the default path. However if you have checks which take longer time and you do not want them to be enabled by default them you can the pre-flight checks in to two parts. Once part which should be run everytime and some extra tests which can be opted to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But ideally pre-flight checks should not take a long time to run. Also it should not make any changes to the platform and it adds so much value that it should run every time installer is being run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the prechecks should be optional, but encouraged to be executed. If you have a testing environment where you usually deploy OCP, it is pointless to run those prechecks every time. So I think having a --skip-prechecks
flag where if not present, the prechecks will be executed. Also, the prechecks shall be able to be executed by themselves, like openshift-install run-prechecks
or something. My 2 cents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right now, I'm leaning towards splitting the pre-flight validations into 2 buckets:
- the core validations, enabled all the time - this includes all of the current validations in the installer.
- the extra validations, enabled on demand, for more involved checks requiring to boot a node. The performance validations are a good example of such checks, where it can take quite some time, especially on BM.
Now, I'm not sure what the best way to enable the extra validations is... It depends if we want to perform the installation or not after running the extra validations:
- a
--dry-run
flag runs all the validation but not performing the installation - a
--with-extra-validations
or similar flag, enabling the extra validations and performing the installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've captured this conversation in my latest patch. Please check.
an explicit action? | ||
- If explicit action, how will the validations be enabled? E.g. adding | ||
a `--dry-run` option to the installer vs. a separate subcommand or even | ||
a separate binary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an explicit command like --dry-run
to do the pre-flight check will help IMO. But I am not too opinionated about it as the pre-flight check should be in default code path.
DNS to reach the cloud's enpoints | ||
- necessary cloud services are available | ||
- storage performance | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-flight checks should be idempotent in nature. We should mention this in the goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, very good point.
succeed. This goal is relatively easy to implement for public cloud platforms | ||
where we can make assumptions about services being available, or performance | ||
meeting requirements, however this is not the case with private clouds where | ||
each cloud is unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public clouds might need fewer checks, however many might apply to them as well: think quota, or permissions. Some public cloud offerings even differ based on the geographical zone. I wouldn't exclude public clouds a priori from this enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, public cloud may be of lower priority but there are many checks in place in OCM for OSD that could help bootstrap something in the AWS and GCP spaces. There are many variables driving limits/quotas that impact if installation will be successful including support plan and region. Explicitly checking limits up front is critical to success and we strive to tell our customers it will fail due to account/project limits before we kick off the installer.
As an administrator of an OpenShift cluster, I would like to verify that my | ||
OpenStack cloud meets all the performance, service and networking requirements | ||
necessary for a successful deployment. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted below, we want to spin a VM mimicking a master
node. To me this means that either implicitly or explicitly, we are validating an install-config
as well.
To be even clearer about this point, I'd add a reference to the aforementioned OpenShift admin wanting to check that his cluster's configuration is valid and appropriate for the target infrastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is ok for OpenStack, but for bare metal it would be almost like if a complete installation is executed (boot one of the bare metal masters/workers with a 'special image' that can execute the prechecks, run the checks, report back & tear down the baremetal... including how to manage the pxe, etc. that is required to boot the bare metal host)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(references to the bastion VM have been removed from this draft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the baremetal and OpenStack teams are trying to solve some very similar problems here! Deployment failures are hard to debug, so more validations are better. Our approach has been to check the quality of the data in the installer as much as possible, but then when things fail that we couldn't validate from the installer to just make sure the user understands why.
I've elaborated a bunch on what that looks like in #328.
Keep in mind, I'm coming from the baremetal perspective which is decidedly different than other on-premise platforms. In our case, the ephemeral validation host would be tough to provide. We could do a VM, but then that's just the same as a bootstrap host IMHO.
So, to take the pull secret validation example, if they don't work, why not just make sure that information gets from the bootstrap and shown to the user in the installer output, and have the installer fail as early as possible?
Node with the master flavor on the user provisioned network: | ||
- pull container images | ||
- validate networking and cloud connectivity | ||
- run fio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's fio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the one. It's a benchmarking tool for storage, and is the recommended tool to check that storage performance meets the requirements for etcd:
https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#system-requirements
- the tenant has adequate quota and the flavors' specifications are within the | ||
recommended ranges. | ||
- for user-provided networks, check the subnets have a DHCP server and valid | ||
DNS to reach the cloud's enpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a "user-provided" network? Does this machine mean machine networks? How do you validate DHCP works? What are cloud endpoints? The metadata URI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"user-provided" refers to networks not created by the installer.
https://github.com/openshift/installer/blob/master/docs/user/aws/customization.md#installing-to-existing-vpc--subnetworks
https://github.com/openshift/installer/blob/master/docs/user/gcp/customization.md#installing-to-existing-networks--subnetworks
In OpenStack platform, we also allow passing the machine network with machinesSubnet
and setting additional networks to the VMs with additionalNetworkIDs
, for multi-NIC VMs.
The cloud endpoints refers to the URLs you use to talk to your cloud's API. I'll expend a bit this section, hopefully clarifying things.
recommended ranges. | ||
- for user-provided networks, check the subnets have a DHCP server and valid | ||
DNS to reach the cloud's enpoints | ||
- necessary cloud services are available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is platform-dependent.
- for user-provided networks, check the subnets have a DHCP server and valid | ||
DNS to reach the cloud's enpoints | ||
- necessary cloud services are available | ||
- storage performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this tested and quantified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### On-premise deployments | ||
|
||
As an administrator of an OpenShift cluster, I would like to verify that my | ||
OpenStack cloud meets all the performance, service and networking requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this enhancement request just for OpenStack? It's in the installer directory so I thought it was a generic framework other platforms might be able to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the enhancement is for all platforms. This user story use the OpenStack example because that's the one I'm the most familiar with, but could be any other platform.
|
||
Node with the master flavor on the user provisioned network: | ||
- pull container images | ||
- validate networking and cloud connectivity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How?
Validation failure should result in actionable action, for example failure | ||
message could provide pointers on how to fix the error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation failure should result in actionable action, for example failure | |
message could provide pointers on how to fix the error. | |
Validation failures should clearly indicate the root cause of an error, and if applicable, suggest a known solution to it. |
|
||
#### Pre-provision a node | ||
|
||
Node with the master flavor on the user provisioned network: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what stage in the installation would this node get created? Between the networking and nodes getting stood up? Or before anything for the cluster gets stood up at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make these checks run as a systemd unit on the bootstrap node, rather than spinning up a new node and creating a new install phase.
environment doesn't match the recommendation and we may want the installer to | ||
go on with the deployment still. | ||
|
||
In that case, the validation can be marked as optional, meaning failure of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you intend to set up the system to mark tests optional? Command line flags? A YAML file?
@mandre Thanks for you proposal. I think we can start implementing it! My suggestions for the implementation: For core validations (platform and machines) we can start using: https://github.com/go-playground/validator I don't like the idea to use a standalone vm for extra validations (i.e. perform core validation, boot a vm, upload tests there, execute them, collect the results, destroy the vm, and, if everything is okay, begin provisioning the infrastructure). First, it will break the existing installation workflow. Second, we will need to create additional resources for the vm (floating ip and rhcos image, for instance), it will lead to code duplication. |
@Fedosin using the bootstrap VM to run the extra validations would indeed remove some of the complexity around provisioning the resources to run the checks and is a good idea worth exploring, although I'd be careful saying that the bootstrap node is equivalent to the master node, depending on the platforms this might not be the case - as I understand every platform is free to provision the bootstrap node however they like so you can't make assumption that the storage the bootstrap gets is similar to the one the master nodes get for instance. |
Core validations are the ones we already know. They run every time, as it is | ||
done today. | ||
|
||
The extra validations will be enabled on demand via a flag when running the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new flags to the installer binary please. the installer is configure using install-config.yaml
and will continue to do so.
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
8128788
to
86cf3c3
Compare
Addressed comments and simplified to match the first implemented iteration.
/remove-lifecycle rotten @mandre PTAL |
/retitle Installer pre-flight validations |
86cf3c3
to
677912a
Compare
* Remove reference to the bastion VM * Remove reference to the validations being optional
677912a
to
313d8a8
Compare
|
||
The installer already performs some validations: | ||
- checks all required fields are set and that the data is in the right format | ||
- some basic validation, such as networks do not overlap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add to the list:
- Quota validations
- Flavor validations (when occurs)
But since the list isn't exhaustive, feel free to ignore my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you say is true: we now have those. However we implemented them in the context of this very Enhancement, so I think it's fine to list them under "goals", line 66
|
||
However, this doesn't check that the environment is suitable to install OpenShift: | ||
- pull secret is valid to fetch the container images | ||
- the tenant has adequate quota and the flavors' specifications are within the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- quotas are checked in 4.7
- flavors are checked in 4.6
|
||
Since pre-flight validations are only run at install time, and not on cluster | ||
upgrade/downgrade, caution should be used when migrating existing validations | ||
to this "pre-flight" framework. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have use-cases where we can't afford increasing the deployment time (ephemeral deployments, kind of best effort, like in CI, where we know if there is a failure, debug won't happen; another potential example, small edge site? etc).
We need a toggle to disable pre-flight-validations (probably in install-config?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have the unsupported flag OPENSHFIT_INSTALL_SKIP_PREFLIGHT_VALIDATIONS=1
The code for the framework and the validations will have unit tests. | ||
|
||
In addition, we will enable the validations checks in CI in order to exercise | ||
them and potentially highlight issues with the underlying CI infrastructure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to be able to run pre-flight validations separately, for 2 use-cases:
- in CI, we would disable pre-check validations everywhere (most likely) if they take too much time from the CI job and can cause timeout issues (it happened a lot in OpenStack CI).
- a customer who want to first check their infrastructure, then run the install
Therefore I propose that we have a subcommand like openshift-install validate pre-flight
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments inline, but this is a serious start I think. Thanks!
succeed. This goal is relatively easy to implement for public cloud platforms | ||
where we can make assumptions about services being available, or performance | ||
meeting requirements, however this is not the case with private clouds where | ||
each cloud is unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, public cloud may be of lower priority but there are many checks in place in OCM for OSD that could help bootstrap something in the AWS and GCP spaces. There are many variables driving limits/quotas that impact if installation will be successful including support plan and region. Explicitly checking limits up front is critical to success and we strive to tell our customers it will fail due to account/project limits before we kick off the installer.
The pre-flight validations should not alter the target infrastructure nor leave | ||
behind any new resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cloud providers are in scope consider that EC2 in a region isn't usable until you try to provision an instance. This initial provision will fail and region has to be approved for use by AWS. Depending on the support plan and the region this can take minutes or days. I am not aware that an API to check this has been added. Are "checks" like this in scope where the implementation is to try creating some infra then tearing it down if successful?
Hey folks! This is an interesting proposal. As a user, I'd like to share my experience here. I just spent a week trying to understand why my attempts to create a cluster keep on failing, and what I learnt is that I had been using few invalid manifests. I'm working on getting Cilium installer operator working with OpenShift, and that needs to be done by the means of customising References: |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
We do not envision the need for the users to write additional validations. As | ||
a consequence, the validations do not need to be loaded on startup and will be | ||
compiled into the `openshift-install` binary. This may be revisited later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be possible to run all desirable validations from the installer. For bare metal IPI, we cannot always assume the host running the installer can talk to the baseboard management controllers (BMCs) used to manage power on the host. The bootstrap VM does need access to those BMCs, so we could run a validation step there that verified that the credentials for accessing the BMCs are correct and that the BMCs support the features required (especially virtual media support).
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Add ability to run pre-flight validations in the installer.