-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installer pre-flight validations #346
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
--- | ||
title: pre-flight-validations | ||
authors: | ||
- @mandre | ||
- @pierreprinetti | ||
reviewers: | ||
- EmilienM | ||
- Fedosin | ||
- LalatenduMohanty | ||
- abhinavdahiya | ||
- iamemilio | ||
- stbenjam | ||
approvers: | ||
- TBD | ||
creation-date: 2020-05-18 | ||
last-updated: 2020-12-17 | ||
status: implementable | ||
--- | ||
|
||
# Pre-flight validations | ||
|
||
## Release Signoff Checklist | ||
|
||
- [v] Enhancement is `implementable` | ||
- [v] Design details are appropriately documented from clear requirements | ||
- [v] Test plan is defined | ||
- [x] Graduation criteria for dev preview, tech preview, GA | ||
- [x] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
One of the guiding principles of OCP 4 is that the installation should always | ||
succeed. This goal is relatively easy to implement for public cloud platforms | ||
where we can make assumptions about services being available, or performance | ||
meeting requirements, however this is not the case with private clouds where | ||
each cloud is unique. | ||
|
||
It is currently not possible to tell with confidence if an installation will be | ||
successful or not for the following platforms: | ||
- BareMetal | ||
- OpenStack | ||
- oVirt | ||
- vSphere | ||
|
||
For this purpose, we propose to implement a framework allowing the installer to | ||
run pre-flight validations in order to certify that all pre-requisites are met | ||
to successfully install OpenShift in the selected environment. | ||
|
||
## Motivation | ||
|
||
This section is for explicitly listing the motivation, goals and non-goals of | ||
this proposal. Describe why the change is important and the benefits to users. | ||
|
||
### Goals | ||
|
||
By having an automated way to identify potential issues early, before the | ||
deployment even started, we want to reduce wasted time and resources, customer | ||
escalations, and improve the perception of OpenShift deployments on-premise. | ||
|
||
The installer already performs some validations: | ||
- checks all required fields are set and that the data is in the right format | ||
- some basic validation, such as networks do not overlap | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can add to the list:
But since the list isn't exhaustive, feel free to ignore my comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What you say is true: we now have those. However we implemented them in the context of this very Enhancement, so I think it's fine to list them under "goals", line 66 |
||
|
||
However, this doesn't check that the environment is suitable to install OpenShift: | ||
- pull secret is valid to fetch the container images | ||
- the tenant has adequate quota and the flavors' specifications are within the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
recommended ranges. | ||
- for user-provided networks, check the subnets have a DHCP server and valid | ||
DNS to reach the cloud's endpoints | ||
- required cloud services are available | ||
- required storage performance | ||
|
||
The pre-flight validations should not alter the target infrastructure nor leave | ||
behind any new resource. | ||
Comment on lines
+73
to
+74
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If cloud providers are in scope consider that EC2 in a region isn't usable until you try to provision an instance. This initial provision will fail and region has to be approved for use by AWS. Depending on the support plan and the region this can take minutes or days. I am not aware that an API to check this has been added. Are "checks" like this in scope where the implementation is to try creating some infra then tearing it down if successful? |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pre-flight checks should be idempotent in nature. We should mention this in the goals. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, very good point. |
||
### Non-Goals | ||
|
||
Implementing every useful validation we can think of is out-of-scope: | ||
- The potential number of validations is huge | ||
- We expect more validations to be added over time as new issues are | ||
discovered. | ||
|
||
## Proposal | ||
|
||
This is where we get down to the nitty gritty of what the proposal actually is. | ||
|
||
### User Stories | ||
|
||
#### On-premise deployments | ||
|
||
As an OpenShift administrator installing a new OpenShift cluster, I want the | ||
installation process to fail early when the requirements are not met for a | ||
successful installation. In such a case, I also want clear and actionable error | ||
messages right in the Installer output. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As noted below, we want to spin a VM mimicking a To be even clearer about this point, I'd add a reference to the aforementioned OpenShift admin wanting to check that his cluster's configuration is valid and appropriate for the target infrastructure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this is ok for OpenStack, but for bare metal it would be almost like if a complete installation is executed (boot one of the bare metal masters/workers with a 'special image' that can execute the prechecks, run the checks, report back & tear down the baremetal... including how to manage the pxe, etc. that is required to boot the bare metal host) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (references to the bastion VM have been removed from this draft) |
||
#### CI debugging | ||
|
||
As an OpenShift developer, I would like to rapidly identify failures caused by | ||
transient environmental issues. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
The validations must leave the environment unaltered. | ||
|
||
The framework should allow implementing checks common to all platforms as well | ||
as per-platform checks. | ||
|
||
We do not envision the need for the users to write additional validations. As | ||
a consequence, the validations do not need to be loaded on startup and will be | ||
compiled into the `openshift-install` binary. This may be revisited later. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may not be possible to run all desirable validations from the installer. For bare metal IPI, we cannot always assume the host running the installer can talk to the baseboard management controllers (BMCs) used to manage power on the host. The bootstrap VM does need access to those BMCs, so we could run a validation step there that verified that the credentials for accessing the BMCs are correct and that the BMCs support the features required (especially virtual media support). |
||
|
||
#### Enabling the validations | ||
|
||
The validations will run automatically, right after the `install-config.yaml` | ||
syntax validation. | ||
|
||
#### Reporting errors | ||
|
||
A failed validation typically causes the installer to fail early and not | ||
proceed with the deployment of OpenShift. | ||
|
||
The installer will report all found failures at once, and will not stop on the | ||
first validation error. | ||
|
||
Validation failures should clearly indicate the root cause of an error, and if | ||
applicable, suggest a solution to it. | ||
|
||
### Risks and Mitigations | ||
|
||
Depending on the number and the nature of the validations, the installation | ||
time might end up increasing noticeably. Because the goal is to fail early, | ||
time-consuming validations should seldom be considered for inclusion. | ||
|
||
Since pre-flight validations are only run at install time, and not on cluster | ||
upgrade/downgrade, caution should be used when migrating existing validations | ||
to this "pre-flight" framework. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll have use-cases where we can't afford increasing the deployment time (ephemeral deployments, kind of best effort, like in CI, where we know if there is a failure, debug won't happen; another potential example, small edge site? etc). We need a toggle to disable pre-flight-validations (probably in install-config?). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We now have the unsupported flag |
||
|
||
## Design Details | ||
|
||
### Test Plan | ||
|
||
The code for the framework and the validations will have unit tests. | ||
|
||
In addition, we will enable the validations checks in CI in order to exercise | ||
them and potentially highlight issues with the underlying CI infrastructure. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great to be able to run pre-flight validations separately, for 2 use-cases:
Therefore I propose that we have a subcommand like |
||
|
||
## Upgrade / Downgrade Strategy | ||
|
||
Pre-flight validations are only run when installing a new OpenShift cluster. | ||
|
||
## Implementation History | ||
|
||
The pre-flight validation framework is implemented in OpenShift v4.6. New | ||
validations should be added in new releases, to increase the coverage of | ||
existing requirements and to cover new requirements. | ||
|
||
## Drawbacks | ||
|
||
Running the validations will increase the time it takes to run the installer. | ||
|
||
## Alternatives | ||
|
||
Only perform input validation as it is done today and rely on runtime errors to | ||
troubleshoot deployment issues. | ||
|
||
The pre-flight validations can give a good indication whether a deployment has | ||
a chance to succeed at a given time, however they can't catch all potential | ||
issues. Any change in the environment invalidates the previous validation | ||
results. That is why it is important to also report runtime errors in way that | ||
is easy to understand. | ||
|
||
The pre-flight validations are not mutually exclusive with [improved | ||
debuggability](https://github.com/openshift/enhancements/pull/328) of the | ||
deployment errors but the two are instead complementary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public clouds might need fewer checks, however many might apply to them as well: think quota, or permissions. Some public cloud offerings even differ based on the geographical zone. I wouldn't exclude public clouds a priori from this enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, public cloud may be of lower priority but there are many checks in place in OCM for OSD that could help bootstrap something in the AWS and GCP spaces. There are many variables driving limits/quotas that impact if installation will be successful including support plan and region. Explicitly checking limits up front is critical to success and we strive to tell our customers it will fail due to account/project limits before we kick off the installer.