Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installer pre-flight validations #346

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions enhancements/installer/pre-flight-validations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: pre-flight-validations
authors:
- @mandre
- @pierreprinetti
reviewers:
- EmilienM
- Fedosin
- LalatenduMohanty
- abhinavdahiya
- iamemilio
- stbenjam
approvers:
- TBD
creation-date: 2020-05-18
last-updated: 2020-12-17
status: implementable
---

# Pre-flight validations

## Release Signoff Checklist

- [v] Enhancement is `implementable`
- [v] Design details are appropriately documented from clear requirements
- [v] Test plan is defined
- [x] Graduation criteria for dev preview, tech preview, GA
- [x] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

One of the guiding principles of OCP 4 is that the installation should always
succeed. This goal is relatively easy to implement for public cloud platforms
where we can make assumptions about services being available, or performance
meeting requirements, however this is not the case with private clouds where
each cloud is unique.
Copy link
Member

@pierreprinetti pierreprinetti May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public clouds might need fewer checks, however many might apply to them as well: think quota, or permissions. Some public cloud offerings even differ based on the geographical zone. I wouldn't exclude public clouds a priori from this enhancement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, public cloud may be of lower priority but there are many checks in place in OCM for OSD that could help bootstrap something in the AWS and GCP spaces. There are many variables driving limits/quotas that impact if installation will be successful including support plan and region. Explicitly checking limits up front is critical to success and we strive to tell our customers it will fail due to account/project limits before we kick off the installer.


It is currently not possible to tell with confidence if an installation will be
successful or not for the following platforms:
- BareMetal
- OpenStack
- oVirt
- vSphere

For this purpose, we propose to implement a framework allowing the installer to
run pre-flight validations in order to certify that all pre-requisites are met
to successfully install OpenShift in the selected environment.

## Motivation

This section is for explicitly listing the motivation, goals and non-goals of
this proposal. Describe why the change is important and the benefits to users.

### Goals

By having an automated way to identify potential issues early, before the
deployment even started, we want to reduce wasted time and resources, customer
escalations, and improve the perception of OpenShift deployments on-premise.

The installer already performs some validations:
- checks all required fields are set and that the data is in the right format
- some basic validation, such as networks do not overlap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add to the list:

  • Quota validations
  • Flavor validations (when occurs)

But since the list isn't exhaustive, feel free to ignore my comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you say is true: we now have those. However we implemented them in the context of this very Enhancement, so I think it's fine to list them under "goals", line 66


However, this doesn't check that the environment is suitable to install OpenShift:
- pull secret is valid to fetch the container images
- the tenant has adequate quota and the flavors' specifications are within the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • quotas are checked in 4.7
  • flavors are checked in 4.6

recommended ranges.
- for user-provided networks, check the subnets have a DHCP server and valid
DNS to reach the cloud's endpoints
- required cloud services are available
- required storage performance

The pre-flight validations should not alter the target infrastructure nor leave
behind any new resource.
Comment on lines +73 to +74
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cloud providers are in scope consider that EC2 in a region isn't usable until you try to provision an instance. This initial provision will fail and region has to be approved for use by AWS. Depending on the support plan and the region this can take minutes or days. I am not aware that an API to check this has been added. Are "checks" like this in scope where the implementation is to try creating some infra then tearing it down if successful?


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-flight checks should be idempotent in nature. We should mention this in the goals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, very good point.

### Non-Goals

Implementing every useful validation we can think of is out-of-scope:
- The potential number of validations is huge
- We expect more validations to be added over time as new issues are
discovered.

## Proposal

This is where we get down to the nitty gritty of what the proposal actually is.

### User Stories

#### On-premise deployments

As an OpenShift administrator installing a new OpenShift cluster, I want the
installation process to fail early when the requirements are not met for a
successful installation. In such a case, I also want clear and actionable error
messages right in the Installer output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted below, we want to spin a VM mimicking a master node. To me this means that either implicitly or explicitly, we are validating an install-config as well.

To be even clearer about this point, I'd add a reference to the aforementioned OpenShift admin wanting to check that his cluster's configuration is valid and appropriate for the target infrastructure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is ok for OpenStack, but for bare metal it would be almost like if a complete installation is executed (boot one of the bare metal masters/workers with a 'special image' that can execute the prechecks, run the checks, report back & tear down the baremetal... including how to manage the pxe, etc. that is required to boot the bare metal host)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(references to the bastion VM have been removed from this draft)

#### CI debugging

As an OpenShift developer, I would like to rapidly identify failures caused by
transient environmental issues.

### Implementation Details/Notes/Constraints

The validations must leave the environment unaltered.

The framework should allow implementing checks common to all platforms as well
as per-platform checks.

We do not envision the need for the users to write additional validations. As
a consequence, the validations do not need to be loaded on startup and will be
compiled into the `openshift-install` binary. This may be revisited later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be possible to run all desirable validations from the installer. For bare metal IPI, we cannot always assume the host running the installer can talk to the baseboard management controllers (BMCs) used to manage power on the host. The bootstrap VM does need access to those BMCs, so we could run a validation step there that verified that the credentials for accessing the BMCs are correct and that the BMCs support the features required (especially virtual media support).


#### Enabling the validations

The validations will run automatically, right after the `install-config.yaml`
syntax validation.

#### Reporting errors

A failed validation typically causes the installer to fail early and not
proceed with the deployment of OpenShift.

The installer will report all found failures at once, and will not stop on the
first validation error.

Validation failures should clearly indicate the root cause of an error, and if
applicable, suggest a solution to it.

### Risks and Mitigations

Depending on the number and the nature of the validations, the installation
time might end up increasing noticeably. Because the goal is to fail early,
time-consuming validations should seldom be considered for inclusion.

Since pre-flight validations are only run at install time, and not on cluster
upgrade/downgrade, caution should be used when migrating existing validations
to this "pre-flight" framework.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have use-cases where we can't afford increasing the deployment time (ephemeral deployments, kind of best effort, like in CI, where we know if there is a failure, debug won't happen; another potential example, small edge site? etc).

We need a toggle to disable pre-flight-validations (probably in install-config?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have the unsupported flag OPENSHFIT_INSTALL_SKIP_PREFLIGHT_VALIDATIONS=1


## Design Details

### Test Plan

The code for the framework and the validations will have unit tests.

In addition, we will enable the validations checks in CI in order to exercise
them and potentially highlight issues with the underlying CI infrastructure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to be able to run pre-flight validations separately, for 2 use-cases:

  • in CI, we would disable pre-check validations everywhere (most likely) if they take too much time from the CI job and can cause timeout issues (it happened a lot in OpenStack CI).
  • a customer who want to first check their infrastructure, then run the install

Therefore I propose that we have a subcommand like openshift-install validate pre-flight for example.


## Upgrade / Downgrade Strategy

Pre-flight validations are only run when installing a new OpenShift cluster.

## Implementation History

The pre-flight validation framework is implemented in OpenShift v4.6. New
validations should be added in new releases, to increase the coverage of
existing requirements and to cover new requirements.

## Drawbacks

Running the validations will increase the time it takes to run the installer.

## Alternatives

Only perform input validation as it is done today and rely on runtime errors to
troubleshoot deployment issues.

The pre-flight validations can give a good indication whether a deployment has
a chance to succeed at a given time, however they can't catch all potential
issues. Any change in the environment invalidates the previous validation
results. That is why it is important to also report runtime errors in way that
is easy to understand.

The pre-flight validations are not mutually exclusive with [improved
debuggability](https://github.com/openshift/enhancements/pull/328) of the
deployment errors but the two are instead complementary.