Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installer pre-flight validations #346

Closed
wants to merge 4 commits into from

Conversation

mandre
Copy link
Member

@mandre mandre commented May 27, 2020

Add ability to run pre-flight validations in the installer.

Add ability to run pre-flight validations in the installer.
@mandre mandre changed the title Installer pre-flight validations [WIP] Installer pre-flight validations May 27, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2020
## Open Questions [optional]

- Will the validations be run automatically on `cluster create` or will it be
an explicit action?
Copy link
Member

@LalatenduMohanty LalatenduMohanty May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preflight checks should be in the default code path. As when it fails it will help communicate why it failed and help the customer fix the issue and run the installer again.

Also there should be an explicit way to skip preflight checks when required. Sometimes customers might want to skip it because of specific environment requirement. For example a customer trying OpenShift installation in a new platform which we do not support and the preflight check fails on the last preflight check. But customer is fine with installing it and then manually fixing the thing which made the preflight check fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that in general we'll want the validations to run by default as it is done today, everyone might not be happy running all validations if it makes the deployment significantly longer, and we may want to explicitly enable them instead. I'm thinking in particular about the extra checks to validate the cloud performance or the ability to pull container images.

Agreed on the ability to bypass checks. Depending on how we implement validations, either turn off validations, or make their failures non-fatal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-flight checks should enough value that it should be in the default path. However if you have checks which take longer time and you do not want them to be enabled by default them you can the pre-flight checks in to two parts. Once part which should be run everytime and some extra tests which can be opted to run.

Copy link
Member

@LalatenduMohanty LalatenduMohanty May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ideally pre-flight checks should not take a long time to run. Also it should not make any changes to the platform and it adds so much value that it should run every time installer is being run.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the prechecks should be optional, but encouraged to be executed. If you have a testing environment where you usually deploy OCP, it is pointless to run those prechecks every time. So I think having a --skip-prechecks flag where if not present, the prechecks will be executed. Also, the prechecks shall be able to be executed by themselves, like openshift-install run-prechecks or something. My 2 cents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right now, I'm leaning towards splitting the pre-flight validations into 2 buckets:

  • the core validations, enabled all the time - this includes all of the current validations in the installer.
  • the extra validations, enabled on demand, for more involved checks requiring to boot a node. The performance validations are a good example of such checks, where it can take quite some time, especially on BM.

Now, I'm not sure what the best way to enable the extra validations is... It depends if we want to perform the installation or not after running the extra validations:

  • a --dry-run flag runs all the validation but not performing the installation
  • a --with-extra-validations or similar flag, enabling the extra validations and performing the installation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've captured this conversation in my latest patch. Please check.

an explicit action?
- If explicit action, how will the validations be enabled? E.g. adding
a `--dry-run` option to the installer vs. a separate subcommand or even
a separate binary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an explicit command like --dry-run to do the pre-flight check will help IMO. But I am not too opinionated about it as the pre-flight check should be in default code path.

DNS to reach the cloud's enpoints
- necessary cloud services are available
- storage performance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-flight checks should be idempotent in nature. We should mention this in the goals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, very good point.

succeed. This goal is relatively easy to implement for public cloud platforms
where we can make assumptions about services being available, or performance
meeting requirements, however this is not the case with private clouds where
each cloud is unique.
Copy link
Member

@pierreprinetti pierreprinetti May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public clouds might need fewer checks, however many might apply to them as well: think quota, or permissions. Some public cloud offerings even differ based on the geographical zone. I wouldn't exclude public clouds a priori from this enhancement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, public cloud may be of lower priority but there are many checks in place in OCM for OSD that could help bootstrap something in the AWS and GCP spaces. There are many variables driving limits/quotas that impact if installation will be successful including support plan and region. Explicitly checking limits up front is critical to success and we strive to tell our customers it will fail due to account/project limits before we kick off the installer.

As an administrator of an OpenShift cluster, I would like to verify that my
OpenStack cloud meets all the performance, service and networking requirements
necessary for a successful deployment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted below, we want to spin a VM mimicking a master node. To me this means that either implicitly or explicitly, we are validating an install-config as well.

To be even clearer about this point, I'd add a reference to the aforementioned OpenShift admin wanting to check that his cluster's configuration is valid and appropriate for the target infrastructure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is ok for OpenStack, but for bare metal it would be almost like if a complete installation is executed (boot one of the bare metal masters/workers with a 'special image' that can execute the prechecks, run the checks, report back & tear down the baremetal... including how to manage the pxe, etc. that is required to boot the bare metal host)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(references to the bastion VM have been removed from this draft)

Copy link
Member

@stbenjam stbenjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the baremetal and OpenStack teams are trying to solve some very similar problems here! Deployment failures are hard to debug, so more validations are better. Our approach has been to check the quality of the data in the installer as much as possible, but then when things fail that we couldn't validate from the installer to just make sure the user understands why.

I've elaborated a bunch on what that looks like in #328.

Keep in mind, I'm coming from the baremetal perspective which is decidedly different than other on-premise platforms. In our case, the ephemeral validation host would be tough to provide. We could do a VM, but then that's just the same as a bootstrap host IMHO.

So, to take the pull secret validation example, if they don't work, why not just make sure that information gets from the bootstrap and shown to the user in the installer output, and have the installer fail as early as possible?

Node with the master flavor on the user provisioned network:
- pull container images
- validate networking and cloud connectivity
- run fio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's fio?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the one. It's a benchmarking tool for storage, and is the recommended tool to check that storage performance meets the requirements for etcd:
https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#system-requirements

- the tenant has adequate quota and the flavors' specifications are within the
recommended ranges.
- for user-provided networks, check the subnets have a DHCP server and valid
DNS to reach the cloud's enpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a "user-provided" network? Does this machine mean machine networks? How do you validate DHCP works? What are cloud endpoints? The metadata URI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: endpoint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"user-provided" refers to networks not created by the installer.

https://github.com/openshift/installer/blob/master/docs/user/aws/customization.md#installing-to-existing-vpc--subnetworks
https://github.com/openshift/installer/blob/master/docs/user/gcp/customization.md#installing-to-existing-networks--subnetworks

In OpenStack platform, we also allow passing the machine network with machinesSubnet and setting additional networks to the VMs with additionalNetworkIDs, for multi-NIC VMs.

The cloud endpoints refers to the URLs you use to talk to your cloud's API. I'll expend a bit this section, hopefully clarifying things.

recommended ranges.
- for user-provided networks, check the subnets have a DHCP server and valid
DNS to reach the cloud's enpoints
- necessary cloud services are available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is platform-dependent.

- for user-provided networks, check the subnets have a DHCP server and valid
DNS to reach the cloud's enpoints
- necessary cloud services are available
- storage performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this tested and quantified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#### On-premise deployments

As an administrator of an OpenShift cluster, I would like to verify that my
OpenStack cloud meets all the performance, service and networking requirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enhancement request just for OpenStack? It's in the installer directory so I thought it was a generic framework other platforms might be able to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the enhancement is for all platforms. This user story use the OpenStack example because that's the one I'm the most familiar with, but could be any other platform.


Node with the master flavor on the user provisioned network:
- pull container images
- validate networking and cloud connectivity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?

Comment on lines 131 to 132
Validation failure should result in actionable action, for example failure
message could provide pointers on how to fix the error.
Copy link

@iamemilio iamemilio Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Validation failure should result in actionable action, for example failure
message could provide pointers on how to fix the error.
Validation failures should clearly indicate the root cause of an error, and if applicable, suggest a known solution to it.


#### Pre-provision a node

Node with the master flavor on the user provisioned network:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what stage in the installation would this node get created? Between the networking and nodes getting stood up? Or before anything for the cluster gets stood up at all?

Copy link

@iamemilio iamemilio Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make these checks run as a systemd unit on the bootstrap node, rather than spinning up a new node and creating a new install phase.

environment doesn't match the recommendation and we may want the installer to
go on with the deployment still.

In that case, the validation can be marked as optional, meaning failure of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you intend to set up the system to mark tests optional? Command line flags? A YAML file?

@Fedosin
Copy link
Contributor

Fedosin commented Jul 6, 2020

@mandre Thanks for you proposal. I think we can start implementing it!

My suggestions for the implementation:

For core validations (platform and machines) we can start using: https://github.com/go-playground/validator
This is a powerful framework that allows to check struct fields with built-in and custom validators.
For non-config validations we will implement new assets, like it is done for AWS and Google Cloud quotas: openshift/installer#3820
This should simplify the existing architecture and make it easier to develop new checks.

I don't like the idea to use a standalone vm for extra validations (i.e. perform core validation, boot a vm, upload tests there, execute them, collect the results, destroy the vm, and, if everything is okay, begin provisioning the infrastructure). First, it will break the existing installation workflow. Second, we will need to create additional resources for the vm (floating ip and rhcos image, for instance), it will lead to code duplication.
I think it's easier to reuse the bootstrap machine for extra validations. In general I see this as: we attach custom scripts to the bootstrap ignition file; they will be executed before boot-kube.sh and will inform us about all potential problems. This will allow us to use existing infrastructure and do not implement custom provisioners, since bootstrap machine is equal to masters. It will also solve the problem of the additional resources. The drawback of this approach is that we perform checks after the infrastructure is deployed. But I think that's okay, because they're done at the very beginning.

@mandre
Copy link
Member Author

mandre commented Jul 6, 2020

@Fedosin using the bootstrap VM to run the extra validations would indeed remove some of the complexity around provisioning the resources to run the checks and is a good idea worth exploring, although I'd be careful saying that the bootstrap node is equivalent to the master node, depending on the platforms this might not be the case - as I understand every platform is free to provision the bootstrap node however they like so you can't make assumption that the storage the bootstrap gets is similar to the one the master nodes get for instance.
We could state that platforms that want to use this feature need to use nodes with similar specs for their bootstrap and master nodes.

Core validations are the ones we already know. They run every time, as it is
done today.

The extra validations will be enabled on demand via a flag when running the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new flags to the installer binary please. the installer is configure using install-config.yaml and will continue to do so.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 24, 2020
@pierreprinetti pierreprinetti force-pushed the validations branch 2 times, most recently from 8128788 to 86cf3c3 Compare December 1, 2020 10:19
@openshift-ci-robot openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 1, 2020
@pierreprinetti
Copy link
Member

Addressed comments and simplified to match the first implemented iteration.

  • Removed reference to the bastion VM
  • Removed reference to the validations being optional
  • Added a drawback: target infra required at create manifests

/remove-lifecycle rotten

@mandre PTAL

@pierreprinetti
Copy link
Member

/retitle Installer pre-flight validations

@openshift-ci-robot openshift-ci-robot changed the title [WIP] Installer pre-flight validations Installer pre-flight validations Dec 1, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2020
* Remove reference to the bastion VM
* Remove reference to the validations being optional

The installer already performs some validations:
- checks all required fields are set and that the data is in the right format
- some basic validation, such as networks do not overlap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add to the list:

  • Quota validations
  • Flavor validations (when occurs)

But since the list isn't exhaustive, feel free to ignore my comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you say is true: we now have those. However we implemented them in the context of this very Enhancement, so I think it's fine to list them under "goals", line 66


However, this doesn't check that the environment is suitable to install OpenShift:
- pull secret is valid to fetch the container images
- the tenant has adequate quota and the flavors' specifications are within the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • quotas are checked in 4.7
  • flavors are checked in 4.6


Since pre-flight validations are only run at install time, and not on cluster
upgrade/downgrade, caution should be used when migrating existing validations
to this "pre-flight" framework.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have use-cases where we can't afford increasing the deployment time (ephemeral deployments, kind of best effort, like in CI, where we know if there is a failure, debug won't happen; another potential example, small edge site? etc).

We need a toggle to disable pre-flight-validations (probably in install-config?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have the unsupported flag OPENSHFIT_INSTALL_SKIP_PREFLIGHT_VALIDATIONS=1

The code for the framework and the validations will have unit tests.

In addition, we will enable the validations checks in CI in order to exercise
them and potentially highlight issues with the underlying CI infrastructure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to be able to run pre-flight validations separately, for 2 use-cases:

  • in CI, we would disable pre-check validations everywhere (most likely) if they take too much time from the CI job and can cause timeout issues (it happened a lot in OpenStack CI).
  • a customer who want to first check their infrastructure, then run the install

Therefore I propose that we have a subcommand like openshift-install validate pre-flight for example.

Copy link
Member

@EmilienM EmilienM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments inline, but this is a serious start I think. Thanks!

succeed. This goal is relatively easy to implement for public cloud platforms
where we can make assumptions about services being available, or performance
meeting requirements, however this is not the case with private clouds where
each cloud is unique.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, public cloud may be of lower priority but there are many checks in place in OCM for OSD that could help bootstrap something in the AWS and GCP spaces. There are many variables driving limits/quotas that impact if installation will be successful including support plan and region. Explicitly checking limits up front is critical to success and we strive to tell our customers it will fail due to account/project limits before we kick off the installer.

Comment on lines +67 to +68
The pre-flight validations should not alter the target infrastructure nor leave
behind any new resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cloud providers are in scope consider that EC2 in a region isn't usable until you try to provision an instance. This initial provision will fail and region has to be approved for use by AWS. Depending on the support plan and the region this can take minutes or days. I am not aware that an API to check this has been added. Are "checks" like this in scope where the implementation is to try creating some infra then tearing it down if successful?

@errordeveloper
Copy link

errordeveloper commented Jan 27, 2021

Hey folks! This is an interesting proposal. As a user, I'd like to share my experience here. I just spent a week trying to understand why my attempts to create a cluster keep on failing, and what I learnt is that I had been using few invalid manifests. I'm working on getting Cilium installer operator working with OpenShift, and that needs to be done by the means of customising install-config.yaml and adding custom manifests. I have had a few issues with my manifests and result of that was bootstrap CVO being blocked until it timed-out. I wish there was a way to pre-validate manifest before going on a long journey of creating a cluster and debugging any outcome. Bootkube logs are not very easily accessible, and one needs to know that it's the place to look in certain situations. I got there in the end, but I wish there was a much much earlier indication of where the initial cluster provisioning failure occurred.
Perhaps this is something add-on authors like myself can handle, however it's much more difficult for add-on users, so please do not underestimate the importance of addressing these usability issues.

References:

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign miciah after the PR has been reviewed.
You can assign the PR to them by writing /assign @miciah in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


We do not envision the need for the users to write additional validations. As
a consequence, the validations do not need to be loaded on startup and will be
compiled into the `openshift-install` binary. This may be revisited later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be possible to run all desirable validations from the installer. For bare metal IPI, we cannot always assume the host running the installer can talk to the baseboard management controllers (BMCs) used to manage power on the host. The bootstrap VM does need access to those BMCs, so we could run a validation step there that verified that the credentials for accessing the BMCs are correct and that the BMCs support the features required (especially virtual media support).

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 19, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jul 19, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.