Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enhancement: IPI kubevirt provider #417

Closed
wants to merge 8 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
336 changes: 336 additions & 0 deletions enhancements/installer/kubevirt-ipi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,336 @@
---
title: KubeVirt-platform-provider
authors:
- "@ravidbro"
reviewers:

approvers:

creation-date: 2020-07-14
last-updated: 2020-07-14
status: implementable
---

# KubeVirt-platform-provider

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [OpenShift/docs]

## Open Questions [optional]

## Summary

This document describes how [KubeVirt][kubevirt-website] becomes an infra provider for OpenShift.

`KubeVirt` is a virtualization platform running as an extension of Kubernetes.

We want to create a tenant cluster on top of existing OpenShift/Kubernetes cluster by creating
virtual machines by KubeVirt for every node in the tenant cluster (master and workers nodes)
and other OpenShift/Kubernetes resources to allow **users** (not admins) of the infra cluster
to create a tenant cluster as it was an application running on the infra cluster.
We will implement all the components needed for the installer and cluster-api-provider
for the machine-api to allow post-install operations of resizing the cluster.


## Motivation

- Achieve true multi-tenancy of OpenShift were each tenant has dedicated control plane
and has full control on its configuration to allow each user to install different versions
with a different configuration as permissions settings and installed operators.

### Goals

- provide a way to install OpenShift on KubeVirt infrastructure using
the installer - an IPI installation. (install-time)
- implementing a cluster-api provider to provide scaling and managing the cluster
nodes (used by IPI, and useful for UPI, and node management/fencing) (post-install)
- Provide multi-tenancy and isolation between the tenant clusters.
- Provide tenant clusters with different versions and different configuration as permissions settings
and installed operators.

### Non-Goals
- Implement UPI flow.

## Proposal

This provider enables the OpenShift Installer to provision VM resources in
KubeVirt infrastructure, that will be used as worker and masters of the clusters. It
will also create the bootstrap machine, and the configuration needed to get
the initial cluster running by supplying a DNS service and load balancing.

We want to approach deployment on KubeVirt as deployment on cloud similar to the
deployments we have on public clouds as AWS and GCP rather than virtualization platform in a way
that the machine's network will be private, and the relevant endpoints will be exposed out of the
cluster with platform services as we can or pods deployed in the infrastructure cluster to supply the services
as DNS and load balancing.

We see two main network options for deployment over KubeVirt:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we be able to support cluster using third-party network SDNs like Calico or similar?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be able to support only CNIs that are supported by CNV,
Right now, the only option that supported is Multus with bridge CNI,
By theory, every Multus CNI that supplies DHCP and routes in and out should work too.

- Deploy the tenant cluster on the pods network and use OpenShift services and routes to provide
DNS and Load-Balancing. This option requires OpenShift to run KubeVirt and not Kubernetes.
- Deploy the tenant cluster on a secondary network (using Multus) and provide DNS service and Load-Balancing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this break "the machine's network will be private" of the first paragraph?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, secondary Multus network with its own bridge and its own CIDR (e.g. using whereabouts IPAM) will be a private network for the cluster

as the same way as other KNI networking deployments using HAProxy, CoreDNS and keepalived running on the
tenant cluster VMs. See the [baremetal ipi networking doc][baremetal-ipi-networking]



### Implementation Details/Notes/Constraints [optional]

1. Survey

The installation starts and right after the user supplies their public ssh key,
and then chooses `KubeVirt` the installation will ask for all the relevant details
of the installation: **kubeconfig** for the infrastructure OpenShift, **namespace**, **storageClass**,
**networkName (NAD)** and other KubeVirt specific attributes.
The installer will validate it can communicate with the api, otherwise it will fail to proceed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you thought through the negative flow and remediation when a failure occurs? ideally the users knows exactly why this failed and how to diagnose, address and fix. And we need to retain any input values the user chose or entered leading up to this point so he/she does not have to enter them again. Same concept should apply for all input values we require.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robyoungky I don't see how this should be different from other platforms. At the end the user will have the same user experience as with any other IPIs. (while I agree this information can probably be improved, it is out of this enhancement scope)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installer config file from any of the IPI installers is persisted, and can be used to deploy the cluster again.


With that the survey continues to the general cluster name, domain name, and
the rest of the non-KubeVirt specific question.

2. Resource creation - terraform

Terraform uses kubernetes provider to create:

- [DataVolume CR][data-volumes] with RHCOS image

*Note:* In disconnected environment the user will need to provide a local image that the installer
can upload to the namespace.
- secrets for the Ignition configs of the VMs
- 1 bootstrap machine
Comment on lines +102 to +103
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the bootstrap ignition config going to be too large to fit into a secret?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this. I'm getting my orders of magnitude mixed up.

- 3 masters

Only on network option 1 - pods network:
- Services and routes for DNS and LB

3. Bootstrap

The bootstrap VM has a huge Ignition config set using terraform as secrets and is visible
as secrets on the infra OpenShift. KubeVirt boots that VM with that content as ConfigDrive and the
bootstraping begins when the `bootkube.service` systemd service starts.

This process described more thoroughly in the [installer overview document][https://github.com/OpenShift/installer/blob/37b99d8c9a3878bac7e8a94b6b0113fad6ffb77a/docs/user/overview.md#cluster-installation-process]

4. Masters bootstrap

Master VMs are booting using a stub Ignition config that are waiting early in
the Ignition service to load their Ignition config from a URL. That URL is the
`https://<internal-api-vip>/config/master` which is still not available until
the **bootstrap** VM is exposing it. It takes few minutes till it does.

When the MachineConfigServer is available on the bootstrap, the masters pull their Ignition config
and boot up joining the tenant cluster as masters and start scheduling pods.

5. Workers bootstrap

After the masters and the control plane is up, we will scale the MachineSet to create workers
by the machine-api-operator.


### Risks and Mitigations

- Network

- Pods network option
-(OCP gap) The ports 22623/22624 that are used by MCS are blocked on the
pods network and prevent from the nodes to pull Ignition config and updates.
- (KubeVirt gap) Interface binding - Currently the only supported binding on the pods
network is masquerade which means that all nodes are behind NAT, each VM
behind the NAT of his own pod.
- (OpenShift/KubeVirt gap) Static IP - OpenShift assumes that node's IP addresses are static,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related topic, some customers insist on allocating IPs to VMs though static addressing, DHCP is not allowed on production networks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that supported on other platforms?
I don't see in the API/YAML a way to supply IP per node,
Also, it contradicts the concept of the machine-API with machine-set, machine-set is like a cattle and not as a pet, it has a property 'replica' which is a number, you modify this number, and VMs are being created/destroyed.
I don't see how that can work with static IPs.
What am I missing here?

but KubeVirt VMs change IP between restarts.

- Secondary network option (MULTUS)

- With this approach admin of the infra cluster will need to be involved in
the creation of each new tenant cluster since NADs need to be created and
probably also nmstate will need to be used to create the topology on the hosts.
In this proposal, we will assume that admin created the namespace and all network resources
before running the installer, and the created networkName (NAD) will be the input for the installer.


- Storage

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the other risks to call out is the etcd performance and latency requirements. We've had issues where people deploy OCP clusters in virtual environments with insufficient hardware, and they have all sorts of problems installing the OCP clusters. Worse, the cluster install may go fine, but the cluster goes unhealthy after a few days. I'm not sure what the right answer is, but we should have a discussion about how we can make this easier to validate and troubleshoot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I don't know how to solve it, especially when we are running on Baremetal and not on a public cloud


CSI driver for `KubeVirt` is not available yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic storage provisioning should be part of the MVP. We previously made this mistake with OCP on RHV IPI

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, KubeVirt CSI driver is planned for Feb 2021, but I guess you know better on this effort.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a blocker even for pre-GA versions then we have a problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocker, we can release this feature without KubeVirt CSI.



## Design Details

- Namespaces

For each tenant cluster we will create a namespace with the ClusterID.

*Open question: should the namespace creation done by the user or by the installer.*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPI implies we do everything. However, some users may have prescribed naming schemes that we should conform to. I'd recommend we give them the options to specify a name, but we create it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we moved to the networking option of using Multus, we are not able anymore to create the namespace, since the input of the installer is NAD resource name and the resource must exists in the namespace before we start theinstallation.


- Images

- Option 1 - For each namespace we will create DataVolume (CDI CRD) with RHCOS image which will be cloned for
each VM, masters and workers.
- Option 2 - We will use URL and each VM will pull the image for itself without the need for cloning.

- Network

####Option 1 - Pods network
- set cluster baseDomain as svc.cluster.local so services that we will create
as LoadBalancer will have the expected FQDN <service-name>.<namespace>.svc.cluster.local
- Create VMs with one interface on pods network.
- Create headless services for each VM to create DNS records for internal communication
between the nodes.
- Create services for 'api' and 'api-int' as loadbalancers between the masters
with MCS port (26223) and API server port (6443)
- Set ingress domain name for default router as subdomain of the ingress domain
of the infra OCP.
- Create in the underlying OCP route for each route in the provisioned cluster with
hostname value as the route on the provisioned cluster.\
Alternatively, if the infra OCP support wildcard route then one route from
type subdomain can be defined for all the routes to the provisioned cluster.
- Isolation will be achieved by creation of network policies that will allow traffic
only between VMs(pods) that are related to the same provisioned cluster.

####Option 2 - Secondary network (Multus)
- Create VMs attached to the secondary network (NAD) that was configured.
- Isolation achieved by the secondary network, it's up to the admin to decide how to create the
secondary networks that can use different VLAN/VXLAN/etc.

- Storage

The VMs boot volumes will be PVs allocated from the infra cluster

For PVs requested for pods running on the tenant cluster we have a few options:

####Option 1 - Direct storage CSI
The provisioned cluster will use CSI to attach storage using network to the VM guests.
This can be OCS CSI driver to consume storage from OCS installed on the infra
OpenShift as a tenant of OCS or any other external storage.

####Option 2 - KubeVirt CSI driver
Develop CSI driver for KubeVirt platform.

This provider should forward the request to the infra cluster to allocate PV
from the infra cluster storageClass and attach it to the relevant VM where the PV will be exposed to the guest
as block device that the driver will attach to the requested pods.

- Anti-affinity

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need to do this, to prevent the cluster from being unrecoverable in the event of the loss of two master nodes. We can use soft affinity, so things can come up in a demo environment.


The VMs will be scheduled with anti-affinity rules between the masters and between the workers in a way
that we will strive to spread the masters between the infra cluster nodes and same for the workers
to reduce the risk that outage of one worker in the infra cluster will cause major failure to a tenant.

### Test Plan

**Note:** *Section not required until targeted at a release.*

Consider the following in developing a test plan for this enhancement:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative flow outcomes and recovery, with user and automated remediation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgrades of different versions of OCP will be interesting. How far will we allow versions to drift? i.e. could we support an OCP 4.2 tenants cluster alongside an OCP 4.10 cluster, all hosted on an OCP 4.8 cluster?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason why not, we are not relying on any feature/resource that doesn't exist in any OCP 4.x version

- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?

No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.

All code is expected to have adequate tests (eventually with coverage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a test plan here.

expectations).

### Graduation Criteria

**Note:** *Section not required until targeted at a release.*

Define graduation milestones.

These may be defined in terms of API maturity, or as something else. Initial proposal
should keep this high-level with a focus on what signals will be looked at to
determine graduation.

Consider the following in developing the graduation criteria for this
enhancement:
- Maturity levels - `Dev Preview`, `Tech Preview`, `GA`
- Deprecation

Clearly define what graduation means.

#### Examples

TODO

[maturity levels][maturity-levels].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see what dev preview entails here


##### Dev Preview -> Tech Preview

- Ability to utilize the enhancement end to end
- End user documentation, relative API stability
- Sufficient test coverage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no real test plan above so feels like this also needs more detail

- Gather feedback from users rather than just developers
- Approved review by the installer team.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


##### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- Available by default

**For non-optional features moving to GA, the graduation criteria must include
end to end tests.**

##### Removing a deprecated feature

TODO
- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

TODO
If applicable, how will the component be upgraded and downgraded? Make sure this
is in the test plan.


Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade in order to keep previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade in order to make use of the enhancement?

### Version Skew Strategy

TODO
What are the guarantees? Make sure this is in the test plan.

Consider the following in developing a version skew strategy for this
enhancement:
- During an upgrade, we will always have skew among components, how will this impact your work?
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI, CRI
or CNI may require updating that component before the kubelet.

## Implementation History

Sep 2020 - Presented a fully working POC

## Drawbacks

The idea is to find the best form of an argument why this enhancement should _not_ be implemented.

## Alternatives

Similar to the `Drawbacks` section the `Alternatives` section is used to
highlight and record other possible approaches to delivering the value proposed
by an enhancement.

## Infrastructure Needed [optional]

- CI
Running and end-to-end job is a must for this feature to graduate, and it is a
non-trivial task. KubeVirt is not a cloud solution, and we need to provide a setup
for a job invocation. We are starting with deploying a static OCP deployment on GCP
as infra cluster.


[baremetal-ipi-networking]: https://github.com/OpenShift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md
[kubevirt-website]: https://kubevirt.io/
[data-volumes]: https://github.com/kubevirt/containerized-data-importer#datavolumes