-
Notifications
You must be signed in to change notification settings - Fork 501
Add enhancement: IPI kubevirt provider #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b7ad18c
a449dad
ee07b2f
66b849a
66c15da
4f76214
dc78dc1
a39a514
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,336 @@ | ||
--- | ||
title: KubeVirt-platform-provider | ||
authors: | ||
- "@ravidbro" | ||
reviewers: | ||
|
||
approvers: | ||
|
||
creation-date: 2020-07-14 | ||
last-updated: 2020-07-14 | ||
status: implementable | ||
--- | ||
|
||
# KubeVirt-platform-provider | ||
|
||
## Release Signoff Checklist | ||
|
||
- [x] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [OpenShift/docs] | ||
|
||
## Open Questions [optional] | ||
|
||
## Summary | ||
|
||
This document describes how [KubeVirt][kubevirt-website] becomes an infra provider for OpenShift. | ||
|
||
`KubeVirt` is a virtualization platform running as an extension of Kubernetes. | ||
|
||
We want to create a tenant cluster on top of existing OpenShift/Kubernetes cluster by creating | ||
virtual machines by KubeVirt for every node in the tenant cluster (master and workers nodes) | ||
and other OpenShift/Kubernetes resources to allow **users** (not admins) of the infra cluster | ||
to create a tenant cluster as it was an application running on the infra cluster. | ||
We will implement all the components needed for the installer and cluster-api-provider | ||
for the machine-api to allow post-install operations of resizing the cluster. | ||
|
||
|
||
## Motivation | ||
|
||
- Achieve true multi-tenancy of OpenShift were each tenant has dedicated control plane | ||
and has full control on its configuration to allow each user to install different versions | ||
with a different configuration as permissions settings and installed operators. | ||
|
||
### Goals | ||
|
||
- provide a way to install OpenShift on KubeVirt infrastructure using | ||
the installer - an IPI installation. (install-time) | ||
- implementing a cluster-api provider to provide scaling and managing the cluster | ||
nodes (used by IPI, and useful for UPI, and node management/fencing) (post-install) | ||
- Provide multi-tenancy and isolation between the tenant clusters. | ||
- Provide tenant clusters with different versions and different configuration as permissions settings | ||
and installed operators. | ||
|
||
### Non-Goals | ||
- Implement UPI flow. | ||
|
||
## Proposal | ||
|
||
This provider enables the OpenShift Installer to provision VM resources in | ||
KubeVirt infrastructure, that will be used as worker and masters of the clusters. It | ||
will also create the bootstrap machine, and the configuration needed to get | ||
the initial cluster running by supplying a DNS service and load balancing. | ||
|
||
We want to approach deployment on KubeVirt as deployment on cloud similar to the | ||
deployments we have on public clouds as AWS and GCP rather than virtualization platform in a way | ||
that the machine's network will be private, and the relevant endpoints will be exposed out of the | ||
cluster with platform services as we can or pods deployed in the infrastructure cluster to supply the services | ||
as DNS and load balancing. | ||
|
||
We see two main network options for deployment over KubeVirt: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will we be able to support cluster using third-party network SDNs like Calico or similar? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will be able to support only CNIs that are supported by CNV, |
||
- Deploy the tenant cluster on the pods network and use OpenShift services and routes to provide | ||
DNS and Load-Balancing. This option requires OpenShift to run KubeVirt and not Kubernetes. | ||
- Deploy the tenant cluster on a secondary network (using Multus) and provide DNS service and Load-Balancing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doesn't this break "the machine's network will be private" of the first paragraph? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so, secondary Multus network with its own bridge and its own CIDR (e.g. using whereabouts IPAM) will be a private network for the cluster |
||
as the same way as other KNI networking deployments using HAProxy, CoreDNS and keepalived running on the | ||
tenant cluster VMs. See the [baremetal ipi networking doc][baremetal-ipi-networking] | ||
|
||
|
||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
1. Survey | ||
|
||
The installation starts and right after the user supplies their public ssh key, | ||
and then chooses `KubeVirt` the installation will ask for all the relevant details | ||
of the installation: **kubeconfig** for the infrastructure OpenShift, **namespace**, **storageClass**, | ||
**networkName (NAD)** and other KubeVirt specific attributes. | ||
The installer will validate it can communicate with the api, otherwise it will fail to proceed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. have you thought through the negative flow and remediation when a failure occurs? ideally the users knows exactly why this failed and how to diagnose, address and fix. And we need to retain any input values the user chose or entered leading up to this point so he/she does not have to enter them again. Same concept should apply for all input values we require. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @robyoungky I don't see how this should be different from other platforms. At the end the user will have the same user experience as with any other IPIs. (while I agree this information can probably be improved, it is out of this enhancement scope) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The installer config file from any of the IPI installers is persisted, and can be used to deploy the cluster again. |
||
|
||
With that the survey continues to the general cluster name, domain name, and | ||
the rest of the non-KubeVirt specific question. | ||
|
||
2. Resource creation - terraform | ||
|
||
Terraform uses kubernetes provider to create: | ||
|
||
- [DataVolume CR][data-volumes] with RHCOS image | ||
|
||
*Note:* In disconnected environment the user will need to provide a local image that the installer | ||
can upload to the namespace. | ||
- secrets for the Ignition configs of the VMs | ||
- 1 bootstrap machine | ||
Comment on lines
+102
to
+103
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't the bootstrap ignition config going to be too large to fit into a secret? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ignore this. I'm getting my orders of magnitude mixed up. |
||
- 3 masters | ||
|
||
Only on network option 1 - pods network: | ||
- Services and routes for DNS and LB | ||
|
||
3. Bootstrap | ||
|
||
The bootstrap VM has a huge Ignition config set using terraform as secrets and is visible | ||
as secrets on the infra OpenShift. KubeVirt boots that VM with that content as ConfigDrive and the | ||
bootstraping begins when the `bootkube.service` systemd service starts. | ||
|
||
This process described more thoroughly in the [installer overview document][https://github.com/OpenShift/installer/blob/37b99d8c9a3878bac7e8a94b6b0113fad6ffb77a/docs/user/overview.md#cluster-installation-process] | ||
|
||
4. Masters bootstrap | ||
|
||
Master VMs are booting using a stub Ignition config that are waiting early in | ||
the Ignition service to load their Ignition config from a URL. That URL is the | ||
`https://<internal-api-vip>/config/master` which is still not available until | ||
the **bootstrap** VM is exposing it. It takes few minutes till it does. | ||
|
||
When the MachineConfigServer is available on the bootstrap, the masters pull their Ignition config | ||
and boot up joining the tenant cluster as masters and start scheduling pods. | ||
|
||
5. Workers bootstrap | ||
|
||
After the masters and the control plane is up, we will scale the MachineSet to create workers | ||
by the machine-api-operator. | ||
|
||
|
||
### Risks and Mitigations | ||
|
||
- Network | ||
|
||
- Pods network option | ||
-(OCP gap) The ports 22623/22624 that are used by MCS are blocked on the | ||
pods network and prevent from the nodes to pull Ignition config and updates. | ||
- (KubeVirt gap) Interface binding - Currently the only supported binding on the pods | ||
network is masquerade which means that all nodes are behind NAT, each VM | ||
behind the NAT of his own pod. | ||
- (OpenShift/KubeVirt gap) Static IP - OpenShift assumes that node's IP addresses are static, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. related topic, some customers insist on allocating IPs to VMs though static addressing, DHCP is not allowed on production networks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that something that supported on other platforms? |
||
but KubeVirt VMs change IP between restarts. | ||
|
||
- Secondary network option (MULTUS) | ||
|
||
- With this approach admin of the infra cluster will need to be involved in | ||
the creation of each new tenant cluster since NADs need to be created and | ||
probably also nmstate will need to be used to create the topology on the hosts. | ||
In this proposal, we will assume that admin created the namespace and all network resources | ||
before running the installer, and the created networkName (NAD) will be the input for the installer. | ||
|
||
|
||
- Storage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the other risks to call out is the etcd performance and latency requirements. We've had issues where people deploy OCP clusters in virtual environments with insufficient hardware, and they have all sorts of problems installing the OCP clusters. Worse, the cluster install may go fine, but the cluster goes unhealthy after a few days. I'm not sure what the right answer is, but we should have a discussion about how we can make this easier to validate and troubleshoot. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point, I don't know how to solve it, especially when we are running on Baremetal and not on a public cloud |
||
|
||
CSI driver for `KubeVirt` is not available yet. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dynamic storage provisioning should be part of the MVP. We previously made this mistake with OCP on RHV IPI There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIK, KubeVirt CSI driver is planned for Feb 2021, but I guess you know better on this effort. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is a blocker even for pre-GA versions then we have a problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a blocker, we can release this feature without KubeVirt CSI. |
||
|
||
|
||
## Design Details | ||
|
||
- Namespaces | ||
|
||
For each tenant cluster we will create a namespace with the ClusterID. | ||
|
||
*Open question: should the namespace creation done by the user or by the installer.* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IPI implies we do everything. However, some users may have prescribed naming schemes that we should conform to. I'd recommend we give them the options to specify a name, but we create it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we moved to the networking option of using Multus, we are not able anymore to create the namespace, since the input of the installer is NAD resource name and the resource must exists in the namespace before we start theinstallation. |
||
|
||
- Images | ||
|
||
- Option 1 - For each namespace we will create DataVolume (CDI CRD) with RHCOS image which will be cloned for | ||
each VM, masters and workers. | ||
- Option 2 - We will use URL and each VM will pull the image for itself without the need for cloning. | ||
|
||
- Network | ||
|
||
####Option 1 - Pods network | ||
- set cluster baseDomain as svc.cluster.local so services that we will create | ||
as LoadBalancer will have the expected FQDN <service-name>.<namespace>.svc.cluster.local | ||
- Create VMs with one interface on pods network. | ||
- Create headless services for each VM to create DNS records for internal communication | ||
between the nodes. | ||
- Create services for 'api' and 'api-int' as loadbalancers between the masters | ||
with MCS port (26223) and API server port (6443) | ||
- Set ingress domain name for default router as subdomain of the ingress domain | ||
of the infra OCP. | ||
- Create in the underlying OCP route for each route in the provisioned cluster with | ||
hostname value as the route on the provisioned cluster.\ | ||
Alternatively, if the infra OCP support wildcard route then one route from | ||
type subdomain can be defined for all the routes to the provisioned cluster. | ||
- Isolation will be achieved by creation of network policies that will allow traffic | ||
only between VMs(pods) that are related to the same provisioned cluster. | ||
|
||
####Option 2 - Secondary network (Multus) | ||
- Create VMs attached to the secondary network (NAD) that was configured. | ||
- Isolation achieved by the secondary network, it's up to the admin to decide how to create the | ||
secondary networks that can use different VLAN/VXLAN/etc. | ||
|
||
- Storage | ||
|
||
The VMs boot volumes will be PVs allocated from the infra cluster | ||
|
||
For PVs requested for pods running on the tenant cluster we have a few options: | ||
|
||
####Option 1 - Direct storage CSI | ||
The provisioned cluster will use CSI to attach storage using network to the VM guests. | ||
This can be OCS CSI driver to consume storage from OCS installed on the infra | ||
OpenShift as a tenant of OCS or any other external storage. | ||
|
||
####Option 2 - KubeVirt CSI driver | ||
Develop CSI driver for KubeVirt platform. | ||
|
||
This provider should forward the request to the infra cluster to allocate PV | ||
from the infra cluster storageClass and attach it to the relevant VM where the PV will be exposed to the guest | ||
as block device that the driver will attach to the requested pods. | ||
|
||
- Anti-affinity | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We definitely need to do this, to prevent the cluster from being unrecoverable in the event of the loss of two master nodes. We can use soft affinity, so things can come up in a demo environment. |
||
|
||
The VMs will be scheduled with anti-affinity rules between the masters and between the workers in a way | ||
that we will strive to spread the masters between the infra cluster nodes and same for the workers | ||
to reduce the risk that outage of one worker in the infra cluster will cause major failure to a tenant. | ||
|
||
### Test Plan | ||
|
||
**Note:** *Section not required until targeted at a release.* | ||
|
||
Consider the following in developing a test plan for this enhancement: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Negative flow outcomes and recovery, with user and automated remediation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. upgrades of different versions of OCP will be interesting. How far will we allow versions to drift? i.e. could we support an OCP 4.2 tenants cluster alongside an OCP 4.10 cluster, all hosted on an OCP 4.8 cluster? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see a reason why not, we are not relying on any feature/resource that doesn't exist in any OCP 4.x version |
||
- Will there be e2e and integration tests, in addition to unit tests? | ||
- How will it be tested in isolation vs with other components? | ||
|
||
No need to outline all of the test cases, just the general strategy. Anything | ||
that would count as tricky in the implementation and anything particularly | ||
challenging to test should be called out. | ||
|
||
All code is expected to have adequate tests (eventually with coverage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there should be a test plan here. |
||
expectations). | ||
|
||
### Graduation Criteria | ||
|
||
**Note:** *Section not required until targeted at a release.* | ||
|
||
Define graduation milestones. | ||
|
||
These may be defined in terms of API maturity, or as something else. Initial proposal | ||
should keep this high-level with a focus on what signals will be looked at to | ||
determine graduation. | ||
|
||
Consider the following in developing the graduation criteria for this | ||
enhancement: | ||
- Maturity levels - `Dev Preview`, `Tech Preview`, `GA` | ||
- Deprecation | ||
|
||
Clearly define what graduation means. | ||
|
||
#### Examples | ||
|
||
TODO | ||
|
||
[maturity levels][maturity-levels]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to see what dev preview entails here |
||
|
||
##### Dev Preview -> Tech Preview | ||
|
||
- Ability to utilize the enhancement end to end | ||
- End user documentation, relative API stability | ||
- Sufficient test coverage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there is no real test plan above so feels like this also needs more detail |
||
- Gather feedback from users rather than just developers | ||
- Approved review by the installer team. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @chenyosef @sdodson FYI |
||
|
||
##### Tech Preview -> GA | ||
|
||
- More testing (upgrade, downgrade, scale) | ||
- Sufficient time for feedback | ||
- Available by default | ||
|
||
**For non-optional features moving to GA, the graduation criteria must include | ||
end to end tests.** | ||
|
||
##### Removing a deprecated feature | ||
|
||
TODO | ||
- Announce deprecation and support policy of the existing feature | ||
- Deprecate the feature | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
TODO | ||
If applicable, how will the component be upgraded and downgraded? Make sure this | ||
is in the test plan. | ||
|
||
|
||
Consider the following in developing an upgrade/downgrade strategy for this | ||
enhancement: | ||
- What changes (in invocations, configurations, API use, etc.) is an existing | ||
cluster required to make on upgrade in order to keep previous behavior? | ||
- What changes (in invocations, configurations, API use, etc.) is an existing | ||
cluster required to make on upgrade in order to make use of the enhancement? | ||
|
||
### Version Skew Strategy | ||
|
||
TODO | ||
What are the guarantees? Make sure this is in the test plan. | ||
|
||
Consider the following in developing a version skew strategy for this | ||
enhancement: | ||
- During an upgrade, we will always have skew among components, how will this impact your work? | ||
- Does this enhancement involve coordinating behavior in the control plane and | ||
in the kubelet? How does an n-2 kubelet without this feature available behave | ||
when this feature is used? | ||
- Will any other components on the node change? For example, changes to CSI, CRI | ||
or CNI may require updating that component before the kubelet. | ||
|
||
## Implementation History | ||
|
||
Sep 2020 - Presented a fully working POC | ||
|
||
## Drawbacks | ||
|
||
The idea is to find the best form of an argument why this enhancement should _not_ be implemented. | ||
|
||
## Alternatives | ||
|
||
Similar to the `Drawbacks` section the `Alternatives` section is used to | ||
highlight and record other possible approaches to delivering the value proposed | ||
by an enhancement. | ||
|
||
## Infrastructure Needed [optional] | ||
|
||
- CI | ||
Running and end-to-end job is a must for this feature to graduate, and it is a | ||
non-trivial task. KubeVirt is not a cloud solution, and we need to provide a setup | ||
for a job invocation. We are starting with deploying a static OCP deployment on GCP | ||
as infra cluster. | ||
|
||
|
||
[baremetal-ipi-networking]: https://github.com/OpenShift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md | ||
[kubevirt-website]: https://kubevirt.io/ | ||
[data-volumes]: https://github.com/kubevirt/containerized-data-importer#datavolumes |
Uh oh!
There was an error while loading. Please reload this page.