title | authors | reviewers | approvers | api-approvers | creation-date | last-updated | tracking-link | see-also | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
infrastructure-external-platform-type |
|
|
|
|
2022-09-06 |
2022-10-25 |
|
- Enhancement is
implementable
- Design details are appropriately documented from clear requirements
- Test plan is defined
- Operational readiness criteria is defined
- Graduation criteria for dev preview, tech preview, GA
- User-facing documentation is created in openshift-docs
In an effort to reduce the amount of time Red Hat engineers spend directly involved with third party engagements,
adding new platforms into the OpenShift product, this enhancement describes how we will add a new External
platform type that will allow third parties to self-serve and integrate with OpenShift without the need to modify
any core payload components and without the need for direct involvement of OpenShift engineers.
Historically, the k8s project contained plenty of code for handling integration with various cloud providers (AWS, GCP, vSphere). These pieces were developed and released as part of Kubernetes core. However, over time, the community concluded that this approach does not scale well and should be changed. The community put a lot of energy into introducing the mechanism to allow cloud-providers and community members to build, develop, test and release provider-specific components independently from the Kubernetes core.
With regard to OpenShift integrations with cloud providers, for the moment, a lot of things are tended to be encoded in the OpenShift codebase within API definitions, operator logic, and installer program code. This fact creates quite a lot of obstacles for RH partners and community members in their attempts to add new cloud providers to OpenShift, as well as making RH engineering involvement quite necessary.
Lately, there have been several initiatives around making OpenShift more composable and flexible. For example, Capabilites selection and Platform Operators are significant steps in this direction. However, despite having these powerful instruments, it is still necessary to land code into the OpenShift codebase for technical enablement of a new cloud provider, which might be hard or nearly impossible for external contributors.
Imagine some regional or special-purpose cloud have created an infrastructure platform that resembles AWS but has its own API that is different than AWS. They would like to give their users the best OpenShift experience possible, but integrating their code into a Red Hat release is not possible for them. Using the "External" platform, capabilities, and platform operators, they can still deliver this functionality by creating their own Cloud Controller Managers, CSI drivers, network topology operators, Machine API controllers, and OpenShift configurations. This allows these cloud providers to supply the best OpenShift experience while also developing their components without the necessity of tying to Red Hat's internal processes or keeping a fork of the significant part of the OpenShift code base.
-
As a cloud-provider affiliated engineer / platform integrator / RH partner I want to have a mechanism to signal OpenShift's built-in operators about additional cloud-provider specific components so that I can inject my own platform-specific controllers into OpenShift to improve the integration between OpenShift and my cloud provider.
We are aware of examples of supplementing OKD installations with custom machine-api controllers, however, users are experiencing a lot of difficulties on this path due to the necessity of, literally, reverse engineering, manual management of generic MAPI controllers, and so on.
-
As a cloud provider whose platform is not integrated into OpenShift, I want to have the Cloud Controller Manager for my infrastructure running in OpenShift from the initial install. Having a platform type that allows for the addition of operators or components which perform platform-specific functionality would help me to create better integrations between OpenShift and my infrastructure platform.
- remove the necessity to make changes in "openshift/api", "openshift/library-go" and dependant infrastructure-related components during basic cloud-provider integration with OpenShift
- make a cloud provider integration process more accessible and simple to external contributors as well as for RH engineers
- provide an overview of projected changes to affected components that will be planned for a later phase of development
- introduce a somewhat "neutral" platform type, which would serve as a signal about an underlying generic cloud-provider presence
- describe concrete delivery mechanisms for cloud-provider specific components
- cover new infrastructure provider enablement from the RHCOS side
- describe specific changes for each affected component, aside no-op reaction to the new "External" platform type
Our main goal is to simplify the integration process for new cloud providers in OpenShift/OKD. To achieve this we are proposing to add a new "External" PlatformType along with respective Spec and Status structures in openshift/api.
Such a generic platform type will serve as a signal for built-in OpenShift operators about an underlying cloud-provider presence. Related PlatformSpec and PlatformStatus type structures will serve as a source of generic configuration information for the OpenShift-specific operators.
Having that special platform type will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace and/or supplement the core Red Hat components.
Splitting the project into phases would be natural to make the implementation process smoother. A reader can find the proposed phase breakdown in OCPPLAN-9429.
This document intends to describe the initial phases of this project. The proposed initial course of action:
- Update "openshift/api" with adding "External" PlatformType
- Ensure that all Red Hat operators tolerate the "External" platform and treat it the same as the "None" platform
Next phase which is out of the scope for this EP:
- Modify operators for doing specific things when seeing the "External" platform. It will be briefly described in the Affected Components below. However, this should be addressed in separate EPs on a per-component basis.
There are several topics in this area that would be wise to defer for upcoming phases, namely:
- Define missing capabilities and their concrete behaviour, for example, add a "capability" for machine-api
- Precisely define the reaction of the operators listed below for the "External" platform type
- Define and document concrete mechanisms for supplementing a cluster with provider-specific components at installation time (CCM, MAPI controller, DNS controller)
- Research the necessity for engagement and api extension for "on-prem"-like in-cluster network infrastructure for the "External" platform. This will depend on demand from partner cloud providers and their cloud capabilities (the presence of a load-balancer-like concept, for example).
At the moment, the Infrastructure resource serves as a primary source of information about the underlying infrastructure provider and provider-specific parameters. Specifically the PlatformSpec and PlatformStatus parts of the infrastructure resource.
Given that PlatformSpec and PlatformStatus are defined as "discriminated unions" and are required to have the platform type encoded within "openshift/api" beforehand, it requires significant involvement and effort from Red Hat engineers to create the initial technical enablement of a new cloud provider and is effectively impossible without Red Hat engineering engagement.
Since a lot of infrastructure related components (such as CCCMO, Machine API operator, Machine Config Operator, Windows Machine Config Operator and so on) require information about, at least, the presence of an underlying cloud provider, the "None" platform does not fit well as a signal in such a situation.
A special, built-in, and somewhat generic platform type that will signal about the presence of an underlying infrastructure platform without platform-specific details will help to reduce the number of changes across OCP repositories and simplify the initial integration work for non-Red Hat contributors. Such a new ability will allow smaller / regional cloud providers to build and test their integration with OpenShift with considerably less effort.
Additionally, there are difficulties that are present today due to having a predefined list of platforms. A few examples:
- No defined mechanism to set
--cloud-provider=external
arg to kubelet/KCM/apiserver without merges and further revendoring of "openshift/api", at the moment decision-making here is tied to the PlatformType. - No way to extend machine-api and deliver a new provider without merges to "openshift/machine-api-operator" and "openshift/api" repos.
In the future, to some degree, an approach based on capabilites selection might help to solve the issue and provide an option for supplementing some platform-dependent components. However, some integral parts of OpenShift cannot be disabled and still require a signal about the underlying platform, for example, KCM and MCO, with respect to enabling an external cloud controller manager.
One of the possible examples of the interaction of the capabilities with the "External" platform type would be the MachineAPI and the Machine Api Operator. At this point, the MachineAPI has no use in running if there is no machine controller (which is heavily cloud-provider dependent). When the platform type is set to "External", and the machine-api capability is enabled, that will cause Machine API operator to deploy only generic cloud-independent controllers (such as machine-healthcheck, machineset, and node-link controllers). The platform-specific components would be deployed through a separate mechanism. Such behaviour will simplify initial cloud-platform enablement and will reduce the necessity of reverse-engineering and replicating work that was already done by Red Hat engineers.
This section enumerates OpenShift's components and briefly elaborates the future plans around this proposal. During initial implementation we must ensure that all OpenShift's components treat "External" platform in the same way as "None" in order to ensure a consistent baseline across OpenShift components.
In the future, we will need to change the behavior of OpenShift components on a case-by-case basis to be able to function harmoniously with supplemental provider-specific components from an infrastructure provider
or, if a component manages something else (e.g. kubelet, kcm), adjust its behaviour (set --cloud-provider=external
arg to kubelet for example).
Specific component changes will be described in detail within separate enhancement documents on a per-component basis.
Significant part of the code around PlatformType handling lives in the "openshift/library-go".
Currently, this code is responsible for the decision around kubelet and kcm flags. Specifically, the IsCloudProviderExternal function is used for decisions around kubelet and KCM flags (within MCO and KCMO respectively). Also, this code is used for the decision-making about CCM operator engagement.
This piece should be changed to react appropriately to the "External" platform type. During the first phases, it will need to behave the same as in the case of the "None" platform type. Then, in upcoming phases, it will need to respect additional parameters from the "External" platform spec down the road.
Same as the MCO in regard to kubelet, Kube Controller Manager Operator manages KCM (kube-controller-manager) deployments.
Historically Kube Controller Manager was home for cloud-specific control loops. This logic is also engaged by setting up proper flags on KCM executable like,
...
--cloud-provider=azure
--cloud-config=/etc/kubernetes/cloud.conf
...
For engaging an external Cloud Controller Manager no cloud-provider
flag should be set for the KCM executable.
In the context of this EP, no particular changes will be needed in the operator itself, changes made in library-go with further dependency update should be suffice.
Currently, MCO sets kubelet flags based on a PlatformStatus and PlatformType. This flag is crucial for Cloud Controller Manager engagement within the cluster.
Initially, the new "External" platform should be treated similarly to PlatformType "None" by the MCO, and do not set up any cloud-specific flags for the kubelet.
Then, down the road (during phase 3), it would be expected for the MCO to use the "External" platform type and its spec as a signal about the underlying platform and cloud controller manager presence and operate accordingly.
For an explicit signal about the necessity to set --cloud-provider=external
flag to the kubelet, we will use the CloudControllerManager
field of the ExternalPlatformSpec
,
which is described in the API Extensions section down below.
The Windows Machine Config Operator configures Windows instances into nodes, enabling Windows container workloads to run within OCP clusters.
Its behaviour relies on the Machine Config Operator since Windows-related machinery uses MCO-rendered ignition files
(there are plans to switch this to use MachineConfigs) to extract and then use some kubelet flags, including the --cloud-provider
one.
Initially, the new "External" platform should be treated similarly to PlatformType "None" by the WMCO.
Important to note that WMCO has specific behaviour for the None
platform type.
With this platform type WMCO will set
--node-ip
flag with the user-provided IP address, which requires additional configuration.
For other supported platform types, WMCO relies on MachineAPI to figure out IP addresses or does not set this flag at all.
This seems acceptable for the initial phase, but during phase 3 this behaviour should be revised and changed
to provide users additional knobs to configure this behaviour or, perhaps, check MachineAPI engagement to make a decision.
Responsible for deploying platform-specific Cloud Controller Manager as well as for handling a number of OpenShift's specific peculiarities (such as populating proxy settings for CCMs, sync credentials, and so on).
The code from the library-go is used for decision-making about operator engagement. In case library-go's IsCloudProviderExternal function indicates that the cloud provider is external and the operator encounters a platform which it is not aware of, it will go into 'Degraded' state.
During the first phases of the "External" platform type enablement, this operator should be just disabled. This might be done with changes within the library-go and further dependency updates or, better, by adding a respective check within the operator itself.
In the future, when the delivery mechanism for CCMs will be defined, the operator might be engaged for deploying a user-provided cloud controller manager, however this is a subject for the upcoming design work.
Machine Api Operator is responsible for deploying and maintaining the set of machine-api related controllers, such as
- machineset controller
- nodelink controller
- machine health check controller
- machine controller
From the list above, only the "machine controller" is cloud-provider dependent, however, for now, Machine Api Operator won't deploy anything if it encounters "None" or an unrecognized platform type.
In the future, "External" platform type, in conjunction with an enabled capability, would serve as a signal for Machine Api Operator to deploy only provider-agnostic controllers, which would leave room for the user to supplement only the machine controller and not to reverse engineer and replicate everything that MAO does.
Cluster Storage Operator will go to no-op state if it encounters PlatformType "None" or an unknown PlatformType.
At this point, nothing requires storage to be there during cluster installation, and storage (CSI) drivers might be supplemented later via OLM or some other way as day two operation.
No particular changes in regards to the "External" platform type introduction are expected there.
Cloud Credential Operator is responsible for handling CredentialsRequest
custom resource.
CredentialsRequest
s allow OpenShift components to request credentials for a particular cloud provider.
On unsupported platforms, the operator goes into no-op mode, which technically is mostly an equivalent of the "Manual" mode.
Cloud Credential operator uses an "actuator pattern" and, in theory, in the future, it might be extended in a way to react to the new "External" platform type and allow users to supplement their own platform-specific credentials management logic.
During initial enablement phases of "External" platform type, no specific actions will be needed there, since CCO would go into no-op mode if it encounters an unrecognized platform.
For image registry a storage backend config decision is platform specific. With the "None" platform type CIRO goes into no-op state, which means that no registry will be deploy in such case. The image registry configures with EmptyDir storage for unknown platform type at the moment.
Image Registry storage options will be configured to use PVC-backed or external storage systems (such as Ceph or S3 compatible object storage) as a day two operation.
For now, it seems that no particular action for the "External" platform type is needed within the Image Registry Operator, since we're providing enough possibilities to customize Image Registry storage backend.
Within the ingress operator, a PlatformType affects two things:
- Choosing EndpointPublishingStrategy, which is
HostNetworkStrategyType
for "None" and unknown PlatformType - Creating DNS provider on some platforms. This logic does not engage for "None" or unknown PlatformType. The DNS provider is used to create a wildcard DNS record for ingress when using
LoadBalancerServiceStrategyType
. It is not used forHostNetworkStrategyType
.
With regard to the EndpointPublishingStrategy,
the cluster admin can configure Cluster Ingress Operator to use LoadBalancerServiceStrategyType
as a day two operation.
The operator itself creates Service
objects with correct provider-specific annotations,
the actual handling of such objects happens in a provider-specific cloud controller manager.
Speaking about Cluster Network Operator, several things depends on PlatformType there.
-
Component called CNCC (cloud-network-config-controller) contains the majority of platform-specific logic. CNO makes the decision about CNCC deployment based on the PlatformType value. Mainly CNCC itself is responsible for attaching/detaching private IP addresses to VMs associated with Kubernetes nodes, which require interaction with cloud-provider APIs. Currently, CNCC deploys on GCP, Azure, AWS, and OpenStack platforms, other platforms, such as IBM or Alibaba do not engage CNCC for the moment.
-
There are also several platform specific hacks, like the access restriction to a metadata service, but it is not entirely connected with the operator itself and is more CNI plugin specific.
For the phase 1 of this project, there seems to be no particular action or API knobs needed regarding the addition of the "External" platform type. Just ensure that CNO is non-reactive to the "External" platform type and behaves the same as in the "None" platform case. In the future, we possibly might want to make CNO more tolerant of partner CNCC implementations and design a way for configuring platform-specific CNI behaviour.
During phase 1, the proposed changes are intended to have no effect, and the "External" platform type should be handled the same as the "None", so no specific user interaction is expected.
A new optional constant of PlatformType
type will be added to the "openshift/api".
const (
...
// ExternalPlatformType represent generic infrastructure provider. Provider-specific components should be supplemented separately.
ExternalPlatformType PlatformType = "External"
...
)
Additionally, the respective external platform spec and status should be added to the infrastructure resource.
// ExternalPlatformSpec holds the desired state for the generic External infrastructure provider.
type ExternalPlatformSpec struct{
// PlatformName holds the arbitrary string represented cloud provider name, expected to be set at the installation time.
// Intended to serve only for informational purposes and not expected to be used for decision-making.
// +kubebuilder:default:="Unknown"
// +optional
PlatformName string `json:"platformName,omitempty"`
}
type PlatformSpec struct {
...
// External contains settings specific to the generic External infrastructure provider.
// +optional
External *ExternalPlatformSpec `json:"external,omitempty"`
}
For the sake of consistency, status should be introduced as well, and will define the settings set at installation time:
type CloudControllerManagerMode string
const (
// Cloud Controller Manager is enabled and expected to be supplied.
// Signaling that kubelets and other CCM consumers should use --cloud-provider=external.
CloudControllerManagerExternal CloudControllerManagerMode = "External"
// Cloud Controller Manager is enabled and expected to be supplied.
// Signaling that kubelets and other CCM consumers should not set --cloud-provider flag.
CloudControllerManagerNone CloudControllerManagerMode = "None"
)
// CloudControllerManagerStatus holds the state of Cloud Controller Manager (a.k.a. CCM or CPI) related settings
// +kubebuilder:validation:XValidation:rule="(has(self.state) == has(oldSelf.state)) || (!has(oldSelf.state) && self.state != \"External\")",message="state may not be added or removed once set"
type CloudControllerManagerStatus struct {
// state determines whether or not an external Cloud Controller Manager is expected to
// be installed within the cluster.
// https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager
//
// When set to "External", new nodes will be tainted as uninitialized when created,
// preventing them from running workloads until they are initialized by the cloud controller manager.
// When omitted or set to "None", new nodes will be not tainted
// and no extra initialization from the cloud controller manager is expected.
// +kubebuilder:validation:Enum="";External;None
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="state is immutable once set"
// +optional
State CloudControllerManagerState `json:"state"`
}
// ExternalPlatformStatus holds the current status of the generic External infrastructure provider.
// +kubebuilder:validation:XValidation:rule="has(self.cloudControllerManager) == has(oldSelf.cloudControllerManager)",message="cloudControllerManager added or removed once set"
type ExternalPlatformStatus struct {
// CloudControllerManager contains settings specific to the external Cloud Controller Manager (a.k.a. CCM or CPI)
// When omitted or set to "None", new nodes will be not tainted
// and no extra initialization from the cloud controller manager is expected.
// +optional
CloudControllerManager CloudControllerManagerStatus `json:"cloudControllerManager"`
}
type PlatformStatus struct {
...
// External contains settings specific to the generic External infrastructure provider.
// +optional
External *ExternalPlatformStatus `json:"external,omitempty"`
}
There is concern that random customers will use this feature out of context and create a support burden. However, using such platform type in conjunction with capabilites selection could give us a clear signal about how to properly triage third-party created replacements for Red Hat components.
Current approach with having statically defined platform list is already there, quite transparent and battle-hardened. Changing it by adding the new, way less specific platform type would mean a significant step away from current design patterns in this area.
Also, future support strategy is not completely clear due to our plans around enablement of third-party platform-specific components, which codebase would be mostly out of our control.
- Should we explicitly communicate that the "External" platform is something that we do not support yet?
Q: Should we gate the "External" platform addition behind the feature gate by generating separate CRD for TPNU clusters?
A: There seems to be a soft consensus that we do not need to gate these changes behind Tech Preview if it is not necessary. Because operators are intended to react on the "External" platform the same as for the "None" one during the first phases, gating these API extensions does not seem needed.
Related discussion: 1
Q: Should we invest in preparing CI workflows that will perform UPI with the "None"/"External" platform types installation on the AWS or GCP, or existing vSphere-based workflows would be enough?
A: Adding a job on one additional cloud platform to ensure that the "External" platform type works as intended looks reasonable now, but we should mainly rely on in-repo functional tests on a per-component basis and avoid creating new e2e workflows as much as possible.
Q: Do we need an API for the MAPI components similar to the proposed CCM one to allow users to choose how the MAPI components are deployed?
A: We do not see an absolute necessity to add such API knobs right now. For the moment combination of the "External" platform and an enabled MAPI capability look like a sufficient signal. By the nature of the capabilities mechanism (i.e., no MAO deployment, no MAPI CRDs), if Machine API operator is running and detects the "External" platform, that is the signal to it that someone is going to run a machine controller.
Given that capabilities "API" is already added and in use, introducing additional knobs which will interfere with that would mean more code changes and the necessity of establishing a mechanism for communicating with users (i.e., what should handle such API if MAPI is disabled by the capabilities mechanism).
However, if during upcoming phases, we will discover a need to add an API field specifically to help a user install their machine API controller, we could update this EP, or a new enhancement document should be created to provide a proper exploration of interaction with capabilities and other operation modes.
Related discussions: 1
During the first stages, we must ensure that OpenShift built-in operators whose behaviour depends on platform type treat the "External" platform type the same way as "None". To achieve this - existing infrastructure and mechanisms employed for exercising topologies with the "None" platform type might be used.
At the time of writing, the only workflow that tests the "None" platform is upi-conf-vsphere-platform-none-ovn. Based on this workflow, a new one with platform type "External" with a respective set of jobs should be added to ensure that we do not disrupt the current OpenShift operation.
However, given that vSphere is the only platform where we're exercising clusters installation with platform "None" specified, it would be beneficial to develop additional workflows using a provider with better capacity and performance (AWS or GCP, perhaps).
Given that infrastructure.config.openshift.io
has already been released and has to be supported during the whole major release lifecycle,
this change will be GAed from the beginning.
However, components behaviour changes might be gated with the feature gates. Specific graduation criteria for each component should be defined separately in respective enhancement documents.
N/A
N/A
N/A
This enhancement does not introduce any additional components, it just describes changes in "openshift/api".
PlatformType is expected to be set once during cluster installation and does not expect to be changed, so adding new platform type should not affect upgrade/downgrade process across existing clusters.
Since PlatformType is set as a day zero operation during cluster installation and is not expected to be changed during the cluster lifecycle, version skew should not be the case there.
It should be no different from other PlatformTypes. Scalability, API throughput and availability should not be affected. In the first phases, it is expected to work the same way as for PlatformType "None".
- OpenShift built-in operators would not recognize the new PlatformType and would go into Degraded or crashed. This will break new clusters installations with that new "External" platform type.
During the first phases of this effort, support procedures should be no different from clusters with PlatformType set to "None".
Due to priority changes for the Installer Flexibility effort,
the decision about temporarily removing CloudControllerManagerSpec
from the Infrastructure
resource was made.
OCP 4.13 will not contain the CloudControllerManagerSpec
part.
-
Remove CloudControllerManagerSpec from the ExternalPlatformSpec because this setting is not meant to be updated after cluster installation
-
Introduce CloudControllerManagerStatus to hold the setting which will be defined at installation time
-
CloudControllerManagerStatus is available behind the feature set
TechPreviewNoUpgrade
-
Because of the feature set, the CEL on
ExternalPlatformStatus
couldn't be defined, we must set it once the feature set is lifted
Leave things as is, i.e., encode every new cloud platform statically into "openshift/api" from the beginning of a technical enablement process.
We could proceed to leverage PlatformType "None"; however, there are some difficulties that need to be worked around somehow, some examples:
-
No defined mechanism to set
--cloud-provider=external
arg to kubelet/KCM/apiserver without merges and further revendoring of "openshift/api", at the moment, decision-making here is tied to the PlatformType.- This might be solved by creating additional documentation and mechanisms for propagating and controlling additional flags on kubelet/KCM/apiserver.
- Possible approach proposed in this EP, but an alternative API / mechanism is possible.
-
No way to extend machine-api and deliver a new provider without merges to "openshift/machine-api-operator" and "openshift/api" repos.
- This might be solved by teaching the MachineAPI operator to deploy platform-independent components despite a platform type.
- Additional CI workflows and set of CI jobs that exercise OpenShift installation with the new "External" platform.