-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for refactoring cloud provider API #128
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
## Refactor Cloud Provider out of Kubernetes Core | ||
|
||
As kubernetes has evolved tremendously, it has become difficult for different cloudproviders (currently 7) to make changes and iterate quickly. Moreover, the cloudproviders are constrained by the kubernetes build/release life-cycle. This proposal aims to move towards a kubernetes code base where cloud providers specific code will move out of the core repository and into "official" repositories, where it will be maintained by the cloud providers themselves. | ||
|
||
### 1. Current use of Cloud Provider | ||
|
||
The following components have cloudprovider dependencies | ||
|
||
1. kube-controller-manager | ||
2. kubelet | ||
3. kube-apiserver | ||
|
||
#### Cloud Provider in Kube-Controller-Manager | ||
|
||
The kube-controller-manager has many controller loops | ||
|
||
- nodeController | ||
- volumeController | ||
- routeController | ||
- serviceController | ||
- replicationController | ||
- endpointController | ||
- resourceQuotaController | ||
- namespaceController | ||
- deploymentController | ||
- etc.. | ||
|
||
Among these controller loops, the following are cloud provider dependent. | ||
|
||
- nodeController | ||
- volumeController | ||
- routeController | ||
- serviceController | ||
|
||
The nodeController uses the cloudprovider to check if a node has been deleted from the cloud. If cloud provider reports a node as deleted, then this controller immediately deletes the node from kubernetes. This check removes the need to wait for a specific amount of time to conclude that an inactive node is actually dead. | ||
|
||
The volumeController uses the cloudprovider to create, delete, attach and detach volumes to nodes. For instance, the logic for provisioning, attaching, and detaching a EBS volume resides in the AWS cloudprovider. The volumeController uses this code to perform its operations. | ||
|
||
The routeController configures routes for hosts in the cloud provider. | ||
|
||
The serviceController maintains a list of currently active nodes, and is responsible for creating and deleting LoadBalancers in the underlying cloud. | ||
|
||
#### Cloud Provider in Kubelet | ||
|
||
Moving on to the kubelet, the following cloud provider dependencies exist in kubelet. | ||
|
||
- Find the cloud nodename of the host that kubelet is running on for the following reasons : | ||
1. To obtain the config map for the kubelet, if one already exists | ||
2. To uniquely identify current node using nodeInformer | ||
3. To instantiate a reference to the current node object | ||
- Find the InstanceID, ProviderID, ExternalID, Zone Info of the node object while initializing it | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear how this information is going to be discovered in the proposal below. The zone and id are generally needed in order to call the cloud provider to get more information about the node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Anything the node needs to bootstrap could be fed in as flags/config locally. Example: hostname override. There is a class of variables that need to be known to get things going and are exceedingly unlikely to change, but are also difficult to express generically. Node setup could get the region/zone from metadata or API or config volume (openstack) or SMBios or whever, and pass it to kubelet. We don't need an internal API for this. |
||
- Periodically poll the cloud provider to figure out if the node has any new IP addresses associated with it | ||
- It sets a condition that makes the node unschedulable until cloud routes are configured. | ||
- It allows the cloud provider to post process DNS settings | ||
|
||
#### Cloud Provider in Kube-apiserver | ||
|
||
Finally, in the kube-apiserver, the cloud provider is used for transferring SSH keys to all of the nodes, and within an admission controller for setting labels on persistent volumes. | ||
|
||
### 2. Strategy for refactoring Kube-Controller-Manager | ||
|
||
In order to create a 100% cloud independent controller manager, the controller-manager will be split into multiple binaries. | ||
|
||
1. Cloud dependent controller-manager binaries | ||
2. Cloud independent controller-manager binaries - This is the existing `kube-controller-manager` that is being shipped with kubernetes releases. | ||
|
||
The cloud dependent binaries will run those loops that rely on cloudprovider as a kubernetes system service. The rest of the controllers will be run in the cloud independent controller manager. | ||
|
||
The decision to run entire controller loops, rather than only the very minute parts that rely on cloud provider was made because it makes the implementation simple. Otherwise, the shared datastructures and utility functions have to be disentangled, and carefully separated to avoid any concurrency issues. This approach among other things, prevents code duplication and improves development velocity. | ||
|
||
Note that the controller loop implementation will continue to reside in the core repository. It takes in cloudprovider.Interface as an input in its constructor. Vendor maintained cloud-controller-manager binary could link these controllers in, as it serves as a reference form of the controller implementation. | ||
|
||
There are four controllers that rely on cloud provider specific code. These are node controller, service controller, route controller and attach detach controller. Copies of each of these controllers have been bundled them together into one binary. The cloud dependent binary registers itself as a controller, and runs the cloud specific controller loops with the user-agent named "external-controller-manager". | ||
|
||
RouteController and serviceController are entirely cloud specific. Therefore, it is really simple to move these two controller loops out of the cloud-independent binary and into the cloud dependent binary. | ||
|
||
NodeController does a lot more than just talk to the cloud. It does the following operations - | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you plan to break up the NodeController? |
||
|
||
1. CIDR management | ||
2. Monitor Node Status | ||
3. Node Pod Eviction | ||
|
||
While Monitoring Node status, if the status reported by kubelet is either 'ConditionUnknown' or 'ConditionFalse', then the controller checks if the node has been deleted from the cloud provider. If it has already been deleted from the cloud provider, then it deletes the nodeobject without waiting for the `monitorGracePeriod` amount of time. This is the only operation that needs to be moved into the cloud dependent controller manager. | ||
|
||
Finally, The attachDetachController is tricky, and it is not simple to disentangle it from the controller-manager easily, therefore, this will be addressed with Flex Volumes (Discussed under a separate section below) | ||
|
||
### 3. Strategy for refactoring Kubelet | ||
|
||
The majority of the calls by the kubelet to the cloud is done during the initialization of the Node Object. The other uses are for configuring Routes (in case of GCE), scrubbing DNS, and periodically polling for IP addresses. | ||
|
||
All of the above steps, except the Node initialization step can be moved into a controller. Specifically, IP address polling, and configuration of Routes can be moved into the cloud dependent controller manager. | ||
|
||
Scrubbing DNS, after discussing with @thockin, was found to be redundant. So, it can be disregarded. It is being removed. | ||
|
||
Finally, Node initialization needs to be addressed. This is the trickiest part. Pods will be scheduled even on uninitialized nodes. This can lead to scheduling pods on incompatible zones, and other weird errors. Therefore, an approach is needed where kubelet can create a Node, but mark it as "NotReady". Then, some asynchronous process can update it and mark it as ready. This is now possible because of the concept of Taints. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mml this could be solution to the problem we talked about in our meeting on Wed. (possible for pods to be scheduled to nodes that aren't ready (properly configured) yet). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
This approach requires kubelet to be started with known taints. This will make the node unschedulable until these taints are removed. The external cloud controller manager will asynchronously update the node objects and remove the taints. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this interact with static pods e.g. KubeProxy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIK taints are not comprehended by kubelet, so static pods would start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't count on that behavior. Any pods that want to run when a NoExecute taint is present will need a toleration. |
||
|
||
### 4. Strategy for refactoring Kube-ApiServer | ||
|
||
Kube-apiserver uses the cloud provider for two purposes | ||
|
||
1. Distribute SSH Keys - This can be moved to the cloud dependent controller manager | ||
2. Admission Controller for PV - This can be refactored using the taints approach used in Kubelet | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See also the admission extension proposal: #132 |
||
|
||
### 5. Strategy for refactoring Volumes | ||
|
||
Volumes need cloud providers, but they only need SPECIFIC cloud providers. The majority of volume management logic resides in the controller manager. These controller loops need to be moved into the cloud-controller manager. The cloud controller manager also needs a mechanism to read parameters for initilization from cloud config. This can be done via config maps. | ||
|
||
There is an entirely different approach to refactoring volumes - Flex Volumes. There is an undergoing effort to move all of the volume logic from the controller-manager into plugins called Flex Volumes. In the Flex volumes world, all of the vendor specific code will be packaged in a separate binary as a plugin. After discussing with @thockin, this was decidedly the best approach to remove all cloud provider dependency for volumes out of kubernetes core. | ||
|
||
### 6. Deployment, Upgrades and Downgrades | ||
|
||
This change will introduce new binaries to the list of binaries required to run kubernetes. The change will be designed such that these binaries can be installed via `kubectl apply -f` and the appropriate instances of the binaries will be running. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The addon manager isn't currently required, but we need an equivalent that is. |
||
|
||
##### 6.1 Upgrading kubelet and proxy | ||
|
||
The kubelet and proxy runs on every node in the kubernetes cluster. Based on your setup (systemd/other), you can follow the normal upgrade steps for it. This change does not affect the kubelet and proxy upgrade steps for your setup. | ||
|
||
##### 6.2 Upgrading plugins | ||
|
||
Plugins such as cni, flex volumes can be upgraded just as you normally upgrade them. This change does not affect the plugin upgrade steps for your setup. | ||
|
||
###### 6.3 Upgrading kubernetes core services | ||
|
||
The master node components (kube-controller-manager,kube-scheduler, kube-apiserver etc.) can be upgraded just as you normally upgrade them. This change does not affect the plugin upgrade steps for your setup. | ||
|
||
##### 6.4 Applying the cloud-controller-manager | ||
|
||
This is the only step that is different in the upgrade process. In order to complete the upgrade process, you need to apply the cloud-controller-manager deployment to the setup. A deployment descriptor file will be provided with this change. You need to apply this change using | ||
|
||
``` | ||
kubectl apply -f cloud-controller-manager.yml | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens when we update the master and get the new controller, but the nodes have not been updated? How does the taints/initializers stuff work in the interim? Do we need an interlock like we have done before? e.g. one release has both old and new master logic and switches only when all nodes report updated kubelets, and then the subsequent release removes old logic? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @thockin, we can prevent this scenario where we've updated master but the nodes haven't updated by adding the taint logic into the master itself. If the master had a node admission controller that added these taints, then we'll only need to update master and any nodes that had the old logic would've talked to the cloudprovider themselves, and the nodes with the new logic wouldn't need to add any new params. There is a flip side, where there will be updated nodes, but older masters will not have the admission controller for adding taints. To fix this, we'll need the interlock approach in the cloud-controller-manager. The CCM will retroactively process nodes and update cloud specific details in the first release and then this logic will be removed in the subsequent release. I thought of this approach of using admission controllers instead of setting the taint in kubelet because of an issue faced by @luxas. It was restrictively difficult for kubeadm to update every kubelet with new params and he suggested that we use a different approach instead. After discussing, we found that the admission controller approach is preferrable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, using an admission controller that taints every node before the node has initialized is preferable. |
||
``` | ||
|
||
This will start the cloud specific controller manager in your kuberentes setup. | ||
|
||
The downgrade steps are also the same as before for all the components except the cloud-controller-manager. In case of the cloud-controller-manager, the deployment should be deleted using | ||
|
||
``` | ||
kubectl delete -f cloud-controller-manager.yml | ||
``` | ||
|
||
### 7. Roadmap | ||
|
||
##### 7.1 Transition plan | ||
|
||
Release 1.6: Add the first implementation of the cloud-controller-manager binary. This binary's purpose is to let users run two controller managers and address any issues that they uncover, that we might have missed. It also doubles as a reference implementation to the external cloud controller manager for the future. Since the cloud-controller-manager runs cloud specific controller loops, it is important to ensure that the kube-controller-manager does not run these loops as well. This is done by leaving the `--cloud-provider` flag unset in the kube-controller-manager. At this stage, the cloud-controller-manager will still be in "beta" stage and optional. | ||
|
||
Release 1.7: In this release, all of the supported turnups will be converted to use cloud controller by default. At this point users will still be allowed to opt-out. Users will be expected run the monolithic cloud controller binary. The cloud controller manager will still continue to use the existing library, but code will be factored out to reduce literal duplication between the controller-manager and the cloud-controller-manager. A deprecation announcement will be made to inform users to switch to the cloud-controller-manager. | ||
|
||
Release 1.8: The main change aimed for this release is to break up the various cloud providers into individual binaries. Users will still be allowed to opt-out. There will be a second warning to inform users about the deprecation of the `--cloud-provider` option in the controller-manager. | ||
|
||
Release 1.9: All of the legacy cloud providers will be completely removed in this version | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will be December - it sucks to plan a year ahead, but this is a huge and important changeset. |
||
|
||
##### 7.2 Code/Library Evolution | ||
|
||
* Break controller-manager into 2 binaries. One binary will be the existing controller-manager, and the other will only run the cloud specific loops with no other changes. The new cloud-controller-manager will still load all the cloudprovider libraries, and therefore will allow the users to choose which cloud-provider to use. | ||
* Move the cloud specific parts of kubelet out using the external admission controller pattern mentioned in the previous sections above. | ||
* The cloud controller will then be made into a library. It will take the cloudprovider.Interface as an argument to its constructor. Individual cloudprovider binaries will be created using this library. | ||
* Cloud specific operations will be moved out of kube-apiserver using the external admission controller pattern mentioned above. | ||
* All cloud specific volume controller loops (attach, detach, provision operation controllers) will be switched to using flex volumes. Flex volumes do not need in-tree cloud specific calls. | ||
* As the final step, all of the cloud provider specific code will be moved out of tree. | ||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we will very very likely move away from this model to one where ConfigMaps can be shared between nodes, which will obviate the ConfigMap dependency on the nodename. For more info see kubernetes/kubernetes#29459