From 7a9eb716ae9c234dda7be401fa294ea75da733ac Mon Sep 17 00:00:00 2001 From: Pengfei Ni Date: Thu, 12 Jul 2018 11:29:58 +0800 Subject: [PATCH] Add topology-aware provisioning and terms of FD/AZ --- .../0018-20180711-azure-availability-zones.md | 64 +++++++++++++++---- 1 file changed, 51 insertions(+), 13 deletions(-) diff --git a/sig-azure/0018-20180711-azure-availability-zones.md b/sig-azure/0018-20180711-azure-availability-zones.md index 42f6deb60f8..a99abc00172 100644 --- a/sig-azure/0018-20180711-azure-availability-zones.md +++ b/sig-azure/0018-20180711-azure-availability-zones.md @@ -49,10 +49,17 @@ This proposal aims to add [Azure Availability Zones (AZ)](https://azure.microsof The proposal includes required changes to support availability zones for various functions in Azure cloud provider and AzureDisk volumes: -- Detect availability zones automatically when registering new nodes and node's label `failure-domain.beta.kubernetes.io/zone` will be replaced with AZ instead of fault domain +- Detect availability zones automatically when registering new nodes (by kubelet or node controller) and node's label `failure-domain.beta.kubernetes.io/zone` will be replaced with AZ instead of fault domain - LoadBalancer and PublicIP will be provisioned with zone redundant - `GetLabelsForVolume` interface will be implemented for Azure managed disks and it will also be added to `PersistentVolumeLabel` admission controller so as to support DynamicProvisioningScheduling +> Note that unlike most cases, fault domain and availability zones mean different on Azure: +> +> - A Fault Domain (FD) is essentially a rack of servers. It consumes subsystems like network, power, cooling etc. +> - Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking. +> +> An Availability Zone in an Azure region is a combination of a fault domain and an update domain (Same like FD, but for updates. When upgrading a deployment, it is carried out one update domain at a time). For example, if you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across three fault domains and three update domains. + ### Non-scopes Provisioning Kubernetes masters and nodes with availability zone support is not included in this proposal. It should be done in the provisioning tools (e.g. acs-engine). Azure cloud provider will auto-detect the node's availability zone if `availabilityZones` option is configured for the Azure cloud provider. @@ -94,7 +101,7 @@ Note that with standard SKU LoadBalancer, `primaryAvailabitySetName` and `primar ## Node registration -When nodes are started, kubelet automatically adds labels to them with region and zone information: +When registering new nodes, kubelet (with build in cloud provider) or node controller (with external cloud provider) automatically adds labels to them with region and zone information: - Region: `failure-domain.beta.kubernetes.io/region=centralus` - Zone: `failure-domain.beta.kubernetes.io/zone=centralus-1` @@ -139,13 +146,13 @@ Note that zonal PublicIPs are not supported. We may add this easily if there’r ## AzureDisk -When Azure managed disks are created, the `PersistentVolumeLabel` admission controller automatically adds zone labels to them. The scheduler (via `VolumeZonePredicate`) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones. +When Azure managed disks are created, the `PersistentVolumeLabel` admission controller or PV controller automatically adds zone labels to them. The scheduler (via `VolumeZonePredicate` or `PV.NodeAffinity` in the future) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones. > Note that only managed disks are supported. Blob disks don't support availability zones on Azure. -### PVLabeler +### PVLabeler interface -`PVLabeler` interface should be implemented for AzureDisk: +To setup AzureDisk's zone label correctly (required by cloud-controller-manager's PersistentVolumeLabelController), Azure cloud provider's [PVLabeler](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L212) interface should be implemented: ```go // PVLabeler is an abstract, pluggable interface for fetching labels for volumes @@ -154,7 +161,7 @@ type PVLabeler interface { } ``` -It should return the region and zone of the AzureDisk, e.g. +It should return the region and zone for the AzureDisk, e.g. - `failure-domain.beta.kubernetes.io/region=centralus` - `failure-domain.beta.kubernetes.io/zone=centralus-1` @@ -167,9 +174,9 @@ NAME CAPACITY ACCESSMODES STATUS CLAIM REASON AGE pv-managed-abc 5Gi RWO Bound default/claim1 46s failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1 ``` -### PersistentVolumeLabel +### PersistentVolumeLabel admission controller -Besides PVLabeler interface, [PersistentVolumeLabel](https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/storage/persistentvolume/label/admission.go) admission controller should also updated with AzureDisk support, so that new PVs could be applied with above labels automatically. +Cloud provider's `PVLabeler` interface is only applied when cloud-controller-manager is used. For build in Azure cloud provider, [PersistentVolumeLabel](https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/storage/persistentvolume/label/admission.go) admission controller should also updated with AzureDisk support, so that new PVs could also be applied with above labels. ```go func (l *persistentVolumeLabel) Admit(a admission.Attributes) (err error) { @@ -185,9 +192,17 @@ func (l *persistentVolumeLabel) Admit(a admission.Attributes) (err error) { } ``` +> Note: the PersistentVolumeLabel admission controller will be deprecated, and cloud-controller-manager is prefered after its GA (probably v1.13 or v1.14). + ### StorageClass -Note that the above interfaces are only applied to AzureDisk PV, not StorageClass. For AzureDisk StorageClass, we should add a new optional parameter `zone` and `zones` (must not be used at the same time) for specifying which zones should be used to provision AzureDisk: +Note that the above interfaces are only applied to AzureDisk persistent volumes, not StorageClass. For AzureDisk StorageClass, we should add a few new options for zone-aware and [topology-aware](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) provisioning. The following three new options will be added in AzureDisk StorageClass: + +- `zoned`: indicates whether new disks are provisioned with AZ. Default is `true`. +- `zone` and `zones`: indicates which zones should be used to provision new disks (zone-aware provisioning). Only can be set if `zoned` is not false and `allowedTopologies` is not set. +- `allowedTopologies`: indicates which topologies are allowed for [topology-aware](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) provisioning. Only can be set if `zoned` is not false and `zone`/`zones` are not set. + +An example of zone-aware provisioning storage class is: ```yaml apiVersion: storage.k8s.io/v1 @@ -200,20 +215,43 @@ metadata: parameters: kind: Managed storageaccounttype: Premium_LRS + # only one of zone and zones are allowed zone: "centralus-1" # zones: "centralus-1,centralus-2,centralus-3" provisioner: kubernetes.io/azure-disk ``` -If multiple zones are specified, then new AzureDisk will be provisioned with zone chosen arbitrarily among them. +Another example of topology-aware provisioning storage class is: + +```yaml +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + annotations: + labels: + kubernetes.io/cluster-service: "true" + name: managed-premium +parameters: + kind: Managed + storageaccounttype: Premium_LRS +provisioner: kubernetes.io/azure-disk +allowedTopologies: +- matchLabelExpressions: + - key: failure-domain.beta.kubernetes.io/zone + values: + - centralus-1 + - centralus-2 +``` + +AzureDisk can only be created with one specific zone, so if multiple zones are specified in the storage class, then new disks will be provisioned with zone chosen arbitrarily among them. -If both zone and zones are not specified, then new AzureDisk will be provisioned with zone chosen by round-robin across all active zones, which means +If no zones are specified and `zoned` is not false, then new disks will be provisioned with zone chosen by round-robin across all active zones, which means -- If there are no zoned nodes, then AzureDisk will also be provisioned without zone +- If there are no zoned nodes, then an `no zoned nodes` error will be reported - Zoned AzureDisk will only be provisioned when there are zoned nodes - If there are multiple zones, then those zones are chosen by round-robin -Note that there are risks if the cluster is running with both zoned and non-zoned nodes. In such case, AzureDisk is always zoned, and it can't be attached to non-zoned nodes. This means +Note that there are risks if the cluster is running with both zoned and non-zoned nodes. In such case, zoned AzureDisk can't be attached to non-zoned nodes. This means - new pods with zoned AzureDisks are always scheduled to zoned nodes - old pods using non-zoned AzureDisks can't be scheduled to zoned nodes