Skip to content

Commit

Permalink
Add topology-aware provisioning and terms of FD/AZ
Browse files Browse the repository at this point in the history
  • Loading branch information
feiskyer committed Jul 12, 2018
1 parent ae17837 commit 7a9eb71
Showing 1 changed file with 51 additions and 13 deletions.
64 changes: 51 additions & 13 deletions sig-azure/0018-20180711-azure-availability-zones.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,17 @@ This proposal aims to add [Azure Availability Zones (AZ)](https://azure.microsof

The proposal includes required changes to support availability zones for various functions in Azure cloud provider and AzureDisk volumes:

- Detect availability zones automatically when registering new nodes and node's label `failure-domain.beta.kubernetes.io/zone` will be replaced with AZ instead of fault domain
- Detect availability zones automatically when registering new nodes (by kubelet or node controller) and node's label `failure-domain.beta.kubernetes.io/zone` will be replaced with AZ instead of fault domain
- LoadBalancer and PublicIP will be provisioned with zone redundant
- `GetLabelsForVolume` interface will be implemented for Azure managed disks and it will also be added to `PersistentVolumeLabel` admission controller so as to support DynamicProvisioningScheduling

> Note that unlike most cases, fault domain and availability zones mean different on Azure:
>
> - A Fault Domain (FD) is essentially a rack of servers. It consumes subsystems like network, power, cooling etc.
> - Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking.
>
> An Availability Zone in an Azure region is a combination of a fault domain and an update domain (Same like FD, but for updates. When upgrading a deployment, it is carried out one update domain at a time). For example, if you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across three fault domains and three update domains.
### Non-scopes

Provisioning Kubernetes masters and nodes with availability zone support is not included in this proposal. It should be done in the provisioning tools (e.g. acs-engine). Azure cloud provider will auto-detect the node's availability zone if `availabilityZones` option is configured for the Azure cloud provider.
Expand Down Expand Up @@ -94,7 +101,7 @@ Note that with standard SKU LoadBalancer, `primaryAvailabitySetName` and `primar

## Node registration

When nodes are started, kubelet automatically adds labels to them with region and zone information:
When registering new nodes, kubelet (with build in cloud provider) or node controller (with external cloud provider) automatically adds labels to them with region and zone information:

- Region: `failure-domain.beta.kubernetes.io/region=centralus`
- Zone: `failure-domain.beta.kubernetes.io/zone=centralus-1`
Expand Down Expand Up @@ -139,13 +146,13 @@ Note that zonal PublicIPs are not supported. We may add this easily if there’r

## AzureDisk

When Azure managed disks are created, the `PersistentVolumeLabel` admission controller automatically adds zone labels to them. The scheduler (via `VolumeZonePredicate`) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones.
When Azure managed disks are created, the `PersistentVolumeLabel` admission controller or PV controller automatically adds zone labels to them. The scheduler (via `VolumeZonePredicate` or `PV.NodeAffinity` in the future) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones.

> Note that only managed disks are supported. Blob disks don't support availability zones on Azure.
### PVLabeler
### PVLabeler interface

`PVLabeler` interface should be implemented for AzureDisk:
To setup AzureDisk's zone label correctly (required by cloud-controller-manager's PersistentVolumeLabelController), Azure cloud provider's [PVLabeler](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L212) interface should be implemented:

```go
// PVLabeler is an abstract, pluggable interface for fetching labels for volumes
Expand All @@ -154,7 +161,7 @@ type PVLabeler interface {
}
```

It should return the region and zone of the AzureDisk, e.g.
It should return the region and zone for the AzureDisk, e.g.

- `failure-domain.beta.kubernetes.io/region=centralus`
- `failure-domain.beta.kubernetes.io/zone=centralus-1`
Expand All @@ -167,9 +174,9 @@ NAME CAPACITY ACCESSMODES STATUS CLAIM REASON AGE
pv-managed-abc 5Gi RWO Bound default/claim1 46s failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1
```

### PersistentVolumeLabel
### PersistentVolumeLabel admission controller

Besides PVLabeler interface, [PersistentVolumeLabel](https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/storage/persistentvolume/label/admission.go) admission controller should also updated with AzureDisk support, so that new PVs could be applied with above labels automatically.
Cloud provider's `PVLabeler` interface is only applied when cloud-controller-manager is used. For build in Azure cloud provider, [PersistentVolumeLabel](https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/storage/persistentvolume/label/admission.go) admission controller should also updated with AzureDisk support, so that new PVs could also be applied with above labels.

```go
func (l *persistentVolumeLabel) Admit(a admission.Attributes) (err error) {
Expand All @@ -185,9 +192,17 @@ func (l *persistentVolumeLabel) Admit(a admission.Attributes) (err error) {
}
```

> Note: the PersistentVolumeLabel admission controller will be deprecated, and cloud-controller-manager is prefered after its GA (probably v1.13 or v1.14).
### StorageClass

Note that the above interfaces are only applied to AzureDisk PV, not StorageClass. For AzureDisk StorageClass, we should add a new optional parameter `zone` and `zones` (must not be used at the same time) for specifying which zones should be used to provision AzureDisk:
Note that the above interfaces are only applied to AzureDisk persistent volumes, not StorageClass. For AzureDisk StorageClass, we should add a few new options for zone-aware and [topology-aware](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) provisioning. The following three new options will be added in AzureDisk StorageClass:

- `zoned`: indicates whether new disks are provisioned with AZ. Default is `true`.
- `zone` and `zones`: indicates which zones should be used to provision new disks (zone-aware provisioning). Only can be set if `zoned` is not false and `allowedTopologies` is not set.
- `allowedTopologies`: indicates which topologies are allowed for [topology-aware](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) provisioning. Only can be set if `zoned` is not false and `zone`/`zones` are not set.

An example of zone-aware provisioning storage class is:

```yaml
apiVersion: storage.k8s.io/v1
Expand All @@ -200,20 +215,43 @@ metadata:
parameters:
kind: Managed
storageaccounttype: Premium_LRS
# only one of zone and zones are allowed
zone: "centralus-1"
# zones: "centralus-1,centralus-2,centralus-3"
provisioner: kubernetes.io/azure-disk
```
If multiple zones are specified, then new AzureDisk will be provisioned with zone chosen arbitrarily among them.
Another example of topology-aware provisioning storage class is:
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
labels:
kubernetes.io/cluster-service: "true"
name: managed-premium
parameters:
kind: Managed
storageaccounttype: Premium_LRS
provisioner: kubernetes.io/azure-disk
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- centralus-1
- centralus-2
```
AzureDisk can only be created with one specific zone, so if multiple zones are specified in the storage class, then new disks will be provisioned with zone chosen arbitrarily among them.
If both zone and zones are not specified, then new AzureDisk will be provisioned with zone chosen by round-robin across all active zones, which means
If no zones are specified and `zoned` is not false, then new disks will be provisioned with zone chosen by round-robin across all active zones, which means

- If there are no zoned nodes, then AzureDisk will also be provisioned without zone
- If there are no zoned nodes, then an `no zoned nodes` error will be reported
- Zoned AzureDisk will only be provisioned when there are zoned nodes
- If there are multiple zones, then those zones are chosen by round-robin

Note that there are risks if the cluster is running with both zoned and non-zoned nodes. In such case, AzureDisk is always zoned, and it can't be attached to non-zoned nodes. This means
Note that there are risks if the cluster is running with both zoned and non-zoned nodes. In such case, zoned AzureDisk can't be attached to non-zoned nodes. This means

- new pods with zoned AzureDisks are always scheduled to zoned nodes
- old pods using non-zoned AzureDisks can't be scheduled to zoned nodes
Expand Down

0 comments on commit 7a9eb71

Please sign in to comment.