-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP for Azure Availability Zones #2364
Conversation
/assign @brendandburns @khenidak @colemickens /cc @justaugustus @andyzhangx @kubernetes/sig-azure-misc @kubernetes/sig-storage-feature-requests |
|
||
### PersistentVolumeLabel | ||
|
||
Besides PVLabeler interface, [PersistentVolumeLabel](https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/storage/persistentvolume/label/admission.go) admission controller should also updated with AzureDisk support, so that new PVs could be applied with above labels automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This admission controller is deprecated now and I believe its replacement is in the cloud controller manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't notice this has been deprecated. Cloud controller manager will GA in v1.13 or v1.14. What's the replacement before its GA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is a replacement in the meantime, but both methods should be implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 you're right,
- if cloud-controller-manager is used, then that is actually PVLabeler interface which is used persistent volume label controller
- or else, this admission controller is used.
Will add this info to both this and PVLabler part.
parameters: | ||
kind: Managed | ||
storageaccounttype: Premium_LRS | ||
zone: "centralus-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new topology aware dynamic provisioning feature adds a new allowedTopologies field to Storageclass which is intended to replace the zone/zones parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topology-aware dynamic provisioning feature is still alpha in v1.12 if I'm not wrong, so these two parameters are still required before its GA, right?
And thanks of reminding this. allowedTopologies
feature should also be added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct just wanted to make sure you know that this area is changing and as part of beta, we want to get all the 3 in tree cloud provider plugins to start moving to the new method as well.
|
||
## AzureDisk | ||
|
||
When Azure managed disks are created, the `PersistentVolumeLabel` admission controller automatically adds zone labels to them. The scheduler (via `VolumeZonePredicate`) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually the VolumeZone predicate will be deprecated and replaced by PV.NodeAffinity
|
||
The proposal includes required changes to support availability zones for various functions in Azure cloud provider and AzureDisk volumes: | ||
|
||
- Detect availability zones automatically when registering new nodes and node's label `failure-domain.beta.kubernetes.io/zone` will be replaced with AZ instead of fault domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than having the kubelet self-register this, can the node controller pull this information and add it into the node object?
|
||
## Node registration | ||
|
||
When nodes are started, kubelet automatically adds labels to them with region and zone information: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowing nodes to self-label has been problematic (enumerated in #911). is there a reason we wouldn't have the node controller pull this info from the cloud provider and manage the labels on the node objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mikedanese
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there're actually two ways supported today:
- Without cloud-controller-manager (kubelet --cloud-privider=azure), kubelet registers itself with zone labels
- With cloud-controller-manager (kubelet --cloud-provider=external), node controller adds zone lables
Until cloud-controller-manager GA (probably v1.13 or v1.14), both way should be supported. I should probably state this more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until cloud-controller-manager GA (probably v1.13 or v1.14), both way should be supported
I agree we need to support the old way until we can transition away from it, but I don't think we should expand it to new labels if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we need to support the old way until we can transition away from it, but I don't think we should expand it to new labels if possible
Agreed. this is only for existing zone labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. this is only for existing zone labels.
great, thanks for confirming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me - is this proposing changes to Kubernetes concepts or adding new labels, or is it just changing the Azure implementation with no impact to the rest of the system?
I'm not sure that changes that stay within the azure implementation need a KEP?
|
||
The proposal includes required changes to support availability zones for various functions in Azure cloud provider and AzureDisk volumes: | ||
|
||
- Detect availability zones automatically when registering new nodes and node's label `failure-domain.beta.kubernetes.io/zone` will be replaced with AZ instead of fault domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the statement. You want to replace a standard label with ... what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Azure is setting fault domain in label failure-domain.beta.kubernetes.io/zone
today. It will be replaced with AZ. Fault domain and AZ are different on Azure.
|
||
Currently, Azure nodes are registered with label `failure-domain.beta.kubernetes.io/zone=faultDomain`. | ||
|
||
The format of fault domain is numbers (e.g. `1` or `2`), which is in same format with AZ (e.g. `1` or `3`). If AZ is using same format with faultDomain, then there'll be scheduler issues for clusters with both AZ and non-AZ nodes. So AZ will use a different format in kubernetes: `<region>-<AZ>`, e.g. `centralus-1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should lead in with some details about what AZ means and what faultDomain means? Why are they not interchangeable terms (they are both pretty commonly used to mean the same thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This should be definitely addressed.
7a9eb71
to
f03f856
Compare
Added a new commit with topology-aware provisioning and difference between fault domain and availability zones. |
Hey @thockin good question. The KEP process is somewhat vague on what constitutes a KEP: https://kubernetes.io/docs/imported/community/keps/#what-type-of-work-should-be-tracked-by-a-kep I think that it is clearly within the scope of a SIG to track something as a KEP if they want too: "As the local bodies of governance, SIGs should have broad latitude in describing what constitutes an enhancement which should be tracked through the KEP process." But it's also clear that I don't think that this meets the guidelines for things that are required to be a KEP: "any technical effort (refactoring, major architectural change) that will impact a large section of the development community should also be communicated widely. The KEP process is suited for this even if it will have zero impact on the typical user or operator." ... "Enhancements that have major impacts on multiple SIGs should use the KEP process." So I guess given that this KEP has been started, we may as well finish it, but going forward SIG-Azure will develop guidelines for what should and should not be a KEP "SIGs may find that helpful to enumerate what does not require a KEP rather than what does." Sound good? |
@brendandburns / @thockin -- So from a Features Maintainer perspective, Feature Owners are required to submit design proposals for units of work that qualify as Features i.e., those tracked within From SIG Azure Chair perspective, anything that moves us along the line of transparency w.r.t. SIG roadmap is a step in the right direction. That said, I'm happy to firm up guidelines on expectations for KEP submission in SIG Azure! :) |
@@ -139,13 +146,13 @@ Note that zonal PublicIPs are not supported. We may add this easily if there’r | |||
|
|||
## AzureDisk | |||
|
|||
When Azure managed disks are created, the `PersistentVolumeLabel` admission controller automatically adds zone labels to them. The scheduler (via `VolumeZonePredicate`) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones. | |||
When Azure managed disks are created, the `PersistentVolumeLabel` admission controller or PV controller automatically adds zone labels to them. The scheduler (via `VolumeZonePredicate` or `PV.NodeAffinity` in the future) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PV controller is actually the "PV label controller in cloud-controller-manager" (not to be be confused with the PV controller in kube-controller-manager).
The admission controller needs to be updated to also add PV.NodeAffinity in addition to the zone labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly. Updated with that information.
@thockin @brendandburns @justaugustus Thanks for clarification for KEP process. As Brendan's suggestion, could we continue the KEP and develop guidelines in sig-azure? |
Absolutely @feiskyer. Let's keep this going! |
@thockin @brendandburns @justaugustus @ddebroy Addressed comments and squashed commits. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @feiskyer . /lgtm
Overall LGTM, but I'd like to hold pending a review from Kal or Cole as well. |
/LGTM |
Awesome, thanks for the review, Kal! /hold cancel |
Ooooooo, I'm going to need to add an |
Fixed in #2399 /approve |
@jdumars @calebamiles @idvoretskyi -- could you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: idvoretskyi, justaugustus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @idvoretskyi 🎉 |
Thanks @idvoretskyi and @justaugustus |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add availability zones support to Azure managed disks **What this PR does / why we need it**: Continue of [Azure Availability Zone feature](kubernetes/enhancements#586). This PR adds availability zone support for Azure managed disks and its storage class. Zoned managed disks is enabled by default if there are zoned nodes in the cluster. The zone could also be customized by `zone` or `zones` parameter, e.g. ```yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: name: managed-disk-zone-1 parameters: zone: "southeastasia-1" # zones: "southeastasia-1,"southeastasia-2" cachingmode: None kind: Managed storageaccounttype: Standard_LRS provisioner: kubernetes.io/azure-disk reclaimPolicy: Delete volumeBindingMode: Immediate ``` All zoned AzureDisk PV will also be labeled with its availability zone, e.g. ```sh $ kubectl get pvc pvc-azuredisk-az-1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-azuredisk-az-1 Bound pvc-5ad0c7b8-8f0b-11e8-94f2-000d3a07de8c 5Gi RWO managed-disk-zone-1 2h $ kubectl get pv pvc-5ad0c7b8-8f0b-11e8-94f2-000d3a07de8c --show-labels NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS pvc-5ad0c7b8-8f0b-11e8-94f2-000d3a07de8c 5Gi RWO Delete Bound default/pvc-azuredisk-az-1 managed-disk-zone-1 2h failure-domain.beta.kubernetes.io/region=southeastasia,failure-domain.beta.kubernetes.io/zone=southeastasia-1 ``` **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes # **Special notes for your reviewer**: See also the [KEP](kubernetes/community#2364). DynamicProvisioningScheduling feature would be added in a following PR. **Release note**: ```release-note Azure managed disks now support availability zones and new parameters `zoned`, `zone` and `zones` are added for AzureDisk storage class. ``` /kind feature /sig azure /assign @brendandburns @khenidak @andyzhangx
KEP for Azure Availability Zones
> - Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking. | ||
> | ||
> An Availability Zone in an Azure region is a combination of a fault domain and an update domain (Same like FD, but for updates. When upgrading a deployment, it is carried out one update domain at a time). For example, if you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across three fault domains and three update domains. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there further classification like does an update domain have multiple fault domains ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, each VM is assigned with a single update domain.
The KEP adds Azure Availability Zones support to Kubernetes, which includes:
failure-domain.beta.kubernetes.io/zone
will be replaced with AZ instead of fault domainGetLabelsForVolume
interface will be implemented for Azure managed disks and it will also be added toPersistentVolumeLabel
admission controller so as to support DynamicProvisioningScheduling/sig azure
/sig storage