Enable dynamic configue max number of PDs allowed on a node based on machine type #53461

jingxu97 · 2017-10-04T21:15:46Z

Currently, for cloud provider including gce, aws, and azure, there is a
hardcoded number to limit the max number of PDs allowed on a node.
However, gce has changed this number based on machine type. This PR
allows scheduler to automatically get this number based on the machine
type of the given node.

fixes issue #24317

Release note:

Enable dynamic configuration of maximum number of persistent disks allowed on a node based on machine type for scheduler to use to check the feasibility of assigning pod to node.

msau42 · 2017-10-04T23:44:53Z

pkg/cloudprovider/providers/gce/gce_disks.go

@@ -54,6 +55,8 @@ const (
 	replicaZoneURITemplateSingleZone = "%s/zones/%s" // {gce.projectID}/zones/{disk.Zone}
 )

+var DiskNumberLimit = []int{16, 32, 64, 128}


Can we make this a map instead? with key = num cpus, and value = pd limit?

I am not sure about how many cpus possibly support

Or at least define an enum to represent the index?

msau42 · 2017-10-04T23:47:06Z

plugin/pkg/scheduler/algorithm/predicates/predicates.go

+	if node != nil {
+		instanceType = node.ObjectMeta.Labels[kubeletapis.LabelInstanceType]
+	}
+	maxVolumes := c.maxPDCount(instanceType)


Can we pass in Node object instead? Then each cloud provider can use whatever label or annotation they have to determine the limit.

changed to pass a node

msau42 · 2017-10-04T23:50:50Z

Do we still want to support the ability for the user to override the limit by setting a flag?

jingxu97 · 2017-10-05T21:28:23Z

@msau42 I checked again, there is no such flag for user to overwrite the number

msau42 · 2017-10-06T01:45:44Z

pkg/cloudprovider/providers/gce/gce_disks.go

 )

+var DiskNumberLimit = []int{16, 32, 64, 128}


Add a comment that the values correspond to the indexes above

msau42 · 2017-10-06T01:46:25Z

pkg/cloudprovider/providers/gce/gce_disks.go

+func MaxPDCount(node *v1.Node) int {
+	machineType := ""
+	if node != nil {
+		machineType = node.ObjectMeta.Labels[kubeletapis.LabelInstanceType]


Could Labels be nil?

msau42 · 2017-10-06T01:54:20Z

plugin/pkg/scheduler/algorithm/predicates/predicates_test.go

 			maxVols:      4,
 			fits:         true,
 			test:         "fits when node capacity >= new pod's EBS volumes",
 		},
 		{
 			newPod:       twoVolPod,
-			existingPods: []*v1.Pod{oneVolPod},
+			existingPods: []*v1.Pod{oneVolPod, twoVolPod, splitVolsPod},


Why do you need to add a splitVolsPod?

There's no issue if newPod is also in existingPods?

msau42 · 2017-10-06T01:55:17Z

plugin/pkg/scheduler/algorithm/predicates/predicates_test.go

@@ -1805,14 +1989,14 @@ func TestEBSVolumeCountConflicts(t *testing.T) {
 	}{
 		{
 			newPod:       oneVolPod,
-			existingPods: []*v1.Pod{twoVolPod, oneVolPod},
+			existingPods: []*v1.Pod{twoVolPod},
 			maxVols:      4,


This field should be removed since it's not used anymore

sttts · 2017-10-06T10:23:58Z

/unassign

msau42 · 2017-10-09T01:20:18Z

plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go

-				// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
-				maxVols := getMaxVols(aws.DefaultMaxEBSVolumes)
-				return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
+				return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, aws.MaxPDCount, args.PVInfo, args.PVCInfo)


I found the code where you can override the default max pd count with the environment variable

All calls to 'getMaxVols' are dropped in this PR, which reads environment variable 'KUBE_MAX_PD_VOLS'.
Is it expected? If so, we'd better add a release note for deprecating it, or switch back to still allow the override.

This is documented here: https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/devel/scheduler_algorithm.md

I added the logic of checking environment variable back to the code. Thanks!

jdumars · 2017-10-11T14:01:35Z

@karataliu could you look at this from the Azure perspective? Is the default max of 16 appropriate?

karataliu · 2017-10-12T02:05:37Z

@jdumars That looks fine since this PR only moves 'DefaultMaxAzureDiskVolumes'(16) from 'defaults.go' to 'azure.go', which won't cause behavior change. I could create a separate PR to calc the value based on node type.

Also, if dynamic config is done, the following issue could be addressed: Azure/acs-engine#186

jdumars · 2017-10-12T19:12:31Z

@karataliu that would be extremely helpful! Thank you for looking into this.

jingxu97 · 2017-10-17T21:38:27Z

@msau42 comments are addressed. PTAL. Thanks!

msau42 · 2017-10-17T22:17:44Z

plugin/pkg/scheduler/algorithm/predicates/predicates_test.go

+			existingPods: onePodList_15,
+			node:         small_node,
+			fits:         true,
+			test:         "doesn't fit when node capacity < new pod's GCE volumes",


fix description

msau42 · 2017-10-17T22:21:16Z

plugin/pkg/scheduler/algorithm/predicates/predicates_test.go

@@ -2006,7 +2251,8 @@ func TestEBSVolumeCountConflicts(t *testing.T) {
 	expectedFailureReasons := []algorithm.PredicateFailureReason{ErrMaxVolumeCountExceeded}

 	for _, test := range tests {
-		pred := NewMaxPDVolumeCountPredicate(filter, test.maxVols, pvInfo, pvcInfo)
+		os.Setenv(KubeMaxPDVols, strconv.Itoa(test.maxVols))


Do you need to restore the previous value like in the original test?

fixed, thanks!

jdumars · 2017-10-26T14:33:18Z

@khenidak PTAL

k8s-github-robot · 2017-10-26T14:33:39Z

@jingxu97 PR needs rebase

khenidak · 2017-11-02T00:54:15Z

if I understand correctly this change will require (1) linking scheduler with Cloud Provider code (2) making sure that scheduler is bootstrapped with cloud config (to allow provider to work correctly).

The first creates one more dependency everybody is trying to walk away from (by out of tree of provider). The second will force users to revisit all the existing clusters to upgrade to whatever version carries this change. in addition to visiting all the bootstrap tooling/scripts to enable that for new clusters.

Additionally Cloud Providers (all of them) are constantly changing/adding VM sizes and shapes (with support for larger # of disks). With this change users have to wait to new kubernetes release versions to support new sizes/shapes. With this limitation i really think we shouldn't go ahead with this PR.

What i am proposing is to keep the predicate side of code but instead of using cloud provider to resolve max-pd-count, the scheduler depends on well-know configuration map in kube-system namespace, this configuration map carries a tuple as the following

# with example data (below data for sampling not accuracy)
PD Spec (Name) | Instance-type       | Count
azureDisk            |   Standard_D2_v2 | 4
gcDisk                 |  n1-highcpu-2      | 4
awsDisk               | instance-name    | 6

a default value for instances that are not in the table (possible overridden as currently done by env var)

Users can then modify the table as needed, or alternatively cloud provider can an publish updated table (As a config map) in json format for users to apply it on clusters using kubectl apply

/sig azure
@jdumars @brendanburns

brendandburns · 2017-11-02T03:21:31Z

@jingxu97 I agree with @khenidak here.

We should have the predicate use a ConfigMap for configuration, rather than relying on an in-tree cloud-provider.

Not only does this provide better de-coupling, but it will also work for additional scenarios (like on-prem SAN, etc) where we don't have an equivalent cloud provider.

Can we rework this PR so that it uses a ConfigMap from the kube-system namespace instead of modifying the cloud provider?

Thanks!

@kubernetes/sig-architecture-pr-reviews

jdumars · 2017-11-02T19:23:30Z

@bgrant0607 this one needs an eyeball

bgrant0607 · 2017-11-08T22:09:18Z

The cloudprovider API is frozen. How would this be done with an external cloud provider?

jingxu97 · 2017-11-08T23:16:57Z

@brendanburns @khenidak @bgrant0607 thanks a lot for the comments. I will try to rework this PR based on your suggestions.

k8s-github-robot · 2017-11-16T08:27:21Z

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@jdef @jingxu97 @msau42

Action required: This pull request requires label changes. If the required changes are not made within 1 day, the pull request will be moved out of the v1.9 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.
priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

dims · 2017-11-16T17:36:20Z

@jingxu97 @jdumars Can we please move this out of 1.9?

jdumars · 2017-11-16T17:43:13Z

@dims done.

andyzhangx · 2017-11-21T05:51:41Z

pkg/cloudprovider/providers/azure/azure.go

+	// DefaultMaxAzureVolumes defines the maximum number of PD Volumes for Azure
+	// Larger Azure VMs can actually have much more disks attached.
+	// TODO We should determine the max based on VM size
+	DefaultMaxAzureVolumes = 16


I have checked on azure DefaultMaxAzureVolumes should be 32

ymsaout · 2017-12-29T12:10:44Z

@jingxu97 Will you be able to do something for 1.10 ?

jingxu97 · 2018-01-02T23:55:34Z

@ymsaout Yes, we will work on it for 1.10.

khenidak · 2018-02-06T17:17:48Z

ping @jingxu97 is still moving forward?

ymsaout · 2018-02-28T11:45:41Z

Is there a sticking point from your view @jingxu97 ?

bgrant0607 · 2018-05-01T01:38:08Z

cc @cheftako, who is working on cloud provider extraction

https://github.com/kubernetes/community/blob/master/keps/0002-controller-manager.md

kkmsft · 2018-06-22T17:14:44Z

Looks stalled and since would like to move forward with this, I would like to take over this. Ping @jingxu97

msau42 · 2018-06-22T17:22:00Z

This is available in 1.11 as an alpha feature: kubernetes/enhancements#554

jdumars · 2018-09-02T20:08:10Z

Does this PR still need to be open? @jingxu97

andyzhangx · 2018-09-03T02:36:24Z

@jdumars It's already resoved by #64154, and this feature is beta in v1.12

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 4, 2017

k8s-github-robot assigned jdef and sttts Oct 4, 2017

jingxu97 added this to the v1.9 milestone Oct 4, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2017

jingxu97 assigned msau42 Oct 4, 2017

msau42 reviewed Oct 4, 2017

View reviewed changes

jingxu97 force-pushed the Oct/maxPD branch from f1c7e07 to 9eebf57 Compare October 5, 2017 20:52

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 5, 2017

jingxu97 force-pushed the Oct/maxPD branch 2 times, most recently from 73b7435 to 7090665 Compare October 5, 2017 21:27

msau42 reviewed Oct 6, 2017

View reviewed changes

k8s-ci-robot unassigned sttts Oct 6, 2017

msau42 reviewed Oct 9, 2017

View reviewed changes

jingxu97 force-pushed the Oct/maxPD branch from 7090665 to fc1b933 Compare October 17, 2017 21:46

msau42 reviewed Oct 17, 2017

View reviewed changes

jingxu97 force-pushed the Oct/maxPD branch from fc1b933 to cfa8c86 Compare October 17, 2017 22:38

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2017

jingxu97 force-pushed the Oct/maxPD branch from cfa8c86 to d3fcb07 Compare October 18, 2017 21:08

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2017

jingxu97 force-pushed the Oct/maxPD branch from d3fcb07 to 732355a Compare October 19, 2017 21:20

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2017

k8s-ci-robot added the sig/azure label Nov 2, 2017

bgrant0607 removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2017

k8s-github-robot added the milestone/incomplete-labels label Nov 14, 2017

jdumars modified the milestones: v1.9, next-candidate Nov 16, 2017

andyzhangx reviewed Nov 21, 2017

View reviewed changes

msau42 mentioned this pull request Dec 11, 2017

Add DigitalOcean provisioner kubernetes-retired/external-storage#470

Merged

ravisantoshgudimetla mentioned this pull request Feb 8, 2018

Considering PVC as a resource for balanced resource utilization in the scheduler #58232

Closed

gonzochic mentioned this pull request Feb 21, 2018

Disk error when pods are mounting a certain amount of volumes on a node Azure/AKS#201

Closed

jdumars closed this Sep 24, 2018

Enable dynamic configue max number of PDs allowed on a node based on machine type #53461

Enable dynamic configue max number of PDs allowed on a node based on machine type #53461

Conversation

jingxu97 commented Oct 4, 2017 • edited Loading

Choose a reason for hiding this comment

jingxu97 Oct 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Oct 4, 2017

jingxu97 commented Oct 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdumars commented Oct 11, 2017

karataliu commented Oct 12, 2017 • edited Loading

jdumars commented Oct 12, 2017

jingxu97 commented Oct 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdumars commented Oct 26, 2017

k8s-github-robot commented Oct 26, 2017

khenidak commented Nov 2, 2017 • edited Loading

brendandburns commented Nov 2, 2017 • edited Loading

jdumars commented Nov 2, 2017

bgrant0607 commented Nov 8, 2017

jingxu97 commented Nov 8, 2017

k8s-github-robot commented Nov 16, 2017

dims commented Nov 16, 2017

jdumars commented Nov 16, 2017

Choose a reason for hiding this comment

ymsaout commented Dec 29, 2017

jingxu97 commented Jan 2, 2018

khenidak commented Feb 6, 2018

ymsaout commented Feb 28, 2018

bgrant0607 commented May 1, 2018

kkmsft commented Jun 22, 2018

msau42 commented Jun 22, 2018

jdumars commented Sep 2, 2018

andyzhangx commented Sep 3, 2018

jingxu97 commented Oct 4, 2017 •

edited

Loading

jingxu97 Oct 5, 2017 •

edited

Loading

karataliu commented Oct 12, 2017 •

edited

Loading

khenidak commented Nov 2, 2017 •

edited

Loading

brendandburns commented Nov 2, 2017 •

edited

Loading