Add initial design proposal for Scheduling Policy #1937

arnaudmz · 2018-03-16T10:05:18Z

k8s-ci-robot · 2018-03-16T10:05:20Z

@arnaudmz: GitHub didn't allow me to request PR reviews from the following users: yastij.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Ref: kubernetes/kubernetes#61185

/assign @bsalamat @liggitt
/cc @yastij

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yastij · 2018-03-16T16:34:58Z

also cc @timothysc

liggitt · 2018-03-16T16:39:40Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+SchedulingPolicy resources are supposed to apply in a deny-all-except approach. They are designed to apply in an additive way (i.e and'ed). From Pod's perspective, a pod can use one or N of the allowed items.
+
+An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations.


Absence of node selectors is also problematic. The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes. Any replacement for that mechanism would need to provide the same capability.

Please see below some proposal which could go this way.

The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes.

Will that introduce order dependence of the two admission controller? for example, podnodeselector added some deny nodeselector after this admission controller; similar to podtolerationrestrict. Is it possible to combine those admission into one? or document it clearly.

It's arguable that cluster admin should configure it correctly; but that'll take time to do trouble-shooting :)

liggitt · 2018-03-16T16:44:52Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations.
+
+All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed.


Clarify what "merged" means. That seems potentially problematic, especially in case of computing coverage of conflicting scheduling components (policy A allowed this toleration, policy B allowed that toleration, policy C required nodeSelector component master=false, policy D allows nodeSelector component master=true, etc)

As long as there was no required components nor default value, merging was quite trivial, but given that need, I guess we'll have to work on it.

I'm thinking of some ways like:

allowed-like rules keep having an additive behaviour:
if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b]

require-like rules prevent over allowed values => allowed values not present in required values are eventually not allowed

for default-like and required-like conditions, we could consider either to weight policies or to sort policies (by name?) and apply a last-seen wins rule.

Any thoughts?

@arnaudmz @liggitt - to me, if a user specifies a required nodeselector master=true and added a default nodeselector master=false under another policy, I’d expect the required to superseed the default one.

prefer to have allow, deny and ignore concept; so the result will be no deny term for the requirements.

allow: passed by policy

deny: rejected by policy

ignore: the policy did not include term for it.

if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b]

That's error-prone and complex; especially some corner cases.

@k82cn - I'd go with require, allow, default. As to me, if something isn't explicitly stated in the SchedulingPolicy it is denied by default.

Also do you any use cases for ignore policies ?

cc @bsalamat

I mean the the single term; so the request is passed only all term passed :)

liggitt · 2018-03-16T16:47:07Z

contributors/design-proposals/scheduling/scheduling-policy.md

+spec:
+  allowedSchedulerNames:      # Describes schedulers names that are allowed
+  allowedPriorityClasseNames: # Describes priority classe names that are allowed
+  allowedNodeSelectors:       # Describes node selectors that can be used


Discuss default vs required vs allowed vs forbidden

Typically, fencing nodes via selector involves requiring a specific set of labels/values (in addition to whatever else the pod wants), e.g. master=false,compute=true

liggitt · 2018-03-16T16:49:42Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedNodeSelectors`.
+
+If `allowedNodeSelectors` is totally absent from the spec, no node selector is allowed.


This doesn't make sense. A pod with no nodeSelector targets the most nodes possible. Adding more selectors constrains a pod.

Generally, you want to require a set of nodeSelector labels be present, error if the pod tries to specify nodeSelector components that conflict with that required set, and allow the pod to specify any additional nodeSelector components it wants. That is what the current podnodeselector admission plugin does

Do you think we could do it this way?
I'm taking nodeSelector as a simple example to start with:

apiVersion: extensions/valpha1 kind: SchedulingPolicy metadata: name: my-schedpol spec: nodeSelectors: required: beta.kubernetes.io/arch: ["amd64", "arm"] # pick one of thoses mandadory values default: beta.kubernetes.io/os: amd64 # Here is the default value unless specified allowed: failure-domain.beta.kubernetes.io/region: [] # any value can be sepcified

Given the deny-by-default design, some kind of forbiden subsection actually would'nt make sense.

@arnaudmz - I agree, given the design, forbiden doesn’t make much sense.

liggitt · 2018-03-16T16:50:27Z

Cc @kubernetes/sig-auth-proposals

bsalamat · 2018-03-16T17:45:40Z

contributors/design-proposals/scheduling/scheduling-policy.md

+metadata:
+  name: my-schedpol
+spec:
+  allowedSchedulerNames:      # Describes schedulers names that are allowed


As Jordan has also mentioned, none of these field should have the "allowed" prefix. They should be "schedulerNames", "priorityClassNames", etc. Then the spec for each one should have a "condition" (or a similar word) that can be set to one of the "allowed", "forbidden", "default", or "required".

@liggitt @bsalamat - by « default », we mean if nothing specied -> add the element of the policy ? (e.g if an SP specifies a nodeselector with a ruleType default, all pods with no nodeSelector will mutated to specify the nodeSelector from the SP) ?

Yes, that's what I meant. One more point to add is that, in Kubernetes, we usually apply Pod policies at the granularity of namespaces. So, a user should be able to specify the namespace that these rules are applied. For example, default priority class of Pods in namespace "ns-1" is "pc-1".

Usually policies such as psp do not hold namespace. Rbac will enables this, as users can creates roles that enables the verb «use » on the policy. cc @liggitt @arnaudmz

@yastij: yes, that was the point of mimicing the PSP RBAC principle: using RoleBindings or ClousterRoleBindings to apply the policies at serviceaccount, namespace or cluster scope.

bsalamat · 2018-03-16T18:19:23Z

cc/ @bgrant0607

bsalamat · 2018-03-16T22:32:02Z

cc/ @ravisantoshgudimetla @aveshagarwal

davidopp · 2018-03-17T20:28:43Z

I see that you're using the same mechanism as for PodSecurityPolicy (in particular using the use verb) to specify which users/service accounts a policy applies to, but have you considered other approaches that might be more intuitive/simpler? For example, just having the SchedulingPolicy specify a selector over namespaces? I guess the question is whether these policies should be applied based on who created the pod in question, or where the pod in question was created.

arnaudmz · 2018-03-18T08:55:08Z

Yes, scoping Scheduling Policies by namespace could make sens.

From a cluster administrators perpective, it seemed they might have to create a schedpol each time they create an end-user-oriented namespace to ensure they can/must provided expected scheduling informations. I'm thinking of the multi-arch clusters, where beta.kubernetes.io/arch will have to be allowed (probably even enforced) cluster-wide. Non-scoped RBAC-based schedpols make this use case a non-event.

However, if other use-cases tend to favor to namespaces schepols (like resourcequotas and limitranges are), why not.

mikedanese · 2018-03-20T00:29:28Z

/assign @tallclair

…o be discussed

yastij · 2018-03-20T10:45:36Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+In this example, only nodeAffinities (required and preferred) are allowed but no podAffinities nor podAntiAffinities.
+
+## Multiple SchedulingPolicies considerations


@arnaudmz @liggitt @bsalamat @tallclair @davidopp - these are the options available to handle conflicts.

concerns for me:

Option 1 is way too complicated to handle.

Option 2 is more deterministic but won't let users split SchedulingPolicy accross multiple resources.

why we have to merge them? Does go through the policy one by one not enough?

@k82cn - it is not a proper "merge", i.e we do not aggregate SchedulingPolicy resources into a new one. we'll be going through policies and resolves conflicts to compute the set of rules, these options are the possible ways of dealing with such conflicts.

k8s-ci-robot · 2018-03-20T15:20:59Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bsalamat · 2018-08-14T21:15:38Z

keep the selectors at the policy level, not inside the matches. this simplifies greatly things I think (I'm in favor of your suggestion of having a labelSelector with subfields for namespaceSelector and PodSelector)

Intuitively, a label selector and a namespace selector belong to the "Matcher". Can you give an example of the case that moving the label selector to the policy simplifies things?

having the stringSelector without any semantic, I think that the anyOf/NoneOf pushes some complexity to the cluster admin.

Again, an example helps understand your point.

to summarize:
policies are ANDed
simple elements inside the match are ORed (example schedulerNames)

IMO, we shouldn't care about ANDing and ORing. We should just follow these rules:

A pod must match all the provided fields of the Matcher to be considered a match for the policy.
Policies are sorted by their priority and then by their action (deny is ahead of allow).
We evaluate the policies in order. As soon as a policy is matched against a pod, its action is returned.

some elements inside the match are more complex, for example:

Yes, for more complex fields, our matchers will need to be more complex, but in case of your example, I think it should be broken into multiple rules. There is no point in making a single rule complex and let it match many pods, instead of creating multiple simple rules that match the same set of pods.

yastij · 2018-08-21T20:33:52Z

SGTM, updating the design doc, once done I'll ping for approval.

yastij · 2018-08-22T18:10:16Z

The SIG thinks that this should be new resource under the policy API group, but as it is an API matter can @kubernetes/api-approvers comment on this ?

tallclair · 2018-08-22T23:08:58Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+# Overview
+
+### syntaxic standards


This section is empty now (please remove)

tallclair · 2018-08-22T23:15:16Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+### empty match and unset fields:
+
+We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing


I think this is backwards. Unset should match everything, empty should match the empty value.

Care will need to be taken when the unset value and empty value are the same. For example, consider:

SchedulerName []string

If I have SchedulerName = nil (or equivalently []string{}) , then that MUST match any scheduler name. If I want to explicitly match the unset ScehdulerName, I would need to set SchedulerName = []string{''}.

Unfortunately I think this will need to be considered on a case-by-case basis, as the behavior depends on both the governed field type (the podspec field), and the policy field type.

Any API that distinguishes between a nil slice and an empty slice is likely to be broken. The conversion semantics do not always retain the difference, and I strongly urge you to not distinguish.

tallclair · 2018-08-22T23:15:47Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+## SchedulingPolicy
+
+Proposed API group: `policy/v1alpha1`


I think we should consider making this policy.k8s.io and putting it in a new policy repo.

cc @kubernetes/api-approvers

tallclair · 2018-08-22T23:16:12Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+### SchedulingPolicy content
+
+SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field.


tallclair · 2018-08-22T23:23:16Z

contributors/design-proposals/scheduling/scheduling-policy.md

+spec:
+  priority: <exception,cluster,default>
+  action:  <allow,deny>     
+  rules:


nit: I don't love the name rules for this field (I'd expect it to be of []Rule type), but I don't have any good suggestions.... maybe criteria?

Alternatively we could just flatten it and put the rules directly in the spec.

tallclair · 2018-08-22T23:36:48Z

contributors/design-proposals/scheduling/scheduling-policy.md

+       match: ["green-scheduler","my-scheduler"]
+```
+
+An empty list of schedulerNames will allow usage of all schedulers:


Remove this. It's redundant with what's already been stated.

tallclair · 2018-08-22T23:38:25Z

contributors/design-proposals/scheduling/scheduling-policy.md

+Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied.
+
+
+#### Allowed


Example. Same everywhere else.

tallclair · 2018-08-22T23:42:01Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+#### Allowed
+
+This allows pods with tolerations the following:


It's unclear from this how the matching works. Does this only match pods that have exactly this set of tolerations? Or a pod that has any of these tolerations (including none?). What if I wanted to require a toleration? I think we need to be careful how we define the semantics for things that match lists (or maps).

From what we've said, I think that all the matches should be satisfied to return the corresponding action.

I think I agree. It's a match if the pod has all the tolerations listed. If the pod also has other tollerations, it's still a match.

How does this work with tolerations using set-based selectors? Does it require an exact match for each selector or does it check for equivalent selectors?

Example:
For a pod:

apiVersion: v1 kind: Pod metadata: name: with-tolerations namespace: team spec: tolerations: - key: "key1" operator: "In" value: ["value1a", "value1b"] effect: "NoSchedule" - key: "key2" operator: "In" value: ["value2a", "value2b"] effect: "NoSchedule" - key: "key3" operator: "Nin" value: ["value3a", "value4b"] effect: "NoExecute" containers: - name: with-tolerations image: k8s.gcr.io/pause:2.0

and policy:

apiVersion: policy/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: action: allow rules: namespaces: - key: team operator: Exists podSelector: {} tolerations: match: - key: "key1" operator: "Nin" value: ["value1c"] effects: "NoSchedule" - key: "key2" operator: "Exists" effects: "NoSchedule" - key: "key3" operator: "Nin" value: ["value3a"] effect: "NoExecute"

should this policy match this pod?

Semantic matching could get complex for users. I think for set-based values, we can simplify and follow these rules:

Keys must be equal.

Operators must be equal.

policy.spec.tolerations.match.value must be a subset of pod.spec.tolerations.value

If all of the above rules match, we consider the policy a match.

@tallclair, what do you think?

I think this is a hard problem, that deserves some thoughtful analysis and validation against real use cases.

policy.spec.tolerations.match.value must be a subset of ...

This is tricky because the tolerance is sort of dependent on the operator. For example, if I want to express "allow pods that are defined to these nodes", allowing a subset makes sense. However, if I want to say "allow pods that are excluded from these nodes", I don't want to allow pods that are only excluded from a subset.

Maybe we should just require an exact match in this case. (modulo sort)

@tallclair - we can go down this path, subset matching can be performed by splitting into multiple policies, any thoughts @bsalamat @ericavonb @arnaudmz ?

sgtm, please update the proposal to capture that decision.

tallclair · 2018-08-22T23:42:22Z

contributors/design-proposals/scheduling/scheduling-policy.md

+        operator: Exists
+    podSelector: {}
+    priorityClasseNames:
+      - match: "critical-priority"


should be a list?

tallclair · 2018-08-22T23:44:06Z

contributors/design-proposals/scheduling/scheduling-policy.md

+In this example, pods cannot be scheduled if they have all of the following at the same time:
+-  `disk: ssd` nodeSelector
+-  `disk: hdd` nodeSelector
+-  `failure-domain.beta.kubernetes.io/region` nodeSelector with any value.


What if they have these 3, plus one other? Or only 2? See above about matching maps.

It depends on the algorithm, as we said we should match everything to take the action ? cc @bsalamat

I agree with match all for a given rule. One could create multiple policies to cover "OR" cases. For example, you want to deny any of nodeSelector "disk: ssd" or nodeSelector "disk:premimum", you could create two policies with deny action. One has

nodeSelectors: match: - disk: "ssd"

and the other:

nodeSelectors: match: - disk: "premium"

Please update the proposal to capture a decision here.

thockin

Without commenting on the proposal itself, the general guidance right now is to define new resources as CRDs unless we can demonstrate why that is infeasible.

@saad-ali @jingxu97 for a couple other early adopters

thockin · 2018-08-23T19:09:23Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+### empty match and unset fields:
+
+We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing


Any API that distinguishes between a nil slice and an empty slice is likely to be broken. The conversion semantics do not always retain the difference, and I strongly urge you to not distinguish.

bgrant0607 · 2018-08-23T22:19:19Z

My suggestion at this point is to use a CRD (as @thockin mentioned) and admission webhook, implement a prototype under the kubernetes-sigs org, and collect user feedback.

Even before that, though, I suggest writing a user guide that covers critical use cases to help you understand how cluster admins and app operators would use this and what abstractions should be presented to each persona in order to map workloads to cluster resources.

bsalamat · 2018-09-10T17:56:00Z

contributors/design-proposals/scheduling/scheduling-policy.md

@@ -56,7 +58,10 @@ for policy in sortedPolicies:
  if policy matches pod: // all specified policy rules match
    return policy.action
 ```
- note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match.
+ note that:
+- rules of policies from a lower priority are superseeded by ones from a higher priority if they match.


nit: I would write in active form:
rules of policies with higher priority supersede lower priority rules if they both match.

Yes looks better

bsalamat · 2018-09-10T17:56:53Z

contributors/design-proposals/scheduling/scheduling-policy.md

- note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match.
+ note that:
+- rules of policies from a lower priority are superseeded by ones from a higher priority if they match.
+- matching is done statically (i.e. we don't interpret logical operators).


This is not clear. You should refer the reader to a section that provides more details/examples.

@bsalamat - maybe the nodAffinity section with the examples ?

bsalamat · 2018-09-10T18:30:49Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+### SchedulingPolicy content
+
+SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field.


Should this be changed based on the recent change on how an empty/unset field is matched?

bsalamat · 2018-09-10T19:06:30Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+```
+
+In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`)


s/that nodeAffinity/with nodeAffinity/ ?

The matching here should match against the nodeAffinity rules of pods. The description is vague. I guess we need to specify that a match happens when a Pod has:

All the "required" and "preferred" sections.

Each section has the same keys and the same operators.

Values must be the same or subset of those of the pod.

yastij · 2018-09-12T19:30:20Z

@bsalamat - updated the proposal with latest comments

bsalamat

I am fine with the current proposal. Please make sure that @tallclair points are addressed too.

bsalamat · 2018-09-14T22:00:19Z

contributors/design-proposals/scheduling/scheduling-policy.md

@@ -70,7 +75,7 @@ Proposed API group: `policy/v1alpha1`

 ### SchedulingPolicy content

-SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field.
+SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it automatically allowed.


s/it automatically/it is automatically/

bsalamat · 2018-09-14T22:25:43Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+```
+
+In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has:


s/pods that nodeAffinity/pods with nodeAffinity/

yastij · 2018-09-20T20:58:29Z

/lgtm

@bsalamat - updated, the proposal seems ready to merge to me.

justaugustus · 2018-10-13T03:48:25Z

/kind kep

tallclair · 2018-10-23T00:38:48Z

contributors/design-proposals/scheduling/scheduling-policy.md

+    podSelector: {}    
+    schedulerNames:
+       match: []
+```


This second policy needs a description.

tallclair · 2018-10-23T00:39:44Z

contributors/design-proposals/scheduling/scheduling-policy.md

+    namespaces:
+      - key: team
+        operator: Exists
+    podSelector: {}


nit: drop the pod selector, it should be optional, and isn't used in this example

tallclair · 2018-10-23T00:40:30Z

contributors/design-proposals/scheduling/scheduling-policy.md

+        operator: Exists
+    podSelector: {}    
+    schedulerNames:
+       match: []


If I want to require that the scheduler name be left blank, would that be: match: ['']?

tallclair · 2018-10-23T00:45:12Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+#### Allowed
+
+This allows pods with tolerations the following:


sgtm, please update the proposal to capture that decision.

tallclair · 2018-10-23T00:45:45Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+### Priority classes
+
+Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field.


Please fix.

tallclair · 2018-10-23T00:50:46Z

contributors/design-proposals/scheduling/scheduling-policy.md

+In this example, pods cannot be scheduled if they have all of the following at the same time:
+-  `disk: ssd` nodeSelector
+-  `disk: hdd` nodeSelector
+-  `failure-domain.beta.kubernetes.io/region` nodeSelector with any value.


Please update the proposal to capture a decision here.

tallclair · 2018-10-23T00:57:33Z

contributors/design-proposals/scheduling/scheduling-policy.md

+
+- All the "required" and "preferred" sections.
+- Each section has the same keys and the same operators.
+- Values must be the same or subset of those of the pod.


Can you give an example of where subset matching would be desireable?

tallclair · 2018-10-23T01:08:26Z

contributors/design-proposals/scheduling/scheduling-policy.md

+-  require that a pods under a namespace run on dedicated nodes
+-  Restrict usage of some `PriorityClass`
+-  Restrict usage to a specific set of schedulers.
+-  enforcing pod affinity or anti-affinity rules on some particular namespace.


This doesn't really count as a use case. Why do administrators want to do this?

For example, I suspect a common use case might be: "pods are not allowed to set namespaces on PodAffinityTerms" (i.e. they cannot have affinity or antiaffinity with pods outside their namespace). Given the current approach to specifying affinity & anti affinity policy, I'm not sure that's possible to express.

tallclair · 2018-10-23T01:25:33Z

contributors/design-proposals/scheduling/scheduling-policy.md

+    podSelector: {}
+    nodeAffinities:
+      requiredDuringSchedulingIgnoredDuringExecution:
+         match:


I'm still having a hard time wrapping my head around the semantics of applying policy to such a deeply nested & logical field. For instance, it seems like it could be useful to have: matchAny, matchAll, and matchNone for these, along with wildcard matching on the specific fields.

As an example, consider the case: pods aren't allowed to schedule to the eu-region nodes.

action: deny requiredDuringSchedulingIgnoredDuringExecution: - key: region operator: "In" values: ["eu-1", "eu-2"]

Well, you probably want subset matching so that a user can't just do this:

requiredDuringSchedulingIgnoredDuringExecution: - key: region operator: "In" values: ["eu-1", "eu-2", "us-1"] - key: region operator: "NotIn" values: ["us-1"]

On the other hand, you also want to prevent doing this: (suppose the full set of regions include eu-1, eu-2, us-1, us-2)

requiredDuringSchedulingIgnoredDuringExecution: - key: region operator: "NotIn" values: ["us-1", "us-2"]

I guess what you really need to say is "this pod must be scheduled outside of the eu nodes", so the policy should be:

requiredDuringSchedulingIgnoredDuringExecution: - key: region operator: "In" values: ["us-1", "us-2"]

Now, suppose a user wants to specifically schedule a pod in "us-1" - the admin wants to allow it:

requiredDuringSchedulingIgnoredDuringExecution: - key: region operator: "In" values: ["us-1"]

As is, this fails because ["us-1", "us-2"] is not a subset of ["us-1"]. They could do:

requiredDuringSchedulingIgnoredDuringExecution: - key: region operator: "In" values: ["us-1", "us-2"] - key: region operator: "In" values: ["us-1"]

So that the required rule is there, and then further scope it down, but that conflicts with some of the other matching semantics we've already declared.

Anyhow, this was a bit rambly - but the point I'm trying to convey is that the composition & matching semantics of these fields really depend on the type of operator being used, and the specific affinity. I.e. the semantics for "podAntiAffinity" and "tolerations" should probably be different from those of pod & node affinity.

I just don't see these nuances covered in the current proposal, and I'm not totally sure they can be cleanly expressed in this model.

justaugustus · 2018-11-20T04:43:30Z

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

yastij · 2018-11-20T13:47:44Z

@justaugustus ACK, thanks

justaugustus · 2018-12-01T08:12:16Z

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

k8s-ci-robot · 2018-12-01T08:12:17Z

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add initial design proposal for Scheduling Policy

1e64624

k8s-ci-robot assigned bsalamat and liggitt Mar 16, 2018

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 16, 2018

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 16, 2018

k8s-github-robot added kind/design Categorizes issue or PR as related to design. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 16, 2018

liggitt reviewed Mar 16, 2018

View reviewed changes

k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Mar 16, 2018

bsalamat reviewed Mar 16, 2018

View reviewed changes

k8s-ci-robot assigned tallclair Mar 20, 2018

Iterate after first remarks, several merging options are presented, t…

0943979

…o be discussed

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 20, 2018

yastij reviewed Mar 20, 2018

View reviewed changes

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2018

updating the proposal with the newest algorithm and structures

f7465aa

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 22, 2018

tallclair reviewed Aug 22, 2018

View reviewed changes

thockin reviewed Aug 23, 2018

View reviewed changes

bsalamat reviewed Sep 10, 2018

View reviewed changes

add CRD-based approach and static matching

f968f81

yastij mentioned this pull request Sep 14, 2018

CRD Installation Mechanism kubernetes/enhancements#615

Closed

bsalamat reviewed Sep 14, 2018

View reviewed changes

arnaudmz and others added 2 commits September 15, 2018 07:38

Typos

00596bc

move to policy.k8s.io apiGroup

74e3cb2

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2018

k8s-ci-robot added the kind/kep label Oct 13, 2018

tallclair reviewed Oct 23, 2018

View reviewed changes

k8s-ci-robot closed this Dec 1, 2018

yastij mentioned this pull request Jan 9, 2019

add schedulingPolicy KEP kubernetes/enhancements#683

Closed

k82cn mentioned this pull request Apr 16, 2019

Added schedulingPolicy api. kubeflow/common#13

Merged

gaocegege mentioned this pull request Apr 17, 2019

Add SchedulingPolicy for the common operator kubeflow/common#11

Open


		SchedulingPolicy resources are supposed to apply in a deny-all-except approach. They are designed to apply in an additive way (i.e and'ed). From Pod's perspective, a pod can use one or N of the allowed items.

		An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations.


		An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations.

		All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed.


		As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedNodeSelectors`.

		If `allowedNodeSelectors` is totally absent from the spec, no node selector is allowed.


		In this example, only nodeAffinities (required and preferred) are allowed but no podAffinities nor podAntiAffinities.

		## Multiple SchedulingPolicies considerations


		### empty match and unset fields:

		We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing


		### SchedulingPolicy content

		SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field.

		Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied.


		#### Allowed


		#### Allowed

		This allows pods with tolerations the following:


		```

		In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`)


		### Priority classes

		Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field.

Add initial design proposal for Scheduling Policy #1937

Add initial design proposal for Scheduling Policy #1937

Conversation

arnaudmz commented Mar 16, 2018

k8s-ci-robot commented Mar 16, 2018

yastij commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn Mar 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij Mar 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij Mar 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat commented Mar 16, 2018

bsalamat commented Mar 16, 2018

davidopp commented Mar 17, 2018

arnaudmz commented Mar 18, 2018

mikedanese commented Mar 20, 2018

yastij Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 20, 2018

bsalamat commented Aug 14, 2018

yastij commented Aug 21, 2018

yastij commented Aug 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericavonb Aug 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgrant0607 commented Aug 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn Mar 18, 2018 •

edited

Loading

yastij Mar 17, 2018 •

edited

Loading

yastij Mar 19, 2018 •

edited

Loading

yastij Mar 16, 2018 •

edited

Loading

yastij Mar 20, 2018 •

edited

Loading

ericavonb Aug 23, 2018 •

edited

Loading