Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial design proposal for Scheduling Policy #1937

Closed
wants to merge 10 commits into from
Closed

Add initial design proposal for Scheduling Policy #1937

wants to merge 10 commits into from

Conversation

arnaudmz
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 16, 2018
@k8s-ci-robot
Copy link
Contributor

@arnaudmz: GitHub didn't allow me to request PR reviews from the following users: yastij.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Ref: kubernetes/kubernetes#61185

/assign @bsalamat @liggitt
/cc @yastij

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 16, 2018
@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 16, 2018
@yastij
Copy link
Member

yastij commented Mar 16, 2018

also cc @timothysc


SchedulingPolicy resources are supposed to apply in a deny-all-except approach. They are designed to apply in an additive way (i.e and'ed). From Pod's perspective, a pod can use one or N of the allowed items.

An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absence of node selectors is also problematic. The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes. Any replacement for that mechanism would need to provide the same capability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see below some proposal which could go this way.

Copy link
Member

@k82cn k82cn Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes.

Will that introduce order dependence of the two admission controller? for example, podnodeselector added some deny nodeselector after this admission controller; similar to podtolerationrestrict. Is it possible to combine those admission into one? or document it clearly.

It's arguable that cluster admin should configure it correctly; but that'll take time to do trouble-shooting :)


An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations.

All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify what "merged" means. That seems potentially problematic, especially in case of computing coverage of conflicting scheduling components (policy A allowed this toleration, policy B allowed that toleration, policy C required nodeSelector component master=false, policy D allows nodeSelector component master=true, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there was no required components nor default value, merging was quite trivial, but given that need, I guess we'll have to work on it.

I'm thinking of some ways like:

  • allowed-like rules keep having an additive behaviour:
    if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b]

  • require-like rules prevent over allowed values => allowed values not present in required values are eventually not allowed

  • for default-like and required-like conditions, we could consider either to weight policies or to sort policies (by name?) and apply a last-seen wins rule.

Any thoughts?

Copy link
Member

@yastij yastij Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudmz @liggitt - to me, if a user specifies a required nodeselector master=true and added a default nodeselector master=false under another policy, I’d expect the required to superseed the default one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to have allow, deny and ignore concept; so the result will be no deny term for the requirements.

  • allow: passed by policy
  • deny: rejected by policy
  • ignore: the policy did not include term for it.

if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b]

That's error-prone and complex; especially some corner cases.

Copy link
Member

@yastij yastij Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k82cn - I'd go with require, allow, default. As to me, if something isn't explicitly stated in the SchedulingPolicy it is denied by default.

Also do you any use cases for ignore policies ?

cc @bsalamat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the the single term; so the request is passed only all term passed :)

spec:
allowedSchedulerNames: # Describes schedulers names that are allowed
allowedPriorityClasseNames: # Describes priority classe names that are allowed
allowedNodeSelectors: # Describes node selectors that can be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss default vs required vs allowed vs forbidden

Typically, fencing nodes via selector involves requiring a specific set of labels/values (in addition to whatever else the pod wants), e.g. master=false,compute=true


As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedNodeSelectors`.

If `allowedNodeSelectors` is totally absent from the spec, no node selector is allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense. A pod with no nodeSelector targets the most nodes possible. Adding more selectors constrains a pod.

Generally, you want to require a set of nodeSelector labels be present, error if the pod tries to specify nodeSelector components that conflict with that required set, and allow the pod to specify any additional nodeSelector components it wants. That is what the current podnodeselector admission plugin does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could do it this way?
I'm taking nodeSelector as a simple example to start with:

apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
  name: my-schedpol
spec:
  nodeSelectors:
    required:
      beta.kubernetes.io/arch: ["amd64", "arm"] # pick one of thoses mandadory values
    default:
      beta.kubernetes.io/os: amd64 # Here is the default value unless specified
    allowed:
      failure-domain.beta.kubernetes.io/region: [] # any value can be sepcified

Given the deny-by-default design, some kind of forbiden subsection actually would'nt make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudmz - I agree, given the design, forbiden doesn’t make much sense.

@liggitt
Copy link
Member

liggitt commented Mar 16, 2018

Cc @kubernetes/sig-auth-proposals

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Mar 16, 2018
metadata:
name: my-schedpol
spec:
allowedSchedulerNames: # Describes schedulers names that are allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Jordan has also mentioned, none of these field should have the "allowed" prefix. They should be "schedulerNames", "priorityClassNames", etc. Then the spec for each one should have a "condition" (or a similar word) that can be set to one of the "allowed", "forbidden", "default", or "required".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt @bsalamat - by « default », we mean if nothing specied -> add the element of the policy ? (e.g if an SP specifies a nodeselector with a ruleType default, all pods with no nodeSelector will mutated to specify the nodeSelector from the SP) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant. One more point to add is that, in Kubernetes, we usually apply Pod policies at the granularity of namespaces. So, a user should be able to specify the namespace that these rules are applied. For example, default priority class of Pods in namespace "ns-1" is "pc-1".

Copy link
Member

@yastij yastij Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually policies such as psp do not hold namespace. Rbac will enables this, as users can creates roles that enables the verb «use » on the policy. cc @liggitt @arnaudmz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yastij: yes, that was the point of mimicing the PSP RBAC principle: using RoleBindings or ClousterRoleBindings to apply the policies at serviceaccount, namespace or cluster scope.

@bsalamat
Copy link
Member

cc/ @bgrant0607

@bsalamat
Copy link
Member

@davidopp
Copy link
Member

I see that you're using the same mechanism as for PodSecurityPolicy (in particular using the use verb) to specify which users/service accounts a policy applies to, but have you considered other approaches that might be more intuitive/simpler? For example, just having the SchedulingPolicy specify a selector over namespaces? I guess the question is whether these policies should be applied based on who created the pod in question, or where the pod in question was created.

@arnaudmz
Copy link
Contributor Author

Yes, scoping Scheduling Policies by namespace could make sens.

From a cluster administrators perpective, it seemed they might have to create a schedpol each time they create an end-user-oriented namespace to ensure they can/must provided expected scheduling informations. I'm thinking of the multi-arch clusters, where beta.kubernetes.io/arch will have to be allowed (probably even enforced) cluster-wide. Non-scoped RBAC-based schedpols make this use case a non-event.

However, if other use-cases tend to favor to namespaces schepols (like resourcequotas and limitranges are), why not.

@mikedanese
Copy link
Member

/assign @tallclair

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 20, 2018

In this example, only nodeAffinities (required and preferred) are allowed but no podAffinities nor podAntiAffinities.

## Multiple SchedulingPolicies considerations
Copy link
Member

@yastij yastij Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudmz @liggitt @bsalamat @tallclair @davidopp - these are the options available to handle conflicts.

concerns for me:

  • Option 1 is way too complicated to handle.
  • Option 2 is more deterministic but won't let users split SchedulingPolicy accross multiple resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have to merge them? Does go through the policy one by one not enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k82cn - it is not a proper "merge", i.e we do not aggregate SchedulingPolicy resources into a new one. we'll be going through policies and resolves conflicts to compute the set of rules, these options are the possible ways of dealing with such conflicts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2018
@bsalamat
Copy link
Member

keep the selectors at the policy level, not inside the matches. this simplifies greatly things I think (I'm in favor of your suggestion of having a labelSelector with subfields for namespaceSelector and PodSelector)

Intuitively, a label selector and a namespace selector belong to the "Matcher". Can you give an example of the case that moving the label selector to the policy simplifies things?

having the stringSelector without any semantic, I think that the anyOf/NoneOf pushes some complexity to the cluster admin.

Again, an example helps understand your point.

to summarize:
policies are ANDed
simple elements inside the match are ORed (example schedulerNames)

IMO, we shouldn't care about ANDing and ORing. We should just follow these rules:

  1. A pod must match all the provided fields of the Matcher to be considered a match for the policy.
  2. Policies are sorted by their priority and then by their action (deny is ahead of allow).
  3. We evaluate the policies in order. As soon as a policy is matched against a pod, its action is returned.

some elements inside the match are more complex, for example:

Yes, for more complex fields, our matchers will need to be more complex, but in case of your example, I think it should be broken into multiple rules. There is no point in making a single rule complex and let it match many pods, instead of creating multiple simple rules that match the same set of pods.

@yastij
Copy link
Member

yastij commented Aug 21, 2018

SGTM, updating the design doc, once done I'll ping for approval.

@yastij
Copy link
Member

yastij commented Aug 22, 2018

The SIG thinks that this should be new resource under the policy API group, but as it is an API matter can @kubernetes/api-approvers comment on this ?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 22, 2018

# Overview

### syntaxic standards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is empty now (please remove)


### empty match and unset fields:

We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is backwards. Unset should match everything, empty should match the empty value.

Care will need to be taken when the unset value and empty value are the same. For example, consider:

SchedulerName []string

If I have SchedulerName = nil (or equivalently []string{}) , then that MUST match any scheduler name. If I want to explicitly match the unset ScehdulerName, I would need to set SchedulerName = []string{''}.

Unfortunately I think this will need to be considered on a case-by-case basis, as the behavior depends on both the governed field type (the podspec field), and the policy field type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any API that distinguishes between a nil slice and an empty slice is likely to be broken. The conversion semantics do not always retain the difference, and I strongly urge you to not distinguish.


## SchedulingPolicy

Proposed API group: `policy/v1alpha1`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider making this policy.k8s.io and putting it in a new policy repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kubernetes/api-approvers


### SchedulingPolicy content

SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^

spec:
priority: <exception,cluster,default>
action: <allow,deny>
rules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't love the name rules for this field (I'd expect it to be of []Rule type), but I don't have any good suggestions.... maybe criteria?

Alternatively we could just flatten it and put the rules directly in the spec.

match: ["green-scheduler","my-scheduler"]
```

An empty list of schedulerNames will allow usage of all schedulers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this. It's redundant with what's already been stated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied.


#### Allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example. Same everywhere else.


#### Allowed

This allows pods with tolerations the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear from this how the matching works. Does this only match pods that have exactly this set of tolerations? Or a pod that has any of these tolerations (including none?). What if I wanted to require a toleration? I think we need to be careful how we define the semantics for things that match lists (or maps).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what we've said, I think that all the matches should be satisfied to return the corresponding action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree. It's a match if the pod has all the tolerations listed. If the pod also has other tollerations, it's still a match.

Copy link

@ericavonb ericavonb Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with tolerations using set-based selectors? Does it require an exact match for each selector or does it check for equivalent selectors?

Example:
For a pod:

apiVersion: v1
kind: Pod
metadata:
  name: with-tolerations
  namespace: team
spec:
  tolerations:
  - key: "key1"
    operator: "In"
    value: ["value1a", "value1b"]
    effect: "NoSchedule"
  - key: "key2"
    operator: "In"
    value: ["value2a", "value2b"]
    effect: "NoSchedule"
  - key: "key3"
    operator: "Nin"
    value: ["value3a", "value4b"]
    effect: "NoExecute"
  containers:
  - name: with-tolerations
    image: k8s.gcr.io/pause:2.0

and policy:

apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
  name: mySchedulingPolicy
spec:
  action: allow   
  rules:
    namespaces:
      - key: team
        operator: Exists
    podSelector: {}    
    tolerations:
       match:
        - key: "key1"
          operator: "Nin"
          value: ["value1c"]
          effects: "NoSchedule"
        - key: "key2"
          operator: "Exists"
          effects: "NoSchedule"
        - key: "key3"
          operator: "Nin"
          value: ["value3a"]
          effect: "NoExecute"

should this policy match this pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic matching could get complex for users. I think for set-based values, we can simplify and follow these rules:

  1. Keys must be equal.
  2. Operators must be equal.
  3. policy.spec.tolerations.match.value must be a subset of pod.spec.tolerations.value

If all of the above rules match, we consider the policy a match.

@tallclair, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a hard problem, that deserves some thoughtful analysis and validation against real use cases.

policy.spec.tolerations.match.value must be a subset of ...

This is tricky because the tolerance is sort of dependent on the operator. For example, if I want to express "allow pods that are defined to these nodes", allowing a subset makes sense. However, if I want to say "allow pods that are excluded from these nodes", I don't want to allow pods that are only excluded from a subset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just require an exact match in this case. (modulo sort)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair - we can go down this path, subset matching can be performed by splitting into multiple policies, any thoughts @bsalamat @ericavonb @arnaudmz ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, please update the proposal to capture that decision.

operator: Exists
podSelector: {}
priorityClasseNames:
- match: "critical-priority"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a list?

In this example, pods cannot be scheduled if they have all of the following at the same time:
- `disk: ssd` nodeSelector
- `disk: hdd` nodeSelector
- `failure-domain.beta.kubernetes.io/region` nodeSelector with any value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if they have these 3, plus one other? Or only 2? See above about matching maps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the algorithm, as we said we should match everything to take the action ? cc @bsalamat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with match all for a given rule. One could create multiple policies to cover "OR" cases. For example, you want to deny any of nodeSelector "disk: ssd" or nodeSelector "disk:premimum", you could create two policies with deny action. One has

nodeSelectors:
  match:
    - disk: "ssd"

and the other:

nodeSelectors:
  match:
    - disk: "premium"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the proposal to capture a decision here.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without commenting on the proposal itself, the general guidance right now is to define new resources as CRDs unless we can demonstrate why that is infeasible.

@saad-ali @jingxu97 for a couple other early adopters


### empty match and unset fields:

We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any API that distinguishes between a nil slice and an empty slice is likely to be broken. The conversion semantics do not always retain the difference, and I strongly urge you to not distinguish.

@bgrant0607
Copy link
Member

My suggestion at this point is to use a CRD (as @thockin mentioned) and admission webhook, implement a prototype under the kubernetes-sigs org, and collect user feedback.

Even before that, though, I suggest writing a user guide that covers critical use cases to help you understand how cluster admins and app operators would use this and what abstractions should be presented to each persona in order to map workloads to cluster resources.

@@ -56,7 +58,10 @@ for policy in sortedPolicies:
if policy matches pod: // all specified policy rules match
return policy.action
```
note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match.
note that:
- rules of policies from a lower priority are superseeded by ones from a higher priority if they match.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would write in active form:
rules of policies with higher priority supersede lower priority rules if they both match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes looks better

note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match.
note that:
- rules of policies from a lower priority are superseeded by ones from a higher priority if they match.
- matching is done statically (i.e. we don't interpret logical operators).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear. You should refer the reader to a section that provides more details/examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat - maybe the nodAffinity section with the examples ?


### SchedulingPolicy content

SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be changed based on the recent change on how an empty/unset field is matched?


```

In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that nodeAffinity/with nodeAffinity/ ?

The matching here should match against the nodeAffinity rules of pods. The description is vague. I guess we need to specify that a match happens when a Pod has:

  1. All the "required" and "preferred" sections.
  2. Each section has the same keys and the same operators.
  3. Values must be the same or subset of those of the pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

@yastij
Copy link
Member

yastij commented Sep 12, 2018

@bsalamat - updated the proposal with latest comments

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the current proposal. Please make sure that @tallclair points are addressed too.

@@ -70,7 +75,7 @@ Proposed API group: `policy/v1alpha1`

### SchedulingPolicy content

SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field.
SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it automatically allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it automatically/it is automatically/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


```

In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pods that nodeAffinity/pods with nodeAffinity/

@yastij
Copy link
Member

yastij commented Sep 20, 2018

/lgtm

@bsalamat - updated, the proposal seems ready to merge to me.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2018
@justaugustus
Copy link
Member

/kind kep

podSelector: {}
schedulerNames:
match: []
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second policy needs a description.

namespaces:
- key: team
operator: Exists
podSelector: {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop the pod selector, it should be optional, and isn't used in this example

operator: Exists
podSelector: {}
schedulerNames:
match: []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I want to require that the scheduler name be left blank, would that be: match: ['']?


#### Allowed

This allows pods with tolerations the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, please update the proposal to capture that decision.


### Priority classes

Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix.

In this example, pods cannot be scheduled if they have all of the following at the same time:
- `disk: ssd` nodeSelector
- `disk: hdd` nodeSelector
- `failure-domain.beta.kubernetes.io/region` nodeSelector with any value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the proposal to capture a decision here.


- All the "required" and "preferred" sections.
- Each section has the same keys and the same operators.
- Values must be the same or subset of those of the pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of where subset matching would be desireable?

- require that a pods under a namespace run on dedicated nodes
- Restrict usage of some `PriorityClass`
- Restrict usage to a specific set of schedulers.
- enforcing pod affinity or anti-affinity rules on some particular namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really count as a use case. Why do administrators want to do this?

For example, I suspect a common use case might be: "pods are not allowed to set namespaces on PodAffinityTerms" (i.e. they cannot have affinity or antiaffinity with pods outside their namespace). Given the current approach to specifying affinity & anti affinity policy, I'm not sure that's possible to express.

podSelector: {}
nodeAffinities:
requiredDuringSchedulingIgnoredDuringExecution:
match:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still having a hard time wrapping my head around the semantics of applying policy to such a deeply nested & logical field. For instance, it seems like it could be useful to have: matchAny, matchAll, and matchNone for these, along with wildcard matching on the specific fields.

As an example, consider the case: pods aren't allowed to schedule to the eu-region nodes.

action: deny
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["eu-1", "eu-2"]

Well, you probably want subset matching so that a user can't just do this:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["eu-1", "eu-2", "us-1"]
- key: region
  operator: "NotIn"
  values: ["us-1"]

On the other hand, you also want to prevent doing this: (suppose the full set of regions include eu-1, eu-2, us-1, us-2)

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "NotIn"
  values: ["us-1", "us-2"]

I guess what you really need to say is "this pod must be scheduled outside of the eu nodes", so the policy should be:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["us-1", "us-2"]

Now, suppose a user wants to specifically schedule a pod in "us-1" - the admin wants to allow it:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["us-1"]

As is, this fails because ["us-1", "us-2"] is not a subset of ["us-1"]. They could do:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["us-1", "us-2"]
- key: region
  operator: "In"
  values: ["us-1"]

So that the required rule is there, and then further scope it down, but that conflicts with some of the other matching semantics we've already declared.


Anyhow, this was a bit rambly - but the point I'm trying to convey is that the composition & matching semantics of these fields really depend on the type of operator being used, and the specific affinity. I.e. the semantics for "podAntiAffinity" and "tolerations" should probably be different from those of pod & node affinity.

I just don't see these nuances covered in the current proposal, and I'm not totally sure they can be cleanly expressed in this model.

@justaugustus
Copy link
Member

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@yastij
Copy link
Member

yastij commented Nov 20, 2018

@justaugustus ACK, thanks

@justaugustus
Copy link
Member

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot
Copy link
Contributor

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.