Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial design proposal for Scheduling Policy #1937

Closed
wants to merge 10 commits into from
312 changes: 312 additions & 0 deletions contributors/design-proposals/scheduling/scheduling-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
# Scheduling Policy

_Status: Draft_

_Authors: @arnaudmz, @yastij_

_Reviewers: @bsalamat, @tallclair, @liggitt_

# Objectives

- Define the concept of scheduling policies
- Propose their initial design and scope

## Non-Goals

- How taints / tolerations work.
- How NodeSelector works.
- How node / pod affinity / anti-affinity rules work.
- How several schedulers can be used within a single cluster.
- How priority classes work.
- How to set defaults in Kubernetes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any mechanism about opt-out a goal to be discussed here? Opt-out means when the policy object is present, is there any mechanism to temporarily disable it without deleting it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a use case in mind that couldn't be accomplished with label selectors?


# Background

During real-life Kubernetes architecturing we encountered contexts where role-isolation (between administration and simple namespace usage in a multi-tenant context) could be improved. So far, no restriction is possible on toleration, priority class usage, nodeSelector, anti-affinity depending on user permissions (RBAC).

Identified use-cases aim to ensure that administrators have a way to restrict users or namespaces, it allows administrators to:

- Restrict execution for specific applications (which are namespace scoped) into certain nodes
- Create policies that prevent users from even attempting to schedule workloads onto masters to maximize security
- require that a pods under a namespace run on dedicated nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to specify dedicated nodes? use node label?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taint/toleration and nodeSelector.

- Restrict usage of some `PriorityClass`
- Restrict usage to a specific set of schedulers.
- enforcing pod affinity or anti-affinity rules on some particular namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really count as a use case. Why do administrators want to do this?

For example, I suspect a common use case might be: "pods are not allowed to set namespaces on PodAffinityTerms" (i.e. they cannot have affinity or antiaffinity with pods outside their namespace). Given the current approach to specifying affinity & anti affinity policy, I'm not sure that's possible to express.


Also Mandatory values must also be enforced by scheduling policies in case of:

- mutli-arch (amd64, arm64) of multi-os (Linux, Windows) clusters (will also be handled later by the [RuntimeClass]() KEP)
- multi-az / region / failure domain clusters

# Overview

The schedulingPolicy will live out-of-tree under kubernetes-sigs org. It will use a CRD-based approach.

## syntaxic standards

### empty match and unset fields:

an empty field and an unset one matches everything.

### how policies are computed

Policies are computed using the following algorithm:

```
sortedPolicies = sort_by_priority(sort_deny_first(policies))
for policy in sortedPolicies:
if policy matches pod: // all specified policy rules match
return policy.action
```
note that:
- rules of policies with higher priority supersede lower priority rules if they both match.
- matching is done statically, i.e. we don't interpret logical operators (see nodeAffinity section for more details).
- matching is considered true if a subset of a set-based field is matched.



# Detailed Design


## SchedulingPolicy

Proposed API group: `policy/v1alpha1`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider making this policy.k8s.io and putting it in a new policy repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kubernetes/api-approvers



### SchedulingPolicy content

SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it automatically allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it automatically/it is automatically/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


```yaml
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
priority: <exception,cluster,default>
action: <allow,deny>
rules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't love the name rules for this field (I'd expect it to be of []Rule type), but I don't have any good suggestions.... maybe criteria?

Alternatively we could just flatten it and put the rules directly in the spec.

namespaceSelector:
- key: value
operator: operator
podSelector:
key: value
schedulerNames:
match: []
priorityClassNames:
match: []
tolerations:
match: []
nodeSelectors:
match: []
nodeAffinities:
match: []
podAntiAffinities:
match: []
podAffinities:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about podAffinities. What kind of use-cases do you have in mind for pod affinities in scheduling policy.

Copy link
Member

@yastij yastij Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use case I have seen is a cache trashing application. you want to enforce the fact that it specifies a podAffinity to other cache trashing applications in order to co-localize them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of cache trashing application, I think it makes more sense to limit them to a set of nodes, instead of forcing them to specify podAffinity towards one another.

match: []
```

- allow: When a policy matches the pod is allowed
- deny: when a policy matches the pod is denied


### Scoping

The `spec.rules.namespaceSelector` and `spec.rules.podSelector` attributes scope the policy application range.

Pods must match both `spec.namespaceSelector` and `spec.podSelector` matching rules to be constrained by a policy.
Both `spec.namespaceSelector` and `spec.podSelector` are optional. If absent, all pods and all namespaces are targeted by the policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about default? default specifies what value the specification should get by default when a Pod does not specify it.

Copy link
Member

@yastij yastij Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default will be adressed on another design proposal. The reason is that handling conflict resolution for default isn't the as policy.

### Scheduler name

If `schedulerNames` is absent from `allowed`, `default` or `required`, no scheduler is allowed by this specific policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required is no more?

I'm not sure what this means... see my comment above about matching the empty value.

no scheduler is allowed by this specific policy.

If this is a common use case, we may want a way to express "any value that is set" (i.e. "deny any value that is set")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess that would be a use for noneOf which I said I didn't think we need...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it enough to 'allow empty scheduler name with higher priority and deny a nil scheduler name with lower priority' to implement 'no scheduler name is allowed'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, the problem becomes when you want to have multiple rules like that. Now your exception case needs to encompass all the allowed field options. But maybe that's OK? Again, it would help to have some use cases to work through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for use cases and the actual policy/policies needed to accomplish the use case. bonus points for reasoning through how those policies would interact with other policies present



#### Allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example?

Allow pods to use either the `green-scheduler` (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in namespaces labeled `team`:

```yaml
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: allow
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop the pod selector, it should be optional, and isn't used in this example

schedulerNames:
match: ["green-scheduler","my-scheduler"]
```

An empty list of schedulerNames will allow usage of all schedulers:
bsalamat marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this. It's redundant with what's already been stated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.


```yaml
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: allow
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
schedulerNames:
match: []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I want to require that the scheduler name be left blank, would that be: match: ['']?

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second policy needs a description.


note: this policy has no effect as we allow all when no policy is set.

### Tolerations

Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied.
Copy link

@easeway easeway May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after reading scheduler names and tolerations, I feel it will be better to change required to be require_any, for less confusion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the matching behavior is the same for all action types, would you also want it to be allow_any and deny_any?



#### Allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example. Same everywhere else.


This allows pods with tolerations the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear from this how the matching works. Does this only match pods that have exactly this set of tolerations? Or a pod that has any of these tolerations (including none?). What if I wanted to require a toleration? I think we need to be careful how we define the semantics for things that match lists (or maps).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what we've said, I think that all the matches should be satisfied to return the corresponding action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree. It's a match if the pod has all the tolerations listed. If the pod also has other tollerations, it's still a match.

Copy link

@ericavonb ericavonb Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with tolerations using set-based selectors? Does it require an exact match for each selector or does it check for equivalent selectors?

Example:
For a pod:

apiVersion: v1
kind: Pod
metadata:
  name: with-tolerations
  namespace: team
spec:
  tolerations:
  - key: "key1"
    operator: "In"
    value: ["value1a", "value1b"]
    effect: "NoSchedule"
  - key: "key2"
    operator: "In"
    value: ["value2a", "value2b"]
    effect: "NoSchedule"
  - key: "key3"
    operator: "Nin"
    value: ["value3a", "value4b"]
    effect: "NoExecute"
  containers:
  - name: with-tolerations
    image: k8s.gcr.io/pause:2.0

and policy:

apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
  name: mySchedulingPolicy
spec:
  action: allow   
  rules:
    namespaces:
      - key: team
        operator: Exists
    podSelector: {}    
    tolerations:
       match:
        - key: "key1"
          operator: "Nin"
          value: ["value1c"]
          effects: "NoSchedule"
        - key: "key2"
          operator: "Exists"
          effects: "NoSchedule"
        - key: "key3"
          operator: "Nin"
          value: ["value3a"]
          effect: "NoExecute"

should this policy match this pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic matching could get complex for users. I think for set-based values, we can simplify and follow these rules:

  1. Keys must be equal.
  2. Operators must be equal.
  3. policy.spec.tolerations.match.value must be a subset of pod.spec.tolerations.value

If all of the above rules match, we consider the policy a match.

@tallclair, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a hard problem, that deserves some thoughtful analysis and validation against real use cases.

policy.spec.tolerations.match.value must be a subset of ...

This is tricky because the tolerance is sort of dependent on the operator. For example, if I want to express "allow pods that are defined to these nodes", allowing a subset makes sense. However, if I want to say "allow pods that are excluded from these nodes", I don't want to allow pods that are only excluded from a subset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just require an exact match in this case. (modulo sort)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair - we can go down this path, subset matching can be performed by splitting into multiple policies, any thoughts @bsalamat @ericavonb @arnaudmz ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, please update the proposal to capture that decision.

- tolerations that tolerates taints with key named `mykey` that has a value `value` and with a `NoSchedule` effect.
- tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect.

```yaml
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: allow
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
tolerations:
match:
- key: "mykey"
operator: "Equal"
value: "value"
effects: "NoSchedule"
- key: "other_key"
operator: "Exists"
effect: "NoExecute"
```


### Priority classes

Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

priorityClassNames

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix.


##### Allow

In this example, we only allow the `critical-job` priority.


```yaml
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: allow
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
priorityClasseNames:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

priorityClassNames

match: "critical-priority"
```


### Node Selector

The `nodeSelector` field makes it possible to specify what pods are allowed or denied based on their nodeSelectors.

#### Examples

##### Complete policy

```yaml
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: deny
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
nodeSelectors:
match:
- failure-domain.beta.kubernetes.io/region: ""
- disk: "ssd"
- disk: "premium"
```


In this example, pods cannot be scheduled if they have all of the following at the same time:
- `disk: ssd` nodeSelector
- `disk: hdd` nodeSelector
- `failure-domain.beta.kubernetes.io/region` nodeSelector with any value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if they have these 3, plus one other? Or only 2? See above about matching maps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the algorithm, as we said we should match everything to take the action ? cc @bsalamat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with match all for a given rule. One could create multiple policies to cover "OR" cases. For example, you want to deny any of nodeSelector "disk: ssd" or nodeSelector "disk:premimum", you could create two policies with deny action. One has

nodeSelectors:
  match:
    - disk: "ssd"

and the other:

nodeSelectors:
  match:
    - disk: "premium"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the proposal to capture a decision here.



### Affinity rules

As anti-affinity rules are really time-consuming, we must be able to restrict their usage for each type (`nodeAffinities`, `podAffinities`, `podAntiAffinities`) a schedulingpolicy can list allowed/denied constraints (`requiredDuringSchedulingIgnoredDuringExecution`
or `requiredDuringSchedulingIgnoredDuringExecution`).

#### Examples

##### Basic policy



```yaml
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: allow
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
nodeAffinities:
requiredDuringSchedulingIgnoredDuringExecution:
match:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still having a hard time wrapping my head around the semantics of applying policy to such a deeply nested & logical field. For instance, it seems like it could be useful to have: matchAny, matchAll, and matchNone for these, along with wildcard matching on the specific fields.

As an example, consider the case: pods aren't allowed to schedule to the eu-region nodes.

action: deny
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["eu-1", "eu-2"]

Well, you probably want subset matching so that a user can't just do this:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["eu-1", "eu-2", "us-1"]
- key: region
  operator: "NotIn"
  values: ["us-1"]

On the other hand, you also want to prevent doing this: (suppose the full set of regions include eu-1, eu-2, us-1, us-2)

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "NotIn"
  values: ["us-1", "us-2"]

I guess what you really need to say is "this pod must be scheduled outside of the eu nodes", so the policy should be:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["us-1", "us-2"]

Now, suppose a user wants to specifically schedule a pod in "us-1" - the admin wants to allow it:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["us-1"]

As is, this fails because ["us-1", "us-2"] is not a subset of ["us-1"]. They could do:

requiredDuringSchedulingIgnoredDuringExecution:
- key: region
  operator: "In"
  values: ["us-1", "us-2"]
- key: region
  operator: "In"
  values: ["us-1"]

So that the required rule is there, and then further scope it down, but that conflicts with some of the other matching semantics we've already declared.


Anyhow, this was a bit rambly - but the point I'm trying to convey is that the composition & matching semantics of these fields really depend on the type of operator being used, and the specific affinity. I.e. the semantics for "podAntiAffinity" and "tolerations" should probably be different from those of pod & node affinity.

I just don't see these nuances covered in the current proposal, and I'm not totally sure they can be cleanly expressed in this model.

- key: "authorized-region"
operator: "NotIn"
values: ["eu-2", "us-1"]
- key: "PCI-region"
operator: "Exists"
values: []
preferredDuringSchedulingIgnoredDuringExecution:
match:
- key: "flavor"
operator: In
values: ["m1.small", "m1.medium"]

```

In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pods that nodeAffinity/pods with nodeAffinity/


- All the "required" and "preferred" sections.
- Each section has the same keys and the same operators.
- Values must be the same or subset of those of the pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of where subset matching would be desireable?




# References
- [Pod affinity/anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity)
- [Pod priorities](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/)
- [Taints and tolerations](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/)
- [Using multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/)