-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial design proposal for Scheduling Policy #1937
Changes from 8 commits
1e64624
0943979
8ed464f
d129432
aad9f2f
bfc1bb9
f7465aa
f968f81
00596bc
74e3cb2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,312 @@ | ||
# Scheduling Policy | ||
|
||
_Status: Draft_ | ||
|
||
_Authors: @arnaudmz, @yastij_ | ||
|
||
_Reviewers: @bsalamat, @tallclair, @liggitt_ | ||
|
||
# Objectives | ||
|
||
- Define the concept of scheduling policies | ||
- Propose their initial design and scope | ||
|
||
## Non-Goals | ||
|
||
- How taints / tolerations work. | ||
- How NodeSelector works. | ||
- How node / pod affinity / anti-affinity rules work. | ||
- How several schedulers can be used within a single cluster. | ||
- How priority classes work. | ||
- How to set defaults in Kubernetes. | ||
|
||
# Background | ||
|
||
During real-life Kubernetes architecturing we encountered contexts where role-isolation (between administration and simple namespace usage in a multi-tenant context) could be improved. So far, no restriction is possible on toleration, priority class usage, nodeSelector, anti-affinity depending on user permissions (RBAC). | ||
|
||
Identified use-cases aim to ensure that administrators have a way to restrict users or namespaces, it allows administrators to: | ||
|
||
- Restrict execution for specific applications (which are namespace scoped) into certain nodes | ||
- Create policies that prevent users from even attempting to schedule workloads onto masters to maximize security | ||
- require that a pods under a namespace run on dedicated nodes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how to specify dedicated nodes? use node label? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. taint/toleration and nodeSelector. |
||
- Restrict usage of some `PriorityClass` | ||
- Restrict usage to a specific set of schedulers. | ||
- enforcing pod affinity or anti-affinity rules on some particular namespace. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't really count as a use case. Why do administrators want to do this? For example, I suspect a common use case might be: "pods are not allowed to set namespaces on PodAffinityTerms" (i.e. they cannot have affinity or antiaffinity with pods outside their namespace). Given the current approach to specifying affinity & anti affinity policy, I'm not sure that's possible to express. |
||
|
||
Also Mandatory values must also be enforced by scheduling policies in case of: | ||
|
||
- mutli-arch (amd64, arm64) of multi-os (Linux, Windows) clusters (will also be handled later by the [RuntimeClass]() KEP) | ||
- multi-az / region / failure domain clusters | ||
|
||
# Overview | ||
|
||
The schedulingPolicy will live out-of-tree under kubernetes-sigs org. It will use a CRD-based approach. | ||
|
||
## syntaxic standards | ||
|
||
### empty match and unset fields: | ||
|
||
an empty field and an unset one matches everything. | ||
|
||
### how policies are computed | ||
|
||
Policies are computed using the following algorithm: | ||
|
||
``` | ||
sortedPolicies = sort_by_priority(sort_deny_first(policies)) | ||
for policy in sortedPolicies: | ||
if policy matches pod: // all specified policy rules match | ||
return policy.action | ||
``` | ||
note that: | ||
- rules of policies with higher priority supersede lower priority rules if they both match. | ||
- matching is done statically, i.e. we don't interpret logical operators (see nodeAffinity section for more details). | ||
- matching is considered true if a subset of a set-based field is matched. | ||
|
||
|
||
|
||
# Detailed Design | ||
|
||
|
||
## SchedulingPolicy | ||
|
||
Proposed API group: `policy/v1alpha1` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should consider making this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @kubernetes/api-approvers |
||
|
||
|
||
### SchedulingPolicy content | ||
|
||
SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it automatically allowed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/it automatically/it is automatically/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
```yaml | ||
apiVersion: policy/v1alpha1 | ||
kind: SchedulingPolicy | ||
metadata: | ||
name: mySchedulingPolicy | ||
spec: | ||
priority: <exception,cluster,default> | ||
action: <allow,deny> | ||
rules: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: I don't love the name Alternatively we could just flatten it and put the rules directly in the spec. |
||
namespaceSelector: | ||
- key: value | ||
operator: operator | ||
podSelector: | ||
key: value | ||
schedulerNames: | ||
match: [] | ||
priorityClassNames: | ||
match: [] | ||
tolerations: | ||
match: [] | ||
nodeSelectors: | ||
match: [] | ||
nodeAffinities: | ||
match: [] | ||
podAntiAffinities: | ||
match: [] | ||
podAffinities: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure about podAffinities. What kind of use-cases do you have in mind for pod affinities in scheduling policy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the use case I have seen is a cache trashing application. you want to enforce the fact that it specifies a podAffinity to other cache trashing applications in order to co-localize them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the case of cache trashing application, I think it makes more sense to limit them to a set of nodes, instead of forcing them to specify podAffinity towards one another. |
||
match: [] | ||
``` | ||
|
||
- allow: When a policy matches the pod is allowed | ||
- deny: when a policy matches the pod is denied | ||
|
||
|
||
### Scoping | ||
|
||
The `spec.rules.namespaceSelector` and `spec.rules.podSelector` attributes scope the policy application range. | ||
|
||
Pods must match both `spec.namespaceSelector` and `spec.podSelector` matching rules to be constrained by a policy. | ||
Both `spec.namespaceSelector` and `spec.podSelector` are optional. If absent, all pods and all namespaces are targeted by the policy. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Default will be adressed on another design proposal. The reason is that handling conflict resolution for |
||
### Scheduler name | ||
|
||
If `schedulerNames` is absent from `allowed`, `default` or `required`, no scheduler is allowed by this specific policy. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. required is no more? I'm not sure what this means... see my comment above about matching the empty value.
If this is a common use case, we may want a way to express "any value that is set" (i.e. "deny any value that is set") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I guess that would be a use for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't it enough to 'allow empty scheduler name with higher priority and deny a nil scheduler name with lower priority' to implement 'no scheduler name is allowed'? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sort of, the problem becomes when you want to have multiple rules like that. Now your exception case needs to encompass all the allowed field options. But maybe that's OK? Again, it would help to have some use cases to work through. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for use cases and the actual policy/policies needed to accomplish the use case. bonus points for reasoning through how those policies would interact with other policies present |
||
|
||
|
||
#### Allowed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Example? |
||
Allow pods to use either the `green-scheduler` (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in namespaces labeled `team`: | ||
|
||
```yaml | ||
apiVersion: policy/v1alpha1 | ||
kind: SchedulingPolicy | ||
metadata: | ||
name: mySchedulingPolicy | ||
spec: | ||
action: allow | ||
rules: | ||
namespaces: | ||
- key: team | ||
operator: Exists | ||
podSelector: {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: drop the pod selector, it should be optional, and isn't used in this example |
||
schedulerNames: | ||
match: ["green-scheduler","my-scheduler"] | ||
``` | ||
|
||
An empty list of schedulerNames will allow usage of all schedulers: | ||
bsalamat marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove this. It's redundant with what's already been stated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True. |
||
|
||
```yaml | ||
apiVersion: policy/v1alpha1 | ||
kind: SchedulingPolicy | ||
metadata: | ||
name: mySchedulingPolicy | ||
spec: | ||
action: allow | ||
rules: | ||
namespaces: | ||
- key: team | ||
operator: Exists | ||
podSelector: {} | ||
schedulerNames: | ||
match: [] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I want to require that the scheduler name be left blank, would that be: |
||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This second policy needs a description. |
||
|
||
note: this policy has no effect as we allow all when no policy is set. | ||
|
||
### Tolerations | ||
|
||
Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. after reading scheduler names and tolerations, I feel it will be better to change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the matching behavior is the same for all action types, would you also want it to be |
||
|
||
|
||
#### Allowed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Example. Same everywhere else. |
||
|
||
This allows pods with tolerations the following: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's unclear from this how the matching works. Does this only match pods that have exactly this set of tolerations? Or a pod that has any of these tolerations (including none?). What if I wanted to require a toleration? I think we need to be careful how we define the semantics for things that match lists (or maps). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From what we've said, I think that all the matches should be satisfied to return the corresponding action. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I agree. It's a match if the pod has all the tolerations listed. If the pod also has other tollerations, it's still a match. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this work with tolerations using set-based selectors? Does it require an exact match for each selector or does it check for equivalent selectors? Example: apiVersion: v1
kind: Pod
metadata:
name: with-tolerations
namespace: team
spec:
tolerations:
- key: "key1"
operator: "In"
value: ["value1a", "value1b"]
effect: "NoSchedule"
- key: "key2"
operator: "In"
value: ["value2a", "value2b"]
effect: "NoSchedule"
- key: "key3"
operator: "Nin"
value: ["value3a", "value4b"]
effect: "NoExecute"
containers:
- name: with-tolerations
image: k8s.gcr.io/pause:2.0 and policy: apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: allow
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
tolerations:
match:
- key: "key1"
operator: "Nin"
value: ["value1c"]
effects: "NoSchedule"
- key: "key2"
operator: "Exists"
effects: "NoSchedule"
- key: "key3"
operator: "Nin"
value: ["value3a"]
effect: "NoExecute" should this policy match this pod? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Semantic matching could get complex for users. I think for set-based values, we can simplify and follow these rules:
If all of the above rules match, we consider the policy a match. @tallclair, what do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a hard problem, that deserves some thoughtful analysis and validation against real use cases.
This is tricky because the tolerance is sort of dependent on the operator. For example, if I want to express "allow pods that are defined to these nodes", allowing a subset makes sense. However, if I want to say "allow pods that are excluded from these nodes", I don't want to allow pods that are only excluded from a subset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should just require an exact match in this case. (modulo sort) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tallclair - we can go down this path, subset matching can be performed by splitting into multiple policies, any thoughts @bsalamat @ericavonb @arnaudmz ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sgtm, please update the proposal to capture that decision. |
||
- tolerations that tolerates taints with key named `mykey` that has a value `value` and with a `NoSchedule` effect. | ||
- tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect. | ||
|
||
```yaml | ||
apiVersion: policy/v1alpha1 | ||
kind: SchedulingPolicy | ||
metadata: | ||
name: mySchedulingPolicy | ||
spec: | ||
action: allow | ||
rules: | ||
namespaces: | ||
- key: team | ||
operator: Exists | ||
podSelector: {} | ||
tolerations: | ||
match: | ||
- key: "mykey" | ||
operator: "Equal" | ||
value: "value" | ||
effects: "NoSchedule" | ||
- key: "other_key" | ||
operator: "Exists" | ||
effect: "NoExecute" | ||
``` | ||
|
||
|
||
### Priority classes | ||
|
||
Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please fix. |
||
|
||
##### Allow | ||
|
||
In this example, we only allow the `critical-job` priority. | ||
|
||
|
||
```yaml | ||
apiVersion: policy/v1alpha1 | ||
kind: SchedulingPolicy | ||
metadata: | ||
name: mySchedulingPolicy | ||
spec: | ||
action: allow | ||
rules: | ||
namespaces: | ||
- key: team | ||
operator: Exists | ||
podSelector: {} | ||
priorityClasseNames: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
match: "critical-priority" | ||
``` | ||
|
||
|
||
### Node Selector | ||
|
||
The `nodeSelector` field makes it possible to specify what pods are allowed or denied based on their nodeSelectors. | ||
|
||
#### Examples | ||
|
||
##### Complete policy | ||
|
||
```yaml | ||
apiVersion: policy/v1alpha1 | ||
kind: SchedulingPolicy | ||
metadata: | ||
name: mySchedulingPolicy | ||
spec: | ||
action: deny | ||
rules: | ||
namespaces: | ||
- key: team | ||
operator: Exists | ||
podSelector: {} | ||
nodeSelectors: | ||
match: | ||
- failure-domain.beta.kubernetes.io/region: "" | ||
- disk: "ssd" | ||
- disk: "premium" | ||
``` | ||
|
||
|
||
In this example, pods cannot be scheduled if they have all of the following at the same time: | ||
- `disk: ssd` nodeSelector | ||
- `disk: hdd` nodeSelector | ||
- `failure-domain.beta.kubernetes.io/region` nodeSelector with any value. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if they have these 3, plus one other? Or only 2? See above about matching maps. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It depends on the algorithm, as we said we should match everything to take the action ? cc @bsalamat There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with match all for a given rule. One could create multiple policies to cover "OR" cases. For example, you want to deny any of nodeSelector "disk: ssd" or nodeSelector "disk:premimum", you could create two policies with deny action. One has
and the other:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update the proposal to capture a decision here. |
||
|
||
|
||
### Affinity rules | ||
|
||
As anti-affinity rules are really time-consuming, we must be able to restrict their usage for each type (`nodeAffinities`, `podAffinities`, `podAntiAffinities`) a schedulingpolicy can list allowed/denied constraints (`requiredDuringSchedulingIgnoredDuringExecution` | ||
or `requiredDuringSchedulingIgnoredDuringExecution`). | ||
|
||
#### Examples | ||
|
||
##### Basic policy | ||
|
||
|
||
|
||
```yaml | ||
apiVersion: policy/v1alpha1 | ||
kind: SchedulingPolicy | ||
metadata: | ||
name: mySchedulingPolicy | ||
spec: | ||
action: allow | ||
rules: | ||
namespaces: | ||
- key: team | ||
operator: Exists | ||
podSelector: {} | ||
nodeAffinities: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
match: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still having a hard time wrapping my head around the semantics of applying policy to such a deeply nested & logical field. For instance, it seems like it could be useful to have: As an example, consider the case: pods aren't allowed to schedule to the eu-region nodes.
Well, you probably want subset matching so that a user can't just do this:
On the other hand, you also want to prevent doing this: (suppose the full set of regions include eu-1, eu-2, us-1, us-2)
I guess what you really need to say is "this pod must be scheduled outside of the eu nodes", so the policy should be:
Now, suppose a user wants to specifically schedule a pod in "us-1" - the admin wants to allow it:
As is, this fails because
So that the required rule is there, and then further scope it down, but that conflicts with some of the other matching semantics we've already declared. Anyhow, this was a bit rambly - but the point I'm trying to convey is that the composition & matching semantics of these fields really depend on the type of operator being used, and the specific affinity. I.e. the semantics for "podAntiAffinity" and "tolerations" should probably be different from those of pod & node affinity. I just don't see these nuances covered in the current proposal, and I'm not totally sure they can be cleanly expressed in this model. |
||
- key: "authorized-region" | ||
operator: "NotIn" | ||
values: ["eu-2", "us-1"] | ||
- key: "PCI-region" | ||
operator: "Exists" | ||
values: [] | ||
preferredDuringSchedulingIgnoredDuringExecution: | ||
match: | ||
- key: "flavor" | ||
operator: In | ||
values: ["m1.small", "m1.medium"] | ||
|
||
``` | ||
|
||
In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/pods that nodeAffinity/pods with nodeAffinity/ |
||
|
||
- All the "required" and "preferred" sections. | ||
- Each section has the same keys and the same operators. | ||
- Values must be the same or subset of those of the pod. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you give an example of where subset matching would be desireable? |
||
|
||
|
||
|
||
# References | ||
- [Pod affinity/anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) | ||
- [Pod priorities](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) | ||
- [Taints and tolerations](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) | ||
- [Using multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any mechanism about opt-out a goal to be discussed here? Opt-out means when the policy object is present, is there any mechanism to temporarily disable it without deleting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a use case in mind that couldn't be accomplished with label selectors?