-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial design proposal for Scheduling Policy #1937
Conversation
@arnaudmz: GitHub didn't allow me to request PR reviews from the following users: yastij. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
also cc @timothysc |
|
||
SchedulingPolicy resources are supposed to apply in a deny-all-except approach. They are designed to apply in an additive way (i.e and'ed). From Pod's perspective, a pod can use one or N of the allowed items. | ||
|
||
An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absence of node selectors is also problematic. The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes. Any replacement for that mechanism would need to provide the same capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see below some proposal which could go this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes.
Will that introduce order dependence of the two admission controller? for example, podnodeselector
added some deny
nodeselector after this admission controller; similar to podtolerationrestrict. Is it possible to combine those admission into one? or document it clearly.
It's arguable that cluster admin should configure it correctly; but that'll take time to do trouble-shooting :)
|
||
An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations. | ||
|
||
All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify what "merged" means. That seems potentially problematic, especially in case of computing coverage of conflicting scheduling components (policy A allowed this toleration, policy B allowed that toleration, policy C required nodeSelector component master=false, policy D allows nodeSelector component master=true, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as there was no required components nor default value, merging was quite trivial, but given that need, I guess we'll have to work on it.
I'm thinking of some ways like:
-
allowed-like rules keep having an additive behaviour:
if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b] -
require-like rules prevent over allowed values => allowed values not present in required values are eventually not allowed
-
for default-like and required-like conditions, we could consider either to weight policies or to sort policies (by name?) and apply a last-seen wins rule.
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to have allow
, deny
and ignore
concept; so the result will be no deny
term for the requirements.
allow
: passed by policydeny
: rejected by policyignore
: the policy did not include term for it.
if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b]
That's error-prone and complex; especially some corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the the single term; so the request is passed only all term passed :)
spec: | ||
allowedSchedulerNames: # Describes schedulers names that are allowed | ||
allowedPriorityClasseNames: # Describes priority classe names that are allowed | ||
allowedNodeSelectors: # Describes node selectors that can be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discuss default vs required vs allowed vs forbidden
Typically, fencing nodes via selector involves requiring a specific set of labels/values (in addition to whatever else the pod wants), e.g. master=false,compute=true
|
||
As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedNodeSelectors`. | ||
|
||
If `allowedNodeSelectors` is totally absent from the spec, no node selector is allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense. A pod with no nodeSelector targets the most nodes possible. Adding more selectors constrains a pod.
Generally, you want to require a set of nodeSelector labels be present, error if the pod tries to specify nodeSelector components that conflict with that required set, and allow the pod to specify any additional nodeSelector components it wants. That is what the current podnodeselector admission plugin does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could do it this way?
I'm taking nodeSelector
as a simple example to start with:
apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
name: my-schedpol
spec:
nodeSelectors:
required:
beta.kubernetes.io/arch: ["amd64", "arm"] # pick one of thoses mandadory values
default:
beta.kubernetes.io/os: amd64 # Here is the default value unless specified
allowed:
failure-domain.beta.kubernetes.io/region: [] # any value can be sepcified
Given the deny-by-default design, some kind of forbiden
subsection actually would'nt make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnaudmz - I agree, given the design, forbiden
doesn’t make much sense.
Cc @kubernetes/sig-auth-proposals |
metadata: | ||
name: my-schedpol | ||
spec: | ||
allowedSchedulerNames: # Describes schedulers names that are allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Jordan has also mentioned, none of these field should have the "allowed" prefix. They should be "schedulerNames", "priorityClassNames", etc. Then the spec for each one should have a "condition" (or a similar word) that can be set to one of the "allowed", "forbidden", "default", or "required".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I meant. One more point to add is that, in Kubernetes, we usually apply Pod policies at the granularity of namespaces. So, a user should be able to specify the namespace that these rules are applied. For example, default priority class of Pods in namespace "ns-1" is "pc-1".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yastij: yes, that was the point of mimicing the PSP RBAC principle: using RoleBindings or ClousterRoleBindings to apply the policies at serviceaccount, namespace or cluster scope.
cc/ @bgrant0607 |
I see that you're using the same mechanism as for PodSecurityPolicy (in particular using the |
Yes, scoping Scheduling Policies by namespace could make sens. From a cluster administrators perpective, it seemed they might have to create a schedpol each time they create an end-user-oriented namespace to ensure they can/must provided expected scheduling informations. I'm thinking of the multi-arch clusters, where beta.kubernetes.io/arch will have to be allowed (probably even enforced) cluster-wide. Non-scoped RBAC-based schedpols make this use case a non-event. However, if other use-cases tend to favor to namespaces schepols (like resourcequotas and limitranges are), why not. |
/assign @tallclair |
|
||
In this example, only nodeAffinities (required and preferred) are allowed but no podAffinities nor podAntiAffinities. | ||
|
||
## Multiple SchedulingPolicies considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnaudmz @liggitt @bsalamat @tallclair @davidopp - these are the options available to handle conflicts.
concerns for me:
- Option 1 is way too complicated to handle.
- Option 2 is more deterministic but won't let users split
SchedulingPolicy
accross multiple resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have to merge them? Does go through the policy one by one not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@k82cn - it is not a proper "merge", i.e we do not aggregate SchedulingPolicy
resources into a new one. we'll be going through policies and resolves conflicts to compute the set of rules, these options are the possible ways of dealing with such conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Intuitively, a label selector and a namespace selector belong to the "Matcher". Can you give an example of the case that moving the label selector to the policy simplifies things?
Again, an example helps understand your point.
IMO, we shouldn't care about ANDing and ORing. We should just follow these rules:
Yes, for more complex fields, our matchers will need to be more complex, but in case of your example, I think it should be broken into multiple rules. There is no point in making a single rule complex and let it match many pods, instead of creating multiple simple rules that match the same set of pods. |
SGTM, updating the design doc, once done I'll ping for approval. |
The SIG thinks that this should be new resource under the policy API group, but as it is an API matter can @kubernetes/api-approvers comment on this ? |
|
||
# Overview | ||
|
||
### syntaxic standards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is empty now (please remove)
|
||
### empty match and unset fields: | ||
|
||
We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is backwards. Unset should match everything, empty should match the empty value.
Care will need to be taken when the unset value and empty value are the same. For example, consider:
SchedulerName []string
If I have SchedulerName = nil
(or equivalently []string{}
) , then that MUST match any scheduler name. If I want to explicitly match the unset ScehdulerName, I would need to set SchedulerName = []string{''}
.
Unfortunately I think this will need to be considered on a case-by-case basis, as the behavior depends on both the governed field type (the podspec field), and the policy field type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any API that distinguishes between a nil slice and an empty slice is likely to be broken. The conversion semantics do not always retain the difference, and I strongly urge you to not distinguish.
|
||
## SchedulingPolicy | ||
|
||
Proposed API group: `policy/v1alpha1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider making this policy.k8s.io
and putting it in a new policy repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kubernetes/api-approvers
|
||
### SchedulingPolicy content | ||
|
||
SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^
spec: | ||
priority: <exception,cluster,default> | ||
action: <allow,deny> | ||
rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't love the name rules
for this field (I'd expect it to be of []Rule
type), but I don't have any good suggestions.... maybe criteria
?
Alternatively we could just flatten it and put the rules directly in the spec.
match: ["green-scheduler","my-scheduler"] | ||
``` | ||
|
||
An empty list of schedulerNames will allow usage of all schedulers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this. It's redundant with what's already been stated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied. | ||
|
||
|
||
#### Allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example. Same everywhere else.
|
||
#### Allowed | ||
|
||
This allows pods with tolerations the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear from this how the matching works. Does this only match pods that have exactly this set of tolerations? Or a pod that has any of these tolerations (including none?). What if I wanted to require a toleration? I think we need to be careful how we define the semantics for things that match lists (or maps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what we've said, I think that all the matches should be satisfied to return the corresponding action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree. It's a match if the pod has all the tolerations listed. If the pod also has other tollerations, it's still a match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with tolerations using set-based selectors? Does it require an exact match for each selector or does it check for equivalent selectors?
Example:
For a pod:
apiVersion: v1
kind: Pod
metadata:
name: with-tolerations
namespace: team
spec:
tolerations:
- key: "key1"
operator: "In"
value: ["value1a", "value1b"]
effect: "NoSchedule"
- key: "key2"
operator: "In"
value: ["value2a", "value2b"]
effect: "NoSchedule"
- key: "key3"
operator: "Nin"
value: ["value3a", "value4b"]
effect: "NoExecute"
containers:
- name: with-tolerations
image: k8s.gcr.io/pause:2.0
and policy:
apiVersion: policy/v1alpha1
kind: SchedulingPolicy
metadata:
name: mySchedulingPolicy
spec:
action: allow
rules:
namespaces:
- key: team
operator: Exists
podSelector: {}
tolerations:
match:
- key: "key1"
operator: "Nin"
value: ["value1c"]
effects: "NoSchedule"
- key: "key2"
operator: "Exists"
effects: "NoSchedule"
- key: "key3"
operator: "Nin"
value: ["value3a"]
effect: "NoExecute"
should this policy match this pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantic matching could get complex for users. I think for set-based values, we can simplify and follow these rules:
- Keys must be equal.
- Operators must be equal.
policy.spec.tolerations.match.value
must be a subset ofpod.spec.tolerations.value
If all of the above rules match, we consider the policy a match.
@tallclair, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a hard problem, that deserves some thoughtful analysis and validation against real use cases.
policy.spec.tolerations.match.value
must be a subset of ...
This is tricky because the tolerance is sort of dependent on the operator. For example, if I want to express "allow pods that are defined to these nodes", allowing a subset makes sense. However, if I want to say "allow pods that are excluded from these nodes", I don't want to allow pods that are only excluded from a subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just require an exact match in this case. (modulo sort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair - we can go down this path, subset matching can be performed by splitting into multiple policies, any thoughts @bsalamat @ericavonb @arnaudmz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, please update the proposal to capture that decision.
operator: Exists | ||
podSelector: {} | ||
priorityClasseNames: | ||
- match: "critical-priority" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a list?
In this example, pods cannot be scheduled if they have all of the following at the same time: | ||
- `disk: ssd` nodeSelector | ||
- `disk: hdd` nodeSelector | ||
- `failure-domain.beta.kubernetes.io/region` nodeSelector with any value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if they have these 3, plus one other? Or only 2? See above about matching maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the algorithm, as we said we should match everything to take the action ? cc @bsalamat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with match all for a given rule. One could create multiple policies to cover "OR" cases. For example, you want to deny any of nodeSelector "disk: ssd" or nodeSelector "disk:premimum", you could create two policies with deny action. One has
nodeSelectors:
match:
- disk: "ssd"
and the other:
nodeSelectors:
match:
- disk: "premium"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the proposal to capture a decision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
### empty match and unset fields: | ||
|
||
We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any API that distinguishes between a nil slice and an empty slice is likely to be broken. The conversion semantics do not always retain the difference, and I strongly urge you to not distinguish.
My suggestion at this point is to use a CRD (as @thockin mentioned) and admission webhook, implement a prototype under the kubernetes-sigs org, and collect user feedback. Even before that, though, I suggest writing a user guide that covers critical use cases to help you understand how cluster admins and app operators would use this and what abstractions should be presented to each persona in order to map workloads to cluster resources. |
@@ -56,7 +58,10 @@ for policy in sortedPolicies: | |||
if policy matches pod: // all specified policy rules match | |||
return policy.action | |||
``` | |||
note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match. | |||
note that: | |||
- rules of policies from a lower priority are superseeded by ones from a higher priority if they match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would write in active form:
rules of policies with higher priority supersede lower priority rules if they both match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes looks better
note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match. | ||
note that: | ||
- rules of policies from a lower priority are superseeded by ones from a higher priority if they match. | ||
- matching is done statically (i.e. we don't interpret logical operators). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear. You should refer the reader to a section that provides more details/examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat - maybe the nodAffinity section with the examples ?
|
||
### SchedulingPolicy content | ||
|
||
SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be changed based on the recent change on how an empty/unset field is matched?
|
||
``` | ||
|
||
In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/that nodeAffinity/with nodeAffinity/ ?
The matching here should match against the nodeAffinity rules of pods. The description is vague. I guess we need to specify that a match happens when a Pod has:
- All the "required" and "preferred" sections.
- Each section has the same keys and the same operators.
- Values must be the same or subset of those of the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
@bsalamat - updated the proposal with latest comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the current proposal. Please make sure that @tallclair points are addressed too.
@@ -70,7 +75,7 @@ Proposed API group: `policy/v1alpha1` | |||
|
|||
### SchedulingPolicy content | |||
|
|||
SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field. | |||
SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it automatically allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it automatically/it is automatically/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
``` | ||
|
||
In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/pods that nodeAffinity/pods with nodeAffinity/
/lgtm @bsalamat - updated, the proposal seems ready to merge to me. |
/kind kep |
podSelector: {} | ||
schedulerNames: | ||
match: [] | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This second policy needs a description.
namespaces: | ||
- key: team | ||
operator: Exists | ||
podSelector: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: drop the pod selector, it should be optional, and isn't used in this example
operator: Exists | ||
podSelector: {} | ||
schedulerNames: | ||
match: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I want to require that the scheduler name be left blank, would that be: match: ['']
?
|
||
#### Allowed | ||
|
||
This allows pods with tolerations the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, please update the proposal to capture that decision.
|
||
### Priority classes | ||
|
||
Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix.
In this example, pods cannot be scheduled if they have all of the following at the same time: | ||
- `disk: ssd` nodeSelector | ||
- `disk: hdd` nodeSelector | ||
- `failure-domain.beta.kubernetes.io/region` nodeSelector with any value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the proposal to capture a decision here.
|
||
- All the "required" and "preferred" sections. | ||
- Each section has the same keys and the same operators. | ||
- Values must be the same or subset of those of the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of where subset matching would be desireable?
- require that a pods under a namespace run on dedicated nodes | ||
- Restrict usage of some `PriorityClass` | ||
- Restrict usage to a specific set of schedulers. | ||
- enforcing pod affinity or anti-affinity rules on some particular namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really count as a use case. Why do administrators want to do this?
For example, I suspect a common use case might be: "pods are not allowed to set namespaces on PodAffinityTerms" (i.e. they cannot have affinity or antiaffinity with pods outside their namespace). Given the current approach to specifying affinity & anti affinity policy, I'm not sure that's possible to express.
podSelector: {} | ||
nodeAffinities: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still having a hard time wrapping my head around the semantics of applying policy to such a deeply nested & logical field. For instance, it seems like it could be useful to have: matchAny
, matchAll
, and matchNone
for these, along with wildcard matching on the specific fields.
As an example, consider the case: pods aren't allowed to schedule to the eu-region nodes.
action: deny
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
operator: "In"
values: ["eu-1", "eu-2"]
Well, you probably want subset matching so that a user can't just do this:
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
operator: "In"
values: ["eu-1", "eu-2", "us-1"]
- key: region
operator: "NotIn"
values: ["us-1"]
On the other hand, you also want to prevent doing this: (suppose the full set of regions include eu-1, eu-2, us-1, us-2)
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
operator: "NotIn"
values: ["us-1", "us-2"]
I guess what you really need to say is "this pod must be scheduled outside of the eu nodes", so the policy should be:
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
operator: "In"
values: ["us-1", "us-2"]
Now, suppose a user wants to specifically schedule a pod in "us-1" - the admin wants to allow it:
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
operator: "In"
values: ["us-1"]
As is, this fails because ["us-1", "us-2"]
is not a subset of ["us-1"]
. They could do:
requiredDuringSchedulingIgnoredDuringExecution:
- key: region
operator: "In"
values: ["us-1", "us-2"]
- key: region
operator: "In"
values: ["us-1"]
So that the required rule is there, and then further scope it down, but that conflicts with some of the other matching semantics we've already declared.
Anyhow, this was a bit rambly - but the point I'm trying to convey is that the composition & matching semantics of these fields really depend on the type of operator being used, and the specific affinity. I.e. the semantics for "podAntiAffinity" and "tolerations" should probably be different from those of pod & node affinity.
I just don't see these nuances covered in the current proposal, and I'm not totally sure they can be cleanly expressed in this model.
REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
@justaugustus ACK, thanks |
KEPs have moved to k/enhancements. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
@justaugustus: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Ref: kubernetes/kubernetes#61185
/assign @bsalamat @liggitt
/cc @yastij