From 1e64624cbf4bf98646ec0ef30d85f9ea55c17a00 Mon Sep 17 00:00:00 2001 From: Arnaud MAZIN Date: Fri, 16 Mar 2018 11:02:32 +0100 Subject: [PATCH 01/10] Add initial design proposal for Scheduling Policy --- .../scheduling/scheduling-policy.md | 355 ++++++++++++++++++ 1 file changed, 355 insertions(+) create mode 100644 contributors/design-proposals/scheduling/scheduling-policy.md diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md new file mode 100644 index 00000000000..a0e1a824fe6 --- /dev/null +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -0,0 +1,355 @@ +# Scheduling Policy + +_Status: Draft_ +_Author: @arnaudmz, @yastij_ +_Reviewers: @bsalamat, @liggitt_ + +# Objectives + +- Define the concept of scheduling policies +- Propose their initial design and scope + +## Non-Goals + +- How taints / tolerations work +- How NodeSelector works +- How node / pod affinity / anti-affinity rules work +- How several schedulers can be used within a single cluster +- How priority classes work + +# Background + +During real-life Kubernetes architecting we encountered contexts where role-isolation (between administration and simple namespace usage in a multi-tenant context) could be improved. So far, no restriction is possible on toleration, priority class usage, nodeSelector, anti-affinity depending on user permissions (RBAC). + +Identified use-cases aim to ensure that administrators have a way to restrict users or namepace when +- using schedulers, +- placing pods on specific nodes (master roles for instance), +- using specific priority classes, +- expressing pod affinity or anti-affinity rules. + +# Overview + +Implementing SchedulingPolicy implies: +- Creating a new resource named **SchedulingPolicy** (schedpol) +- Creating an **AdmissionController** that dehaves on a deny-all-but basis +- Allow SchedulingPolicy to be used by pods using RoleBindings or ClusterRoleBindings + +# Detailed Design + +SchedulingPolicy resources are supposed to apply in a deny-all-except approach. They are designed to apply in an additive way (i.e and'ed). From Pod's perspective, a pod can use one or N of the allowed items. + +An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations. + +All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed. + +## SchedulingPolicy + +Proposed API group: `extensions/v1alpha1` + +SchedulingPolicy is a cluster-scoped resource (not namespaced). + +### SchedulingPolicy content + +SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this schedpol won't allow any item from the missing field. + +```yaml +apiVersion: extensions/valpha1 +kind: SchedulingPolicy +metadata: + name: my-schedpol +spec: + allowedSchedulerNames: # Describes schedulers names that are allowed + allowedPriorityClasseNames: # Describes priority classe names that are allowed + allowedNodeSelectors: # Describes node selectors that can be used + allowedTolerations: # Describes tolerations that can be used + allowedAffinities: # Describes affinities that can be used +``` + +### Scheduler name + +It should be possible to allow users to use only specific schedulers using `allowedSchedulerNames` field. + +If `allowedSchedulerNames` is absent from SchedulingPolicy, no scheduler is allowed by this specific policy. + +#### Examples + +Allow serviceaccounts to use either the default-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`): +```yaml +Kind: SchedulingPolicy +spec: + allowedSchedulerNames: + - default-scheduler + - my-scheduler +``` + + +Allow all schedulers: +```yaml +Kind: SchedulingPolicy +spec: + allowedSchedulerNames: [] +``` + + +### Tolerations + +Toleration usage can be allowed using fine-grain rules with `allowedTolerations` field. If specifying multiple `allowedTolerations`, pod will be scheduled if one of the allowedTolerations is satisfied. + +If `allowedTolerations` is absent from SchedulingPolicy, no toleration is allowed. + +#### Examples + +##### Fine-grain allowedTolerations +```yaml +Kind: SchedulingPolicy +spec: + allowedTolerations: + - keys: ["mykey"] + operators: ["Equal"] + values: ["value"] + effects: ["NoSchedule"] + - keys: ["other_key"] + operators: ["Exists"] + effects: ["NoExecute"] +``` +This example allows tolerations in the following forms: +- tolerations that tolerates taints with key named `mykey` that has a value `value` and with a `NoSchedule` effect. +- tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect. + +##### Coarse-grain allowedTolerations +```yaml +Kind: SchedulingPolicy +spec: + allowedTolerations: + - keys: [] + operators: [] + values: [] + effects: ["PreferNoSchedule"] + - keys: [] + operators: ["Exists"] + effects: ["NoSchedule"] +``` +This example allows tolerations in the following forms: +- tolerations that tolerates all `PreferNoSchedule` taints with any value. +- tolerations that tolerates taints based on any key existence with effect `NoSchedule`. +Also note that this SchedulingPolicy does not allow tolerating NoExecute taints. + + +### Priority classes + +We must be able to enforce users to use specific priority classes by using the `allowedPriorityClasseNames` field. + +If `allowedPriorityClasseNames` is absent from SchedulingPolicy, no priority class is allowed. + +#### Examples + +##### Only allow a single priority class +```yaml +Kind: SchedulingPolicy +spec: + allowedPriorityClasseNames: + - high-priority +``` +In this example, only the `high-priority` PriorityClass is allowed. + + +##### Allow all priorities + +```yaml +Kind: SchedulingPolicy +spec: + allowedPriorityClasseNames: [] +``` +In this example, all priority classes are allowed. + +### Node Selector + +As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedNodeSelectors`. + +If `allowedNodeSelectors` is totally absent from the spec, no node selector is allowed. + +#### Examples + +##### Fine-grained policy + +```yaml +Kind: SchedulingPolicy +spec: + allowedNodeSelectors: + disk: ["ssd"] + region: [] # means any value +``` +In this example, pods can be scheduled only if they either: +- have no nodeSelector +- or have a `disk: ssd` nodeSelector +- and / or have a `region` key nodeSelector with any value + +### Affinity rules + +As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedAffinities`. +`allowedAffinities` is supposed to keep a coarse-grained approach in allowing affinities. For each type (`nodeAffinities`, `podAffinities`, `podAntiAffinities`) a schedulingpolicy can list allowed constraints (`requiredDuringSchedulingIgnoredDuringExecution` +or `requiredDuringSchedulingIgnoredDuringExecution`). + +If `allowedAffinities` is totally absent from the spec, no affinity is allowed whatever its kind. + +#### Examples + +##### Basic policy +```yaml +Kind: SchedulingPolicy +spec: + allowedAffinities: + nodeAffinities: + - requiredDuringSchedulingIgnoredDuringExecution + podAntiAffinities: + - requiredDuringSchedulingIgnoredDuringExecution + - preferredDuringSchedulingIgnoredDuringExecution +``` + +##### Allow-all policy +In this example, all affinities are allowed: +```yaml +Kind: SchedulingPolicy +spec: + allowedAffinities: + nodeAffinities: [] + podAffinities: [] + podAntiAffinities: [] +``` + +If a sub-item of allowedAffinities is absent from SchedulingPolicy, it is not allowed e.g: +```yaml +Kind: SchedulingPolicy +spec: + allowedAffinities: + nodeAffinities: [] +``` +In this example, only soft and hard nodeAffinities are allowed. + +### When both `allowedNodeSelectors` and `nodeAffinities` are specified + +Use of both `allowedNodeSelectors` and `nodeAffinities` is not recommended as the latter being way more permissive. + +## Default SchedulingPolicies + +### Restricted policy +Here is a reasonable policy that might be allowed for any cluster without specific needs: +```yaml +apiVersion: extensions/valpha1 +kind: SchedulingPolicy +metadata: + name: restricted +spec: + allowedSchedulerNames: ["default-scheduler"] +``` +It only allows usage of the default scheduler, no tolerations, nodeSelectors nor affinities. + +Multi-archi (x86_64, arm) or multi-OS (Linux, Windows) clusters might also allow the following nodeSelectors: +```yaml +apiVersion: extensions/valpha1 +kind: SchedulingPolicy +metadata: + name: restricted +spec: + allowedSchedulerNames: ["default-scheduler"] + allowedNodeSelectors: + beta.kubernetes.io/arch: [] + beta.kubernetes.io/os: [] +``` + +### Privileged Policy + +This is the privileged SchedulingPolicy, it allows usage of all schedulers, priority classes, nodeSelectors, affinities and tolerations. + +```yaml +apiVersion: extensions/valpha1 +kind: SchedulingPolicy +metadata: + name: privileged +spec: + allowedSchedulerNames: [] + allowedPriorityClasseNames: [] + allowedNodeSelectors: {} + allowedTolerations: + - keys: [] # any keys + operators: [] # => Equivalent to ["Exists", "Equals"] + values: [] # any values + effects: [] # => Equivalent to ["PreferNoSchedule", "NoSchedule", "NoExecute"] + allowedAffinities: + nodeAffinities: [] + podAffinities: [] + podAntiAffinities: [] +``` + +## RBAC +SchedulingPolicy are supposed to be allowed using the verb `use` to apply at pod runtime + +the following default ClusterRoles / ClusterRoleBindings are supposed to be provisioned to ensure at least the default-scheduler can be used. + +RBAC objects are going to be auto-provisioned at cluster creation / upgrade. + + +This ClusterRole allows the use of the default scheduler: +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + annotations: + rbac.authorization.kubernetes.io/autoupdate: "true" + labels: + kubernetes.io/bootstrapping: rbac-defaults + name: sp:restricted +rules: +- apiGroups: ['extensions'] + resources: ['schedulingpolicies'] + verbs: ['use'] + resourceNames: + - restricted +``` + +This ClusterRoleBinding ensures any serviceaccount can use the default-scheduler: +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + annotations: + rbac.authorization.kubernetes.io/autoupdate: "true" + labels: + kubernetes.io/bootstrapping: rbac-defaults + name: sp:restricted +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: sp:restricted +subjects: +- kind: Group + name: system:authenticated + apiGroup: rbac.authorization.k8s.io +``` + +This RoleBinding ensures that kube-system pods can run with no scheduling restriction: +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + annotations: + rbac.authorization.kubernetes.io/autoupdate: "true" + labels: + kubernetes.io/bootstrapping: rbac-defaults + name: sp:kube-system-privileged + namespace: kube-system +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: sp:privileged +subjects: +- kind: Group + name: system:serviceaccounts:kube-system + apiGroup: rbac.authorization.k8s.io +``` +# References +- [Pod affinity/anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) +- [Pod priorities](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) +- [Taints and tolerations](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) +- [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/) +- [Using multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/) From 094397982156d86614d6b41762c4a1c236f315f7 Mon Sep 17 00:00:00 2001 From: Arnaud MAZIN Date: Tue, 20 Mar 2018 11:31:40 +0100 Subject: [PATCH 02/10] Iterate after first remarks, several merging options are presented, to be discussed --- .../scheduling/scheduling-policy.md | 618 +++++++++++++++--- 1 file changed, 521 insertions(+), 97 deletions(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index a0e1a824fe6..23d034fac55 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -1,7 +1,9 @@ # Scheduling Policy _Status: Draft_ + _Author: @arnaudmz, @yastij_ + _Reviewers: @bsalamat, @liggitt_ # Objectives @@ -19,7 +21,7 @@ _Reviewers: @bsalamat, @liggitt_ # Background -During real-life Kubernetes architecting we encountered contexts where role-isolation (between administration and simple namespace usage in a multi-tenant context) could be improved. So far, no restriction is possible on toleration, priority class usage, nodeSelector, anti-affinity depending on user permissions (RBAC). +During real-life Kubernetes architecturing we encountered contexts where role-isolation (between administration and simple namespace usage in a multi-tenant context) could be improved. So far, no restriction is possible on toleration, priority class usage, nodeSelector, anti-affinity depending on user permissions (RBAC). Identified use-cases aim to ensure that administrators have a way to restrict users or namepace when - using schedulers, @@ -27,6 +29,10 @@ Identified use-cases aim to ensure that administrators have a way to restrict us - using specific priority classes, - expressing pod affinity or anti-affinity rules. +Mandatory (and optionally default) values must also be enforced by scheduling policies in case of: +- mutli-arch (amd64, arm64) of multi-os (Linux, Windows) clusters +- multi-az / region / failure domain clusters + # Overview Implementing SchedulingPolicy implies: @@ -36,9 +42,16 @@ Implementing SchedulingPolicy implies: # Detailed Design -SchedulingPolicy resources are supposed to apply in a deny-all-except approach. They are designed to apply in an additive way (i.e and'ed). From Pod's perspective, a pod can use one or N of the allowed items. +SchedulingPolicy resource specs are composed of several main attributes: +- **Required** scheduling components. These list the manadory NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations and optionally valid values that have to be provided to allow scheduling. +- **Allowed** scheduling components. These list the optional components that can be specified in pods definition. +- **Default** scheduling components. These list default values to set unless specified in pods definition. -An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations. +SchedulingPolicy resources are supposed to apply in a deny-all-except approach. + +An AdmissionController must be added to the mutating phase to +- add default values if unspecified for NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations, +- reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations. All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed. @@ -58,131 +71,255 @@ kind: SchedulingPolicy metadata: name: my-schedpol spec: - allowedSchedulerNames: # Describes schedulers names that are allowed - allowedPriorityClasseNames: # Describes priority classe names that are allowed - allowedNodeSelectors: # Describes node selectors that can be used - allowedTolerations: # Describes tolerations that can be used - allowedAffinities: # Describes affinities that can be used + required: + schedulerNames: # Describes schedulers names that are required + priorityClasseNames: # Describes priority class names that are required + nodeSelectors: # Describes node selectors that must be used + affinities: # Describes affinities that must be used + allowed: + schedulerNames: # Describes schedulers names that are allowed + priorityClasseNames: # Describes priority class names that are allowed + nodeSelectors: # Describes node selectors that can be used + tolerations: # Describes tolerations that can be used + affinities: # Describes affinities that can be used + default: + schedulerName: # Describes default scheduler name + priorityClasseName: # Describes default priority class name + nodeSelector: # Describes default node selector + tolerations: # Describes default tolerations + affinity: # Describes default affinity ``` +### required +Elements here are required, pods won't schedule if they aren't present. Also note that if something is required it is also allowed. + +### allowed +Elements here are allowed, the policy allows the presence of these elements. From Pod's perspective, a pod can use one or N of the allowed items. + +### default +If pods do not specify one, the elements here will be added + + ### Scheduler name -It should be possible to allow users to use only specific schedulers using `allowedSchedulerNames` field. +If `schedulerNames` is absent from `allowed`, `default` or `required`, no scheduler is allowed by this specific policy. -If `allowedSchedulerNames` is absent from SchedulingPolicy, no scheduler is allowed by this specific policy. +#### required -#### Examples +Require that pods use either the green-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`): +```yaml +Kind: SchedulingPolicy +spec: + required: + SchedulerNames: ["green-scheduler", "my-scheduler"] +``` -Allow serviceaccounts to use either the default-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`): +An empty list of schedulerNames here is not a valid syntax: ```yaml Kind: SchedulingPolicy spec: - allowedSchedulerNames: - - default-scheduler - - my-scheduler + required: + SchedulerNames: [] # equivalent to ["default-scheduler"] or to not specifying this item ``` -Allow all schedulers: +#### allowed +Allow pods to use either the green-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`): ```yaml Kind: SchedulingPolicy spec: - allowedSchedulerNames: [] + allowed: + SchedulerNames: ["green-scheduler", "my-scheduler"] +``` + +An empty list of schedulerNames will allow usage of all schedulers: +```yaml +Kind: SchedulingPolicy +spec: + allowed: + SchedulerNames: [] +``` + +#### default +pods will default use either the my-scheduler if nothing is specified in `spec.schedulerName`: +```yaml +Kind: SchedulingPolicy +spec: + default: + SchedulerName: "my-scheduler" ``` ### Tolerations -Toleration usage can be allowed using fine-grain rules with `allowedTolerations` field. If specifying multiple `allowedTolerations`, pod will be scheduled if one of the allowedTolerations is satisfied. +Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied. -If `allowedTolerations` is absent from SchedulingPolicy, no toleration is allowed. -#### Examples +#### Allowed -##### Fine-grain allowedTolerations -```yaml -Kind: SchedulingPolicy -spec: - allowedTolerations: - - keys: ["mykey"] - operators: ["Equal"] - values: ["value"] - effects: ["NoSchedule"] - - keys: ["other_key"] - operators: ["Exists"] - effects: ["NoExecute"] -``` -This example allows tolerations in the following forms: +This allows requires tolerations in the following forms: - tolerations that tolerates taints with key named `mykey` that has a value `value` and with a `NoSchedule` effect. - tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect. -##### Coarse-grain allowedTolerations +##### Fine-grain allowed tolerations + ```yaml Kind: SchedulingPolicy spec: - allowedTolerations: - - keys: [] - operators: [] - values: [] - effects: ["PreferNoSchedule"] - - keys: [] - operators: ["Exists"] - effects: ["NoSchedule"] + allowed: + tolerations: + - keys: ["mykey"] + operators: ["Equal"] + values: ["value"] + effects: ["NoSchedule"] + - keys: ["other_key"] + operators: ["Exists"] + effects: ["NoExecute"] ``` -This example allows tolerations in the following forms: + +Here we allow tolerations in the following forms: - tolerations that tolerates all `PreferNoSchedule` taints with any value. - tolerations that tolerates taints based on any key existence with effect `NoSchedule`. Also note that this SchedulingPolicy does not allow tolerating NoExecute taints. +##### Coarse-grain allowed tolerations -### Priority classes +```yaml +Kind: SchedulingPolicy +spec: + allowed: + tolerations: + - keys: [] + operators: [] + values: [] + effects: ["PreferNoSchedule"] + - keys: [] + operators: ["Exists"] + effects: ["NoSchedule"] +``` +an empty list of toleration allows all types of tolerations: -We must be able to enforce users to use specific priority classes by using the `allowedPriorityClasseNames` field. +```yaml +Kind: SchedulingPolicy +spec: + allowed: + tolerations: [] +``` -If `allowedPriorityClasseNames` is absent from SchedulingPolicy, no priority class is allowed. +Which is equivalent to: -#### Examples +```yaml +Kind: SchedulingPolicy +spec: + allowed: + tolerations: + - keys: [] + operators: [] + values: [] + effects: [] +``` + +#### default + +if no toleration is not specified, the following `SchedulingPolicy` will add a: +- toleration for and Taints. +- toleration for Taint. -##### Only allow a single priority class ```yaml Kind: SchedulingPolicy spec: - allowedPriorityClasseNames: - - high-priority + default: + tolerations: + - key: "mykey" + operator: "Equal" + values: ["value","other_value"] + effect: "NoSchedule" + - key: "other_key" + operator: "Exists" + effect: "NoExecute" ``` -In this example, only the `high-priority` PriorityClass is allowed. +note: an empty array of toleration is not a valid syntax for default toleration. + +### Priority classes + +Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field. + +##### Only allow a single priority class +```yaml +Kind: SchedulingPolicy +spec: + required: + priorityClasseNames: ["high-priority"] + default: + priorityClasseName: "high-priority" +``` +In this example, only the `high-priority` PriorityClass is enforced by default. ##### Allow all priorities ```yaml Kind: SchedulingPolicy spec: - allowedPriorityClasseNames: [] + allowed: + priorityClasseNames: [] ``` -In this example, all priority classes are allowed. +In this example, all priority classes are allowed, but not mandatory. -### Node Selector +Note: an empty list of required priorityClasseNames is considered as invalid +```yaml +Kind: SchedulingPolicy +spec: + required: + priorityClasseNames: [] +``` -As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedNodeSelectors`. -If `allowedNodeSelectors` is totally absent from the spec, no node selector is allowed. +### Node Selector + +The `nodeSelector` fields in `required`, `default` and `allowed` sections allow to precise what nodeSelectors are mandatory, possible and may provide default values if not set. As for other components, `required` nodeSelectors are automatically considered as allowed. #### Examples -##### Fine-grained policy +##### Complete policy + +```yaml +Kind: SchedulingPolicy +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] # pick one + default: # if not set, inject this to pods definitions + nodeSelector: + beta.kubernetes.io/arch: "amd64" + allowed: # other optional allowed nodeSelectors + nodeSelectors: + disk: ["ssd", "hdd"] + failure-domain.beta.kubernetes.io/region: [] # means any value +``` + +In this example, pods can be scheduled if they: +- have no nodeSelector at all. The default `beta.kubernetes.io/arch=amd64` will then be assigned. +- have a nodeSelector `beta.kubernetes.io/arch=amd64` or `beta.kubernetes.io/arch=arm64` + +They can also optionally have: +- `disk: ssd` nodeSelector, +- `disk: hdd` nodeSelector, +- `failure-domain.beta.kubernetes.io/region` nodeSelector with any value. + +##### Allowed-only policy ```yaml Kind: SchedulingPolicy spec: - allowedNodeSelectors: - disk: ["ssd"] - region: [] # means any value + allowed: # other optional allowed nodeSelectors + nodeSelectors: + failure-domain.beta.kubernetes.io/zone: ["eu-west-1a", "eu-west-1b", "eu-west-1c"] ``` -In this example, pods can be scheduled only if they either: -- have no nodeSelector -- or have a `disk: ssd` nodeSelector -- and / or have a `region` key nodeSelector with any value + +In this example, pods can be scheduled if they: +- have no nodeSelector at all. +- `failure-domain.beta.kubernetes.io/zone` nodeSelector with a value in the three listed: `eu-west-1a`, `eu-west-1b` or `eu-west-1c`. ### Affinity rules @@ -195,66 +332,355 @@ If `allowedAffinities` is totally absent from the spec, no affinity is allowed w #### Examples ##### Basic policy + ```yaml Kind: SchedulingPolicy spec: - allowedAffinities: - nodeAffinities: - - requiredDuringSchedulingIgnoredDuringExecution - podAntiAffinities: - - requiredDuringSchedulingIgnoredDuringExecution - - preferredDuringSchedulingIgnoredDuringExecution + required: + affinities: + nodeAffinities: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - keys: ["beta.kubernetes.io/arch"] + operators: ["In"] + values: ["amd64", "arm64"] + allowed: + affinities: + nodeAffinities: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - keys: ["failure-domain.beta.kubernetes.io/region","kubernetes.io/authorized-region"] + operators: ["In","NotIn"] + values: ["eu-2", "us-1"] + podAntiAffinities: {} + default: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - keys: ["beta.kubernetes.io/arch"] + operators: ["In"] + values: ["amd64"] +``` + +In this example, we allow: +- hard NodeAffinity based on + - `beta.kubernetes.io/arch` if value is `amd64` or `arm64`. Defaults to `amd64` if unspecified. + - (optionally) any combination of specified in `allowed` secion. +- All podAntiAffinities +- No podAffinities + +```yaml +Kind: SchedulingPolicy +spec: + allowed: + affinities: + nodeAffinities: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - keys: ["failure-domain.beta.kubernetes.io/region","kubernetes.io/authorized-region"] + operators: ["In","NotIn"] + values: ["eu-2", "us-1"] + - matchExpressions: + - keys: ["failure-domain.beta.kubernetes.io/zone"] + operators: ["NotIn"] + values: ["dc1", "dc2"] + podAntiAffinities: {} ``` +This example highlights the case where you don't want a full combinatory, here we allow the same affinities as the previous example, in addition, +we allow a match expression that checks that some zone labels are not in the specified values. + + ##### Allow-all policy In this example, all affinities are allowed: + ```yaml Kind: SchedulingPolicy spec: - allowedAffinities: - nodeAffinities: [] - podAffinities: [] - podAntiAffinities: [] + allowed: + affinities: {} +``` + +Which is equivalent to: + +```yaml +Kind: SchedulingPolicy +spec: + allowed: + affinities: + nodeAffinities: {} + podAffinities: {} + podAntiAffinities: {} ``` If a sub-item of allowedAffinities is absent from SchedulingPolicy, it is not allowed e.g: + ```yaml Kind: SchedulingPolicy spec: allowedAffinities: - nodeAffinities: [] + nodeAffinities: {} +``` + +In this example, only nodeAffinities (required and preferred) are allowed but no podAffinities nor podAntiAffinities. + +## Multiple SchedulingPolicies considerations + +several merging strategies are being considered. + +### Option1: smart deep merge + +If RBAC permissions provide a serviceaccount a way to use several schedpols, conflict resolution must occur. The proposed behaviour is: + +- `required` fields use an alphabetic order and then a first-seen-wins strategy for each sub-keys +- `allowed` fields are additive +- `default` fields use an alphabetic order and then a first-seen-wins strategy for each sub-keys + +for instance, if we have the following two schedpols that apply: + +```yaml +Kind: SchedulingPolicy +metadata: + name: schedpol-a # first in alphabetic order +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] + allowed: + nodeSelectors: + disk: ["ssd"] + default: + nodeSelector: + beta.kubernetes.io/arch: "amd64" +--- +Kind: SchedulingPolicy +metadata: + name: schedpol-b # second in alphabetic order +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64", "i386"] + beta.kubernetes.io/os: ["Linux", "Windows"] + priorityClasseNames: ["bronze", "gold", "silver"] + allowed: + nodeSelectors: + disk: ["sata"] + default: + nodeSelector: + beta.kubernetes.io/arch: "i386" + beta.kubernetes.io/os: "Linux" + priorityClasseName: "bronze" +``` + +The merged applied schedpol will be: + +```yaml +Kind: SchedulingPolicy +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] + beta.kubernetes.io/os: ["Linux", "Windows"] + priorityClasseNames: ["bronze", "gold", "silver"] + allowed: + nodeSelectors: + disk: ["ssd", "sata"] + default: + nodeSelector: + beta.kubernetes.io/arch: "amd64" + beta.kubernetes.io/os: "Linux" + priorityClasseName: "bronze" +``` + +### Option2: first-seen wins, ever +In this option the strategy is at SchedulingPolicy level. + +for instance, if we have the following two schedpols that apply: + +```yaml +Kind: SchedulingPolicy +metadata: + name: schedpol-a # first in alphabetic order +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] + allowed: + nodeSelectors: + disk: ["ssd"] + default: + nodeSelector: + beta.kubernetes.io/arch: "amd64" +--- +Kind: SchedulingPolicy +metadata: + name: schedpol-b # second in alphabetic order +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64", "i386"] + beta.kubernetes.io/os: ["Linux", "Windows"] + priorityClasseNames: ["bronze", "gold", "silver"] + allowed: + nodeSelectors: + disk: ["sata"] + default: + nodeSelector: + beta.kubernetes.io/arch: "i386" + beta.kubernetes.io/os: "Linux" + priorityClasseName: "bronze" +``` + +The merged applied schedpol will be: + +```yaml +Kind: SchedulingPolicy +metadata: + name: schedpol-a # first in alphabetic order +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] + allowed: + nodeSelectors: + disk: ["ssd"] + default: + nodeSelector: + beta.kubernetes.io/arch: "amd64" ``` -In this example, only soft and hard nodeAffinities are allowed. -### When both `allowedNodeSelectors` and `nodeAffinities` are specified +### Option3: simple merge + +In this strategy, merge is performed on a first-seen-wins on second-level entries. +for instance, if we have the following two schedpols that apply: + +```yaml +Kind: SchedulingPolicy +metadata: + name: schedpol-a # first in alphabetic order +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] + allowed: + nodeSelectors: + disk: ["ssd"] + default: + nodeSelector: + beta.kubernetes.io/arch: "amd64" +--- +Kind: SchedulingPolicy +metadata: + name: schedpol-b # second in alphabetic order +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64", "i386"] + beta.kubernetes.io/os: ["Linux", "Windows"] + priorityClasseNames: ["bronze", "gold", "silver"] + allowed: + nodeSelectors: + disk: ["sata"] + default: + nodeSelector: + beta.kubernetes.io/arch: "i386" + beta.kubernetes.io/os: "Linux" + priorityClasseName: "bronze" +``` -Use of both `allowedNodeSelectors` and `nodeAffinities` is not recommended as the latter being way more permissive. +The merged applied schedpol will be: +```yaml +Kind: SchedulingPolicy +spec: + required: + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] + priorityClasseNames: ["bronze", "gold", "silver"] + allowed: + nodeSelectors: + disk: ["ssd"] + default: + nodeSelector: + beta.kubernetes.io/arch: "amd64" + priorityClasseName: "bronze" +``` ## Default SchedulingPolicies ### Restricted policy Here is a reasonable policy that might be allowed for any cluster without specific needs: + ```yaml apiVersion: extensions/valpha1 kind: SchedulingPolicy metadata: name: restricted spec: - allowedSchedulerNames: ["default-scheduler"] + allowed: + schedulerNames: ["default-scheduler"] ``` + It only allows usage of the default scheduler, no tolerations, nodeSelectors nor affinities. Multi-archi (x86_64, arm) or multi-OS (Linux, Windows) clusters might also allow the following nodeSelectors: + ```yaml apiVersion: extensions/valpha1 kind: SchedulingPolicy metadata: - name: restricted + name: restricted-multiarch-by-node-selector +spec: + required: + schedulerNames: ["default-scheduler"] + nodeSelectors: + beta.kubernetes.io/arch: ["amd64", "arm64"] # pick one in required values + beta.kubernetes.io/os: ["Linux", "Windows"] # pick one + default: # if not set, inject to pods definitions + schedulerName: "default-scheduler" + nodeSelector: + beta.kubernetes.io/arch: "amd64" + beta.kubernetes.io/os: "Linux" +``` + +```yaml +apiVersion: extensions/valpha1 +kind: SchedulingPolicy +metadata: + name: restricted-multiarch-by-affinity spec: - allowedSchedulerNames: ["default-scheduler"] - allowedNodeSelectors: - beta.kubernetes.io/arch: [] - beta.kubernetes.io/os: [] + required: + schedulerNames: ["default-scheduler"] + affinities: + nodeAffinities: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - keys: ["beta.kubernetes.io/arch"] + operators: ["In"] + values: ["amd64", "arm64"] + - matchExpressions: + - keys: ["beta.kubernetes.io/os"] + operators: ["In"] + values: ["Linux", "Windows"] + default: # if not set, inject to pods definitions + schedulerName: "default-scheduler" + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: "beta.kubernetes.io/arch" + operator: "In" + values: ["amd64"] + - matchExpressions: + - key: "beta.kubernetes.io/os" + operator: "In" + values: ["Linux"] ``` ### Privileged Policy @@ -267,18 +693,12 @@ kind: SchedulingPolicy metadata: name: privileged spec: - allowedSchedulerNames: [] - allowedPriorityClasseNames: [] - allowedNodeSelectors: {} - allowedTolerations: - - keys: [] # any keys - operators: [] # => Equivalent to ["Exists", "Equals"] - values: [] # any values - effects: [] # => Equivalent to ["PreferNoSchedule", "NoSchedule", "NoExecute"] - allowedAffinities: - nodeAffinities: [] - podAffinities: [] - podAntiAffinities: [] + allowed: + schedulerNames: [] + priorityClasseNames: [] + nodeSelectors: {} + tolerations: [] + affinities: {} ``` ## RBAC @@ -290,6 +710,7 @@ RBAC objects are going to be auto-provisioned at cluster creation / upgrade. This ClusterRole allows the use of the default scheduler: + ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole @@ -308,6 +729,7 @@ rules: ``` This ClusterRoleBinding ensures any serviceaccount can use the default-scheduler: + ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding @@ -328,6 +750,7 @@ subjects: ``` This RoleBinding ensures that kube-system pods can run with no scheduling restriction: + ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding @@ -347,6 +770,7 @@ subjects: name: system:serviceaccounts:kube-system apiGroup: rbac.authorization.k8s.io ``` + # References - [Pod affinity/anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) - [Pod priorities](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) From 8ed464f5f993fad1d612bf2cd23ccecef91318c9 Mon Sep 17 00:00:00 2001 From: Yassine TIJANI Date: Tue, 20 Mar 2018 16:20:48 +0100 Subject: [PATCH 03/10] state clearly that the merge is just going through the policies to compute authorizations --- contributors/design-proposals/scheduling/scheduling-policy.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index 23d034fac55..e3169c258dd 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -2,7 +2,7 @@ _Status: Draft_ -_Author: @arnaudmz, @yastij_ +_Authors: @arnaudmz, @yastij_ _Reviewers: @bsalamat, @liggitt_ @@ -431,6 +431,8 @@ In this example, only nodeAffinities (required and preferred) are allowed but no ## Multiple SchedulingPolicies considerations +NOTE: here a merge is the set of resulting authorizations after going through the available policies (i.e. we do not aggregate policies into a newly created `SchedulingPolicy`) + several merging strategies are being considered. ### Option1: smart deep merge From d129432b0dd6bd0932cbce34aa2e10204254232f Mon Sep 17 00:00:00 2001 From: Yassine TIJANI Date: Mon, 16 Apr 2018 20:12:25 +0200 Subject: [PATCH 04/10] update schedulingPolicy with all inputs --- .../scheduling/scheduling-policy.md | 929 +++++++----------- 1 file changed, 332 insertions(+), 597 deletions(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index e3169c258dd..dcde3b48224 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -13,92 +13,168 @@ _Reviewers: @bsalamat, @liggitt_ ## Non-Goals -- How taints / tolerations work -- How NodeSelector works -- How node / pod affinity / anti-affinity rules work -- How several schedulers can be used within a single cluster -- How priority classes work +- How taints / tolerations work. +- How NodeSelector works. +- How node / pod affinity / anti-affinity rules work. +- How several schedulers can be used within a single cluster. +- How priority classes work. +- How to set defaults in Kubernetes. # Background During real-life Kubernetes architecturing we encountered contexts where role-isolation (between administration and simple namespace usage in a multi-tenant context) could be improved. So far, no restriction is possible on toleration, priority class usage, nodeSelector, anti-affinity depending on user permissions (RBAC). -Identified use-cases aim to ensure that administrators have a way to restrict users or namepace when -- using schedulers, -- placing pods on specific nodes (master roles for instance), -- using specific priority classes, +Identified use-cases aim to ensure that administrators have a way to restrict users or namespaces, it allows administrators to: + +- Restrict execution for specific applications (which are namespace scoped) into certain nodes +- Create policies that prevent users from even attempting to schedule workloads onto masters to maximize security +- require that a pods under a namespace run on dedicated nodes +- Restrict usage of some `PriorityClass` +- Restrict usage to a specific set of schedulers. +- placing pods on specific nodes (master roles for instance). +- using specific priority classes. - expressing pod affinity or anti-affinity rules. -Mandatory (and optionally default) values must also be enforced by scheduling policies in case of: +Also Mandatory (and optionally default) values must also be enforced by scheduling policies in case of: + - mutli-arch (amd64, arm64) of multi-os (Linux, Windows) clusters - multi-az / region / failure domain clusters # Overview -Implementing SchedulingPolicy implies: -- Creating a new resource named **SchedulingPolicy** (schedpol) -- Creating an **AdmissionController** that dehaves on a deny-all-but basis -- Allow SchedulingPolicy to be used by pods using RoleBindings or ClusterRoleBindings +### syntaxic standards -# Detailed Design +```yaml +apiVersion: policy/v1alpha1 +kind: +metadata: + name: +spec: + bindingMode: # Describes a bindingMode (any or all) + namespaces: # a list of namespaces that the policy targets (optional) + - ns1 + - ns1 + namespaceSelector: # a selector to match a set of namespaces (optional) + key: value + rules: # rules that must be satisfied (optional) + fieldA: # name of the field (optional) + - match: # Describes how the rule is matched (required) + - elt1 # elements here could be objects like tolerations or strings like SchedulerName + action: # Describes the action that should be taken (allowed, denied or required) + +``` + +### policy composition + +`bindingMode` Describes how policies should be composed: + +- any: any policy that has its rules satisfied +- all: all policies MUST have its rules satisfied + +if we have a heterogenous bindingMode across policies (i.e. some policies with any and others with all). +Then the most restrictive one (i.e the all bindingMode) is applied. + +### unset fields: + +if a field was not specified in the policy no rule should apply (i.e. everything is allowed). + +### empty match: + +Usually policies should distinguish between empty matches, since it depends on the action: + +- require/deny: no rule apply. +- allow: allows everything. + + +### inside the matches: + +The match field can express pretty much any structure you want to, but there's some things that should be considered: + +- the structure should match as much as possible what you try to take action on. +- match elements are ANDed. +- the `*` wildcard is usually needed in the allow section. It should be represented with an empty value (e.g. empty array). + + +### conflict handling for policies + +There is two kinds of conflict in policies: + +- structural conflicts: these exists when any rule has the same matchingExpression and opposed actions (deny and require) + +- semantical conflicts: these exists due to semantic of the fields (e.g. require a NodeSelector `master=true` and also require `master=false`) + + +Structural conflicts must handled at creation time, on the other hand, semantical conflicts should be handled at runtime: detect that there's a conflict and emit an event stating that policies couldn't be satisfied due to a conflict. + + +### required vs allowed vs denied + +Policies may have overlapping rules, to handle this policies are computed in the following order: -SchedulingPolicy resource specs are composed of several main attributes: -- **Required** scheduling components. These list the manadory NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations and optionally valid values that have to be provided to allow scheduling. -- **Allowed** scheduling components. These list the optional components that can be specified in pods definition. -- **Default** scheduling components. These list default values to set unless specified in pods definition. +- compute what is denied. +- compute what is required. +- compute what was allowed. -SchedulingPolicy resources are supposed to apply in a deny-all-except approach. +they should also obey to the following rules: -An AdmissionController must be added to the mutating phase to -- add default values if unspecified for NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations, -- reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations. +- everything that is required is by definition allowed. +- everything that is not denied is not automatically allowed: to be allowed a rule must not be denied AND must be allowed. + + +# Detailed Design -All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed. ## SchedulingPolicy -Proposed API group: `extensions/v1alpha1` +Proposed API group: `policy/v1alpha1` -SchedulingPolicy is a cluster-scoped resource (not namespaced). ### SchedulingPolicy content -SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this schedpol won't allow any item from the missing field. +SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field. ```yaml -apiVersion: extensions/valpha1 +apiVersion: policy/v1alpha1 kind: SchedulingPolicy metadata: - name: my-schedpol -spec: - required: - schedulerNames: # Describes schedulers names that are required - priorityClasseNames: # Describes priority class names that are required - nodeSelectors: # Describes node selectors that must be used - affinities: # Describes affinities that must be used - allowed: - schedulerNames: # Describes schedulers names that are allowed - priorityClasseNames: # Describes priority class names that are allowed - nodeSelectors: # Describes node selectors that can be used - tolerations: # Describes tolerations that can be used - affinities: # Describes affinities that can be used - default: - schedulerName: # Describes default scheduler name - priorityClasseName: # Describes default priority class name - nodeSelector: # Describes default node selector - tolerations: # Describes default tolerations - affinity: # Describes default affinity + name: mySchedulingPolicy +spec: + bindingMode: all + namespaces: + - default + rules: + schedulerNames: + - match: [] + action: + priorityClassNames: + - match: [] + action: + tolerations: + - match: [] + action: + nodeSelectors: + - match: [] + action: + nodeAffinities: + - match: [] + action: + podAntiAffinities: + - match: [] + action: + podAffinities: + - match: [] + action: ``` + ### required Elements here are required, pods won't schedule if they aren't present. Also note that if something is required it is also allowed. ### allowed Elements here are allowed, the policy allows the presence of these elements. From Pod's perspective, a pod can use one or N of the allowed items. -### default -If pods do not specify one, the elements here will be added - +### deny +If pods specify one of these, the pod won't schedule to a node, we won't dive into deny as it is the exact opposite of required. ### Scheduler name @@ -107,46 +183,73 @@ If `schedulerNames` is absent from `allowed`, `default` or `required`, no schedu #### required Require that pods use either the green-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`): + ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - required: - SchedulerNames: ["green-scheduler", "my-scheduler"] + bindingMode: all + namespaces: + - default + rules: + schedulerNames: + - match: ["green-scheduler","my-scheduler"] + action: required ``` -An empty list of schedulerNames here is not a valid syntax: +An empty list of schedulerNames has no effect, as pod use the `defaultScheduler` if no scheduler is specified: + ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - required: - SchedulerNames: [] # equivalent to ["default-scheduler"] or to not specifying this item + bindingMode: all + namespaces: + - default + rules: + schedulerNames: + - match: [] + action: required ``` #### allowed -Allow pods to use either the green-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`): +Allow pods to use either the green-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in the namespace `default`: + ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - allowed: - SchedulerNames: ["green-scheduler", "my-scheduler"] + bindingMode: all + namespaces: + - default + rules: + schedulerNames: + - match: ["green-scheduler","my-scheduler"] + action: allowed ``` An empty list of schedulerNames will allow usage of all schedulers: -```yaml -Kind: SchedulingPolicy -spec: - allowed: - SchedulerNames: [] -``` -#### default -pods will default use either the my-scheduler if nothing is specified in `spec.schedulerName`: ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - default: - SchedulerName: "my-scheduler" + bindingMode: all + namespaces: + - default + rules: + schedulerNames: + - match: [] + action: allowed ``` @@ -154,6 +257,33 @@ spec: Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied. +#### required + +This allows to require toleration in the following forms of: + +- tolerations that tolerates taints with key named `projectA-dedicated` with all effects. + +##### Fine-grain allowed tolerations + +```yaml +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy +spec: + bindingMode: all + namespaces: + - projectA + rules: + tolerations: + - match: + - keys: ["projectA-dedicated"] + operators: ["Exists"] + effects: [] + action: required +``` + + an empty list of matches has no effect (i.e. do not require anything). #### Allowed @@ -164,17 +294,25 @@ This allows requires tolerations in the following forms: ##### Fine-grain allowed tolerations ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - allowed: + bindingMode: all + namespaces: + - default + rules: tolerations: - - keys: ["mykey"] - operators: ["Equal"] - values: ["value"] - effects: ["NoSchedule"] - - keys: ["other_key"] - operators: ["Exists"] - effects: ["NoExecute"] + - match: + - keys: ["mykey"] + operators: ["Equal"] + values: ["value"] + effects: ["NoSchedule"] + - keys: ["other_key"] + operators: ["Exists"] + effects: ["NoExecute"] + action: allowed ``` Here we allow tolerations in the following forms: @@ -184,142 +322,136 @@ Also note that this SchedulingPolicy does not allow tolerating NoExecute taints. ##### Coarse-grain allowed tolerations -```yaml -Kind: SchedulingPolicy -spec: - allowed: - tolerations: - - keys: [] - operators: [] - values: [] - effects: ["PreferNoSchedule"] - - keys: [] - operators: ["Exists"] - effects: ["NoSchedule"] -``` + an empty list of toleration allows all types of tolerations: ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - allowed: - tolerations: [] + bindingMode: all + namespaces: + - default + rules: + tolerations: + - match: [] + action: allowed ``` Which is equivalent to: + ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - allowed: + bindingMode: all + namespaces: + - default + rules: tolerations: - - keys: [] - operators: [] - values: [] - effects: [] + - match: + - keys: [] + operators: [] + values: [] + effects: [] + action: allowed ``` -#### default - -if no toleration is not specified, the following `SchedulingPolicy` will add a: -- toleration for and Taints. -- toleration for Taint. -```yaml -Kind: SchedulingPolicy -spec: - default: - tolerations: - - key: "mykey" - operator: "Equal" - values: ["value","other_value"] - effect: "NoSchedule" - - key: "other_key" - operator: "Exists" - effect: "NoExecute" -``` -note: an empty array of toleration is not a valid syntax for default toleration. ### Priority classes Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field. -##### Only allow a single priority class +##### required + +this example requires a `priorityClass` that is either `high-priority` or `critical-job` ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - required: - priorityClasseNames: ["high-priority"] - default: - priorityClasseName: "high-priority" + bindingMode: all + namespaces: + - default + rules: + priorityClasseNames: + - match: ["high-priority","critical-job"] + action: required ``` -In this example, only the `high-priority` PriorityClass is enforced by default. -##### Allow all priorities -```yaml -Kind: SchedulingPolicy -spec: - allowed: - priorityClasseNames: [] -``` -In this example, all priority classes are allowed, but not mandatory. -Note: an empty list of required priorityClasseNames is considered as invalid +##### Allow + +In this example, we only allow the `critical-job` priority. + + ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - required: - priorityClasseNames: [] + bindingMode: all + namespaces: + - default + rules: + priorityClasseNames: + - match: ["critical-priority"] + action: allowed ``` + + ### Node Selector -The `nodeSelector` fields in `required`, `default` and `allowed` sections allow to precise what nodeSelectors are mandatory, possible and may provide default values if not set. As for other components, `required` nodeSelectors are automatically considered as allowed. +The `nodeSelector` field makes it possible to specify what nodeSelectors are required, allowed and denied. As for other components, `required` nodeSelectors are automatically considered as allowed. #### Examples ##### Complete policy ```yaml -Kind: SchedulingPolicy +apiVersion: policy/v1alpha1 +kind: SchedulingPolicy +metadata: + name: mySchedulingPolicy spec: - required: + bindingMode: all + namespaces: + - default + rules: nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] # pick one - default: # if not set, inject this to pods definitions - nodeSelector: - beta.kubernetes.io/arch: "amd64" - allowed: # other optional allowed nodeSelectors - nodeSelectors: - disk: ["ssd", "hdd"] - failure-domain.beta.kubernetes.io/region: [] # means any value + - match: + - beta.kubernetes.io/arch: ["amd64", "arm64"] + - team: [] + action: required + - match: + - failure-domain.beta.kubernetes.io/region: [] + - disk: ["ssd", "hdd"] + action: allowed + ``` -In this example, pods can be scheduled if they: -- have no nodeSelector at all. The default `beta.kubernetes.io/arch=amd64` will then be assigned. -- have a nodeSelector `beta.kubernetes.io/arch=amd64` or `beta.kubernetes.io/arch=arm64` -They can also optionally have: +In this example, pods can be scheduled if they match the two following: +- a nodeSelector `beta.kubernetes.io/arch=amd64` or `beta.kubernetes.io/arch=arm64`. +- a nodeSelector with the key `team`. + +They can also optionally specify: - `disk: ssd` nodeSelector, - `disk: hdd` nodeSelector, - `failure-domain.beta.kubernetes.io/region` nodeSelector with any value. -##### Allowed-only policy - -```yaml -Kind: SchedulingPolicy -spec: - allowed: # other optional allowed nodeSelectors - nodeSelectors: - failure-domain.beta.kubernetes.io/zone: ["eu-west-1a", "eu-west-1b", "eu-west-1c"] -``` - -In this example, pods can be scheduled if they: -- have no nodeSelector at all. -- `failure-domain.beta.kubernetes.io/zone` nodeSelector with a value in the three listed: `eu-west-1a`, `eu-west-1b` or `eu-west-1c`. ### Affinity rules @@ -333,449 +465,52 @@ If `allowedAffinities` is totally absent from the spec, no affinity is allowed w ##### Basic policy -```yaml -Kind: SchedulingPolicy -spec: - required: - affinities: - nodeAffinities: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - keys: ["beta.kubernetes.io/arch"] - operators: ["In"] - values: ["amd64", "arm64"] - allowed: - affinities: - nodeAffinities: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - keys: ["failure-domain.beta.kubernetes.io/region","kubernetes.io/authorized-region"] - operators: ["In","NotIn"] - values: ["eu-2", "us-1"] - podAntiAffinities: {} - default: - affinity: - nodeAffinity: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - keys: ["beta.kubernetes.io/arch"] - operators: ["In"] - values: ["amd64"] -``` - -In this example, we allow: -- hard NodeAffinity based on - - `beta.kubernetes.io/arch` if value is `amd64` or `arm64`. Defaults to `amd64` if unspecified. - - (optionally) any combination of specified in `allowed` secion. -- All podAntiAffinities -- No podAffinities - -```yaml -Kind: SchedulingPolicy -spec: - allowed: - affinities: - nodeAffinities: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - keys: ["failure-domain.beta.kubernetes.io/region","kubernetes.io/authorized-region"] - operators: ["In","NotIn"] - values: ["eu-2", "us-1"] - - matchExpressions: - - keys: ["failure-domain.beta.kubernetes.io/zone"] - operators: ["NotIn"] - values: ["dc1", "dc2"] - podAntiAffinities: {} -``` - -This example highlights the case where you don't want a full combinatory, here we allow the same affinities as the previous example, in addition, -we allow a match expression that checks that some zone labels are not in the specified values. - - -##### Allow-all policy -In this example, all affinities are allowed: -```yaml -Kind: SchedulingPolicy -spec: - allowed: - affinities: {} -``` - -Which is equivalent to: - -```yaml -Kind: SchedulingPolicy -spec: - allowed: - affinities: - nodeAffinities: {} - podAffinities: {} - podAntiAffinities: {} -``` - -If a sub-item of allowedAffinities is absent from SchedulingPolicy, it is not allowed e.g: - -```yaml -Kind: SchedulingPolicy -spec: - allowedAffinities: - nodeAffinities: {} -``` - -In this example, only nodeAffinities (required and preferred) are allowed but no podAffinities nor podAntiAffinities. - -## Multiple SchedulingPolicies considerations - -NOTE: here a merge is the set of resulting authorizations after going through the available policies (i.e. we do not aggregate policies into a newly created `SchedulingPolicy`) - -several merging strategies are being considered. - -### Option1: smart deep merge - -If RBAC permissions provide a serviceaccount a way to use several schedpols, conflict resolution must occur. The proposed behaviour is: - -- `required` fields use an alphabetic order and then a first-seen-wins strategy for each sub-keys -- `allowed` fields are additive -- `default` fields use an alphabetic order and then a first-seen-wins strategy for each sub-keys - -for instance, if we have the following two schedpols that apply: - -```yaml -Kind: SchedulingPolicy -metadata: - name: schedpol-a # first in alphabetic order -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] - allowed: - nodeSelectors: - disk: ["ssd"] - default: - nodeSelector: - beta.kubernetes.io/arch: "amd64" ---- -Kind: SchedulingPolicy -metadata: - name: schedpol-b # second in alphabetic order -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64", "i386"] - beta.kubernetes.io/os: ["Linux", "Windows"] - priorityClasseNames: ["bronze", "gold", "silver"] - allowed: - nodeSelectors: - disk: ["sata"] - default: - nodeSelector: - beta.kubernetes.io/arch: "i386" - beta.kubernetes.io/os: "Linux" - priorityClasseName: "bronze" -``` - -The merged applied schedpol will be: - -```yaml -Kind: SchedulingPolicy -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] - beta.kubernetes.io/os: ["Linux", "Windows"] - priorityClasseNames: ["bronze", "gold", "silver"] - allowed: - nodeSelectors: - disk: ["ssd", "sata"] - default: - nodeSelector: - beta.kubernetes.io/arch: "amd64" - beta.kubernetes.io/os: "Linux" - priorityClasseName: "bronze" -``` - -### Option2: first-seen wins, ever -In this option the strategy is at SchedulingPolicy level. - -for instance, if we have the following two schedpols that apply: ```yaml -Kind: SchedulingPolicy -metadata: - name: schedpol-a # first in alphabetic order -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] - allowed: - nodeSelectors: - disk: ["ssd"] - default: - nodeSelector: - beta.kubernetes.io/arch: "amd64" ---- -Kind: SchedulingPolicy -metadata: - name: schedpol-b # second in alphabetic order -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64", "i386"] - beta.kubernetes.io/os: ["Linux", "Windows"] - priorityClasseNames: ["bronze", "gold", "silver"] - allowed: - nodeSelectors: - disk: ["sata"] - default: - nodeSelector: - beta.kubernetes.io/arch: "i386" - beta.kubernetes.io/os: "Linux" - priorityClasseName: "bronze" -``` - -The merged applied schedpol will be: - -```yaml -Kind: SchedulingPolicy -metadata: - name: schedpol-a # first in alphabetic order -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] - allowed: - nodeSelectors: - disk: ["ssd"] - default: - nodeSelector: - beta.kubernetes.io/arch: "amd64" -``` - -### Option3: simple merge - -In this strategy, merge is performed on a first-seen-wins on second-level entries. -for instance, if we have the following two schedpols that apply: - -```yaml -Kind: SchedulingPolicy -metadata: - name: schedpol-a # first in alphabetic order -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] - allowed: - nodeSelectors: - disk: ["ssd"] - default: - nodeSelector: - beta.kubernetes.io/arch: "amd64" ---- -Kind: SchedulingPolicy -metadata: - name: schedpol-b # second in alphabetic order -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64", "i386"] - beta.kubernetes.io/os: ["Linux", "Windows"] - priorityClasseNames: ["bronze", "gold", "silver"] - allowed: - nodeSelectors: - disk: ["sata"] - default: - nodeSelector: - beta.kubernetes.io/arch: "i386" - beta.kubernetes.io/os: "Linux" - priorityClasseName: "bronze" -``` - -The merged applied schedpol will be: - -```yaml -Kind: SchedulingPolicy -spec: - required: - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] - priorityClasseNames: ["bronze", "gold", "silver"] - allowed: - nodeSelectors: - disk: ["ssd"] - default: - nodeSelector: - beta.kubernetes.io/arch: "amd64" - priorityClasseName: "bronze" -``` -## Default SchedulingPolicies - -### Restricted policy -Here is a reasonable policy that might be allowed for any cluster without specific needs: - -```yaml -apiVersion: extensions/valpha1 +apiVersion: policy/v1alpha1 kind: SchedulingPolicy metadata: - name: restricted -spec: - allowed: - schedulerNames: ["default-scheduler"] -``` - -It only allows usage of the default scheduler, no tolerations, nodeSelectors nor affinities. - -Multi-archi (x86_64, arm) or multi-OS (Linux, Windows) clusters might also allow the following nodeSelectors: + name: mySchedulingPolicy +spec: + bindingMode: all + namespaces: + - default + rules: + nodeAffinities: + - match: + - keys: ["failure-domain.beta.kubernetes.io/region","authorized-region"] + operators: ["In","NotIn"] + values: ["eu-2", "us-1"] + type: "requiredDuringSchedulingIgnoredDuringExecution" + action: allowed + - match: + - keys: ["beta.kubernetes.io/arch"] + operators: ["In"] + values: ["amd64", "arm64"] + type: "requiredDuringSchedulingIgnoredDuringExecution" + action: required + podAntiAffinities: + - match: [] + action: allowed -```yaml -apiVersion: extensions/valpha1 -kind: SchedulingPolicy -metadata: - name: restricted-multiarch-by-node-selector -spec: - required: - schedulerNames: ["default-scheduler"] - nodeSelectors: - beta.kubernetes.io/arch: ["amd64", "arm64"] # pick one in required values - beta.kubernetes.io/os: ["Linux", "Windows"] # pick one - default: # if not set, inject to pods definitions - schedulerName: "default-scheduler" - nodeSelector: - beta.kubernetes.io/arch: "amd64" - beta.kubernetes.io/os: "Linux" ``` -```yaml -apiVersion: extensions/valpha1 -kind: SchedulingPolicy -metadata: - name: restricted-multiarch-by-affinity -spec: - required: - schedulerNames: ["default-scheduler"] - affinities: - nodeAffinities: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - keys: ["beta.kubernetes.io/arch"] - operators: ["In"] - values: ["amd64", "arm64"] - - matchExpressions: - - keys: ["beta.kubernetes.io/os"] - operators: ["In"] - values: ["Linux", "Windows"] - default: # if not set, inject to pods definitions - schedulerName: "default-scheduler" - affinity: - nodeAffinity: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - key: "beta.kubernetes.io/arch" - operator: "In" - values: ["amd64"] - - matchExpressions: - - key: "beta.kubernetes.io/os" - operator: "In" - values: ["Linux"] -``` - -### Privileged Policy - -This is the privileged SchedulingPolicy, it allows usage of all schedulers, priority classes, nodeSelectors, affinities and tolerations. - -```yaml -apiVersion: extensions/valpha1 -kind: SchedulingPolicy -metadata: - name: privileged -spec: - allowed: - schedulerNames: [] - priorityClasseNames: [] - nodeSelectors: {} - tolerations: [] - affinities: {} -``` - -## RBAC -SchedulingPolicy are supposed to be allowed using the verb `use` to apply at pod runtime - -the following default ClusterRoles / ClusterRoleBindings are supposed to be provisioned to ensure at least the default-scheduler can be used. - -RBAC objects are going to be auto-provisioned at cluster creation / upgrade. - - -This ClusterRole allows the use of the default scheduler: - -```yaml -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - annotations: - rbac.authorization.kubernetes.io/autoupdate: "true" - labels: - kubernetes.io/bootstrapping: rbac-defaults - name: sp:restricted -rules: -- apiGroups: ['extensions'] - resources: ['schedulingpolicies'] - verbs: ['use'] - resourceNames: - - restricted -``` +In this example, we allow: +- hard NodeAffinity based on + - `beta.kubernetes.io/arch` if value is `amd64` or `arm64`. + - any combination of specified in `allowed` rule. +- All podAntiAffinities +- No podAffinities -This ClusterRoleBinding ensures any serviceaccount can use the default-scheduler: -```yaml -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRoleBinding -metadata: - annotations: - rbac.authorization.kubernetes.io/autoupdate: "true" - labels: - kubernetes.io/bootstrapping: rbac-defaults - name: sp:restricted -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: ClusterRole - name: sp:restricted -subjects: -- kind: Group - name: system:authenticated - apiGroup: rbac.authorization.k8s.io -``` +##### By default behavior -This RoleBinding ensures that kube-system pods can run with no scheduling restriction: +by default no `SchedulingPolicy` is created, so any workload will be running as expected (i.e. no restriction apply). -```yaml -apiVersion: rbac.authorization.k8s.io/v1 -kind: RoleBinding -metadata: - annotations: - rbac.authorization.kubernetes.io/autoupdate: "true" - labels: - kubernetes.io/bootstrapping: rbac-defaults - name: sp:kube-system-privileged - namespace: kube-system -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: ClusterRole - name: sp:privileged -subjects: -- kind: Group - name: system:serviceaccounts:kube-system - apiGroup: rbac.authorization.k8s.io -``` # References - [Pod affinity/anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) - [Pod priorities](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) - [Taints and tolerations](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/) -- [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/) - [Using multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/) From aad9f2fcad32b42ceb0df074028962b2b8c76f01 Mon Sep 17 00:00:00 2001 From: Yassine TIJANI Date: Thu, 17 May 2018 20:20:58 +0200 Subject: [PATCH 05/10] delete ambiguous statement --- contributors/design-proposals/scheduling/scheduling-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index dcde3b48224..bab6ae7f86b 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -118,7 +118,7 @@ Policies may have overlapping rules, to handle this policies are computed in the they should also obey to the following rules: - everything that is required is by definition allowed. -- everything that is not denied is not automatically allowed: to be allowed a rule must not be denied AND must be allowed. +- everything that is not denied is not automatically allowed. # Detailed Design From bfc1bb9e5ad2db785bdc514da274aaeece47e2ad Mon Sep 17 00:00:00 2001 From: Yassine TIJANI Date: Sun, 22 Jul 2018 22:07:54 +0200 Subject: [PATCH 06/10] updating schedulingPolicy proposal after aligning --- .../scheduling/scheduling-policy.md | 273 ++++++++---------- 1 file changed, 125 insertions(+), 148 deletions(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index bab6ae7f86b..b9551fe26fc 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -4,7 +4,7 @@ _Status: Draft_ _Authors: @arnaudmz, @yastij_ -_Reviewers: @bsalamat, @liggitt_ +_Reviewers: @bsalamat, @tallclair, @liggitt_ # Objectives @@ -31,13 +31,11 @@ Identified use-cases aim to ensure that administrators have a way to restrict us - require that a pods under a namespace run on dedicated nodes - Restrict usage of some `PriorityClass` - Restrict usage to a specific set of schedulers. -- placing pods on specific nodes (master roles for instance). -- using specific priority classes. -- expressing pod affinity or anti-affinity rules. +- enforcing pod affinity or anti-affinity rules on some particular namespace. -Also Mandatory (and optionally default) values must also be enforced by scheduling policies in case of: +Also Mandatory values must also be enforced by scheduling policies in case of: -- mutli-arch (amd64, arm64) of multi-os (Linux, Windows) clusters +- mutli-arch (amd64, arm64) of multi-os (Linux, Windows) clusters (will also be handled later by the [RuntimeClass]() KEP) - multi-az / region / failure domain clusters # Overview @@ -50,75 +48,75 @@ kind: metadata: name: spec: - bindingMode: # Describes a bindingMode (any or all) - namespaces: # a list of namespaces that the policy targets (optional) - - ns1 - - ns1 - namespaceSelector: # a selector to match a set of namespaces (optional) - key: value + priority: + namespaces: + matchNames: + - ns1 + - ns2 + matchLabels: + key: value + action: # Describes the action that should be taken (allowed, denied or required) rules: # rules that must be satisfied (optional) fieldA: # name of the field (optional) - match: # Describes how the rule is matched (required) - elt1 # elements here could be objects like tolerations or strings like SchedulerName - action: # Describes the action that should be taken (allowed, denied or required) ``` -### policy composition - -`bindingMode` Describes how policies should be composed: -- any: any policy that has its rules satisfied -- all: all policies MUST have its rules satisfied +### empty match and unset fields: -if we have a heterogenous bindingMode across policies (i.e. some policies with any and others with all). -Then the most restrictive one (i.e the all bindingMode) is applied. +if a field is set to empty in the policy, except when the action is `required`, it should match everything then the corresponding action will apply. ### unset fields: -if a field was not specified in the policy no rule should apply (i.e. everything is allowed). - -### empty match: - -Usually policies should distinguish between empty matches, since it depends on the action: - -- require/deny: no rule apply. -- allow: allows everything. - +When a field is not specified it is automatically allowed. This makes it easy to rollout the feature for existing clusters, as it makes everything allowed in the cluster when no policy is created. ### inside the matches: -The match field can express pretty much any structure you want to, but there's some things that should be considered: +At the policy level, the match field can express pretty much any structure you want to, but there's some things that should be considered: - the structure should match as much as possible what you try to take action on. -- match elements are ANDed. -- the `*` wildcard is usually needed in the allow section. It should be represented with an empty value (e.g. empty array). +- match elements are combined. +example : -### conflict handling for policies +``` +- match: + - keys: ["projectA-dedicated","projectB-dedicated"] + operators: ["Exists"] + effects: [] +``` -There is two kinds of conflict in policies: +- This matches the structure of toleration +- This match means the following : t + - The toleration `projectA-dedicated` with the operator `Exists` + - The toleration `projectB-dedicated` with the operator `Exists` -- structural conflicts: these exists when any rule has the same matchingExpression and opposed actions (deny and require) +### policy composition and conflict handling -- semantical conflicts: these exists due to semantic of the fields (e.g. require a NodeSelector `master=true` and also require `master=false`) +Policies are composed by ANDing them, note that rules of policies from a lower priority are superseded by ones from a higher priority if there is a conflict. +There is two kinds of conflict to handle in policies of the same priority level: Structural and semantical conflicts. -Structural conflicts must handled at creation time, on the other hand, semantical conflicts should be handled at runtime: detect that there's a conflict and emit an event stating that policies couldn't be satisfied due to a conflict. +Structural conflicts must handled at creation time as much as possible, on the other hand, semantical conflicts should be handled at runtime: detect that there's a conflict and emit an event stating that policies couldn't be satisfied due to a conflict. -### required vs allowed vs denied + +### how policies are computed Policies may have overlapping rules, to handle this policies are computed in the following order: -- compute what is denied. -- compute what is required. -- compute what was allowed. +- compute policies at `exception` priority. +- compute policies at `cluster` priority. +- compute policies at `user` priority. +- compute policies at `default` priority. + +If a policy doesn't specify a priority the default priority applies. they should also obey to the following rules: - everything that is required is by definition allowed. -- everything that is not denied is not automatically allowed. # Detailed Design @@ -139,31 +137,29 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all + priority: namespaces: - - default + matchNames: + - ns1 + - ns2 + matchLabels: + key: value + action: rules: schedulerNames: - - match: [] - action: + match: [] priorityClassNames: - - match: [] - action: + match: [] tolerations: - - match: [] - action: + match: [] nodeSelectors: - - match: [] - action: + match: [] nodeAffinities: - - match: [] - action: + match: [] podAntiAffinities: - - match: [] - action: + match: [] podAffinities: - - match: [] - action: + match: [] ``` @@ -190,13 +186,13 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: required rules: schedulerNames: - - match: ["green-scheduler","my-scheduler"] - action: required + match: ["green-scheduler","my-scheduler"] ``` An empty list of schedulerNames has no effect, as pod use the `defaultScheduler` if no scheduler is specified: @@ -207,18 +203,18 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: required rules: schedulerNames: - - match: [] - action: required + match: [] ``` #### allowed -Allow pods to use either the green-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in the namespace `default`: +Allow pods to use either the `green-scheduler` (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in the namespace `default`: ```yaml apiVersion: policy/v1alpha1 @@ -226,13 +222,13 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: allowed rules: schedulerNames: - - match: ["green-scheduler","my-scheduler"] - action: allowed + match: ["green-scheduler","my-scheduler"] ``` An empty list of schedulerNames will allow usage of all schedulers: @@ -243,15 +239,16 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: allowed rules: schedulerNames: - - match: [] - action: allowed + match: [] ``` +note: this policy has no effect as we allow all when no policy is set. ### Tolerations @@ -259,11 +256,10 @@ Toleration usage can be regulated using fine-grain rules with `tolerations` fiel #### required -This allows to require toleration in the following forms of: +This requires toleration in the following forms of: - tolerations that tolerates taints with key named `projectA-dedicated` with all effects. - -##### Fine-grain allowed tolerations +- tolerations that tolerates taints with key named `node-misc` with `NoSchedule` effect. ```yaml apiVersion: policy/v1alpha1 @@ -271,19 +267,22 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - projectA + matchNames: + - projectA + action: required rules: tolerations: - - match: + match: - keys: ["projectA-dedicated"] operators: ["Exists"] effects: [] - action: required + - keys: ["node-misc"] + operators: ["Exists"] + effects: ["NoSchedule"] ``` - an empty list of matches has no effect (i.e. do not require anything). + note: an empty list of matches has no effect (i.e. do not require anything). #### Allowed @@ -291,20 +290,19 @@ This allows requires tolerations in the following forms: - tolerations that tolerates taints with key named `mykey` that has a value `value` and with a `NoSchedule` effect. - tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect. -##### Fine-grain allowed tolerations - ```yaml apiVersion: policy/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - projectA + action: allowed rules: tolerations: - - match: + match: - keys: ["mykey"] operators: ["Equal"] values: ["value"] @@ -312,7 +310,6 @@ spec: - keys: ["other_key"] operators: ["Exists"] effects: ["NoExecute"] - action: allowed ``` Here we allow tolerations in the following forms: @@ -331,13 +328,13 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - projectA + action: allowed rules: tolerations: - - match: [] - action: allowed + match: [] ``` Which is equivalent to: @@ -349,17 +346,17 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - projectA + action: allowed rules: tolerations: - - match: + match: - keys: [] operators: [] values: [] effects: [] - action: allowed ``` @@ -378,13 +375,13 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: required rules: priorityClasseNames: - - match: ["high-priority","critical-job"] - action: required + match: ["high-priority","critical-job"] ``` @@ -400,13 +397,13 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: allowed rules: priorityClasseNames: - match: ["critical-priority"] - action: allowed ``` @@ -426,28 +423,19 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: required rules: nodeSelectors: - - match: - - beta.kubernetes.io/arch: ["amd64", "arm64"] - - team: [] - action: required - - match: - - failure-domain.beta.kubernetes.io/region: [] - - disk: ["ssd", "hdd"] - action: allowed - + match: + - failure-domain.beta.kubernetes.io/region: ["northeurope"] + - disk: ["ssd", "premium"] ``` -In this example, pods can be scheduled if they match the two following: -- a nodeSelector `beta.kubernetes.io/arch=amd64` or `beta.kubernetes.io/arch=arm64`. -- a nodeSelector with the key `team`. - -They can also optionally specify: +In this example, pods cannot be scheduled if they match one of the following: - `disk: ssd` nodeSelector, - `disk: hdd` nodeSelector, - `failure-domain.beta.kubernetes.io/region` nodeSelector with any value. @@ -473,40 +461,29 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - bindingMode: all namespaces: - - default + matchNames: + - default + action: required rules: nodeAffinities: - - match: - - keys: ["failure-domain.beta.kubernetes.io/region","authorized-region"] - operators: ["In","NotIn"] - values: ["eu-2", "us-1"] - type: "requiredDuringSchedulingIgnoredDuringExecution" - action: allowed - - match: - - keys: ["beta.kubernetes.io/arch"] - operators: ["In"] - values: ["amd64", "arm64"] - type: "requiredDuringSchedulingIgnoredDuringExecution" - action: required - podAntiAffinities: - - match: [] - action: allowed + requiredDuringSchedulingIgnoredDuringExecution: + match: + - keys: ["failure-domain.beta.kubernetes.io/region","authorized-region"] + operator: "NotIn" + values: ["eu-2", "us-1"] + - keys: ["PCI-region"] + operator: "Exists" + values: [] + preferredDuringSchedulingIgnoredDuringExecution: + match: + - keys: ["flavor"] + operator: In + values: ["m1.small", "m1.medium"] ``` -In this example, we allow: -- hard NodeAffinity based on - - `beta.kubernetes.io/arch` if value is `amd64` or `arm64`. - - any combination of specified in `allowed` rule. -- All podAntiAffinities -- No podAffinities - - -##### By default behavior - -by default no `SchedulingPolicy` is created, so any workload will be running as expected (i.e. no restriction apply). +In this example, we require pods to use nodeAffinity to select nodes having `failure-domain.beta.kubernetes.io/region` or `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`) # References From f7465aa3ee72e4df0f30d4f6ca88ca076dfa7597 Mon Sep 17 00:00:00 2001 From: Yassine TIJANI Date: Thu, 23 Aug 2018 00:47:42 +0200 Subject: [PATCH 07/10] updating the proposal with the newest algorithm and structures --- .../scheduling/scheduling-policy.md | 341 ++++-------------- 1 file changed, 75 insertions(+), 266 deletions(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index b9551fe26fc..5d37e8ce2e4 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -42,81 +42,22 @@ Also Mandatory values must also be enforced by scheduling policies in case of: ### syntaxic standards -```yaml -apiVersion: policy/v1alpha1 -kind: -metadata: - name: -spec: - priority: - namespaces: - matchNames: - - ns1 - - ns2 - matchLabels: - key: value - action: # Describes the action that should be taken (allowed, denied or required) - rules: # rules that must be satisfied (optional) - fieldA: # name of the field (optional) - - match: # Describes how the rule is matched (required) - - elt1 # elements here could be objects like tolerations or strings like SchedulerName - -``` - - ### empty match and unset fields: -if a field is set to empty in the policy, except when the action is `required`, it should match everything then the corresponding action will apply. - -### unset fields: - -When a field is not specified it is automatically allowed. This makes it easy to rollout the feature for existing clusters, as it makes everything allowed in the cluster when no policy is created. - -### inside the matches: +We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing -At the policy level, the match field can express pretty much any structure you want to, but there's some things that should be considered: - -- the structure should match as much as possible what you try to take action on. -- match elements are combined. +### how policies are computed -example : +Policies are computed using the following algorithm: ``` -- match: - - keys: ["projectA-dedicated","projectB-dedicated"] - operators: ["Exists"] - effects: [] +sortedPolicies = sort_by_priority(sort_deny_first(policies)) +for policy in sortedPolicies: + if policy matches pod: // all specified policy rules match + return policy.action ``` + note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match. -- This matches the structure of toleration -- This match means the following : t - - The toleration `projectA-dedicated` with the operator `Exists` - - The toleration `projectB-dedicated` with the operator `Exists` - -### policy composition and conflict handling - -Policies are composed by ANDing them, note that rules of policies from a lower priority are superseded by ones from a higher priority if there is a conflict. - -There is two kinds of conflict to handle in policies of the same priority level: Structural and semantical conflicts. - - -Structural conflicts must handled at creation time as much as possible, on the other hand, semantical conflicts should be handled at runtime: detect that there's a conflict and emit an event stating that policies couldn't be satisfied due to a conflict. - - -### how policies are computed - -Policies may have overlapping rules, to handle this policies are computed in the following order: - -- compute policies at `exception` priority. -- compute policies at `cluster` priority. -- compute policies at `user` priority. -- compute policies at `default` priority. - -If a policy doesn't specify a priority the default priority applies. - -they should also obey to the following rules: - -- everything that is required is by definition allowed. # Detailed Design @@ -137,15 +78,14 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - priority: - namespaces: - matchNames: - - ns1 - - ns2 - matchLabels: - key: value - action: + priority: + action: rules: + namespaceSelector: + - key: value + operator: operator + podSelector: + key: value schedulerNames: match: [] priorityClassNames: @@ -162,59 +102,24 @@ spec: match: [] ``` +- allow: When a policy matches the pod is allowed +- deny: when a policy matches the pod is denied + -### required -Elements here are required, pods won't schedule if they aren't present. Also note that if something is required it is also allowed. +### Scoping -### allowed -Elements here are allowed, the policy allows the presence of these elements. From Pod's perspective, a pod can use one or N of the allowed items. +The `spec.rules.namespaceSelector` and `spec.rules.podSelector` attributes scope the policy application range. -### deny -If pods specify one of these, the pod won't schedule to a node, we won't dive into deny as it is the exact opposite of required. +Pods must match both `spec.namespaceSelector` and `spec.podSelector` matching rules to be constrained by a policy. +Both `spec.namespaceSelector` and `spec.podSelector` are optional. If absent, all pods and all namespaces are targeted by the policy. ### Scheduler name If `schedulerNames` is absent from `allowed`, `default` or `required`, no scheduler is allowed by this specific policy. -#### required - -Require that pods use either the green-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`): -```yaml -apiVersion: policy/v1alpha1 -kind: SchedulingPolicy -metadata: - name: mySchedulingPolicy -spec: - namespaces: - matchNames: - - default - action: required - rules: - schedulerNames: - match: ["green-scheduler","my-scheduler"] -``` - -An empty list of schedulerNames has no effect, as pod use the `defaultScheduler` if no scheduler is specified: - -```yaml -apiVersion: policy/v1alpha1 -kind: SchedulingPolicy -metadata: - name: mySchedulingPolicy -spec: - namespaces: - matchNames: - - default - action: required - rules: - schedulerNames: - match: [] -``` - - -#### allowed -Allow pods to use either the `green-scheduler` (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in the namespace `default`: +#### Allowed +Allow pods to use either the `green-scheduler` (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in namespaces labeled `team`: ```yaml apiVersion: policy/v1alpha1 @@ -222,11 +127,12 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - namespaces: - matchNames: - - default - action: allowed + action: allow rules: + namespaces: + - key: team + operator: Exists + podSelector: {} schedulerNames: match: ["green-scheduler","my-scheduler"] ``` @@ -239,11 +145,12 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - namespaces: - matchNames: - - default - action: allowed + action: allow rules: + namespaces: + - key: team + operator: Exists + podSelector: {} schedulerNames: match: [] ``` @@ -254,39 +161,10 @@ note: this policy has no effect as we allow all when no policy is set. Toleration usage can be regulated using fine-grain rules with `tolerations` field. If specifying multiple `tolerations`, pod will be scheduled if one of the tolerations is satisfied. -#### required - -This requires toleration in the following forms of: - -- tolerations that tolerates taints with key named `projectA-dedicated` with all effects. -- tolerations that tolerates taints with key named `node-misc` with `NoSchedule` effect. - -```yaml -apiVersion: policy/v1alpha1 -kind: SchedulingPolicy -metadata: - name: mySchedulingPolicy -spec: - namespaces: - matchNames: - - projectA - action: required - rules: - tolerations: - match: - - keys: ["projectA-dedicated"] - operators: ["Exists"] - effects: [] - - keys: ["node-misc"] - operators: ["Exists"] - effects: ["NoSchedule"] -``` - - note: an empty list of matches has no effect (i.e. do not require anything). #### Allowed -This allows requires tolerations in the following forms: +This allows pods with tolerations the following: - tolerations that tolerates taints with key named `mykey` that has a value `value` and with a `NoSchedule` effect. - tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect. @@ -296,96 +174,28 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - namespaces: - matchNames: - - projectA - action: allowed - rules: - tolerations: - match: - - keys: ["mykey"] - operators: ["Equal"] - values: ["value"] - effects: ["NoSchedule"] - - keys: ["other_key"] - operators: ["Exists"] - effects: ["NoExecute"] -``` - -Here we allow tolerations in the following forms: -- tolerations that tolerates all `PreferNoSchedule` taints with any value. -- tolerations that tolerates taints based on any key existence with effect `NoSchedule`. -Also note that this SchedulingPolicy does not allow tolerating NoExecute taints. - -##### Coarse-grain allowed tolerations - - -an empty list of toleration allows all types of tolerations: - -```yaml -apiVersion: policy/v1alpha1 -kind: SchedulingPolicy -metadata: - name: mySchedulingPolicy -spec: - namespaces: - matchNames: - - projectA - action: allowed - rules: - tolerations: - match: [] -``` - -Which is equivalent to: - - -```yaml -apiVersion: policy/v1alpha1 -kind: SchedulingPolicy -metadata: - name: mySchedulingPolicy -spec: - namespaces: - matchNames: - - projectA - action: allowed + action: allow rules: + namespaces: + - key: team + operator: Exists + podSelector: {} tolerations: match: - - keys: [] - operators: [] - values: [] - effects: [] + - key: "mykey" + operator: "Equal" + value: "value" + effects: "NoSchedule" + - key: "other_key" + operator: "Exists" + effect: "NoExecute" ``` - ### Priority classes Priority class usage can be regulated using fine-grain rules with `priorityClasseName` field. -##### required - -this example requires a `priorityClass` that is either `high-priority` or `critical-job` - -```yaml -apiVersion: policy/v1alpha1 -kind: SchedulingPolicy -metadata: - name: mySchedulingPolicy -spec: - namespaces: - matchNames: - - default - action: required - rules: - priorityClasseNames: - match: ["high-priority","critical-job"] -``` - - - ##### Allow In this example, we only allow the `critical-job` priority. @@ -397,21 +207,20 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - namespaces: - matchNames: - - default - action: allowed + action: allow rules: + namespaces: + - key: team + operator: Exists + podSelector: {} priorityClasseNames: - - match: ["critical-priority"] + - match: "critical-priority" ``` - - ### Node Selector -The `nodeSelector` field makes it possible to specify what nodeSelectors are required, allowed and denied. As for other components, `required` nodeSelectors are automatically considered as allowed. +The `nodeSelector` field makes it possible to specify what pods are allowed or denied based on their nodeSelectors. #### Examples @@ -423,32 +232,31 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - namespaces: - matchNames: - - default - action: required + action: deny rules: + namespaces: + - key: team + operator: Exists + podSelector: {} nodeSelectors: match: - - failure-domain.beta.kubernetes.io/region: ["northeurope"] - - disk: ["ssd", "premium"] + - failure-domain.beta.kubernetes.io/region: "" + - disk: "ssd" + - disk: "premium" ``` -In this example, pods cannot be scheduled if they match one of the following: -- `disk: ssd` nodeSelector, -- `disk: hdd` nodeSelector, +In this example, pods cannot be scheduled if they have all of the following at the same time: +- `disk: ssd` nodeSelector +- `disk: hdd` nodeSelector - `failure-domain.beta.kubernetes.io/region` nodeSelector with any value. ### Affinity rules -As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedAffinities`. -`allowedAffinities` is supposed to keep a coarse-grained approach in allowing affinities. For each type (`nodeAffinities`, `podAffinities`, `podAntiAffinities`) a schedulingpolicy can list allowed constraints (`requiredDuringSchedulingIgnoredDuringExecution` +As anti-affinity rules are really time-consuming, we must be able to restrict their usage for each type (`nodeAffinities`, `podAffinities`, `podAntiAffinities`) a schedulingpolicy can list allowed/denied constraints (`requiredDuringSchedulingIgnoredDuringExecution` or `requiredDuringSchedulingIgnoredDuringExecution`). -If `allowedAffinities` is totally absent from the spec, no affinity is allowed whatever its kind. - #### Examples ##### Basic policy @@ -461,29 +269,30 @@ kind: SchedulingPolicy metadata: name: mySchedulingPolicy spec: - namespaces: - matchNames: - - default - action: required + action: allow rules: + namespaces: + - key: team + operator: Exists + podSelector: {} nodeAffinities: requiredDuringSchedulingIgnoredDuringExecution: match: - - keys: ["failure-domain.beta.kubernetes.io/region","authorized-region"] + - key: "authorized-region" operator: "NotIn" values: ["eu-2", "us-1"] - - keys: ["PCI-region"] + - key: "PCI-region" operator: "Exists" values: [] preferredDuringSchedulingIgnoredDuringExecution: match: - - keys: ["flavor"] + - key: "flavor" operator: In values: ["m1.small", "m1.medium"] ``` -In this example, we require pods to use nodeAffinity to select nodes having `failure-domain.beta.kubernetes.io/region` or `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`) +In this example, we require pods to use nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`) # References From f968f81e6df34b495b9e0c45262736c8fbcbe158 Mon Sep 17 00:00:00 2001 From: Yassine TIJANI Date: Fri, 7 Sep 2018 21:29:46 +0200 Subject: [PATCH 08/10] add CRD-based approach and static matching --- .../scheduling/scheduling-policy.md | 22 ++++++++++++++----- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index 5d37e8ce2e4..ce419aef691 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -40,11 +40,13 @@ Also Mandatory values must also be enforced by scheduling policies in case of: # Overview -### syntaxic standards +The schedulingPolicy will live out-of-tree under kubernetes-sigs org. It will use a CRD-based approach. + +## syntaxic standards ### empty match and unset fields: -We follow Kubernetes conventions, an empty field matches everything an unset one matches nothing +an empty field and an unset one matches everything. ### how policies are computed @@ -56,7 +58,10 @@ for policy in sortedPolicies: if policy matches pod: // all specified policy rules match return policy.action ``` - note that rules of policies from a lower priority are superseeded by ones from a higher priority if they match. + note that: +- rules of policies with higher priority supersede lower priority rules if they both match. +- matching is done statically, i.e. we don't interpret logical operators (see nodeAffinity section for more details). +- matching is considered true if a subset of a set-based field is matched. @@ -70,7 +75,7 @@ Proposed API group: `policy/v1alpha1` ### SchedulingPolicy content -SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this `SchedulingPolicy` won't allow any item from the missing field. +SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it automatically allowed. ```yaml apiVersion: policy/v1alpha1 @@ -214,7 +219,7 @@ spec: operator: Exists podSelector: {} priorityClasseNames: - - match: "critical-priority" + match: "critical-priority" ``` @@ -292,7 +297,12 @@ spec: ``` -In this example, we require pods to use nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`) +In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has: + +- All the "required" and "preferred" sections. +- Each section has the same keys and the same operators. +- Values must be the same or subset of those of the pod. + # References From 00596bc1a1f0342acc1d813269e77ddacc07d631 Mon Sep 17 00:00:00 2001 From: Arnaud MAZIN Date: Sat, 15 Sep 2018 07:38:08 +0200 Subject: [PATCH 09/10] Typos --- contributors/design-proposals/scheduling/scheduling-policy.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index ce419aef691..aadd946bd97 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -75,7 +75,7 @@ Proposed API group: `policy/v1alpha1` ### SchedulingPolicy content -SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it automatically allowed. +SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it is automatically allowed. ```yaml apiVersion: policy/v1alpha1 @@ -297,7 +297,7 @@ spec: ``` -In this example, we allow pods that nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has: +In this example, we allow pods with nodeAffinity to select nodes having `authorized-region` without `eu-1` or `us-1` values, or nodes having `PCI-region` label set. On those filtered nodes we require the pod to prefer nodes with the lowest compute capabilities (`m1.small` or `m1.medium`). The matching is done when a pod has: - All the "required" and "preferred" sections. - Each section has the same keys and the same operators. From 74e3cb23a1048c3baa88751edb0c82e64ad4a78f Mon Sep 17 00:00:00 2001 From: Yassine TIJANI Date: Thu, 20 Sep 2018 22:56:32 +0200 Subject: [PATCH 10/10] move to policy.k8s.io apiGroup --- .../scheduling/scheduling-policy.md | 20 ++++++++----------- 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/contributors/design-proposals/scheduling/scheduling-policy.md b/contributors/design-proposals/scheduling/scheduling-policy.md index aadd946bd97..52de7635fbb 100644 --- a/contributors/design-proposals/scheduling/scheduling-policy.md +++ b/contributors/design-proposals/scheduling/scheduling-policy.md @@ -70,7 +70,7 @@ for policy in sortedPolicies: ## SchedulingPolicy -Proposed API group: `policy/v1alpha1` +Proposed API group: `policy.k8s.io/v1alpha1` ### SchedulingPolicy content @@ -78,7 +78,7 @@ Proposed API group: `policy/v1alpha1` SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a `SchedulingPolicy` it is automatically allowed. ```yaml -apiVersion: policy/v1alpha1 +apiVersion: policy.k8s.io/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy @@ -120,14 +120,11 @@ Both `spec.namespaceSelector` and `spec.podSelector` are optional. If absent, al ### Scheduler name -If `schedulerNames` is absent from `allowed`, `default` or `required`, no scheduler is allowed by this specific policy. - - #### Allowed Allow pods to use either the `green-scheduler` (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`) in namespaces labeled `team`: ```yaml -apiVersion: policy/v1alpha1 +apiVersion: policy.k8s.io/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy @@ -142,10 +139,9 @@ spec: match: ["green-scheduler","my-scheduler"] ``` -An empty list of schedulerNames will allow usage of all schedulers: ```yaml -apiVersion: policy/v1alpha1 +apiVersion: policy.k8s.io/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy @@ -174,7 +170,7 @@ This allows pods with tolerations the following: - tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect. ```yaml -apiVersion: policy/v1alpha1 +apiVersion: policy.k8s.io/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy @@ -207,7 +203,7 @@ In this example, we only allow the `critical-job` priority. ```yaml -apiVersion: policy/v1alpha1 +apiVersion: policy.k8s.io/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy @@ -232,7 +228,7 @@ The `nodeSelector` field makes it possible to specify what pods are allowed or d ##### Complete policy ```yaml -apiVersion: policy/v1alpha1 +apiVersion: policy.k8s.io/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy @@ -269,7 +265,7 @@ or `requiredDuringSchedulingIgnoredDuringExecution`). ```yaml -apiVersion: policy/v1alpha1 +apiVersion: policy.k8s.io/v1alpha1 kind: SchedulingPolicy metadata: name: mySchedulingPolicy