Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial design proposal for Scheduling Policy #1937

Closed
wants to merge 10 commits into from
355 changes: 355 additions & 0 deletions contributors/design-proposals/scheduling/scheduling-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,355 @@
# Scheduling Policy

_Status: Draft_
_Author: @arnaudmz, @yastij_
_Reviewers: @bsalamat, @liggitt_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true 😄


# Objectives

- Define the concept of scheduling policies
- Propose their initial design and scope

## Non-Goals

- How taints / tolerations work
- How NodeSelector works
- How node / pod affinity / anti-affinity rules work
- How several schedulers can be used within a single cluster
- How priority classes work

# Background

During real-life Kubernetes architecting we encountered contexts where role-isolation (between administration and simple namespace usage in a multi-tenant context) could be improved. So far, no restriction is possible on toleration, priority class usage, nodeSelector, anti-affinity depending on user permissions (RBAC).

Identified use-cases aim to ensure that administrators have a way to restrict users or namepace when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't great use cases - I really expected a more end kubernetes user focus:

  1. Allow administrators to restrict execution for specific applications (which are namespace scoped) into certain nodes
  2. Allow administrators to create policies that prevent users from even attempting to schedule workloads onto masters to maximize security

etc.

Something as critical as policy needs a lot more use case design before we even get into implementation details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably not going to be happy until at least 300 lines of this doc is a detailed justification for the design space and what we are actually trying to build. If that exists elsewhere, please link it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton - Indeed there's some another use case I've got with @arnaudmz, we'll add them. I'll try to put up a design section to have an overview about this policy, SGTY ?

Copy link
Member

@bgrant0607 bgrant0607 Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- using schedulers,
- placing pods on specific nodes (master roles for instance),
- using specific priority classes,
- expressing pod affinity or anti-affinity rules.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by these last three items - those are things that are specified on the pod, not the policy? It seems like you already covered the policy use-cases? (except for the last one, maybe "Enforce anti-affinity requirements between pods in specific namespaces")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll do a rewrite on this.


# Overview

Implementing SchedulingPolicy implies:
- Creating a new resource named **SchedulingPolicy** (schedpol)
- Creating an **AdmissionController** that dehaves on a deny-all-but basis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dehaves->behaves?

- Allow SchedulingPolicy to be used by pods using RoleBindings or ClusterRoleBindings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider applying the policies to namespaces. That's more aligned with similar K8s policies, such as quota.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point was to be aligned with the PodSecurityPolicy principles:

  • Apply cluster-wide to ease administrator job when creating new namespaces
  • As soon as it is enforced in AdmissionControlers, it applies in a restrictive way to ensure administrators won't let any leaks in permissions

If I understand well, you seem more found of a non-breaking approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat - I'm not sure we want to do that, integrating with RBAC would be better in term of experience (e.g. grant cluster-wide usage on a SchedulingPolicy).

cc @tallclair @liggitt

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support namespaced policy as well. For example, we want schedulerB to be used by only namespaceB - How do we restrict it using current policy?

I think we should have 2 policies or atleast have a field in this spec. The global scheduling policies are applied to every pod and some of the fields in global policy cannot be overridden by local policy(created at namespace level) and even if the local policy is beyond scope of current proposal, we should include the fields which cannot be overridden at namespace level, if we go this route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravisantoshgudimetla - this is all enabled by RBAC (roleBindings allow the verb «use » on a schedulingPolicy on a specific namespace, ClusterRoleBindings on the other hand will allow a cluster-wide usage of a schedulingPolicy, having a namespaceSelector is not viable for this usecase)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravisantoshgudimetla: to be very precise, enforcing schedulerB to be used by only namespaceB can be achieved by:

  1. create a SchedulingPolicy say policyB:
apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
  name: policyB
spec:
  allowed:
    schedulerNames: ["default-scheduler", "schedulerB"]
  1. create a ClusterRole allowing to use policyB:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: policyB
rules:
- apiGroups: ['extensions']
  resources: ['schedulingpolicies']
  verbs:     ['use']
  resourceNames:
  - policyB
  1. create a RoleBinding that applies to all serviceaccounts in namespace namespaceB:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: policyB
  namespace: namespaceB
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: policyB
subjects:
- kind: Group
  name: system:serviceaccounts:namespaceB
  apiGroup: rbac.authorization.k8s.io

Other service accounts (in any other namespaces) will fallback to the default restricted policy that forbids the use of this scheduler.

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudmz @yastij Thanks. Clearly my example is not complex enough. My question is more on the lines of how do we ensure that certain attributes of scheduling like nodeSelector could come from a namespace rather than created by whom and others created by whom. For example as of now there is namespace level whitelist for tolerations. How can we tell that this toleration is valid or not until the pod creation happens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of RBAC to determine which PodSecurityPolicy applies is one of the most confusing things we've done in the entire system, which is saying a lot.

Another model to consider is NetworkPolicy:
https://kubernetes.io/docs/concepts/services-networking/network-policies/


# Detailed Design

SchedulingPolicy resources are supposed to apply in a deny-all-except approach. They are designed to apply in an additive way (i.e and'ed). From Pod's perspective, a pod can use one or N of the allowed items.

An AdmissionController must be added to the validating phase and must reject pod scheduling if the serviceaccount running the pod is not allowed to specify requested NodeSelectors, Scheduler-Name, Anti-Affinity rules, Priority class, and Tolerations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absence of node selectors is also problematic. The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes. Any replacement for that mechanism would need to provide the same capability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see below some proposal which could go this way.

Copy link
Member

@k82cn k82cn Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current podnodeselector admission plugin allows admins to force specific nodeselectors onto pods to constrain them to a subset of nodes.

Will that introduce order dependence of the two admission controller? for example, podnodeselector added some deny nodeselector after this admission controller; similar to podtolerationrestrict. Is it possible to combine those admission into one? or document it clearly.

It's arguable that cluster admin should configure it correctly; but that'll take time to do trouble-shooting :)


All usable scheduling policies (allowed by RBAC) are merged before evaluating if scheduling constraints defined in pods are allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify what "merged" means. That seems potentially problematic, especially in case of computing coverage of conflicting scheduling components (policy A allowed this toleration, policy B allowed that toleration, policy C required nodeSelector component master=false, policy D allows nodeSelector component master=true, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there was no required components nor default value, merging was quite trivial, but given that need, I guess we'll have to work on it.

I'm thinking of some ways like:

  • allowed-like rules keep having an additive behaviour:
    if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b]

  • require-like rules prevent over allowed values => allowed values not present in required values are eventually not allowed

  • for default-like and required-like conditions, we could consider either to weight policies or to sort policies (by name?) and apply a last-seen wins rule.

Any thoughts?

Copy link
Member

@yastij yastij Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudmz @liggitt - to me, if a user specifies a required nodeselector master=true and added a default nodeselector master=false under another policy, I’d expect the required to superseed the default one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to have allow, deny and ignore concept; so the result will be no deny term for the requirements.

  • allow: passed by policy
  • deny: rejected by policy
  • ignore: the policy did not include term for it.

if policy A allows nodeSelector key=a and policy B allows nodeSelector key=b => the merge produces nodeSelector key can be in [a,b]

That's error-prone and complex; especially some corner cases.

Copy link
Member

@yastij yastij Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k82cn - I'd go with require, allow, default. As to me, if something isn't explicitly stated in the SchedulingPolicy it is denied by default.

Also do you any use cases for ignore policies ?

cc @bsalamat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the the single term; so the request is passed only all term passed :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work with the workflow of having a default cluster wide policy, and then granting specific users (or namespaces) elevated privileges. See Policy matching - union or intersection for a breakdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than pod restriction (#1950), are there other policies we're trying to align with here, approach-wise?


## SchedulingPolicy

Proposed API group: `extensions/v1alpha1`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extensions is locked down. This should be in the policy group.


SchedulingPolicy is a cluster-scoped resource (not namespaced).

### SchedulingPolicy content

SchedulingPolicy spec is composed of optional fields that allow scheduling rules. If a field is absent from a SchedulingPolicy, this schedpol won't allow any item from the missing field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using a policy intersection approach to handle multiple policies, but locking down fields by default breaks composition since there's no way to open them back up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

What composition scenarios do we expect?


```yaml
apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
name: my-schedpol
spec:
allowedSchedulerNames: # Describes schedulers names that are allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Jordan has also mentioned, none of these field should have the "allowed" prefix. They should be "schedulerNames", "priorityClassNames", etc. Then the spec for each one should have a "condition" (or a similar word) that can be set to one of the "allowed", "forbidden", "default", or "required".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt @bsalamat - by « default », we mean if nothing specied -> add the element of the policy ? (e.g if an SP specifies a nodeselector with a ruleType default, all pods with no nodeSelector will mutated to specify the nodeSelector from the SP) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant. One more point to add is that, in Kubernetes, we usually apply Pod policies at the granularity of namespaces. So, a user should be able to specify the namespace that these rules are applied. For example, default priority class of Pods in namespace "ns-1" is "pc-1".

Copy link
Member

@yastij yastij Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually policies such as psp do not hold namespace. Rbac will enables this, as users can creates roles that enables the verb «use » on the policy. cc @liggitt @arnaudmz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yastij: yes, that was the point of mimicing the PSP RBAC principle: using RoleBindings or ClousterRoleBindings to apply the policies at serviceaccount, namespace or cluster scope.

allowedPriorityClasseNames: # Describes priority classe names that are allowed
allowedNodeSelectors: # Describes node selectors that can be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss default vs required vs allowed vs forbidden

Typically, fencing nodes via selector involves requiring a specific set of labels/values (in addition to whatever else the pod wants), e.g. master=false,compute=true

allowedTolerations: # Describes tolerations that can be used
allowedAffinities: # Describes affinities that can be used
```

### Scheduler name

It should be possible to allow users to use only specific schedulers using `allowedSchedulerNames` field.

If `allowedSchedulerNames` is absent from SchedulingPolicy, no scheduler is allowed by this specific policy.

#### Examples

Allow serviceaccounts to use either the default-scheduler (which is used by specifying `spec.schedulerName` in pod definition) or the `my-scheduler` scheduler (by specifying `spec.schedulerName: "my-scheduler"`):
```yaml
Kind: SchedulingPolicy
spec:
allowedSchedulerNames:
- default-scheduler
- my-scheduler
```


Allow all schedulers:
```yaml
Kind: SchedulingPolicy
spec:
allowedSchedulerNames: []
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second policy needs a description.



### Tolerations

Toleration usage can be allowed using fine-grain rules with `allowedTolerations` field. If specifying multiple `allowedTolerations`, pod will be scheduled if one of the allowedTolerations is satisfied.

If `allowedTolerations` is absent from SchedulingPolicy, no toleration is allowed.

#### Examples

##### Fine-grain allowedTolerations
```yaml
Kind: SchedulingPolicy
spec:
allowedTolerations:
- keys: ["mykey"]
operators: ["Equal"]
values: ["value"]
effects: ["NoSchedule"]
- keys: ["other_key"]
operators: ["Exists"]
effects: ["NoExecute"]
```
This example allows tolerations in the following forms:
- tolerations that tolerates taints with key named `mykey` that has a value `value` and with a `NoSchedule` effect.
- tolerations that tolerates taints with key `other_key` that has a `NoExecute` effect.

##### Coarse-grain allowedTolerations
```yaml
Kind: SchedulingPolicy
spec:
allowedTolerations:
- keys: []
operators: []
values: []
effects: ["PreferNoSchedule"]
- keys: []
operators: ["Exists"]
effects: ["NoSchedule"]
```
This example allows tolerations in the following forms:
- tolerations that tolerates all `PreferNoSchedule` taints with any value.
- tolerations that tolerates taints based on any key existence with effect `NoSchedule`.
Also note that this SchedulingPolicy does not allow tolerating NoExecute taints.


### Priority classes

We must be able to enforce users to use specific priority classes by using the `allowedPriorityClasseNames` field.

If `allowedPriorityClasseNames` is absent from SchedulingPolicy, no priority class is allowed.

#### Examples

##### Only allow a single priority class
```yaml
Kind: SchedulingPolicy
spec:
allowedPriorityClasseNames:
- high-priority
```
In this example, only the `high-priority` PriorityClass is allowed.


##### Allow all priorities

```yaml
Kind: SchedulingPolicy
spec:
allowedPriorityClasseNames: []
```
In this example, all priority classes are allowed.

### Node Selector

As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedNodeSelectors`.

If `allowedNodeSelectors` is totally absent from the spec, no node selector is allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense. A pod with no nodeSelector targets the most nodes possible. Adding more selectors constrains a pod.

Generally, you want to require a set of nodeSelector labels be present, error if the pod tries to specify nodeSelector components that conflict with that required set, and allow the pod to specify any additional nodeSelector components it wants. That is what the current podnodeselector admission plugin does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could do it this way?
I'm taking nodeSelector as a simple example to start with:

apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
  name: my-schedpol
spec:
  nodeSelectors:
    required:
      beta.kubernetes.io/arch: ["amd64", "arm"] # pick one of thoses mandadory values
    default:
      beta.kubernetes.io/os: amd64 # Here is the default value unless specified
    allowed:
      failure-domain.beta.kubernetes.io/region: [] # any value can be sepcified

Given the deny-by-default design, some kind of forbiden subsection actually would'nt make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudmz - I agree, given the design, forbiden doesn’t make much sense.


#### Examples

##### Fine-grained policy

```yaml
Kind: SchedulingPolicy
spec:
allowedNodeSelectors:
disk: ["ssd"]
region: [] # means any value
```
In this example, pods can be scheduled only if they either:
- have no nodeSelector
- or have a `disk: ssd` nodeSelector
- and / or have a `region` key nodeSelector with any value

### Affinity rules

As anti-affinity rules are really time-consuming, we must be able to restrict their usage with `allowedAffinities`.
`allowedAffinities` is supposed to keep a coarse-grained approach in allowing affinities. For each type (`nodeAffinities`, `podAffinities`, `podAntiAffinities`) a schedulingpolicy can list allowed constraints (`requiredDuringSchedulingIgnoredDuringExecution`
or `requiredDuringSchedulingIgnoredDuringExecution`).

If `allowedAffinities` is totally absent from the spec, no affinity is allowed whatever its kind.

#### Examples

##### Basic policy
```yaml
Kind: SchedulingPolicy
spec:
allowedAffinities:
nodeAffinities:
- requiredDuringSchedulingIgnoredDuringExecution
podAntiAffinities:
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
```

##### Allow-all policy
In this example, all affinities are allowed:
```yaml
Kind: SchedulingPolicy
spec:
allowedAffinities:
nodeAffinities: []
podAffinities: []
podAntiAffinities: []
```

If a sub-item of allowedAffinities is absent from SchedulingPolicy, it is not allowed e.g:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. You're saying that if no sub-items are specified they're all allowed, but as soon as you specify one, the others are implicitly denied?

```yaml
Kind: SchedulingPolicy
spec:
allowedAffinities:
nodeAffinities: []
```
In this example, only soft and hard nodeAffinities are allowed.

### When both `allowedNodeSelectors` and `nodeAffinities` are specified

Use of both `allowedNodeSelectors` and `nodeAffinities` is not recommended as the latter being way more permissive.

## Default SchedulingPolicies

### Restricted policy
Here is a reasonable policy that might be allowed for any cluster without specific needs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already lost track of which fields are closed by default, and which are open. I'm worried this is too difficult to reason about.

```yaml
apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
name: restricted
spec:
allowedSchedulerNames: ["default-scheduler"]
```
It only allows usage of the default scheduler, no tolerations, nodeSelectors nor affinities.

Multi-archi (x86_64, arm) or multi-OS (Linux, Windows) clusters might also allow the following nodeSelectors:
```yaml
apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
name: restricted
spec:
allowedSchedulerNames: ["default-scheduler"]
allowedNodeSelectors:
beta.kubernetes.io/arch: []
beta.kubernetes.io/os: []
```

### Privileged Policy

This is the privileged SchedulingPolicy, it allows usage of all schedulers, priority classes, nodeSelectors, affinities and tolerations.

```yaml
apiVersion: extensions/valpha1
kind: SchedulingPolicy
metadata:
name: privileged
spec:
allowedSchedulerNames: []
allowedPriorityClasseNames: []
allowedNodeSelectors: {}
allowedTolerations:
- keys: [] # any keys
operators: [] # => Equivalent to ["Exists", "Equals"]
values: [] # any values
effects: [] # => Equivalent to ["PreferNoSchedule", "NoSchedule", "NoExecute"]
allowedAffinities:
nodeAffinities: []
podAffinities: []
podAntiAffinities: []
```

## RBAC
SchedulingPolicy are supposed to be allowed using the verb `use` to apply at pod runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly discourage this approach. See Policy Binding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair @bsalamat @liggitt @smarterclayton - I'm fine with having a list of namespaces + namespace selector.


the following default ClusterRoles / ClusterRoleBindings are supposed to be provisioned to ensure at least the default-scheduler can be used.

RBAC objects are going to be auto-provisioned at cluster creation / upgrade.


This ClusterRole allows the use of the default scheduler:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: sp:restricted
rules:
- apiGroups: ['extensions']
resources: ['schedulingpolicies']
verbs: ['use']
resourceNames:
- restricted
```

This ClusterRoleBinding ensures any serviceaccount can use the default-scheduler:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: sp:restricted
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: sp:restricted
subjects:
- kind: Group
name: system:authenticated
apiGroup: rbac.authorization.k8s.io
```

This RoleBinding ensures that kube-system pods can run with no scheduling restriction:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: sp:kube-system-privileged
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: sp:privileged
subjects:
- kind: Group
name: system:serviceaccounts:kube-system
apiGroup: rbac.authorization.k8s.io
```
# References
- [Pod affinity/anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity)
- [Pod priorities](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/)
- [Taints and tolerations](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/)
- [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)
- [Using multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/)