Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically add system critical priority classes at cluster boostrapping #60519

Merged
merged 3 commits into from
Mar 27, 2018

Conversation

bsalamat
Copy link
Member

@bsalamat bsalamat commented Feb 27, 2018

What this PR does / why we need it:
We had two PriorityClasses that were hardcoded and special cased in our code base. These two priority classes never existed in API server. Priority admission controller had code to resolve these two names. This PR removes the hardcoded PriorityClasses and adds code to create these PriorityClasses automatically when API server starts.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60178

ref/ #57471

Special notes for your reviewer:

Release note:

Automatically add system critical priority classes at cluster boostrapping.

/sig scheduling

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 27, 2018
@k8s-github-robot k8s-github-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Feb 27, 2018
@bsalamat
Copy link
Member Author

@bsalamat bsalamat force-pushed the auto_prio_class branch 2 times, most recently from fe3f75a to b7359ae Compare February 27, 2018 20:02
@@ -168,7 +168,7 @@ func IsCriticalPodBasedOnPriority(ns string, priority int32) bool {
if ns != kubeapi.NamespaceSystem {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this PR, but critical pods based on priority should not be limited to the kube-system namespace... that namespace should not have any special significance as far as the API is concerned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, that's unrelated this to PR, but K8s docs clearly state that critical pods can only exist in system namespace only.

Copy link
Member

@liggitt liggitt Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K8s docs clearly state that critical pods can only exist in system namespace only

that is left over from critical pods being determined from the uncontrollable pod annotation. that limitation should not apply to the priority field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will file a separate issue for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed: #60596

return admission.NewForbidden(a, fmt.Errorf("the name of the priority class is a reserved name for system use only: %v", pc.Name))
// API server adds system critical priority classes at bootstrapping. We should
// not enforce restrictions on adding system level priority classes for API server.
if userInfo := a.GetUserInfo(); userInfo == nil || userInfo.GetName() != user.APIServerUser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't an appropriate use of this username. if you want to require additional authorization to set certain priority classes, pass in an authorizer and perform the authorization check (see PodSecurityPolicyPlugin for an example of obtaining and using the authorizer). The API server always has superuser permissions when making API calls to itself, and will pass any authorization check, but cluster-admins should be able to modify these objects as well.

I also find it odd to do this enforcement in admission... why not in the rest handler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at PodSecurityPolicyPlugin. I am not sure if the authorization done there is enough though. We don't want to allow any user-defined priority classes to start with "system-" or have a value in the system range (>1 billion).
Performing the check in the rest handler can be done too if there is a way to obtain userInfo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you let the rest handler have the map of allowed system priorities and only allow those to be created. On upgrade, any newly allowed system priority classes would be bootstrapped by a new apiserver using the loopback connection, which would always speak to a rest handler that agreed on what system priority classes were allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you don't need to worry about userinfo at all. Anyone authorized to create priority classes can create the system priority classes the apiserver expects to be bootstrapped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a very good idea. I changed the PR accordingly. The new changes are under a new commit.

}
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) {
return admission.NewForbidden(a, fmt.Errorf("priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only"))
}
}
// If the new PriorityClass tries to be the default priority, make sure that no other priority class is marked as default.
if pc.GlobalDefault {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated, but it still appears we're trying to enforce a singleton GlobalDefault... I thought we dropped that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never dropped it. We just added code to handle the case of having more than one global default when they are added due to race in HA clusters.

Copy link
Member

@liggitt liggitt Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that means admission can reject updates that swap which one is the default

  1. apply a file containing two priority classes, a and b (default)
  2. apply an update of the file with a (default) and b (non-default)

admission rejects if the update to a happens first. it would also reject if the update to b happened first but the watch event didn't make it into the informer-fed cache the admission plugin consulted.

this guard seems more problematic than helpful to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there are races causing one to add more than one global default, but are you suggesting that it is better to allow people to add as many global default priority classes as they like?

Copy link
Member

@liggitt liggitt Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we cannot guarantee a single default (which we can't), and we have a deterministic and documented behavior when there's more than one (which we do), then breaking legitimate updates in order to half-accomplish correctness doesn't seem worthwhile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. We can remove it and document the behavior. I will do that in a separate PR to ensure we are not mixing the two.

@@ -49,6 +63,64 @@ func (p RESTStorageProvider) v1alpha1Storage(apiResourceConfigSource serverstora
return storage
}

func (p RESTStorageProvider) PostStartHook() (string, genericapiserver.PostStartHookFunc, error) {
return PostStartHookName, AddSystemPriorityClasses(), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i delete them? what happens if i do? will they come back only on restart?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are deleted, no more new Pods with these priority classes can be created. These PriorityClasses are added back on API server restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it also means no new critical pods could be created. Can we avoid that? Say something like Delete would return error, if priorityclass name is one of systemclustercritical or systemnodecritical? Not sure if that takes flexibility away from user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't have a validation mechanism in our REST layer for delete operations. So, I am not sure if it is possible to prevent deletion of these priority classes.

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. I did a quicksearch in the codebase and it seems there are few things like immortalNamespaces(basically 'kube-system' and 'defaut' etc), where at the admission plugin level we are prohibiting user to delete them.(https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L77). Wondering if we could do the same thing but I think it can be a different admission controller from priority AdmissionController.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside the impl of Delete (which the priorityclassstorage would need to override, put the guard in, then delegate, like:

func (r *REST) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
	// if name is in set of expected system priority classes, forbid and return
	return r.store.Delete(ctx, name, options)
}

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks @liggitt. Unrelated to this change, when do we use this mechanism vs one I mentioned earlier for kube-system(is this mechanism only for top level objects?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @liggitt and @ravisantoshgudimetla. I added the check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating.

allErrs = append(allErrs, apivalidation.ValidateObjectMeta(&pc.ObjectMeta, false, apivalidation.NameIsDNSSubdomain, field.NewPath("metadata"))...)
// If the priorityClass starts with a system prefix, it must be one of the
// predefined system priority classes.
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) && !scheduling.IsKnownSystemPriorityClass(pc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since both the name and value are immutable, comparison to the bootstrap priority classes should only be done on create.

update should not concern itself with these checks, since during rolling upgrades or in downgraded apiserver cases, you might need to update/delete a system priority class bootstrapped by a newer apiserver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is already the case. ValidatePriorityClassUpdate is called to validate updates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strategy.ValidateUpdate calls this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I will remove the call.

for _, spc := range systemPriorityClasses {
spcCopy := *spc
spcCopy.Description = ""
if reflect.DeepEqual(spcCopy, pcCopy) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only compare name and value. descriptions, annotations, labels, etc, we don't care about.

also, if the value mismatches, surfacing a better error message is important ("system priority class %s may only have value %d", etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that I used DeepEqual was to make it resilient to changes of fields in PriorityClass. Anyway, I changed it to compare the three fields only, but I didn't add more logic to report errors for each priority class. Reporting better errors would make the logic hard to maintain. These prioirty classes are supposed to be created by the API server automatically and not by humans.

Copy link
Member Author

@bsalamat bsalamat Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, adding more detailed error does not make the function much different. Let me add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PTAL.

@bsalamat bsalamat force-pushed the auto_prio_class branch 3 times, most recently from 0291abb to 494dabf Compare March 1, 2018 21:31
if spc.Value != pc.Value {
return false, fmt.Errorf("value of %v PrioityClass must be %v", spc.Name, spc.Value)
}
if spc.GlobalDefault != pc.GlobalDefault {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we care about this either... just the name and value. If they want to make the default for pods in their cluster "system-node-critical", that shouldn't matter to us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting such a high priority as default could cause eviction of system critical components of a cluster. We should prevent catastrophic human errors as much as possible.

} else if pc.Value > scheduling.HighestUserDefinablePriority {
// Similarly, if the Value is in the system range, it must be one of the
// predefined system priority classes.
if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it doesn't start with the SystemPriorityClassPrefix, there's no need to check this here

// predefined system priority classes.
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) {
if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {
allErrs = append(allErrs, field.Forbidden(field.NewPath("metadata", "Name"), "priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only. error: "+err.Error()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"name" should match the json field

@@ -56,3 +61,14 @@ var _ rest.ShortNamesProvider = &REST{}
func (r *REST) ShortNames() []string {
return []string{"pc"}
}

// Delete ensures that system priority classes are not deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have guard for update as well or else we will able to update system priority class to lower value and then create pods that have higher priority than system priority?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update validation already prevents changing value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat
Copy link
Member Author

bsalamat commented Mar 5, 2018

@liggitt @aveshagarwal @ravisantoshgudimetla @derekwaynecarr
Any other comments on this PR?

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@derekwaynecarr
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2018
@liggitt
Copy link
Member

liggitt commented Mar 6, 2018

/lgtm

@liggitt
Copy link
Member

liggitt commented Mar 6, 2018

(needs squash)

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2018
@bsalamat
Copy link
Member Author

bsalamat commented Mar 6, 2018

squashed commits.
Thanks all again for your reviews.

@bsalamat
Copy link
Member Author

@liggitt Could you please re-lgtm this after squash?

@bsalamat
Copy link
Member Author

ping

@liggitt
Copy link
Member

liggitt commented Mar 22, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 22, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, derekwaynecarr, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@tengqm
Copy link
Contributor

tengqm commented Mar 23, 2018

/test pull-kubernetes-integration

@bsalamat
Copy link
Member Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 60519, 61099, 61218, 61166, 61714). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 71050b6 into kubernetes:master Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

System-cluster-critical and system-node-critical priorityClasses should be in API storage
9 participants