Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API modifications for in-place pod resize KEP #1883

Merged

Conversation

thockin
Copy link
Member

@thockin thockin commented Jul 1, 2020

This was originally not really intended to be committed, but to get the questions debated and answered. Now that most of the details are hammered out, we should probably keep it.

@vinaykul

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 1, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 1, 2020
Why is "allocated" spec? The need for special admission control is a strong
smell. Why doesn't this belong in status? If there's a compelling reason,
this KEP should say so.
<<[/UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing our conversation from kubernetes/kubernetes#92127"

  • API conventions require Status to be reconstructible by observation if lost
  • If ResourcesAllocated was in Status, and it were lost or erased, as of today it cannot be reconstructed purely based on observations on the node - memory request in particular.
  • We discussed it at length in sig-node, explored the alternative of node-local checkpointing instead of Spec change, and eventually felt Spec.ResourcesAllocated was the best option - KEP discussion: KEP: in-place update of pod resources #686 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to read that comment thread, but there's a lot of unwritten context, so I am sorry if I rehash. I feel pretty strongly that spec is wrong for this, and all the machinery around locking users and controllers to specific fields under spec reinforces that feeling.

We use the "reconstructible from observation" criteria for status, though we're not always 100% consistent, and the truth is that status being erased is actually a very small risk. Is that corner case really worth this complexity?

What is the corner case? Can we spell it out really clearly?

Consider an abstrct resource.

Time t:

  • Node has allocatable=4
  • Node has 3 pods, each with request=1, allocated=1, actual=1
  • Sum(all allocated) = 3 => OK

Time t+1:

  • Pod p1 request is changed to 2
  • Sum(other allocated, new request)== 4 => OK
  • Kubelet sets allocated to 2

Time t+2:

  • Pod status is wiped and kubelet crashes before actuating the change
  • Kubelet observes pods p2, p3 with request=1 and their cgroups confirm that
  • Kubelet sets those pods to allocated=1, actual=1
  • Kubelet observes pod p1 with request= 2 but the cgroup says 1
  • Kubelet can infer that this was in-flight and set allocated=2, actual=1

Tim t+3:

  • Kubelet actuates p1's resize
  • Kubelet sets actual=2

That seems fine. So is the problem centered around multiple concurrent resizes? Or am I just missing it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time t+2:

  • Pod status is wiped and kubelet crashes before actuating the change
  • Kubelet observes pods p2, p3 with request=1 and their cgroups confirm that

That is not possible today for memory request, it is not stored anywhere in the cgroups. I'm not sure if ephemeral-storage has similar discoverability issues as well (not in scope of this KEP but someone asked to support it).

Anyways, in above scenario at T+2, now end up in AddPod without knowledge of what was actually granted to p2, p3 (and p1) in the first place. Their request might have changed in the interim isn't any good.

Apologies I dumped the top of that rather huge discussion thread on you. A cleaner analysis of how this works with a way to store allocated is in the following comment and below: #686 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not possible today for memory request, it is not stored anywhere in the cgroups

This is a good point, which I missed. I'm still not convinced that pushing this back into the user-facing API is the right tradeoff. I know kubelet doesn't like local checkpoints, but I maintain that it is unavoidable for exactly these sorts of reasons.

That thread basically says "we'll use a local checkpoint", then does all of the analysis for that path, showing that it is the better option, then says "we've changed our mind" and proposes the worse option. What am I missing?

That said you DO know the limits from cgroups and you can re-calculate how you would have reacted to any resize request to arrive at the same decision.

Guaranteed class is handled because request == limit. So let's look at burstable class. Request must be >= limit at all times. So you have to consider:

  • if memory.limit_in_bytes == limit, then this pod is done resizing (unless we also tweak other knobs?)
  • if memory.limit_in_bytes > limit, then this pod must need to be shrunk
  • if memory.limit_in_bytes < limit, then this pod must need to be grown

Right?

Copy link
Member

@vinaykul vinaykul Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all fairness, local checkpoints with atomic file-writes was my preference and showed up in my analysis on that thread. SIG-node folks were against it from get-go based on their history with Borg, I'm not privy to the details.

And correct on the above burstable case for limits. Missing requests is the issue.

Adding ResourcesAllocated to the Spec served two purposes:

  1. Allowed kubelet to handle restarts with a single source-of-truth.
  2. Allowed user/VPA to determine (via timeouts) if resize was not accepted (Resources != ResourcesAllocated at timer expiry)

After implementing it, imho, it looks quite clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm resisting the urge to respond to everything... I'll just say that I think the crux of our disagreement is scheduler involvement, not checkpointing (which I have repeatedly said i'm fine with) or users clamoring for the functionality the field provides (which I have said won't be the case).

The path forward depends on what you would accept, so maybe some hypotheticals would help. What would your response be if...:

  1. We went to sig-scheduling and sig-node, and got agreement that having scheduler involvement was an important and useful feature, and that we wanted the scheduler making resource allocation decisions.
  2. We fell back to our plan B, which is checkpointing resourcesAllocated, and moving the field as-is to pod status.

Or we can try to come up with a new design that probably relies on a checkpoint, and exposes different information in status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At sig-node today, we reached consensus on the second option. So if that works for you, we can move forward. Derek did point out an issue that we will have to figure out an answer for, but both Dawn and Derek were OK with the switch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin FYI, here's overview of the node checkpointing based approach (@dashpole 's second option above that we discussed in sig-node this morning).

I'll be working on the details of leveraging the node checkpointing framework and defining what checkpoint state (internal to kubelet) looks like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry I missed this update!! I was waiting for an email...sigh.

I looked at the overview, and I think it looks good. We should consider the problem you described (flapping to deny the scheduler access) and decide to either do it or file an issue for later evaluation. The workaround I proposed is crude, but maybe enough.

Let's focus on the other areas now :)

Copy link
Member

@vinaykul vinaykul Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry I missed this update!! I was waiting for an email...sigh.

I looked at the overview, and I think it looks good. We should consider the problem you described (flapping to deny the scheduler access) and decide to either do it or file an issue for later evaluation. The workaround I proposed is crude, but maybe enough.

Let's focus on the other areas now :)

Great, thanks for reviewing the doc!

I should be able to give some time to this starting Sept. I'll do a new PR taking in much of the stuff already implemented in PR 92127 and apply this API modification that moves ResourcesAllocated from Spec to Status, and kubelet patches Status instead of Spec, and it already has permission to modify Status. I don't expect it to be a major change requiring me to retest a lot of stuff, except for the new checkpointing code in kubelet.

For the flapping issue, the risk is low since VPA is primary consumer, and we don't expect it to indulge in any funny business. So I'm leaning towards filing a tracking issue and note it down as a risk in the KEP.

<<[UNRESOLVED]>>
Do we want to define a subresource for setting `resources` on pods? This would
allow us to grant RBAC access to things like VPA without allowing full write
access to pod spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding subresource was our initial proposal, based on the /binding precedent. At the Live API review session in K8s contributor summit in San Diego, @liggitt said adding ResourcesAllocated was reasonable, but in the post-session chat, we all felt that we were starting to add subresources willy-nilly.

And after playing with admission controller, it offers good advantages - for e.g ability to handle down-level clients who may drop ResourcesAllocated on write in addition to access control.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we're talking about 2 things. Ignore "allocated" for this thread. :)

I'm asking about a sub-resource for pod.spec.containers[*].resources which would allow things like VPA to be granted access to set resources but not anything else in pod.spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VPA is like any other external user in this capacity. And companies may have their own solutions in-lieu of VPA. In any case, how does it help us if we don't have ResourcesAllocated or another similar alternative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give VPA-ish things PUT/PATCH access to pods/resources and GET/LIST/WATCH access to pods.

They can read the status for "actual" resources and write the /resources subresource to request a resize.

Without the sub-resource you have to grant access to VPA-ish things to PUT/PATCH all of pod spec - that seems worse in every way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are limiting what can be changed by the user in validation. There are a limited set of fields that users can patch. e.g. image, annotations. There's no subresource for changing those fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right - there's lots of ways that RBAC can get configured, but the main point is that VPA doesn't get FULL access to write to pod.spec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, after speaking with @liggitt we realized that this is not a pod-specific thing. If we defined a /resize subresource, that could apply to Deployments, ReplicaSets, etc too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any objections to defining a subresource for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any objections to defining a subresource for this?

Not from me. I'm all for it. In our original user scenario a Job spawns the Pods, and we could benefit from ability to resize Job spec based on external triggers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great to me. How are we going to proceed - do you want me to finish this KEP-PR or do you want to take it over and update it yourself (preferred :) ?

* if viable, kubelet sets `allocated` = `resources` immediately (ACK) and sets
`actual` = `resources` once resized
* if not viable, kubelet does not change `allocated` and instead drops an event
and maybe a new status field or condition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it works - actual = ContainerStatus.Resources.
We had a PodCondition proposed at one point in the KEP, but removed it for simplicity - discussion thread here: #686 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the logic there - it's been a common trend that "more information is generally better". If kubelet has considered a resize request and decided it can't do that right now, we REALLY should tell users and consumers. Otherwise users will be forced to go to kubelet logs to figure out why the heck their change was accepted by "the system" (the API) but isn't happening.

We could argue you don't need a condition if it is just for humans (e.g. load-balancer creation is just events) but actually, we're now considering more status for LB creation, too. Just as an analogy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll bring this up again in the next sig-node meeting. I had a lot of pushback for this, mainly due to concerns of putting a point-in-time result as a pod condition, and if it was fully reconstructible, and added complexity.

I let it go because I felt if my concerns around not having it were to materialize, it can always be added at a later point.

Are you available next Tuesday July 7th at 10-10:30 am for sig-node meeting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I have 2 other SIG meetings at the same time. I do agree that we can always add this, but I'd put money on the notion that we WILL need it, and the arguments don't hold water for me.

All conditions are point-in-time - that's why they have a timestamp.

All conditions are reconstructible - they are status after all.

This is kind of what they are FOR. Something about this pod needs to be noted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I'll bring it up anyways - having your voice would help a lot!

@@ -287,6 +367,11 @@ Pod with ResizePolicy set to NoRestart for all its Containers.
- Evict the Pod to trigger a replacement Pod with new desired resources,
- Do nothing and let Kubelet back off and later retry the in-place resize.

<<[UNRESOLVED]>>
I'm suggesting that there be a real signal indicating "I tried to resize, but I
can't" beyond the lack of update to `allocated`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried - #686 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my argument above - lack of signal is a poor signal, IMO

<<[/UNRESOLVED]>>

<<[UNRESOLVED]>>
`resizePolicy: NoRestart` is not obvious. To me that reads as "you may not
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On phrasing: is there a more self-documenting expression we can use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about LiveUpdate or LiveResize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a chance to speak with @dchen1107 and @dashpole today. I'll update more comment threads in a minute.

On this topic we seem to have converged - the semantic is really "I don't need a restart, but a restart might happen anyway". Cool.

I suck at naming things, so grain-of-salt time.

resizePolicy: NoRestart is OK, as long as comments are clear.

resizePolicy: RestartNotRequired is better but verbose. Since it is the default, maybe this is OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suck at naming things, so grain-of-salt time.

That makes two of us :)

@thockin
Copy link
Member Author

thockin commented Jul 6, 2020

@vinaykul I got to speak with @dchen1107 and @dashpole today, and I will try to find time with @liggitt to understand different positions.

My goal is not to sand-bag you, so I will try to wrap it up soon and either make concrete proposals or else waive my concerns.

In the mean time, I am going to update this PR with the things I THINK we're agreeing on, so we can focus more :)

@thockin
Copy link
Member Author

thockin commented Jul 6, 2020

I just pushed a 2nd commit which removed a bunch of comments and made the edits around semantic and naming (presumptive of me). Still several points to discuss.

@thockin
Copy link
Member Author

thockin commented Jul 7, 2020

My concerns come down to three threads:

  1. spec vs status+checkpoint

  2. subresource for /resize

  3. Conditions and Events

Copy link
Member Author

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that we're taking one topic at a time :)

Why is "allocated" spec? The need for special admission control is a strong
smell. Why doesn't this belong in status? If there's a compelling reason,
this KEP should say so.
<<[/UNRESOLVED]>>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourcesAllocated is the desired result. There is literally a loop in the kubelet that looks at resourcesAllocated and applies it to existing containers.

We're kind of arguing semantics. It's KUBELET's desired result. Pod.spec is not about what kubelet wants, it is about what the user wants. The user sets their intentions in resources, right?

[skip this section if you want] I was philosophizing

You know me better than to expect I would skip philosophizing :)

about this over my long weekend... I actually think the most "kubernetes-style" way to handle this sort of intermediate desired state is to have hierarchies of objects (like deployment vs replicaset) in which changes to the upper layer (deployment) can be processed/validated by an intermediary controller before being propagated to the lower layer object (replicaset). The Spec of the replicaset is the desired result for the replicaset, even though it isn't necessarily the desired result of the deployment. In theory, we could do away with the late initialization pattern all-together with this approach. We could have had UnscheduledPod (without pod.nodeName) and Pod (with pod.nodeName) and eliminated the need for binding. The scheduler would just create fully specified Pod objects based on UnscheduledPod objects once it has made a scheduling decision. I imagine we decided against this for performance reasons...

There was once a BoundPod resource, which isn't quite the same but gets at the same result. That was a monolithic thing - image one object with 100 podspecs in it. Yikes!

I'm not sure such a model as you propose was considered, frankly, but whether it was or wasn't, it's not where we landed.

That said, it may be worth considering for this. What if we had a PodResize "pseudo-imperative" resource? I am not sure it solves any of the problems we have here. We still have a mutable resources field in pod which suffers the same RBAC weirdness.

With that previous design, scheduler had to rely on max(Spec.Resources.Requests, Status.ResourcesAllocated)

You still have to, don't you? If I submit a resize and kubelet takes 2 seconds to update the allocated field, I don't want to start jamming more pods in there and overflow it. It seems to me that an unambiguous signal is best. Kubelet has either said what is happening or will soon say what is happening. No timeouts, no inferences.

You don't have to look at status.Resources, just spec.Resources and spec.ResourcesAllocated. This is because changes to container requests are generally accepted (unlike changes to limits, which may not be possible due to current usage). If in the future the scheduler writes the ResourcesAllocated field, it only needs to look at ResourcesAllocated to do scheduling.

My point was that you still need to take a max(). No one field is sufficient, whether it is from spec or status isn't really important.

Given a spec (desired) and status (current) and condition (opinion about what spec-status disparity means), I think you can still make the same decisions.

If status is blank, you can't assume anything about the pod and you need to wait for a status update.

If spec == status, no confusion

If spec != status && condition(ResizeInProgress) - you know you can take the max().

If spec != status && conditon(ResizeInfeasible) - you know you can just use status and ignore spec.

If spec != status a&& condition(ResizeDeferred) - you know the resize might be feasible, but what that means depends on context. The scheduler might "score" that node lower if it would depend on deferred-resizes.

Is there philosophical or technical objection to the idea of using Conditions to indicate "in progress", "deferred", and "infeasible"?

I'm generally against conditions which describe equality of two separately-controlled "things", and prefer separate observations of those "things". Assuming, for example, "in progress" == false implies spec.Resources == status.Resources, as soon as spec.Resources is updated by the user or VPA, the condition becomes incorrect. You could try and use probe and transition times for the condition. But even then, there can be races between the kubelet and writers. For example:

kubelet reads spec.CPUs == 1
Writer writes spec.CPUs == 2
kubelet observes spec.CPUs == 1
kubelet writes status.conditions["in progress"] == false
Even with probe timestamps, in the above case the probe timestamp would occur after the update timestamp.

Conditions also don't do a good job of describing multiple in-progress updates. If I update CPUs 1 -> 2, and then 2 -> 3, the conditions only apply to the last update I made. CPUs 1 -> 2 may be accepted by the kubelet, and 2 -> 3 may be deferred, but I can't tell that 1 -> 2 has been accepted. With an explicit description of what the kubelet has "admitted", I can verify that an explicit update (1 -> 2) has been accepted.

Excellent points. Conditions are a poor fit for what I am describing here, but not only for the reasons you cite. Conditions really want to be orthogonal, and these are not. If we pursue this model, it may be better to actually enumerate this as a distinct think in the API.

As for races, there are always races in our eventually-consistent model. We get away with it by defining the edge-cases and making sure that losing the race will still self-recover.

In the case you lay out, consumers could observe that spec.resources != status.resources AND the resize indicator said no resize was in progress, and it was not deferred, and it iwas not infeasible. So clearly it hasn't been updated.

But there's STILL a race - if there was another resize in progress, a client could observe that condition and assume it applies to the wrong thing. I was trying to stay very high-level, but maybe we'd actually want a map of resource name to resize-status which is automatically cleared on any update to the corresponding spec.resources.

So in the timeline you listed:

kubelet reads spec.CPUs == 1
user writes spec.CPUs = 2
apiserver sets status.resize["cpu"] = Unknown
kubelet tries to set status.resize["cpu"] = 1 (why?)
resourceVersion blocks kubelet from writing
kubelet observes change to status.resize["cpu"] and must decide what to do
kubelet sets status.resize["cpu"] to one of (pending, deferred, infeasible)

Or maybe we use metadata.generation as part of the resize status. "pending as of 1234" while the current generation is "1235" means there are potentially unacknowledged changes. But this is very coarse, so I don't like it

Note that through all this - the API is only carrying either user-intention or re-observable state. If you need to keep track of kubelet's allocated field beyond this, that's still "internal".

Also, what is the difference between deferred and infeasible? If deferred implies "will retry" and infeasible means "won't retry", then we would only ever use deferred.

I assumed (maybe wrongly) that Deferred means something like "the CPUs you asked for exist on the machine but are in use, so I will try again the next time a Pod is deleted", whereas Infeasible means something like "You asked for 17 cpus on a 16 core node, this will never work and I won't bother to re-evaluate it". Does that make any sense?

@dashpole
Copy link
Contributor

dashpole commented Jul 20, 2020

The user sets their intentions in resources, right?

That is the common case for sure. I'm not arguing users will often set resources allocated, or even should generally set it. I'm only arguing that

  1. The field has clearly defined behavior, such that users could set it.
    a. To reiterate, setting resourcesAllocated means "run with these resources, or don't run at all". This is as opposed to spec.resources, which is "what I would prefer to run with". Setting both results in "start with resourcesAllocated, but then try and change to resources".
  2. The field is not linked to an observation of the state of any "thing" that makes up the pod (containers, sandboxes, images, etc).
  3. The field is generally used as a description of what should be. Updating the field results in "the system" making changes in response.

I can understand putting this information in Status, even if I don't like it.
We don't appear to have been consistent on such fields. Spec.NodeName is the closest precedent IMO, but there are other fields, such as spec.Priority (the integer, not the name), which are not generally expected to be set by users (expected to be set by admission controllers). On the other hand, something like status.nominatedNodeName, or even status.QoS could be argued to meet the above points.

I was trying to stay very high-level, but maybe we'd actually want a map of resource name to resize-status which is automatically cleared on any update to the corresponding spec.resources.

Keep in mind that there can be multiple resizes in-progress on a single resource.

I believe the problem of tracking a single update is solved decently by the deployment with status.ObservedGeneration, which I think is what you are suggesting above. But it does then make it harder for users to reason about.


Given spec.Resources and status.Resources, we can already determine if a resize is "in progress", so given all the hoops a user has to jump through to correctly make use of it (assuming we implemented ObservedGeneration or something else), I think it is less useful than the fields we already have. The same goes for "Infeasible" which can be discovered by looking at node.status.Allocatable and pod.spec.Resources. FWIW, I haven't seen anyone ask for an "infeasible" status either.

The two main unanswered questions (given just spec.resources and status.resources) are:

  1. why hasn't my update been applied?
  2. what portion of the node is "occupied" by a given pod?

The current possible answers to 1 are either "because there isn't room for it" or "because the updates haven't been applied yet". We can either answer this with a "here's what there is room for" (~status.resourcesAllocated) or an A or B condition (~"deferred") + observed generation.

Answering 2 is harder to answer because "occupied" is really about what set of resources would pass the scheduler's scheduling algorithm, and the kubelet's admission algorithm. Today, this question is easy to answer, because the scheduler and kubelet both use the same definition of occupied: spec.resources.requests. But if now kubelet is using (internal) resourcesAllocated checkpoint, and scheduler is using max(spec.resources, status.resources), we can get different answers for resizes vs creates. Creates go through the scheduler, and thus "occupied" == max(spec, status). Resizes only involve the kubelet, so "occupied" == (internal) resources allocated. For something like kubectl top no, we would probably display it just using spec.resources, which would make nodes appear overcommitted by requests.

The broader point is that without resourcesAllocated somewhere in the API, we lose our sum(pod resources) < node allocatable invariant. It technically is still maintained in a file on the node, but I can't use it for chargeback, or for determining occupancy of a node.

TBH, doing neither, and just having events also seems like a reasonable position for alpha. My preference is roughly (spec.ResourcesAllocated > nothing/punt until beta > status.ResourcesAllocated > status.conditions["deferred"]).

Copy link
Member Author

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to break that back-and-forth.

Why is "allocated" spec? The need for special admission control is a strong
smell. Why doesn't this belong in status? If there's a compelling reason,
this KEP should say so.
<<[/UNRESOLVED]>>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler can't write to pod spec - it can only use pods/binding to set
NodeName. I'm still going to argue that resize should be a subresource, so
that VPA has very granular control, so maybe that can cover both spec.resources
and spec.preferredResources together? How would the flow work?

I'm remembering now why we settled on the design we did... Having spec.Resources == preferred, and spec.ResourcesAllocated == required makes the access control easier, as VPA/users interact with spec.Resources, and scheduler/kubelet (depending on what we decide) interacts with spec.ResourcesAllocated.

Given that, let's go back to spec.Resources (preferred) and spec.RequiredResources (old resources allocated). There would be a subresource for setting spec.RequiredResources, which kubelet/scheduler would use to update spec.RequiredResources based on spec.Resources.

Just to be clear, this path requires:

  • a subresource (e.g. /resize?) for users and VPA-like controllers
  • a subresource (e.g. /allocate?) for kubelet and/or scheduler
  • clear semantics for what happens when a user sets RequiredResources != Resources

It could maybe be simplified to a single sub-resource to cover both fields, at the cost of expanded power granted.

That's net-new functionality for the scheduler that seems out-of-place to me.

The benefit of having the scheduler determine feasibility of resizes is that it can serialize them with scheduling decisions. Otherwise, a high rate of resizes makes it likely that scheduled pods will be rejected on arrival.

I think this is a solution looking for a problem.

This isn't awful but it still feels like a hypothetical feature as an excuse to not checkpoint.

It would be a useful feature to have the scheduler make feasibility decisions. It was originally one of the core requirements of the KEP before sig-node blocked it unless it was dropped from the design. That being said, a field doesn't need to be a "feature" to be in spec. It just needs to describe desired state, and have consistent behavior when set.

Every field has to justify its own cost. If you can implement a feature with a field or without adding API, you should ALMOST ALWAYS choose the option without.

I don't buy the "excuse not to checkpoint" argument. There are many things in the API that could have been checkpointed, but that doesn't mean they should be. NodeName could have been checkpointed in a PV attached to the scheduler, which the scheduler "observes" periodically, and writes to the pod status.

Huh? NodeName is an actual feature, which is used by people. And if my time machine was working, I would happily go back and re-apply our current understanding of API semantics to ancient (in k8s terms) decisions. That isn't useful. Even if that were a better design "We did it that way before" is the one of the weakest arguments to do it again. All things being equal, defer to precedent, sure. BUt all things are not equal (to me, yet).

Or, the priority admission controller could, instead of initializing spec.Priority integer during admission, periodically re-resolve the priority of the priority class, and write it to status.

Again, that was intentional - changing priority at runtime is something we should consider very very carefully. The design intentionally snapshots the value.

The question isn't "could this field have been recorded locally in the controller, and written to status?". The question is "does this field describe desired state" (or what should be), or is this an observation of the thing represented by the object?

"is it state or control" is not the ONLY criteria, or even the primary criteria for being in the API. Before that is even asked "do we need it in the API" must be evaluated, and that's where I have not been convinced.

I'm saying that the field doesn't allow the user to control anything that is particularly useful and that the suggest feature wherein this field IS useful is itself not self-evidently useful.

Here I am willing to be swayed, but what I am seeing is not super compelling. Am I just too far away from it to see the value?

I still think that proper status signals are needed BUT that doesn't serve as a
replacement for checkpoint. The only corner-case is "what if status gets
nuked". That doesnt ACTUALLY happen, but we use it as a litms test. Are we
going to let that force us down a path we don't know we want?

The spec vs status litmus test exists for a reason. It is so that writing controllers is simple and reliable. They take desired state from a source of truth, observe the actual state of the thing, take corrective actions, and report the actual state as status. Your example below illustrates my point nicely. If status is required to know the desired state (i.e. what should be), the above controller strategy no longer works, as the controller would have to ensure state is written to durable storage before taking corrective action.

I think we're crossing streams here. I'm not saying status is carrying desired state at all. I am saying spec.resources is the only desired state that matters. If spec.resources != status.resources AND there's not further status, and controller MUST assume the worst case.

If you're trying to schedule on to that node, you have to assume the larger of those 2 is correct. If spec > status, that implies that kubelet is about to start the resize". If spec < status, that implies that kubelet has not yet actuated the downwards-resize.

Unless I am still missing something, the only place where this is wrong is when kubelet is not planning to actuate an upwards-resize (i.e. infeasible or similar). That's useful information to surface on its own.

I think the question at hand is, at its core, whether spec.requiredResources (fka spec.resourcesAllocated) is the best representation of that. If kubelet changes it before I look at it (NB: it's a race) then I can know "this resize has been accepted". If kubelet has not updated it, I can only know "kubelet has not updated it". I'm saying that this is not sufficient, and that patching that deficiency ALSO obviated spec.requiredResources.

You are (I think, please correct me) arguing that spec.requiredResources is useful on its own. That's what I am not believing.

"Down a path we don't know we want" is fair. I do think my position (field in spec + scheduler involvement) is a reasonable path that we should want. However, if the kubelet is forever the only controller for resizes, i'm less certain. I can see arguments for efficiency (no extra write, just included in status), and simplicity of implementation (no subresources, although we now need checkpoints).

I don't buy that this should be part of the scheduler. To my eyes it is very marginal value for significant complexity. We're better off having a distinct re-scheduler or something that recognizes bad situations and plays whack-a-mole with them.

t7: kubelet wakes up
t8: kubelet looks at p1, finds status.resources != spec.resources
also finds status.resize = Pending
t9; kubelet adds p1 at req=10
t10: kubelet adds p2 at req=5

We probably don't want to do admission based on the contents of pod status for many reasons. For one, pod status often lags considerably behind what it has admitted. For example, assume node has 1 core:

  1. Admit pod A w/ 1000m cpu
  2. Update pod A status: status.resources.cpu == 1000m
  3. Resize pod A to 500m cpu. Accepted by the kubelet!
  4. Admit pod B w/ 500m cpu

I think the error is here. You should not schedule it until the resize has been actuated, and you should not admit it.

  1. Kubelet restarts
  2. Pod A admitted w/ status.resources.cpu == 1000m
  3. Pod B rejected w/ request == 500m cpu

What we can do, which is what we have discussed previously, is checkpoint the "accepted" or "admitted" decisions the kubelet makes. That way, admission doesn't rely on pod status, just the local file.

That's right, to me. I don't mean to trivialize the problem of checkpointing, but it's "Not the API's Problem" that kubelet made a decision and then forgot about it.

To recap (trying to make a fair statement here, please point out if I am not even-handed):

  1. I think we want clear signal that a change to spec.resources has been accepted (ACK or NACK, but not "no-signal").
  2. I think we want as little API surface as we can get away with.
  • I do not think spec.requiredResources is the best representation of (1), and we will need API surface specifically for that.
  • You think spec.requiredResources enables additional capabilities that justify adding it.
  • I am "unconvinced" that the value of spec.requiredResources is sufficient (i.e. if fails (2) for me

On to Vinay's response:

@vinaykul:

  • Regarding max(resources, resourcesAllocated) or max(spec, status): It is well intentioned but gives a user some control over how cluster resource is applied. Suppose a node with 4G mem has a guaranteed Pod1 (1G,1G), and evil user starts up a burstable pod P2(1G,-). Now, that user can request 3.1G which will lead to ResizeInfeasable/Deferred condition. Observing that, the user switches back to 1G which would clear that condition, quickly follow it by requesting 3.1G ... and keep toggling to take advantage of durations during which scheduler is not selecting that node. This way it can potentially block off additional 2G with some probability without "paying" for it.

This is a very interesting point. Your first problem is that you have an evil user... :)

Seriously though, it makes me think. If this is a criteria, and it seems reasonable, we need to balance it against the risk of scheduler assigning pods to nodes that have other resizes pending that would make the new scheduling decision infeasible.

In other words, it's a bigger problem. How do you allow a user to request a resource increase without DoS'ing and without "false-feasibility" when everything is fundamentally async.

One path could be to use the max(spec, status) unless no node is feasible, in which case you fall back to the nodes where spec < status (assuming resources will be released) and finally nodes where status < spec. If we have a proper signal for "infeasible" those nodes could be part of the first set.

  • PreferredResources (assuming that it is Requests + Limits) sounds like a clean option, let me chew on this a little. Can we use /resize to update spec.Resources if we go with this model? This would avoid needing to do access control in admission plugins, but wouldn't we still need to do access control somewhere outside RBAC on what fields someone is allowed to write - something similar to NodeRestriction plugin to allow node to write only to Resources field.

IF we go with 2 spec fields, I think one sub-resource could carry both and we just assert that any controller that can set one can set the other. That seems DRAMATICALLY better than arbitrary spec access and better than admission control. We'd want to discuss that with security ppl. I don't think 2 sub-resources would be justified, especially because the generality of the second one goes way down, but i could be wrong.

So how do we break this impasse, folks?

Do we drag in a 3rd-part API reviewer to weigh in? Do we drag in a sig-node lead to explain why I am wrong and this "feature" is worthwile? Both?

I'd like to wrap this topic up - we have more to discuss.

@vinaykul vinaykul mentioned this pull request Feb 5, 2021
31 tasks
@vinaykul
Copy link
Member

vinaykul commented Feb 5, 2021

New commit adds a version of the timeline from the earlier discussion. PTAL.

I'm happy to merge and iterate or pin down the last niggles:

  1. Per-container resize vs per-pod. I think you have me convinced that per-pod could work. Check my logic.
  2. Field naming resourcesAllocated vs resources.allocated (might need a type change but not a big deal)

@thockin I'm still leaning towards reporting resize at podStatus level. It makes things simpler for scheduler, and much easier for user/watcher such as VPA to consume. About resourcesAllocated field naming, I'm happy with either option - your call.

Thanks for driving this. It looks very good overall. I can submit a downstream PR to fix some of the minor typos and corrections, and also update the above 1 & 2 remaining items.

@vinaykul
Copy link
Member

vinaykul commented Feb 5, 2021

/lgtm

@k8s-ci-robot
Copy link
Contributor

@vinaykul: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@thockin thockin force-pushed the in-place-pod-resize-questions branch from 04ef821 to d902d37 Compare February 5, 2021 19:48
Add a per-pod "resize" field and make a bunch of clarifications on the
overall flow of an in-place update.
@thockin thockin force-pushed the in-place-pod-resize-questions branch from d902d37 to 104a062 Compare February 5, 2021 19:51
@thockin
Copy link
Member Author

thockin commented Feb 5, 2021

I did a squash and edited the commit message. PTAL. Last one, I hope.

@vinaykul
Copy link
Member

vinaykul commented Feb 5, 2021

/lgtm

@k8s-ci-robot
Copy link
Contributor

@vinaykul: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vinaykul
Copy link
Member

vinaykul commented Feb 5, 2021

I believe it will need /approve from Derek or Dawn.

@vinaykul
Copy link
Member

vinaykul commented Feb 9, 2021

/assign @derekwaynecarr
/assign @dchen1107

denotes the Node resources **allocated** to the Pod and its Containers,
* Pod.Status.ContainerStatuses[i].Resources (new object, type
* Pod.Status.ContainerStatuses[i].ResourcesAllocated (new field, type
v1.ResourceList) denotes the Node resources **allocated** to the Pod and its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI a ResourceList type is already in use in Kustomize's configuration functions. The GV isn't v1 of course, but I think there's still some potential for confusion.

@dchen1107
Copy link
Member

dchen1107 commented Apr 6, 2021

We discussed this offline and at SIG Node several times, and convergent at status+checkpoint vs spec. I hope we can make more progress on this feature in 1.22. Let's continue to discuss the definition of resize subresource in next phase. Thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2021
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

this is a good plan for alpha, and we can iterate on the questions i asked during implementation.

* Secondary: allow users to specify which Pods and Containers can be resized
without a restart.
* Secondary: allow users to specify which Containers can be resized without a
restart.

### Non-Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have a non-goal around pods that failed resizing, but given its adoption, can we include a mention of a non-goal to handle nodes that are configured with cpu manager , memory manager , and topology manager policies beyond the default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should determine how we handle emptydir memory backed volumes.

with SizeMemoryBackedVolumes feature (should go beta in 1.22) the empty dir is sized to match pod memory

I think we can handle resizing emptyDir volumes in beta or later.

will allow users to grant RBAC access to controllers like VPA without allowing
full write access to pod specs.

The exact API here is TBD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having a resize is a blocker to beta.

can mark Containers as safe (or unsafe) for in-place resource update. Kubelet
uses it to determine the required action.

Note: `RestartNotRequired` does not *guarantee* that a container won't be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to clarify this during implementation.

Is it the kubelet or the container runtime that is stopping the container? I anticipated that the kubelet sends an update resources call to the runtime, and the runtime should error if that update cannot be applied. It would have to be the role of the kubelet I assume to decide if the restart should or should not occur. The runtime should not stop containers without being told to stop them by the kubelet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this more, I think the runtime must not choose to stop the container if it is unable to apply the new resources without doing so. The kubelet must be the actor that sees it cannot be done and ask the stop/start to occur.

* Pod.Spec.Containers[i].ResourcesAllocated (new object, type v1.ResourceList)
denotes the Node resources **allocated** to the Pod and its Containers,
* Pod.Status.ContainerStatuses[i].Resources (new object, type
* Pod.Status.ContainerStatuses[i].ResourcesAllocated (new field, type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for clarity, we are saying this change is to the ContainerStatus struct, but we would have entries for pod.status.initcontainerstatuses and pod.status.ephemeralcontainerstatuses?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, thockin, vinaykul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants