-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for online resizing of persistent volumes #1535
Conversation
/sig storage |
@mlmhl: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @gnufied |
|
||
#### Fetch All Pods Using a PVC | ||
|
||
We need to know which pods are using a volume with specified name, but `ExpandController` doesn't cache this relationship. To achieve this goal, We can share the `DesiredStateOfWorld` with `AttachDetachController`. In detail, reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and add a `GetPodsInVolume` method to `DesiredStateOfWorld`. Expand controller use this method to fetch pods using the volume with specified name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in word reflector
? I am still unclear though - how will this cache become available to ExpandController
since it will be a internal property of AttachDetachController
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the proposal to add more detailed description of this reflector operation.
And a preliminary implementation can be found from these commits: reflector
remove dependence
The `GetPodsInVolume` method will look like this: | ||
|
||
```go | ||
func (dsw *desiredStateOfWorld) GetPodsInVolume(volumeName v1.UniqueVolumeName) []*v1.Pod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really use ActualStateOfWorld rather than dsow here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. I think we should get the pod-volume-relationship(which pod use which volume) from DesiredStateOfWorld
as ActualStateOfWorld
doesn't cache this relationship, and get the volume-node-relationship(which volume mount to which node) from ActualStateOfWorld
. The current proposal misses the latter one, I will update the proposal later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add a detailed explanation for this subject to the proposal.
6756f2a
to
bca098f
Compare
### Prerequisite | ||
|
||
* `sig-api-machinery` need to allow pod's annotation update from kubelet. | ||
* `sig-api-machinery` need to allow pod's get and annotation update from expand-controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @liggitt
* This modification is possible because `DesiredStateOfWorld` and `ActualStateOfWorld` only use `volumePluginMgr` to fetch volume name, inside the `DesiredStateOfWorld.AddPod` and `ActualStateOfWorld.AddVolumeNode` method, we can get the volume name outside and transform to them as a method parameter. | ||
|
||
* Then we can create a common `DesiredStateOfWorld` and `ActualStateOfWorld`, implemented as fields of the `ControllerContext` object, instead of creating them inside `NewAttachDetachController`. After this, `AttachDetachController` and `ExpandController` can both access it through `ControllerContext`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any example of this design being used previously? I am just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can create a common
DesiredStateOfWorld
andActualStateOfWorld
, implemented as fields of theControllerContext
object
no... controller context is not intended to be a shared mutable struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this refactoring is really a bit trick. Do we have any better idea to share ASW/DSW?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets list out all the ideas we discussed in Github issue, so as we know all the available options. The other ways of implementing this is:
- We discussed volumemanager on kubelet side could watch for PVCs and perform FS resize when certain condition is met. The downside of this approach is - in a large enough cluster (think 5000 nodes), it may not be good idea to create such a large number of watches. Having said that, volumemanager need not watch for all PVCs - it can only watch for PVCs which are in its actual_state_of_world. I am not sure, performance penalty of watching all PVCs VS watching only a small subset of PVCs.
- Another option is, ExpandController can build its own object graph of pvc/pod/node mapping. This will enable us to determine pod/node where PVC is being used without sharing a cache with Attach/Detach controller.
- We have kicked around idea of adding node name to the PVC for awhile now after attach is done. This will require an API change, but it will enable us to determine pvc->node relationship rather quickly.
|
||
To achieve this goal, we need to: | ||
|
||
* reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and remove the parameter `volumePluginMgr` of `NewDesiredStateOfWorld` and `NewActualStateOfWorld` function, so that we can create a new `DesiredStateOfWorld` and `ActualStateOfWorld` without any dependence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean reflector
or you mean refactor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, typo in word, I've fixed this.
|
||
Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state. | ||
|
||
Besides, we add a `VolumeFsResizeMapPopulator` interface as the producer of `VolumeFsResizeMap`, it runs an asynchronous periodic loop to populate the volumes need file system resize. In each loop, `VolumeFsResizeMapPopulator` traverses each pod's annotation, and generate a `VolumeToResize` object for each `volumeFSResizingAnnotation`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be a new package or part of volume-manager's interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand what you mean. We can put VolumeFsResizeMap
interface under package pkg/kubelet/volumemanager/cache
, just like the ASW and DSW, and put VolumeFsResizeMapPopulator
interface under package pkg/kubelet/volumemanager/populator
, just like the DesiredStateOfWorldPopulator
. Then pkg/kubelet/volumemanager/reconciler.Reconciler
can work as a consumer of VolumeFsResizeMap
. These two interfaces will look like this:
type VolumeFsResizeMap interface {
AddVolumeForPod(volumeName string, pod *v1.Pod)
GetVolumesToResize() []VolumeToResize
MarkAsFileSystemResized(pod *v1.Pod, volumeName string) error
}
type VolumeFsResizeMapPopulator interface {
Run(stopCh <-chan struct{})
}
eedc43b
to
2affa0e
Compare
I probably miss something very obvious, but why do we need new annotations? Right now, kubelet already knows that a PV needs filesystems resize, because it does offline resize. And it also can update the PVC and "finish" the resize so user knows it's done. So why a new annotation and a new controller? |
|
||
#### Volume Resize Request Detect | ||
|
||
Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state. This interface will look like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet already loops through all mounts in VolumeManager. It can see that a volume needs fs resize there. And based on whether the mount exists in its ASW, it can see if it's going to perform online or offline resize.
In addition, fs resize must be coordinated with VolumeManager anyway, since VolumeManager should not touch the volume (e.g. unmount it) during resize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet already loops through all mounts in VolumeManager. It can see that a volume needs fs resize there.
The problem is that the volume objects cached in ASW
is only updated when pod starting, and won't update after pod started any more, it means that if we resize the volume after pod already started, VolumeManager
can't realize this change from ASW
.
In addition, fs resize must be coordinated with VolumeManager anyway, since VolumeManager should not touch the volume (e.g. unmount it) during resize.
We append the resize operation to OperationExecutor
so that VolumeManager
can't touch the volume before resize operation finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original discussion in this issue: kubernetes/kubernetes#57185
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think - what we can do is. When we are iterating through pods in volumemanager's desired_state_of_world for mount and call PodExistsInVolume
function, then that function can make an additional API call to check if underlying volume needs resizing. Currently PodExistsInVolume
already raises an error remountRequired
when some change requires us to remount the volume. So we can introduce a new error resizeRequired
which will be raised by the function and that in turn will cause file system resize on the kubelet.
This will require no API change or annotation and everything will work online as expected. thanks @jsafrane for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential problem with ^ approach is - the reconciler loop turns every 100ms and say if a node has 20 pods with volumes, then we will be making 20 GET requests every 100ms, because there is no informer inside volume_manager. Again - I am not sure how badly it will affect us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a lastUpdateTimestamp
for each volumes, and use an independent loop interval with the reconciler
, this can reduce the request counts, but I am not sure how much help this can have
Kubelet only watches for pod and only fetch PVC object from apiserver when pod starting, won't get latest PVC object again after pod is already started. So if we resize the volume after pod started, kubelet can't realize this request. We add a annotation to pod to inform kubelet some volumes need resizing, but as @gnufied mentioned, this is just one of some available approaches to inform kubelet. |
/ok-to-test |
af37d56
to
6bf5bf4
Compare
Hi guys @jsafrane , @yujuhong , we changed this proposal again as we found that Actually current PV/PVC reprocess mechanism has a hidden performance problem, so we opened a separate issue to track this problem: #63274 cc @gnufied |
In light of @mlmhl finding that we are already fetching pvc/pv from api-server on every pod sync, I do agree that we do not need PVC cache similar to configmaps etc. This proposal can definitely work without introducing new api calls. I propose we tackle the bug/issue with PVC/PV getting fetched all the time in separate github issue - kubernetes/kubernetes#63274 that has been opened. cc @saad-ali |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm-ish, however you should note somewhere what happens when volume is being resized and at the same time a pod is deleted and the same volume is being unmounted. IMO unmount must wait (should not be started) until resize finishes. Reconciler(?) should check that.
In each loop, for each volume in DSW, `Reconciler` checks whether this volume exists in ASW or not. If volume exists but marked as file system resizing required, ASW returns a `fsResizeRequiredError` error. then `Reconciler` triggers a file system resizing operation for this volume. The code snippet looks like this: | ||
|
||
```go | ||
if cache.IsFSResizeRequiredError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like passing random return values via errors, they should be reserved to real errors / exceptions. But we can solve that in implementation phase and not in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we use a remountRequiredError
to indicate a volume need remount, so I add a resizeRequiredError
to keep consistency. As you said we can review this again in implementation phase.
@jsafrane yeah I think there is a note in original resize proposal that if pvc has any of resize-in-progress conditions then we will wait for it to clear before unmounting the volume. |
Automatic merge from submit-queue (batch tested with PRs 62460, 64480, 63774, 64540, 64337). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Implement kubelet side online file system resize for volume **What this PR does / why we need it**: Implement kubelet side online file system resize. xref - [kubernetes/feature#531](kubernetes/enhancements#531) proposal - [kubernetes/community#1535](kubernetes/community#1535) **Release note**: ```release-note Implement kubelet side online file system resizing ```
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
@msau42: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is a supplement for this proposal.