Proposal for online resizing of persistent volumes #1535

mlmhl · 2017-12-20T12:16:19Z

This is a supplement for this proposal.

mlmhl · 2017-12-20T12:16:29Z

/sig storage

k8s-ci-robot · 2017-12-20T12:17:21Z

@mlmhl: Reiterating the mentions to trigger a notification:
@kubernetes/sig-storage-pr-reviews

In response to this:

@kubernetes/sig-storage-pr-reviews @gnufied

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mlmhl · 2017-12-20T12:20:49Z

/assign @gnufied

gnufied · 2017-12-20T17:04:57Z

contributors/design-proposals/storage/online-grow-volume-size.md

+
+#### Fetch All Pods Using a PVC
+
+We need to know which pods are using a volume with specified name, but `ExpandController` doesn't cache this relationship. To achieve this goal, We can share the `DesiredStateOfWorld` with `AttachDetachController`. In detail, reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and add a `GetPodsInVolume` method to `DesiredStateOfWorld`. Expand controller use this method to fetch pods using the volume with specified name.


typo in word reflector ? I am still unclear though - how will this cache become available to ExpandController since it will be a internal property of AttachDetachController.

I've updated the proposal to add more detailed description of this reflector operation.
And a preliminary implementation can be found from these commits: reflector
remove dependence

gnufied · 2017-12-20T17:05:35Z

contributors/design-proposals/storage/online-grow-volume-size.md

+The `GetPodsInVolume` method will look like this:
+
+```go
+func (dsw *desiredStateOfWorld) GetPodsInVolume(volumeName v1.UniqueVolumeName) []*v1.Pod {


We should really use ActualStateOfWorld rather than dsow here.

Yeah, you are right. I think we should get the pod-volume-relationship(which pod use which volume) from DesiredStateOfWorld as ActualStateOfWorld doesn't cache this relationship, and get the volume-node-relationship(which volume mount to which node) from ActualStateOfWorld. The current proposal misses the latter one, I will update the proposal later.

I add a detailed explanation for this subject to the proposal.

gnufied · 2018-01-02T22:44:39Z

contributors/design-proposals/storage/online-grow-volume-size.md

+### Prerequisite
+
+* `sig-api-machinery` need to allow pod's annotation update from kubelet. 
+* `sig-api-machinery` need to allow pod's get and annotation update from expand-controller.


cc @liggitt

gnufied · 2018-01-02T22:47:08Z

contributors/design-proposals/storage/online-grow-volume-size.md

+* This modification is possible because `DesiredStateOfWorld` and `ActualStateOfWorld` only use `volumePluginMgr` to fetch volume name, inside the `DesiredStateOfWorld.AddPod` and `ActualStateOfWorld.AddVolumeNode` method, we can get the volume name outside and transform to them as a method parameter.
+
+* Then we can create a common `DesiredStateOfWorld` and `ActualStateOfWorld`, implemented as fields of the `ControllerContext` object, instead of creating them inside `NewAttachDetachController`. After this, `AttachDetachController` and `ExpandController` can both access it through `ControllerContext`.
+


Is there any example of this design being used previously? I am just curious.

Then we can create a common DesiredStateOfWorld and ActualStateOfWorld, implemented as fields of the ControllerContext object

no... controller context is not intended to be a shared mutable struct

Yeah, this refactoring is really a bit trick. Do we have any better idea to share ASW/DSW?

Lets list out all the ideas we discussed in Github issue, so as we know all the available options. The other ways of implementing this is:

We discussed volumemanager on kubelet side could watch for PVCs and perform FS resize when certain condition is met. The downside of this approach is - in a large enough cluster (think 5000 nodes), it may not be good idea to create such a large number of watches. Having said that, volumemanager need not watch for all PVCs - it can only watch for PVCs which are in its actual_state_of_world. I am not sure, performance penalty of watching all PVCs VS watching only a small subset of PVCs.

Another option is, ExpandController can build its own object graph of pvc/pod/node mapping. This will enable us to determine pod/node where PVC is being used without sharing a cache with Attach/Detach controller.

We have kicked around idea of adding node name to the PVC for awhile now after attach is done. This will require an API change, but it will enable us to determine pvc->node relationship rather quickly.

gnufied · 2018-01-02T22:47:39Z

contributors/design-proposals/storage/online-grow-volume-size.md

+
+ To achieve this goal, we need to:
+
+* reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and remove the parameter `volumePluginMgr` of `NewDesiredStateOfWorld` and `NewActualStateOfWorld` function, so that we can create a new `DesiredStateOfWorld` and `ActualStateOfWorld` without any dependence. 


Do you mean reflector or you mean refactor ?

Sorry, typo in word, I've fixed this.

gnufied · 2018-01-02T22:49:44Z

contributors/design-proposals/storage/online-grow-volume-size.md

+
+Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state.
+
+Besides, we add a `VolumeFsResizeMapPopulator` interface as the producer of `VolumeFsResizeMap`, it runs an asynchronous periodic loop to populate the volumes need file system resize. In each loop, `VolumeFsResizeMapPopulator` traverses each pod's annotation, and generate a `VolumeToResize` object for each `volumeFSResizingAnnotation`.


Will this be a new package or part of volume-manager's interface?

I'm not sure if I understand what you mean. We can put VolumeFsResizeMap interface under package pkg/kubelet/volumemanager/cache, just like the ASW and DSW, and put VolumeFsResizeMapPopulator interface under package pkg/kubelet/volumemanager/populator, just like the DesiredStateOfWorldPopulator. Then pkg/kubelet/volumemanager/reconciler.Reconciler can work as a consumer of VolumeFsResizeMap. These two interfaces will look like this:

type VolumeFsResizeMap interface { AddVolumeForPod(volumeName string, pod *v1.Pod) GetVolumesToResize() []VolumeToResize MarkAsFileSystemResized(pod *v1.Pod, volumeName string) error }

type VolumeFsResizeMapPopulator interface { Run(stopCh <-chan struct{}) }

jsafrane · 2018-01-25T13:50:42Z

I probably miss something very obvious, but why do we need new annotations? Right now, kubelet already knows that a PV needs filesystems resize, because it does offline resize. And it also can update the PVC and "finish" the resize so user knows it's done. So why a new annotation and a new controller?

jsafrane · 2018-01-25T13:53:37Z

contributors/design-proposals/storage/online-grow-volume-size.md

+
+#### Volume Resize Request Detect
+
+Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state. This interface will look like this:


Kubelet already loops through all mounts in VolumeManager. It can see that a volume needs fs resize there. And based on whether the mount exists in its ASW, it can see if it's going to perform online or offline resize.

In addition, fs resize must be coordinated with VolumeManager anyway, since VolumeManager should not touch the volume (e.g. unmount it) during resize.

Kubelet already loops through all mounts in VolumeManager. It can see that a volume needs fs resize there.

The problem is that the volume objects cached in ASW is only updated when pod starting, and won't update after pod started any more, it means that if we resize the volume after pod already started, VolumeManager can't realize this change from ASW.

In addition, fs resize must be coordinated with VolumeManager anyway, since VolumeManager should not touch the volume (e.g. unmount it) during resize.

We append the resize operation to OperationExecutor so that VolumeManager can't touch the volume before resize operation finished.

The original discussion in this issue: kubernetes/kubernetes#57185

I think - what we can do is. When we are iterating through pods in volumemanager's desired_state_of_world for mount and call PodExistsInVolume function, then that function can make an additional API call to check if underlying volume needs resizing. Currently PodExistsInVolume already raises an error remountRequired when some change requires us to remount the volume. So we can introduce a new error resizeRequired which will be raised by the function and that in turn will cause file system resize on the kubelet.

This will require no API change or annotation and everything will work online as expected. thanks @jsafrane for pointing it out.

One potential problem with ^ approach is - the reconciler loop turns every 100ms and say if a node has 20 pods with volumes, then we will be making 20 GET requests every 100ms, because there is no informer inside volume_manager. Again - I am not sure how badly it will affect us.

Maybe we can add a lastUpdateTimestamp for each volumes, and use an independent loop interval with the reconciler, this can reduce the request counts, but I am not sure how much help this can have

mlmhl · 2018-01-26T02:30:08Z

@jsafrane

I probably miss something very obvious, but why do we need new annotations?

Kubelet only watches for pod and only fetch PVC object from apiserver when pod starting, won't get latest PVC object again after pod is already started. So if we resize the volume after pod started, kubelet can't realize this request.

We add a annotation to pod to inform kubelet some volumes need resizing, but as @gnufied mentioned, this is just one of some available approaches to inform kubelet.

cblecker · 2018-01-29T17:29:41Z

/ok-to-test

mlmhl · 2018-04-28T06:55:44Z

Hi guys @jsafrane , @yujuhong , we changed this proposal again as we found that Kubelet's VolumeManager is already fetching PV/PVC during every pod resync loop, see here and here. So we decided to reuse this code, using the fetched PV/PVC object to check whether volume requires online resizing or not instead of adding a PVC cache.

Actually current PV/PVC reprocess mechanism has a hidden performance problem, so we opened a separate issue to track this problem: #63274

cc @gnufied

gnufied · 2018-04-28T12:09:38Z

In light of @mlmhl finding that we are already fetching pvc/pv from api-server on every pod sync, I do agree that we do not need PVC cache similar to configmaps etc. This proposal can definitely work without introducing new api calls.

I propose we tackle the bug/issue with PVC/PV getting fetched all the time in separate github issue - kubernetes/kubernetes#63274 that has been opened. cc @saad-ali

gnufied · 2018-05-19T13:28:59Z

/lgtm

cc @saad-ali @msau42 @jsafrane lets try and get this merged for 1.11. I do not think there are any remaining blockers here

jsafrane

lgtm-ish, however you should note somewhere what happens when volume is being resized and at the same time a pod is deleted and the same volume is being unmounted. IMO unmount must wait (should not be started) until resize finishes. Reconciler(?) should check that.

jsafrane · 2018-05-21T07:03:53Z

contributors/design-proposals/storage/online-grow-volume-size.md

+In each loop, for each volume in DSW, `Reconciler` checks whether this volume exists in ASW or not. If volume exists but marked as file system resizing required, ASW returns a `fsResizeRequiredError` error. then `Reconciler` triggers a file system resizing operation for this volume. The code snippet looks like this:
+
+```go
+if cache.IsFSResizeRequiredError(err) {


I don't like passing random return values via errors, they should be reserved to real errors / exceptions. But we can solve that in implementation phase and not in this proposal.

Currently we use a remountRequiredError to indicate a volume need remount, so I add a resizeRequiredError to keep consistency. As you said we can review this again in implementation phase.

gnufied · 2018-05-21T10:57:22Z

@jsafrane yeah I think there is a note in original resize proposal that if pvc has any of resize-in-progress conditions then we will wait for it to clear before unmounting the volume.

Automatic merge from submit-queue (batch tested with PRs 62460, 64480, 63774, 64540, 64337). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Implement kubelet side online file system resize for volume **What this PR does / why we need it**: Implement kubelet side online file system resize. xref - [kubernetes/feature#531](kubernetes/enhancements#531) proposal - [kubernetes/community#1535](kubernetes/community#1535) **Release note**: ```release-note Implement kubelet side online file system resizing ```

fejta-bot · 2018-08-30T01:47:53Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-09-29T02:10:00Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

gnufied · 2018-11-07T20:47:16Z

/remove-lifecycle rotten

fejta-bot · 2019-02-05T21:37:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

mlmhl · 2019-02-06T09:37:59Z

/remove-lifecycle stale

fejta-bot · 2019-05-07T09:45:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

agolomoodysaada · 2019-05-07T15:22:25Z

/remove-lifecycle stale
still relevant right?

msau42 · 2019-05-07T15:29:11Z

This has been replaced by https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190125-online-growing-persistent-volume-size.md

/close

k8s-ci-robot · 2019-05-07T15:29:14Z

@msau42: Closed this PR.

In response to this:

This has been replaced by https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190125-online-growing-persistent-volume-size.md

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 20, 2017

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Dec 20, 2017

k8s-ci-robot assigned gnufied Dec 20, 2017

This was referenced Dec 20, 2017

Add support for online resizing of PVs kubernetes/enhancements#531

Closed

Support online file system resize for PVs kubernetes/kubernetes#57185

Closed

gnufied reviewed Dec 20, 2017

View reviewed changes

mlmhl force-pushed the pvc-online-resize branch 6 times, most recently from 6756f2a to bca098f Compare December 22, 2017 06:37

gnufied reviewed Jan 2, 2018

View reviewed changes

mlmhl force-pushed the pvc-online-resize branch 3 times, most recently from eedc43b to 2affa0e Compare January 3, 2018 10:39

gnufied mentioned this pull request Jan 25, 2018

Perform resize of mounted volume if necessary kubernetes/kubernetes#58794

Merged

jsafrane reviewed Jan 25, 2018

View reviewed changes

proposal for online resizing of persistent volumes

6bf5bf4

mlmhl force-pushed the pvc-online-resize branch from af37d56 to 6bf5bf4 Compare April 28, 2018 06:51

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2018

jsafrane reviewed May 21, 2018

View reviewed changes

angapov approved these changes Jun 1, 2018

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 29, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2018

wongma7 mentioned this pull request Jan 28, 2019

Online Growing Persistent Volume Size KEP kubernetes/enhancements#737

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2019

k8s-ci-robot closed this May 7, 2019

toshipp mentioned this pull request Sep 2, 2024

feat: controller only expansion topolvm/topolvm#898

Closed


		#### Fetch All Pods Using a PVC

		We need to know which pods are using a volume with specified name, but `ExpandController` doesn't cache this relationship. To achieve this goal, We can share the `DesiredStateOfWorld` with `AttachDetachController`. In detail, reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and add a `GetPodsInVolume` method to `DesiredStateOfWorld`. Expand controller use this method to fetch pods using the volume with specified name.

		* This modification is possible because `DesiredStateOfWorld` and `ActualStateOfWorld` only use `volumePluginMgr` to fetch volume name, inside the `DesiredStateOfWorld.AddPod` and `ActualStateOfWorld.AddVolumeNode` method, we can get the volume name outside and transform to them as a method parameter.

		* Then we can create a common `DesiredStateOfWorld` and `ActualStateOfWorld`, implemented as fields of the `ControllerContext` object, instead of creating them inside `NewAttachDetachController`. After this, `AttachDetachController` and `ExpandController` can both access it through `ControllerContext`.


		To achieve this goal, we need to:

		* reflector the `pkg/controller/volume/attachdetach/cache` package to `pkg/controller/volume/cache`, and remove the parameter `volumePluginMgr` of `NewDesiredStateOfWorld` and `NewActualStateOfWorld` function, so that we can create a new `DesiredStateOfWorld` and `ActualStateOfWorld` without any dependence.


		Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state.

		Besides, we add a `VolumeFsResizeMapPopulator` interface as the producer of `VolumeFsResizeMap`, it runs an asynchronous periodic loop to populate the volumes need file system resize. In each loop, `VolumeFsResizeMapPopulator` traverses each pod's annotation, and generate a `VolumeToResize` object for each `volumeFSResizingAnnotation`.


		#### Volume Resize Request Detect

		Kubelet need to periodically loop through the list of active pods to detect which volume requires file system resizing. We use a `VolumeToResize` object to represent a volume resizing requests and introduce a new interface `VolumeFsResizeMap` rather than `DesiredStateOfWorld` to cache such `VolumeToResize` objects, since we treat it as a request queue, not a state. This interface will look like this:

Proposal for online resizing of persistent volumes #1535

Proposal for online resizing of persistent volumes #1535

Conversation

mlmhl commented Dec 20, 2017

mlmhl commented Dec 20, 2017

k8s-ci-robot commented Dec 20, 2017

mlmhl commented Dec 20, 2017

Choose a reason for hiding this comment

mlmhl Dec 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlmhl Dec 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented Jan 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlmhl commented Jan 26, 2018

cblecker commented Jan 29, 2018

mlmhl commented Apr 28, 2018

gnufied commented Apr 28, 2018

gnufied commented May 19, 2018

jsafrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented May 21, 2018

fejta-bot commented Aug 30, 2018

fejta-bot commented Sep 29, 2018

gnufied commented Nov 7, 2018

fejta-bot commented Feb 5, 2019

mlmhl commented Feb 6, 2019

fejta-bot commented May 7, 2019

agolomoodysaada commented May 7, 2019

msau42 commented May 7, 2019

k8s-ci-robot commented May 7, 2019

mlmhl Dec 21, 2017 •

edited

Loading

mlmhl Dec 22, 2017 •

edited

Loading