- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed [optional]
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
A user may expand a PersistentVolumeClaim(PVC) to a size which may not be supported by underlying storage provider. In which case - typically expansion controller forever tries to expand the volume and keeps failing. We want to make it easier for users to recover from expansion failures, so that user can retry volume expansion with values that may succeed.
To enable recovery, this proposal relaxes API validation for PVC objects to allow reduction in
requested size. This KEP also adds a new field called pvc.Status.AllocatedResources
to allow resource quota
to be tracked accurately if user reduces size of their PVCs.
Sometimes a user may expand a PVC to a size which may not be supported by underlying storage provider. Lets say PVC size was 10GB and user expanded it to 500GB but underlying storage provider can not support 500GB volume for some reason. We want to make it easier for user to retry re-sizing by requesting more reasonable value say - 100GB.
Problem in allowing users to reduce size is - a malicious user can use this feature to abuse the quota system. They can do so by rapidly expanding and then shrinking the PVC. In general - both CSI and in-tree volume plugins have been designed to never perform actual shrink on underlying volume, but it can fool the quota system in believing the user is using less storage than he/she actually is.
To solve this problem - we propose that although users are allowed to reduce size of their
PVC (as long as requested size > pvc.Status.Capacity
), quota calculation
will use max(pvc.Spec.Capacity, pvc.Status.AllocatedResources)
and reduction in pvc.Status.AllocatedResources
is only done by resize-controller after previously issued expansion has failed with some kind of terminal failure.
- Allow users to cancel previously issued volume expansion requests, assuming they are not yet successful or have failed.
- Allow users to retry expansion request with smaller value than original requested size in
pvc.Spec.Resources
, assuming previous requests are not yet successful or have failed.
- As part of this KEP we do not intend to actually support shrinking of underlying volume.
- As part of this KEP when user intends to recover from volume expansion failures, they must recover to a size slightly larger than original size of the volume. For example - if user expanded a volume from
10Gi
to100Gi
and expansion to100Gi
is failing, then we only permit recovery to a size greater than10Gi
, for example -11Gi
or anything higher than10Gi
.
As part of this proposal, we are mainly proposing three changes:
- Relax API validation on PVC update so as reducing
pvc.Spec.Resoures
is allowed, as long as requested size is>pvc.Status.Capacity
. - Add a new field in PVC API called -
pvc.Status.AllocatedResources
for the purpose of tracking used storage size. By default only api-server or resize-controller can set this field. - Add a new field in PVC API called -
pvc.Status.ResizeStatus
for purpose of tracking status of volume expansion. Following are possible values of newly introducedResizeStatus
field:- nil // General default because type is a pointer.
- empty // When expansion is complete, the empty string is set by resize controller or kubelet.
- ControllerExpansionInProgress // state set when external resize controller starts expanding the volune.
- ControllerExpansionFailed // state set when expansion has failed in external resize controller with a terminal error. Transient errors don't set ControllerExpansionFailed.
- NodeExpansionPending // state set when external resize controller has finished expanding the volume but further expansion is needed on the node.
- NodeExpansionInProgress // state set when kubelet starts expanding the volume.
- NodeExpansionFailed // state set when expansion has failed in kubelet with a terminal error. Transient errors don't set NodeExpansionFailed.
- Update quota code to use
max(pvc.Spec.Resources, pvc.Status.AllocatedResources)
when evaluating usage for PVC.
We are trying to reduce number of state changes which can happen when volume expansion on either the kubelet or external-resizer fails.
We are considering following gRPC error codes as "infeasible":
- INVALID_ARGUMENt
- OUT_OF_RANGE
- NOT_FOUND
In the external-resizer if ControllerExpandVolume
fails with any of the error codes above, controller expansion will be marked as failed and resizing will be retried at slower rate. For all the other errors - an event will be generated and a condition will be added to PVC that expansion has failed, but state change will not be recorded in allocatedResourceStatus
.
On the node side - allocatedResourceStatus
will only be updated with failed expansion if:
NodeExpandVolume
failed with one of theinfeasible
error codes from above.NodeExpandVolume
failed with a final error and there is a pending pvc size request change from the user.
This will allow external-resizer to recover safely from node expansion failures too.
![New flow kubelet](./Expanding volume - Kubelet Loop.png)
After some discussion with sig-storage folks, we are proposing that we rename pvc.Status.ResizeStatus
to pvc.Status.AllocatedResourceStatus
and make it a map.
So basically new API will look like:
type ClaimResourceStatus string
const (
PersistentVolumeClaimControllerResizeProgress ClaimResourceStatus = "ControllerResizeInProgress"
PersistentVolumeClaimControllerResizeFailed ClaimResourceStatus = "ControllerResizeFailed"
PersistentVolumeClaimNodeResizePending ClaimResourceStatus = "NodeResizePending"
PersistentVolumeClaimNodeResizeInProgress ClaimResourceStatus = "NodeResizeInProgress"
PersistentVolumeClaimNodeResizeFailed ClaimResourceStatus = "NodeResizeFailed"
)
// PersistentVolumeClaimStatus represents the status of PV claim
type PersistentVolumeClaimStatus struct {
// Phase represents the current phase of PersistentVolumeClaim
// +optional
Phase PersistentVolumeClaimPhase
// AccessModes contains all ways the volume backing the PVC can be mounted
// +optional
AccessModes []PersistentVolumeAccessMode
// Represents the actual resources of the underlying volume
// +optional
Capacity ResourceList
// +optional
AllocatedResources ResourceList
// AllocatedResourceStatus stores a map of resource that is being expanded
// and possible status that the resource exists in.
// Some examples may be:
// pvc.status.allocatedResourceStatus["storage"] = "ControllerResizeInProgress"
AllocatedResourceStatus map[ResourceName]ClaimResourceStatus
}
- User creates a 10Gi PVC by setting -
pvc.spec.resources.requests["storage"] = "10Gi"
.
- User increases 10Gi PVC to 100Gi by changing -
pvc.spec.resources.requests["storage"] = "100Gi"
. - Quota controller uses
max(pvc.Status.AllocatedResources, pvc.Spec.Resources)
and adds90Gi
to used quota. - Expansion controller starts expanding the volume and sets
pvc.Status.AllocatedResources
to100Gi
. - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toControllerResizeInProgress
. - Expansion to 100Gi fails and hence
pv.Spec.Capacity
andpvc.Status.Capacity
stays at 10Gi. - Expansion controller sets
pvc.Status.AllocatedResourceStatus['storage']
toControllerResizeFailed
. - User requests size to 20Gi.
- Expansion controler notices that previous expansion to last known
allocatedresources
failed, so it sets newallocatedResources
to20G
- Expansion succeeds and
pvc.Status.Capacity
andpv.Spec.Capacity
report new size as20Gi
. - Quota controller sees a reduction in used quota because
max(pvc.Spec.Resources, pvc.Status.AllocatedResources)
is 20Gi.
- User increases 10Gi PVC to 100Gi by changing -
pvc.spec.resources.requests["storage"] = "100Gi"
- Quota controller uses
max(pvc.Status.AllocatedResources, pvc.Spec.Resources)
and adds90Gi
to used quota. - Expansion controller starts expanding the volume and sets
pvc.Status.AllocatedResources
to100Gi
. - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toControllerResizeInProgress
. - Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and
pv.Spec
is set to100G
. - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toNodeResizePending
. - Expansion starts on the node and kubelet sets
pvc.Status.AllocatedResourceStatus['storage']
toNodeResizeInProgress
. - Expansion fails on the node with a final error.
- Kubelet sets
pvc.Status.AllocatedResourceStatus['storage']
toNodeResizeFailed
. - Since pvc has
pvc.Status.AllocatedResourceStatus['storage']
set toNodeResizeFailed
- kubelet will stop retrying node expansion. - At this point Kubelet will wait for
pvc.Status.AllocatedResourceStatus['storage']
to beNodeResizePending
. - User requests size to 20Gi.
- Expansion controller starts expanding the volume and sees that last expansion failed on the node and driver does not have control-plane expansion.
- Expansion controller sets
pvc.Status.AllocatedResources
to20G
. - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toControllerResizeInProgress
. - Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and
pv.Spec
is set to20G
. - Expansion succeed on the node with latest
allocatedResources
andpvc.Status.Size
is set to20G
. - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toNodeResizePending
. - Kubelet can now retry expansion and expansion on node succeeds.
- Kubelet sets
pvc.Status.AllocatedResourceStatus['storage']
to empty string andpvc.Status.Capacity
to new value. - Quota controller sees a reduction in used quota because
max(pvc.Spec.Resources, pvc.Status.AllocatedResources)
is 20Gi.
- User increases 10Gi PVC to 100Gi by changing
pvc.spec.resources.requests["storage"] = "100Gi"
- Quota controller uses
max(pvc.Status.AllocatedResources, pvc.Spec.Resources)
and adds90Gi
to used quota. - Expansion controller slowly starts expanding the volume and sets
pvc.Status.AllocatedResources
to100Gi
(before expanding). - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toControllerResizeInProgress
. - At this point -
pv.Spec.Capacity
andpvc.Status.Capacity
stays at 10Gi until the resize is finished. - While the storage backend is re-sizing the volume, user requests size 20Gi by changing
pvc.spec.resources.requests["storage"] = "20Gi"
- Expansion controller notices that previous expansion to last known
allocatedresources
is still in-progress. - Expansion controller keeps expanding the volume to
allocatedResources
. - Expansion to
100G
succeeds andpv.Spec.Capacity
andpvc.Status.Capacity
report new size as100G
. - Although
pvc.Spec.Resources
reports size as20G
, expansion to20G
is never attempted. - Quota controller sees no change in storage usage by the PVC because
pvc.Status.AllocatedResources
is 100Gi.
- User requests a PVC of 10.1GB but underlying storage provides provisions a volume of 11GB after rounding.
- Size recorded in
pvc.Status.Capacity
andpv.Spec.Capacity
is however still 10.1GB. - User expands expands the PVC to 100GB.
- Quota controller uses
max(pvc.Status.AllocatedResources, pvc.Spec.Resources)
and adds89.9GB
to used quota. - Expansion controller starts expanding the volume and sets
pvc.Status.AllocatedResources
to100GB
(before expanding). - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toControllerResizeInProgress
. - At this point -
pv.Spec.Capacity
andpvc.Status.Capacity
stays at 10.1GB until the resize is finished. - while resize was in progress - expansion controler crashes and loses state.
- User reduces the size of PVC to 10.5GB.
- Expansion controller notices that previous expansion to last known
allocatedresources
is still in-progress. - Expansion controller starts expanding the volume to last known
allocatedResources
(which is100GB
). - Expansion to
100G
succeeds andpv.Spec.Capacity
andpvc.Status.Capacity
report new size as100G
. - Although
pvc.Spec.Resources
reports size as10.5GB
, expansion to10.5GB
is never attempted. - Quota controller sees no change in storage usage by the PVC because
pvc.Status.AllocatedResources
is 100Gi.
- User increases 10Gi PVC to 100Gi by changing
pvc.spec.resources.requests["storage"] = "100Gi"
- Quota controller uses
max(pvc.Status.AllocatedResources, pvc.Spec.Resources)
and adds90Gi
to used quota. - Expansion controller slowly starts expanding the volume and sets
pvc.Status.AllocatedResources
to100Gi
(before expanding). - Expansion controller also sets
pvc.Status.AllocatedResourceStatus['storage']
toControllerResizeInProgress
. - At this point -
pv.Spec.Capacity
andpvc.Status.Capacity
stays at 10Gi until the resize is finished. - While the storage backend is re-sizing the volume, user requests size 200Gi by changing
pvc.spec.resources.requests["storage"] = "200Gi"
- Quota controller uses
max(pvc.Status.AllocatedResources, pvc.Spec.Resources)
and adds100Gi
to used quota. - Since
pvc.Status.AllocatedResourceStatus['storage']
is inControllerResizeInProgress
- expansion controller still chooses lastpvc.Status.AllocatedResources
as new size. - User reduces size back to
20Gi
. - Quota controller uses
max(pvc.Status.AllocatedResources, pvc.Spec.Resources)
and returns100Gi
to used quota. - Expansion controller notices that previous expansion to last known
allocatedresources
is still in-progress. - Expansion controller keeps expanding the volume to
allocatedResources
. - Expansion to
100G
succeeds andpv.Spec.Capacity
andpvc.Status.Capacity
report new size as100G
. - Although
pvc.Spec.Resources
reports size as20G
, expansion to20G
is never attempted. - Quota controller sees no change in storage usage by the PVC because
pvc.Status.AllocatedResources
is 100Gi.
- Once expansion is initiated, the lowering of requested size is only allowed upto a value greater than
pvc.Status
. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value thanpvc.Status
and still recover from previously failing expansion request.
We propose that by relaxing validation on PVC update to allow users to reduce pvc.Spec.Resources
, it becomes possible to cancel previously issued expansion requests or retry expansion with a lower value if previous request has not been successful. In general - we know that volume plugins are designed to never perform actual shrinking of the volume, for both in-tree and CSI volumes. Moreover if a previously issued expansion has been successful and user
reduces the PVC request size, for both CSI and in-tree plugins they are designed to return a successful response with NO-OP. So, reducing requested size will be a safe operation and will never result in data loss or actual shrinking of volume.
We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.
To solve aforementioned problem - we propose that, a new field will be added to PVC, called pvc.Status.AllocatedResources
. When user expands the PVC, and when expansion-controller starts volume expansion - it will set pvc.Status.AllocatedResources
to user requested value in pvc.Spec.Resources
before performing expansion and it will set pvc.Status.AllocatedResourceStatus[storage]
to ControllerResizeInProgress
. The quota calculation will be updated to use max(pvc.Spec.Resources, pvc.Status.AllocatedResources)
which will ensure that abusing quota will not be possible.
Resizing operation in external resize controller will always work towards full-filling size recorded in pvc.Status.AllocatedResources
and only when previous operation has finished(i.e pvc.Status.AllocatedResourceStatus[storage]
is nil) or when previous operation has failed with a terminal error - it will use new user requested value from pvc.Spec.Resources
.
Kubelet on the other hand will only expand volumes for which pvc.Status.AllocatedResourceStatus[storage]
is in NodeResizePending
or NodeResizeInProgress
state and pv.Spec.Cap > pvc.Status.Cap
. If a volume expansion fails in kubelet with a terminal error(which will set NodeResizeFailed
state) - then it must wait for resize controller in external-resizer to reconcile the state and put it back in NodeResizePending
.
When user reduces pvc.Spec.Resources
, expansion-controller will set pvc.Status.AllocatedResources
to lower value only if one of the following is true:
- If
pvc.Status.AllocatedResourceStatus[storage]
isControllerResizeFailed
(indicating that previous expansion to last knownallocatedResources
failed with a final error) and previous control-plane has not succeeded. - If
pvc.Status.AllocatedResourceStatus[storage]
isNodeResizeFailed
and SP supports node-only expansion (indicating that previous expansion to last knownallocatedResources
failed on node with a final error). - If
pvc.Status.AllocatedResourceStatus[storage]
isnil
orempty
and previousControllerExpandVolume
has not succeeded.
Note: Whenever resize controller or kubelet modifies pvc.Status
(such as when setting both AllocatedResources
and pvc.Status.AllocatedResourceStatus
) - it is expected that all changes to pvc.Status
are submitted as part of same patch request to avoid race conditions.
The complete expansion and recovery flow of both control-plane and kubelet is documented in attached PDF. The attached pdf - documents complete volume expansion flow via state diagrams and is much more exhaustive than the text above.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
In external-resizer:
- pkg/controller/expand_and_recover.go: 09/30/2024 - 78.3%
None
There are already e2e tests running that verify correctness of this feature - https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-cos-alpha-features
- Alpha in 1.23 behind
RecoverVolumeExpansionFailure
feature gate with set to a default offalse
.
- Beta in 1.29: We are going to move this to beta with enhanced e2e and more stability improvements.
- There are already e2e tests running that verify correctness of this feature - https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-cos-alpha-features
Not Applicable
To use this feature RecoverVolumeExpansionFailure
should be enabled in kubelet
, external-resizer
, api-server
and
kube-controller-manager
.
If for some reason - external-resizer
is running with older version and while core kubernetes components have been upgraded, then volume expansion will follow old behaviour and no recovery from volume expansion failure will be possible.
If external-resizer
has been upgraded to latest version (and feature-gate is enabled), but core kubernetes components such as api-server
or kubelet
has not been updated, then as well no recovery will be possible, because api-server
will not allow users to reduce size of pvc
.
If api-server
and external-resizer
has been upgraded and has feature-gate enabled but kubelet
is still
older then for volume types which require expansion on kubelet, resize status will not be updated correctly until kubelet
version is upgraded. This will not block volume from being mounted but it does mean that resize operation will not fully finish.
- We will add events that are added to both PV and PVC objects for user reducing PVC size.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
RecoverVolumeExpansionFailure
- Components depending on the feature gate: kube-apiserver, kube-controller-manager, csi-external-resizer, kubelet
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? Allow users to reduce size of pvc in
pvc.spec.resources
. In general this was not permitted before, so it should not break any of existing automation. This means that ifpvc.Status.AllocatedResources
is available it will be used for calculating quota.To facilitate older kubelet - external resize controller will set
pvc.Status.AllocatedResourceStatus[storage]
to "''" after entire expansion process is complete. This will ensure thatResizeStatus
is updated after expansion is complete even with older kubelet. No recovery from expansion failure will be possible in this case and the workaround will be removed once feature goes GA.One more thing to keep in mind is - enabling this feature in kubelet while keeping it disabled in external-resizer will cause all volume expansions operations to get stuck(similar thing will happen when feature moves to beta and kubelet is newer but external-resizer sidecar is older). This limitation will be documented in external-resizer.
-
Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)? Yes the feature gate can be disabled once enabled. However quota resources present in the cluster will be based off a field(
pvc.Status.AllocatedResources
) which is no longer updated. Currently without this feature - quota calculation is entirely based onpvc.Spec.Resources
and when feature is enabled it will based offmax(pvc.Spec.Resources, pvc.Status.AllocatedResources)
so when the feature is disabled, cluster might be reporting stale quota. To fix this issue - cluster admins can re-createResourceQuota
objects so as quota controller can recompute the quota usingpvc.Spec.Resources
. -
What happens if we reenable the feature if it was previously rolled back? It should be possible to re-enable the feature after disabling it. When feature is disabled and re-enabled, users will be able to reduce size of
pvc.Spec.Resources
to cancel previously issued expansion but in casepvc.Spec.Resources
reports lower value than what was reported inpvc.Status.AllocatedResources
(which would mean resize controller tried to expand this volume to bigger size previously) quota calculation will be based offpvc.Status.AllocatedResources
. -
Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling and disabling feature gates. However, we will write extensive unit tests to test behaviour of code when feature gate is disabled and when feature gate is enabled.
-
How can a rollout fail? Can it impact already running workloads? This change should not impact existing workloads and requires user interaction via reducing pvc capacity.
-
What specific metrics should inform a rollback? No specific metric but if expansion of PVCs are being stuck (can be verified from
pvc.Status.Conditions
) then user should plan a rollback. -
Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested? We have not fully tested upgrade and rollback but as part of beta process we will have it tested.
-
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? This feature deprecates no existing functionality.
Any volume that has been recovered will emit a metric: operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}
.
Since feature requires user interaction, reducing the size of the PVC is only supported if this feature-gate is enabled.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service
- Metrics
- controller expansion operation duration:
- Metric name: storage_operation_duration_seconds{operation_name=expand_volume}
- [Optional] Aggregation method: percentile
- Components exposing the metric: kube-controller-manager
- controller expansion operation errors:
- Metric name: storage_operation_errors_total{operation_name=expand_volume}
- [Optional] Aggregation method: cumulative counter
- Components exposing the metric: kube-controller-manager
- node expansion operation duration:
- Metric name: storage_operation_duration_seconds{operation_name=volume_fs_resize}
- [Optional] Aggregation method: percentile
- Components exposing the metric: kubelet
- node expansion operation errors:
- Metric name: storage_operation_errors_total{operation_name=volume_fs_resize}
- [Optional] Aggregation method: cumulative counter
- Components exposing the metric: kubelet
- controller expansion operation duration:
- Other (treat as last resort)
- Details:
After this feature is rolled out, there should not be any increase in 95-99 percentile of
both expand_volume
and volume_fs_resize
durations. Also the error rate should not increase for
storage_operation_errors_total
metric.
Are there any missing metrics that would be useful to have to improve observability if this feature?
We are planning to add new counter metrics that will record success and failure of recovery operations. In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.
Tentative name of metric is - operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often and hence it should be okay to use name of PVs that were recovered this way.
- Does this feature depend on any specific services running in the cluster?
For CSI volumes this feature requires users to run with latest version of
external-resizer
sidecar which also should haveRecoverVolumeExpansionFailure
feature enabled.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirms the previous answers based on experience in the field.
Potentially yes. If user expands a PVC and expansion fails on the node, then expansion controller in control-plane must intervene and verify if it is safe to retry expansion on the kubelet.This requires round-trips between kubelet and control-plane and hence more API calls.
None
Potentially yes. Since expansion operation is idempotent and expansion controller must verify if it is safe to retry expansion with lower values, there could be additional calls to CSI drivers (and hence cloudproviders).
Since this feature adds a new field to PersistentVolumeClaim object it will increase size of the object. Describe them providing:
- API type(s): PersistentVolumeClaim
- Estimated increase in size: 40B
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
In some cases if expansion fails on the node, then since it requires now correction from control-plane, it may require longer time for entire expansion.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components
It should not result in increase of resource usage.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
Enabling this feature should not result in resource exhaustion of node resources.
Troubleshooting section serves the Playbook
role as of now. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now we leave it here though.
This section must be completed when targeting beta graduation to a release.
-
How does this feature react if the API server and/or etcd is unavailable? If API server or etcd is not available midway through a request then some of the modifications to API objects can not be saved. But this should be a recoverable state because expansion operation is idempotent and controller will retry the operation and succeed whenever API server does come back.
-
What are other known failure modes? For each of them fill in the following information by copying the below template:
- No recovery is possible if volume has been expanded on control-plane and only failing on node.
- Detection: Expansion is stuck with
AllocatedResourceStatus[storage]
-NodeResizePending
orNodeResizeFailed
. - Mitigations: This should not affect any of existing PVCs but this was already broken in some sense and if volume has been expanded in control-plane then we can't allow users to shrink their PVCs because that would violate the quota.
- Diagnostics: Expansion is stuck with
AllocatedResourceStatus['storage']
-NodeResizePending
orNodeResizeFailed
. - Testing: There are some unit tests for this failure mode.
- Detection: Expansion is stuck with
- No recovery is possible if volume has been expanded on control-plane and only failing on node.
-
What steps should be taken if SLOs are not being met to determine the problem? If admin notices an increase in expansion failure operations via aformentioned metrics or by observing
pvc.Status.AllocatedResourceStatus['storage']
then: - Check events on the PVC and observe why is PVC expansion failing. - Gather logs from kubelet and external-resizer and check for problems.
- 2020-01-27 Initial KEP pull request submitted
- 2023-02-03 Changing the APIs of
pvc.Status.ResizeStatus
by renaming it topvc.Status.AllocatedResourceStatus
and converting it to a map. - 2024-09-12 Propose move to beta status.
- One drawback is while this KEP allows an user to reduce requested PVC size, it does not allow reduction in size all the way to same value as original size recorded in
pvc.Status.Cap
. The reason is currently resize controller and kubelet work on a volume one after the other and we need special logic in both kubelet and control-plane to process reduction in size all the way upto original value. This is somewhat racy and we need a signaling mechanism between control-plane and kubelet whenpvc.Status.AllocatedResources
needs to be adjusted. We plan to revisit and address this in next release.
There were several alternatives considered before proposing this KEP.
pvc.Status.Capacity
can't be used for tracking quota because pvc.Status.Capacity is calculated after binding happens, which could be when pod is started. This would allow an user to overcommit because quota won't reflect accurate value until PVC is bound to a PV.
We also considered if it is worth leaving this problem as it is and allow admins to fix it manually since the problem that this KEP fixes is not very frequent(there have not been any reports from end-users about this bug). But there are couple of problems with this:
- A workflow that depends on restarting the pods after resizing is successful - can't complete if resizing is forever stuck. This is the case with current design of statefulset expansion KEP.
- If a storage type only supports node expansion and it keeps failing then PVC could become unusable and prevent its usage.
As mentioned above - one limitation of this KEP is, users can only recover upto a size greater than previous value recorded in pvc.Status.Cap
.