Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reiszer and Modifer Controllers race to patch pvc causing loser to restart reconciliation loop #458

Open
ElijahQuinones opened this issue Dec 12, 2024 · 2 comments
Assignees

Comments

@ElijahQuinones
Copy link
Member

ElijahQuinones commented Dec 12, 2024

If a cluster operator attempts to resize and modify a volume at the same time (By Patching the PVC), either the first ExapndVolume or ModifyVolume will be delayed by retry-interval-start

This is due to the external-resizer's modifyController and resizeController reconciliation loops which both attempt to patch the same PVC to mark the operation as in-progress. The loser will need to restart their reconciliation loop after waiting their retry-interval-start with the error can't patch status of PVC ebs-5935/pvc-d5jhc with Operation cannot be fulfilled on persistentvolumeclaims \"pvc-d5jhc\": the object has been modified; please apply your changes to the latest version and try again

The patch attempt happens in these three places markControllerModifyVolumeStatus, updateConditionBasedOnError, and finally in markControllerModifyVolumeCompleted for modifying the volume.

And in these three places for resizing the volume markPVCAsFSResizeRequired, markPVCResizeInProgress, and markPVCResizeFinished

These are the full logs with a retry interval start of 4 seconds on the external-resizer.

I1212 16:57:30.697020       1 event.go:389] "Event occurred" object="ebs-5935/pvc-d5jhc" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="VolumeModify" message="external resizer is modifying volume pvc-d5jhc with vac ebs-volume-tester-45bxg"
E1212 16:57:30.704249       1 controller.go:314] "Error syncing PVC" err="marking pvc \"ebs-5935/pvc-d5jhc\" as resizing failed: Mark PVC \"ebs-5935/pvc-d5jhc\" as resize as in progress failed: can't patch status of  PVC ebs-5935/pvc-d5jhc with Operation cannot be fulfilled on persistentvolumeclaims \"pvc-d5jhc\": the object has been modified; please apply your changes to the latest version and try again"
I1212 16:57:34.712984       1 event.go:389] "Event occurred" object="ebs-5935/pvc-d5jhc" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="Resizing" message="External resizer is resizing volume pvc-69a28244-127b-4c61-81e1-edaaa8ae2e51"
I1212 16:57:36.172525       1 event.go:389] "Event occurred" object="ebs-5935/pvc-d5jhc" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="VolumeModifySuccessful" message="external resizer modified volume pvc-d5jhc with vac ebs-volume-tester-45bxg successfully "
E1212 16:57:37.325585       1 controller.go:314] "Error syncing PVC" err="resize volume \"pvc-69a28244-127b-4c61-81e1-edaaa8ae2e51\" by resizer \"ebs.csi.aws.com\" failed: rpc error: code = Internal desc = Could not resize volume \"vol-078c2da88041a73f0\": rpc error: code = Internal desc = Could not modify volume \"vol-078c2da88041a73f0\": volume \"vol-078c2da88041a73f0\" in OPTIMIZING state, cannot currently modify"
I1212 16:57:37.325781       1 event.go:389] "Event occurred" object="ebs-5935/pvc-d5jhc" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeResizeFailed" message="resize volume \"pvc-69a28244-127b-4c61-81e1-edaaa8ae2e51\" by resizer \"ebs.csi.aws.com\" failed: rpc error: code = Internal desc = Could not resize volume \"vol-078c2da88041a73f0\": rpc error: code = Internal desc = Could not modify volume \"vol-078c2da88041a73f0\": volume \"vol-078c2da88041a73f0\" in OPTIMIZING state, cannot currently modify"
I1212 16:57:37.333980       1 event.go:389] "Event occurred" object="ebs-5935/pvc-d5jhc" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="Resizing" message="External resizer is resizing volume pvc-69a28244-127b-4c61-81e1-edaaa8ae2e51"
E1212 16:57:39.965349       1 controller.go:314] "Error syncing PVC" err="resize volume \"pvc-69a28244-127b-4c61-81e1-edaaa8ae2e51\" by resizer \"ebs.csi.aws.com\" failed: rpc error: code = Internal desc = Could not resize volume \"vol-078c2da88041a73f0\": rpc error: code = Internal desc = Could not modify volume \"vol-078c2da88041a73f0\": volume \"vol-078c2da88041a73f0\" in OPTIMIZING state, cannot currently modify"

.....

This affects the EBS CSI Driver if retry-interval-start is large because it attempts to coalesce both the ExpandVolume and ModifyVolume RPCs into one EC2 ModifyVolume call (due to AWS' 6 hour volume modification cooldown) which is not possible if one of these rpcs has to wait retry-interval-start due to this issue

Happy to work on this issue if we agree it is worth solving.

@xing-yang
Copy link
Contributor

/assign @gnufied

@ElijahQuinones
Copy link
Member Author

ElijahQuinones commented Dec 16, 2024

As discussed in Kubernetes CSI Implementation Team Standup we do not want to pull a fresh pvc before each of these patches as the failure we would see otherwise is an indicator that the controllers view of the world is wrong and as such it would be much safer to restart the reconciliation loop .Additionally we do not want to set addResourceVersionCheck to false for these calls as this is an indicator that we in fact do need to retry the reconciliation loop and is intended behavior. Best steps for aws-ebs-csi-driver in the case as discussed is to keep the retry-interval-start to 1 second to avoid this issue. Combining resize controller and modify controller work queues was considered for a long term solution, though this would complicate EBS CSI Driver request coalescing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants