-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random delay before readyToUse=true #225
Comments
/assign @xing-yang |
Is this tested with my fix #224? |
Can you also check the status of content: "kubectl describe volumesnapshotcontent"? Is ReadyToUse in content already set to "true" while ReadyToUse in volumesnapshot is still "false"? Is it a long delay until they are in sync? I noticed this behavior early on and fixed it. If it still happens to you, I need to see logs from both sidecar and snapshot controller. |
I just reproduced this with the canary version controller and the #224 version of the sidecar. During the period between when the snapshot is created and readyToUse=true on the volumesnapshot, the volumesnapshotcontent object has no status subresource at all. Here are the 2 logs from the test run: Here is what the objects look like during the problem: $ kubectl get volumesnapshot snap2 -o yaml apiVersion: snapshot.storage.k8s.io/v1beta1 kind: VolumeSnapshot metadata: creationTimestamp: "2019-12-31T01:29:02Z" finalizers: - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection generation: 2 name: snap2 namespace: default resourceVersion: "875" selfLink: /apis/snapshot.storage.k8s.io/v1beta1/namespaces/default/volumesnapshots/snap2 uid: 144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 spec: source: persistentVolumeClaimName: pvc2 volumeSnapshotClassName: snapshot-class status: boundVolumeSnapshotContentName: snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 readyToUse: false $ kubectl get volumesnapshotcontent snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 -o yaml apiVersion: snapshot.storage.k8s.io/v1beta1 kind: VolumeSnapshotContent metadata: creationTimestamp: "2019-12-31T01:29:02Z" finalizers: - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection generation: 1 name: snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 resourceVersion: "873" selfLink: /apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 uid: 62a724f8-8656-40ab-afd0-5b614460509a spec: deletionPolicy: Delete driver: csi.test.net source: volumeHandle: pvc-22ff1db8-4f00-4f1a-b086-d4c9cd9cd3fe volumeSnapshotClassName: snapshot-class volumeSnapshotRef: apiVersion: snapshot.storage.k8s.io/v1beta1 kind: VolumeSnapshot name: snap2 namespace: default resourceVersion: "866" uid: 144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 |
I guess you meant "readyToUse=false" in the above paragraph? The status of VolumeSnapshot is updated based on the status of VolumeSnapshotContent. So what you are seeing is "in sync" since VolumeSnapshotContent will be created before we try to create the physical snapshot on the storage system to avoid leaking. So looks like it is not a delay of status synchronization issue. I'll take a look of the logs. |
I just mean from the time the snapshot is created until it's ready to use. In this case it appears to be 49 seconds between 01:29:02-01:29:51. The relevant line in the log is line 96 of the external-snapshotter.log, where the controller fails to update the status. It seems to be losing a race with something else that updates the same object, but I'm not sure what it's racing with. Also it's not clear why the controller can't just retry the update immediately after losing the race. |
Regarding "Also it's not clear why the controller can't just retry the update immediately after losing the race", there were some concerns raised during reviews over retries added immediately after an API object update error. The decision was to revisit this issue after 2.0 release. See this issue here: #214. It's probably not obvious what that is about. You can go to the original PR where the issue is raised and search for "requeue". Let me mention this issue in that issue so they are tracked together. |
Yeah a retry might not be appropriate and a requeue could be the better strategy, as long as it syncs again quickly, at least of the case of losing an API race. For other error conditions that are likely to persist for some time, I think an exponential backoff would be better than automatically waiting a whole minute. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@xing-yang I think this one is fixed now with the requeue logic improvements. Can we close this, or is there a remaining issue to solve? |
Yes, we can close it now. Thanks! |
My CSI plugin always returns readyToUse=true, because it simply blocks in CreateSnapshot() until the snapshot is created (typically 1 second or less). Usually the volumesnapshot object in k8s reflects readyToUse=true immediately, but with some randomness it sometimes shows up as readyToUse=false, and only get corrected after about a minute.
Here is a log that illustrates this happening:
external-snapshotter.log
Notice at 05:10:00, the CreateSnapshot() RPC returns success with readyToUse=true. However there's an error on line 96 of the log:
snapshot_controller.go:325] error updating volume snapshot content status for snapshot snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b: snapshot controller failed to update snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b": the object has been modified; please apply your changes to the latest version and try again.
48 seconds later, the controller retries, and successfully updates the object.
I have 2 issues with this behavior (1) why was readyToUse ever set to false if the CreateSnapshot() RPC returned readyToUse=true on the first try? And (2) it seems that the long wait time before retrying is unneeded because it's just an API race with something else modifying the same snapshotcontent object. We could just retry the update right after the error, or requeue the operation for very soon after instead of waiting. 48 seconds is a long time to wait in an automated sequence of steps that's waiting for the snapshot to be usable.
The text was updated successfully, but these errors were encountered: