Random delay before readyToUse=true #225

bswartz · 2019-12-30T05:25:49Z

My CSI plugin always returns readyToUse=true, because it simply blocks in CreateSnapshot() until the snapshot is created (typically 1 second or less). Usually the volumesnapshot object in k8s reflects readyToUse=true immediately, but with some randomness it sometimes shows up as readyToUse=false, and only get corrected after about a minute.

Here is a log that illustrates this happening:
external-snapshotter.log

Notice at 05:10:00, the CreateSnapshot() RPC returns success with readyToUse=true. However there's an error on line 96 of the log:

snapshot_controller.go:325] error updating volume snapshot content status for snapshot snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b: snapshot controller failed to update snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b": the object has been modified; please apply your changes to the latest version and try again.

48 seconds later, the controller retries, and successfully updates the object.

I have 2 issues with this behavior (1) why was readyToUse ever set to false if the CreateSnapshot() RPC returned readyToUse=true on the first try? And (2) it seems that the long wait time before retrying is unneeded because it's just an API race with something else modifying the same snapshotcontent object. We could just retry the update right after the error, or requeue the operation for very soon after instead of waiting. 48 seconds is a long time to wait in an automated sequence of steps that's waiting for the snapshot to be usable.

xing-yang · 2019-12-30T14:51:16Z

/assign @xing-yang

xing-yang · 2019-12-30T18:07:12Z

Is this tested with my fix #224?
Can you provide corresponding logs from snapshot-controller?

xing-yang · 2019-12-30T18:28:43Z

Can you also check the status of content: "kubectl describe volumesnapshotcontent"? Is ReadyToUse in content already set to "true" while ReadyToUse in volumesnapshot is still "false"? Is it a long delay until they are in sync? I noticed this behavior early on and fixed it. If it still happens to you, I need to see logs from both sidecar and snapshot controller.

bswartz · 2019-12-31T01:56:45Z

I just reproduced this with the canary version controller and the #224 version of the sidecar. During the period between when the snapshot is created and readyToUse=true on the volumesnapshot, the volumesnapshotcontent object has no status subresource at all.

Here are the 2 logs from the test run:
snapshot-controller.log
external-snapshotter.log

Here is what the objects look like during the problem:

$ kubectl get volumesnapshot snap2 -o yaml
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2019-12-31T01:29:02Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generation: 2
  name: snap2
  namespace: default
  resourceVersion: "875"
  selfLink: /apis/snapshot.storage.k8s.io/v1beta1/namespaces/default/volumesnapshots/snap2
  uid: 144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
spec:
  source:
    persistentVolumeClaimName: pvc2
  volumeSnapshotClassName: snapshot-class
status:
  boundVolumeSnapshotContentName: snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
  readyToUse: false
$ kubectl get volumesnapshotcontent snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 -o yaml
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotContent
metadata:
  creationTimestamp: "2019-12-31T01:29:02Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
  generation: 1
  name: snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
  resourceVersion: "873"
  selfLink: /apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
  uid: 62a724f8-8656-40ab-afd0-5b614460509a
spec:
  deletionPolicy: Delete
  driver: csi.test.net
  source:
    volumeHandle: pvc-22ff1db8-4f00-4f1a-b086-d4c9cd9cd3fe
  volumeSnapshotClassName: snapshot-class
  volumeSnapshotRef:
    apiVersion: snapshot.storage.k8s.io/v1beta1
    kind: VolumeSnapshot
    name: snap2
    namespace: default
    resourceVersion: "866"
    uid: 144ed5ae-cffd-4fa3-975c-abe4bbfdfac1

xing-yang · 2019-12-31T02:54:54Z

I just reproduced this with the canary version controller and the #224 version of the sidecar. During the period between when the snapshot is created and readyToUse=true on the volumesnapshot, the volumesnapshotcontent object has no status subresource at all.

I guess you meant "readyToUse=false" in the above paragraph?

The status of VolumeSnapshot is updated based on the status of VolumeSnapshotContent. So what you are seeing is "in sync" since VolumeSnapshotContent will be created before we try to create the physical snapshot on the storage system to avoid leaking. So looks like it is not a delay of status synchronization issue. I'll take a look of the logs.

bswartz · 2019-12-31T03:05:19Z

I just mean from the time the snapshot is created until it's ready to use. In this case it appears to be 49 seconds between 01:29:02-01:29:51.

The relevant line in the log is line 96 of the external-snapshotter.log, where the controller fails to update the status. It seems to be losing a race with something else that updates the same object, but I'm not sure what it's racing with. Also it's not clear why the controller can't just retry the update immediately after losing the race.

xing-yang · 2019-12-31T03:14:09Z

Regarding "Also it's not clear why the controller can't just retry the update immediately after losing the race", there were some concerns raised during reviews over retries added immediately after an API object update error. The decision was to revisit this issue after 2.0 release. See this issue here: #214. It's probably not obvious what that is about. You can go to the original PR where the issue is raised and search for "requeue". Let me mention this issue in that issue so they are tracked together.

bswartz · 2019-12-31T15:26:34Z

Yeah a retry might not be appropriate and a requeue could be the better strategy, as long as it syncs again quickly, at least of the case of losing an API race.

For other error conditions that are likely to persist for some time, I think an exponential backoff would be better than automatically waiting a whole minute.

fejta-bot · 2020-03-30T16:15:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

xing-yang · 2020-03-31T12:38:04Z

/remove-lifecycle stale

fejta-bot · 2020-06-29T12:43:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-07-29T13:25:21Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

bswartz · 2020-07-29T14:31:20Z

@xing-yang I think this one is fixed now with the requeue logic improvements. Can we close this, or is there a remaining issue to solve?

xing-yang · 2020-07-29T14:43:33Z

Yes, we can close it now. Thanks!

k8s-ci-robot assigned xing-yang Dec 30, 2019

xing-yang mentioned this issue Dec 31, 2019

Requeue the snapshot object #214

Closed

xing-yang mentioned this issue Jan 3, 2020

Fix the requeue logic #230

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2020

xing-yang closed this as completed Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random delay before readyToUse=true #225

Random delay before readyToUse=true #225

bswartz commented Dec 30, 2019

xing-yang commented Dec 30, 2019

xing-yang commented Dec 30, 2019

xing-yang commented Dec 30, 2019

bswartz commented Dec 31, 2019 •

edited

Loading

xing-yang commented Dec 31, 2019

bswartz commented Dec 31, 2019

xing-yang commented Dec 31, 2019

bswartz commented Dec 31, 2019

fejta-bot commented Mar 30, 2020

xing-yang commented Mar 31, 2020

fejta-bot commented Jun 29, 2020

fejta-bot commented Jul 29, 2020

bswartz commented Jul 29, 2020

xing-yang commented Jul 29, 2020

Random delay before readyToUse=true #225

Random delay before readyToUse=true #225

Comments

bswartz commented Dec 30, 2019

xing-yang commented Dec 30, 2019

xing-yang commented Dec 30, 2019

xing-yang commented Dec 30, 2019

bswartz commented Dec 31, 2019 • edited Loading

xing-yang commented Dec 31, 2019

bswartz commented Dec 31, 2019

xing-yang commented Dec 31, 2019

bswartz commented Dec 31, 2019

fejta-bot commented Mar 30, 2020

xing-yang commented Mar 31, 2020

fejta-bot commented Jun 29, 2020

fejta-bot commented Jul 29, 2020

bswartz commented Jul 29, 2020

xing-yang commented Jul 29, 2020

bswartz commented Dec 31, 2019 •

edited

Loading