Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random delay before readyToUse=true #225

Closed
bswartz opened this issue Dec 30, 2019 · 14 comments
Closed

Random delay before readyToUse=true #225

bswartz opened this issue Dec 30, 2019 · 14 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@bswartz
Copy link
Contributor

bswartz commented Dec 30, 2019

My CSI plugin always returns readyToUse=true, because it simply blocks in CreateSnapshot() until the snapshot is created (typically 1 second or less). Usually the volumesnapshot object in k8s reflects readyToUse=true immediately, but with some randomness it sometimes shows up as readyToUse=false, and only get corrected after about a minute.

Here is a log that illustrates this happening:
external-snapshotter.log

Notice at 05:10:00, the CreateSnapshot() RPC returns success with readyToUse=true. However there's an error on line 96 of the log:

snapshot_controller.go:325] error updating volume snapshot content status for snapshot snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b: snapshot controller failed to update snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b": the object has been modified; please apply your changes to the latest version and try again.

48 seconds later, the controller retries, and successfully updates the object.

I have 2 issues with this behavior (1) why was readyToUse ever set to false if the CreateSnapshot() RPC returned readyToUse=true on the first try? And (2) it seems that the long wait time before retrying is unneeded because it's just an API race with something else modifying the same snapshotcontent object. We could just retry the update right after the error, or requeue the operation for very soon after instead of waiting. 48 seconds is a long time to wait in an automated sequence of steps that's waiting for the snapshot to be usable.

@xing-yang
Copy link
Collaborator

/assign @xing-yang

@xing-yang
Copy link
Collaborator

Is this tested with my fix #224?
Can you provide corresponding logs from snapshot-controller?

@xing-yang
Copy link
Collaborator

Can you also check the status of content: "kubectl describe volumesnapshotcontent"? Is ReadyToUse in content already set to "true" while ReadyToUse in volumesnapshot is still "false"? Is it a long delay until they are in sync? I noticed this behavior early on and fixed it. If it still happens to you, I need to see logs from both sidecar and snapshot controller.

@bswartz
Copy link
Contributor Author

bswartz commented Dec 31, 2019

I just reproduced this with the canary version controller and the #224 version of the sidecar. During the period between when the snapshot is created and readyToUse=true on the volumesnapshot, the volumesnapshotcontent object has no status subresource at all.

Here are the 2 logs from the test run:
snapshot-controller.log
external-snapshotter.log

Here is what the objects look like during the problem:

$ kubectl get volumesnapshot snap2 -o yaml
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2019-12-31T01:29:02Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generation: 2
  name: snap2
  namespace: default
  resourceVersion: "875"
  selfLink: /apis/snapshot.storage.k8s.io/v1beta1/namespaces/default/volumesnapshots/snap2
  uid: 144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
spec:
  source:
    persistentVolumeClaimName: pvc2
  volumeSnapshotClassName: snapshot-class
status:
  boundVolumeSnapshotContentName: snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
  readyToUse: false
$ kubectl get volumesnapshotcontent snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1 -o yaml
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotContent
metadata:
  creationTimestamp: "2019-12-31T01:29:02Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
  generation: 1
  name: snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
  resourceVersion: "873"
  selfLink: /apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-144ed5ae-cffd-4fa3-975c-abe4bbfdfac1
  uid: 62a724f8-8656-40ab-afd0-5b614460509a
spec:
  deletionPolicy: Delete
  driver: csi.test.net
  source:
    volumeHandle: pvc-22ff1db8-4f00-4f1a-b086-d4c9cd9cd3fe
  volumeSnapshotClassName: snapshot-class
  volumeSnapshotRef:
    apiVersion: snapshot.storage.k8s.io/v1beta1
    kind: VolumeSnapshot
    name: snap2
    namespace: default
    resourceVersion: "866"
    uid: 144ed5ae-cffd-4fa3-975c-abe4bbfdfac1

@xing-yang
Copy link
Collaborator

I just reproduced this with the canary version controller and the #224 version of the sidecar. During the period between when the snapshot is created and readyToUse=true on the volumesnapshot, the volumesnapshotcontent object has no status subresource at all.

I guess you meant "readyToUse=false" in the above paragraph?

The status of VolumeSnapshot is updated based on the status of VolumeSnapshotContent. So what you are seeing is "in sync" since VolumeSnapshotContent will be created before we try to create the physical snapshot on the storage system to avoid leaking. So looks like it is not a delay of status synchronization issue. I'll take a look of the logs.

@bswartz
Copy link
Contributor Author

bswartz commented Dec 31, 2019

I just mean from the time the snapshot is created until it's ready to use. In this case it appears to be 49 seconds between 01:29:02-01:29:51.

The relevant line in the log is line 96 of the external-snapshotter.log, where the controller fails to update the status. It seems to be losing a race with something else that updates the same object, but I'm not sure what it's racing with. Also it's not clear why the controller can't just retry the update immediately after losing the race.

@xing-yang
Copy link
Collaborator

Regarding "Also it's not clear why the controller can't just retry the update immediately after losing the race", there were some concerns raised during reviews over retries added immediately after an API object update error. The decision was to revisit this issue after 2.0 release. See this issue here: #214. It's probably not obvious what that is about. You can go to the original PR where the issue is raised and search for "requeue". Let me mention this issue in that issue so they are tracked together.

@bswartz
Copy link
Contributor Author

bswartz commented Dec 31, 2019

Yeah a retry might not be appropriate and a requeue could be the better strategy, as long as it syncs again quickly, at least of the case of losing an API race.

For other error conditions that are likely to persist for some time, I think an exponential backoff would be better than automatically waiting a whole minute.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2020
@xing-yang
Copy link
Collaborator

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2020
@bswartz
Copy link
Contributor Author

bswartz commented Jul 29, 2020

@xing-yang I think this one is fixed now with the requeue logic improvements. Can we close this, or is there a remaining issue to solve?

@xing-yang
Copy link
Collaborator

Yes, we can close it now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants