-
Notifications
You must be signed in to change notification settings - Fork 1.3k
ws-manager: volume snapshot metric is not accurate for stops #10334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think only controller can watch for events, no? |
Yes. If we haven't moved to ws-manager-mk2, we could add another controller volume_snapshot_controller just like we have pod_controller. And handles the event in ws-manager monitoring. After that, notify the waiter in finalizeWorkspaceContent function (probably we could use go channel to notify). |
@kylos101 I propose to move this task out of the epic for durability and instead move it into ws-manager-mk2 epic. I think it will make more sense to do that after mk2 is done, rather then adding potential complexity (I think). |
It impacts the workspace stopping, from my testing, generally it took more 15 seconds to wait (from 45 seconds to 60 second), so the volume_snapshot metric is not accurate. However, since most user more care about the start time and no data loss, this issue is an enhancement. If we plan move to wk-manager-mk2 this year, I am okay to move this issue out of PVC epic. |
@jenting I see, can you change the title for this issue? For example Also, which metric is used for to measure metrics for snapshot restore on workspace start? I ask because you mention this only impacts stopping...but don't we use snapshots (restore or create) in both cases (start or stop)? We don't have any concrete plans, yet, to move to mk2. This highlights why moving would help, but, none are customer use cases that are in demand now. What would the effort be like time wise to do in mk1? (note: it seems like this is safe to move, but I'd like to socialize some of the above questions first) |
The metric for volume snapshot seconds Yes, we use snapshots for both workspace start and stop, for
Correct me if I'm wrong Pavel. From my point of view, the mk1 and mk2 efforts should be similar
|
Is your feature request related to a problem? Please describe
When implementing the #10195, we realized that the time to measure the VolumeSnapshotContent becomes Ready might not be accurate because we using the exponential back-off retry mechanism.
For example, the exponential back-off retry time period is around: 100ms, 150ms, 225ms, 337.5ms, ..., 40 secs, and 60 secs.
If we watch the VolumeSnapshotContent object, we observe that the VolumeSnapshotContent object becomes ready at about 45 seconds. However, the time metric of VolumeSnapshotContent would be recorded as 60 seconds because that's the next round the exponential back-off retry be triggered.
Describe the behaviour you'd like
If possible, we could change the exponential back-off mechanism's factor or implement the Kubernetes client Watch method to more accurately the time waiting for the VolumeSnapshotContent becomes Ready.
It would benefit the time on stopping the workspace as well as have more accurate metrics.
Describe alternatives you've considered
N/A
Additional context
#10195
#7901
The text was updated successfully, but these errors were encountered: