Skip to content

ws-manager: volume snapshot metric is not accurate for stops #10334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #7901
jenting opened this issue May 30, 2022 · 7 comments · Fixed by #10820
Closed
Tracked by #7901

ws-manager: volume snapshot metric is not accurate for stops #10334

jenting opened this issue May 30, 2022 · 7 comments · Fixed by #10820
Assignees
Labels
team: workspace Issue belongs to the Workspace team

Comments

@jenting
Copy link
Contributor

jenting commented May 30, 2022

Is your feature request related to a problem? Please describe

When implementing the #10195, we realized that the time to measure the VolumeSnapshotContent becomes Ready might not be accurate because we using the exponential back-off retry mechanism.

For example, the exponential back-off retry time period is around: 100ms, 150ms, 225ms, 337.5ms, ..., 40 secs, and 60 secs.
If we watch the VolumeSnapshotContent object, we observe that the VolumeSnapshotContent object becomes ready at about 45 seconds. However, the time metric of VolumeSnapshotContent would be recorded as 60 seconds because that's the next round the exponential back-off retry be triggered.

Describe the behaviour you'd like

If possible, we could change the exponential back-off mechanism's factor or implement the Kubernetes client Watch method to more accurately the time waiting for the VolumeSnapshotContent becomes Ready.

It would benefit the time on stopping the workspace as well as have more accurate metrics.

Describe alternatives you've considered

N/A

Additional context

#10195
#7901

@jenting jenting added the team: workspace Issue belongs to the Workspace team label May 30, 2022
@jenting jenting changed the title Enhance the way we wait for the VolumeSnapshotContent becomes Ready [ws-manager] Enhance the way we wait for the VolumeSnapshotContent becomes Ready May 30, 2022
@sagor999
Copy link
Contributor

I think only controller can watch for events, no?
So this is something that ws-manager mk2 would allow us to do. Correct me if I am wrong here? @jenting

@jenting
Copy link
Contributor Author

jenting commented Jun 21, 2022

I think only controller can watch for events, no? So this is something that ws-manager mk2 would allow us to do. Correct me if I am wrong here? @jenting

Yes. If we haven't moved to ws-manager-mk2, we could add another controller volume_snapshot_controller just like we have pod_controller. And handles the event in ws-manager monitoring. After that, notify the waiter in finalizeWorkspaceContent function (probably we could use go channel to notify).

@sagor999
Copy link
Contributor

@kylos101 I propose to move this task out of the epic for durability and instead move it into ws-manager-mk2 epic. I think it will make more sense to do that after mk2 is done, rather then adding potential complexity (I think).

@kylos101
Copy link
Contributor

@sagor999 why move? I ask because @jenting created while working the original epic.

@jenting what do you think of deferring this work? I see you created the issue. What is the impact to users if we wait to do this? How will they be negatively impacted?

@jenting
Copy link
Contributor Author

jenting commented Jun 21, 2022

It impacts the workspace stopping, from my testing, generally it took more 15 seconds to wait (from 45 seconds to 60 second), so the volume_snapshot metric is not accurate.

However, since most user more care about the start time and no data loss, this issue is an enhancement.

If we plan move to wk-manager-mk2 this year, I am okay to move this issue out of PVC epic.

@kylos101
Copy link
Contributor

@jenting I see, can you change the title for this issue? For example [ws-manager] volume snapshot metric is not accurate for stops is more accurate.

Also, which metric is used for to measure metrics for snapshot restore on workspace start? I ask because you mention this only impacts stopping...but don't we use snapshots (restore or create) in both cases (start or stop)?

We don't have any concrete plans, yet, to move to mk2. This highlights why moving would help, but, none are customer use cases that are in demand now.

What would the effort be like time wise to do in mk1?
What about in mk2?
I'm trying to understand the trade-offs.
As @sagor999 mentioned with mk1 it'll be more complex (I assume to both implement and support). How much easier would mk2 me?

(note: it seems like this is safe to move, but I'd like to socialize some of the above questions first)

@jenting jenting changed the title [ws-manager] Enhance the way we wait for the VolumeSnapshotContent becomes Ready [ws-manager] volume snapshot metric is not accurate for stops Jun 22, 2022
@jenting
Copy link
Contributor Author

jenting commented Jun 22, 2022

Also, which metric is used for to measure metrics for snapshot restore on workspace start? I ask because you mention this only impacts stopping...but don't we use snapshots (restore or create) in both cases (start or stop)?

The metric for volume snapshot seconds gitpod_ws_manager_volume_snapshot_seconds.

Yes, we use snapshots for both workspace start and stop, for

  • workspace start (if the volume snapshot present), the metric name is gitpod_ws_manager_volume_restore_seconds.
  • workspace stop, the metric name is gitpod_ws_manager_volume_snapshot_seconds.

Correct me if I'm wrong Pavel. From my point of view, the mk1 and mk2 efforts should be similar

  • Add a new volume snapshot controller to watch the VolumeSnapshot.
  • When the VolumeSnapshot is ready, notify the waiter that the VolumeSnapshot is ready, continue the finalizer process.

@jenting jenting changed the title [ws-manager] volume snapshot metric is not accurate for stops ws-manager: volume snapshot metric is not accurate for stops Jun 22, 2022
@jenting jenting self-assigned this Jun 22, 2022
@jenting jenting moved this to In Progress in 🌌 Workspace Team Jun 22, 2022
Repository owner moved this from In Progress to Done in 🌌 Workspace Team Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants