Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-manager] observe when backup or content restore fails #10330

Closed
kylos101 opened this issue May 29, 2022 · 6 comments · Fixed by #10342
Closed

[ws-manager] observe when backup or content restore fails #10330

kylos101 opened this issue May 29, 2022 · 6 comments · Fixed by #10342
Assignees
Labels
team: workspace Issue belongs to the Workspace team

Comments

@kylos101
Copy link
Contributor

Is your feature request related to a problem? Please describe

We lack the ability to track failures for backups and content restore.

Describe the behaviour you'd like

Add metrics so that we can observe trends with backup and content restore success and failure. Perhaps counters? Four in total:

  1. backup success
  2. backup failure
  3. restore success
  4. restore failure

Describe alternatives you've considered

Consult with @sagor999 or @jenting , they are working the durability epic (PVC) and may have alternative ideas.

Additional context

We added metrics to time content init and finalize via #9355

We lack metrics to track content init or finalize failures, aside from seeing at a high level that a workspace start or stop failed, without necessarily knowing if it was related to content.

@kylos101 kylos101 added the team: workspace Issue belongs to the Workspace team label May 29, 2022
@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team May 29, 2022
@kylos101
Copy link
Contributor Author

@jenting Pavel is out on vacation. Can you think of an alternative for this to help us observe backup and restore success and failure?

@jenting
Copy link
Contributor

jenting commented May 30, 2022

Roger that.

@jenting jenting self-assigned this May 30, 2022
@atduarte
Copy link
Contributor

atduarte commented May 30, 2022

For context, this issue arised from observing this case. The main goal is to create a metric that represents data loss.

Would these metrics catch the following data loss scenario?

  1. Workspace starts and stops, backup A is done 👍
  2. Workspace restarts, loads backup A, node goes down, no backup is started 👎
  3. Workspace restarts, load backup A instead of the actual latest changes. 👎

@kylos101
Copy link
Contributor Author

kylos101 commented Jun 1, 2022

For context, this issue arised from observing this case. The main goal is to create a metric that represents data loss.

Would these metrics catch the following data loss scenario?

  1. Workspace starts and stops, backup A is done +1
  2. Workspace restarts, loads backup A, node goes down, no backup is started -1
  3. Workspace restarts, load backup A instead of the actual latest changes. -1

Implementing a counter metrics for backup and restore success and failure should allow us to compare:

  1. successful workspace starts - successful workspace backups, to find missing backups.
  2. Same as 1 (a restart)
  3. This is an interesting scenario, @atduarte can you share a more concrete example timeline so we can discuss with @jenting ?

@atduarte
Copy link
Contributor

atduarte commented Jun 1, 2022

Same as #1 (a restart)

Wrong link?

This is an interesting scenario, @atduarte can you share a more concrete example timeline so we can discuss with @jenting ?

Do you mean an example where it happened?

@kylos101
Copy link
Contributor Author

kylos101 commented Jun 2, 2022

Same as #1 (a restart)

Wrong link?

Yes, wrong link, 1 as in your first use case (I removed the link).

This is an interesting scenario, @atduarte can you share a more concrete example timeline so we can discuss with @jenting ?

Do you mean an example where it happened?

I mean number list of all steps leading to the outcome you want to measure, from the initial workspace start, some amount of restarts, and then a failiure. In other words, it looked like you were combining the above 3 use cases, but I was having trouble following.

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants