Add a Grafana plot to monitor disk usage on the home directory #1119

choldgraf · 2022-03-15T22:19:33Z

Background and proposal

In #1081 we had a hub outage because the cluster had run out of disk space, causing user launches to fail.

Running out of disk space is a common concern for our hubs, and we should set up a Grafana plot so that we can monitor and potentially send alerts when disk space is low.

Implementation guide and constraints

There are two related issues here, and we may want to solve them independently if need be:

First, have a plot that shows disk space usage / remaining on a cluster. This could be an upstream contribution in https://github.com/jupyterhub/grafana-dashboards
Second, add an alert for this, ref: Cloud usage monitoring and alerting infrastructure and process #328

Updates and ongoing work

No response

sgibson91 · 2022-03-16T10:27:22Z

This may be a useful issue to move upstream? https://github.com/jupyterhub/grafana-dashboards

choldgraf · 2022-03-16T18:16:10Z

@sgibson91 yep definitely agree that an upstream improvement would be the best place for this, if we know that it is generalizable to many deployments. Have updated top comment with a ref to https://github.com/jupyterhub/grafana-dashboards

yuvipanda · 2022-03-23T23:17:24Z

This is a bit complicated on Azure or AWS since we're using managed storage service (AzureFile or EFS), and we'll need an exporter specifically for those that prometheus can scrape.

choldgraf · 2022-03-23T23:49:30Z

@yuvipanda aren't we also planning to move to Google's managed filesystem service as well?

ref: Move off manually maintained NFS servers on GCP #1105

GeorgianaElena · 2022-05-13T13:44:04Z

Update: I opened a PR upstream to add three dashboards from https://grafana.com/grafana/dashboards/11454 that track some PVC stats. I've deployed them to the 2i2c grafana.

I know it's not exactly disk usage, but it gives an intuition about what is going on and doesn't care about storage type.
Curious if you find it useful.

damianavila · 2022-05-16T22:43:13Z

Curious if you find it useful.

I think it provides useful information. The second one is a per-user usage rate, correct?

choldgraf · 2022-05-20T08:34:49Z

These all seem useful to me as well - I am curious what "daily usage" means though, does it mean "data written to disk"?

GeorgianaElena · 2022-05-20T12:50:25Z

As I mentioned, the graphs are ported from https://grafana.com/grafana/dashboards/11454 and not my creation, but the query that generates that graph is rate(kubelet_volume_stats_used_bytes[1d]).

Didn't find any official docs for kubelet_volume_stats_used_bytes other than these.

So, from what I understand, the daily usage graph shows the daily bytes usage rate in a particular volume. And yes, I believe each of the prod (home-nfs) there corresponds to a user. And since the home-nfs PVs have a retain policy, they don't ever get destroyed.

consideRatio · 2023-02-08T10:36:09Z

This is done!

damianavila · 2023-02-15T11:05:53Z

This is done!

I think this is the proper ref (for a future reader): #1992

choldgraf mentioned this issue Mar 15, 2022

[Incident] UToronto cluster ran out of disk space #1081

Closed

4 tasks

GeorgianaElena added this to Cloud usage monitoring and alerting infrastructure and process May 5, 2022

GeorgianaElena moved this to Todo in Cloud usage monitoring and alerting infrastructure and process May 5, 2022

GeorgianaElena added the 🏷️ monitoring label May 12, 2022

GeorgianaElena moved this from Todo to In Progress in Cloud usage monitoring and alerting infrastructure and process May 16, 2022

GeorgianaElena moved this from In Progress to Todo in Cloud usage monitoring and alerting infrastructure and process Jun 20, 2022

choldgraf removed the 🏷️ monitoring label Sep 16, 2022

consideRatio closed this as completed Feb 8, 2023

github-project-automation bot moved this from Todo to Done in Cloud usage monitoring and alerting infrastructure and process Feb 8, 2023

damianavila added this to DEPRECATED Engineering and Product Backlog Feb 15, 2023

damianavila moved this to Complete in DEPRECATED Engineering and Product Backlog Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Grafana plot to monitor disk usage on the home directory #1119

Add a Grafana plot to monitor disk usage on the home directory #1119

choldgraf commented Mar 15, 2022 •

edited

Loading

sgibson91 commented Mar 16, 2022

choldgraf commented Mar 16, 2022 •

edited

Loading

yuvipanda commented Mar 23, 2022

choldgraf commented Mar 23, 2022 •

edited

Loading

GeorgianaElena commented May 13, 2022 •

edited

Loading

damianavila commented May 16, 2022

choldgraf commented May 20, 2022

GeorgianaElena commented May 20, 2022

consideRatio commented Feb 8, 2023

damianavila commented Feb 15, 2023 •

edited

Loading

Add a Grafana plot to monitor disk usage on the home directory #1119

Add a Grafana plot to monitor disk usage on the home directory #1119

Comments

choldgraf commented Mar 15, 2022 • edited Loading

Background and proposal

Implementation guide and constraints

Updates and ongoing work

sgibson91 commented Mar 16, 2022

choldgraf commented Mar 16, 2022 • edited Loading

yuvipanda commented Mar 23, 2022

choldgraf commented Mar 23, 2022 • edited Loading

GeorgianaElena commented May 13, 2022 • edited Loading

damianavila commented May 16, 2022

choldgraf commented May 20, 2022

GeorgianaElena commented May 20, 2022

consideRatio commented Feb 8, 2023

damianavila commented Feb 15, 2023 • edited Loading

choldgraf commented Mar 15, 2022 •

edited

Loading

choldgraf commented Mar 16, 2022 •

edited

Loading

choldgraf commented Mar 23, 2022 •

edited

Loading

GeorgianaElena commented May 13, 2022 •

edited

Loading

damianavila commented Feb 15, 2023 •

edited

Loading