[ENH] - Setup alerts for uptime monitoring #2558

aktech · 2024-07-09T14:32:33Z

Feature description

After we have setup some sort of uptime monitoring: #2557 it would be very useful to setup alerts on those like:

Email
Slack, etc

Value and/or benefit

Know immediately when the system's health is compromised

Anything else?

No response

kcpevey · 2024-07-18T18:59:50Z

Things I think would be interesting to see:

notification when things go down (might not be possible since we're monitoring from within, we might have to settle for some sort of heartbeat push update)
heartbeat check on conda-store
how long the current users have been on the platform (to identify when someone has accidentally left a server running)
Current apps
- number of apps running
- total number of servers used by these apps
- uptime for each app
Daily/weekly stats
- total number of users per day
- average uptime per user
- total disk usage for user storage/shared directories
- total disk usage by conda-store
- max/average memory consumption of jhub server and conda-store server

Some of these will likely be outside scope for the initial POC. I imagine some of them may require storing data over time which is likely outside of scope as well. I couldn't tell which could be achieved by scraping Loki logs and which would require an additional bit of storing information so I've just thrown all the ideas out there.

kenafoster · 2024-07-22T20:17:41Z

All of Kim's ideas above are crucial - conda store heartbeat would be a top priority since in practice it has been the most likely to be unavailable.

Some additional suggestions

GOAL: Conda store performance
- using some simple, consistent spec file, periodically submit a build job and record the elapsed time.
- In addition to alerting on any lapses (builds getting stuck in "queued" state), we can also use this performance metric as a proxy to monitor worker pods' resource availability and any issues with upstream
GOAL: Monitor VM utilization for cost optimization
- Monitor auto-scaling events within each K8S node pool. Alert when a node pool hits its maximum size. Monitor/reporting on number of nodes running per node group over time.
- Aggregate utilization stats (CPU, RAM) per node in each node group. This could help measure if our node sizes are actually too big and wasting money. (one example would be if a worker pools' node size was 4 vCPUs and 64 GiB RAM, but the Nebari deployment usage pattern was usually only one user pod (or app pod) which required only 0.5 CPU and 2 GiB RAM).

github-project-automation bot added this to 🪴 Nebari Project Management Jul 9, 2024

github-project-automation bot moved this to New 🚦 in 🪴 Nebari Project Management Jul 9, 2024

Adam-D-Lewis added area: user experience 👩🏻‍💻 area: monitoring 🔍 labels Sep 10, 2024

dcmcand closed this as completed Sep 30, 2024

github-project-automation bot moved this from New 🚦 to Done 💪🏾 in 🪴 Nebari Project Management Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] - Setup alerts for uptime monitoring #2558

[ENH] - Setup alerts for uptime monitoring #2558

aktech commented Jul 9, 2024

kcpevey commented Jul 18, 2024

kenafoster commented Jul 22, 2024

[ENH] - Setup alerts for uptime monitoring #2558

[ENH] - Setup alerts for uptime monitoring #2558

Comments

aktech commented Jul 9, 2024

Feature description

Value and/or benefit

Anything else?

kcpevey commented Jul 18, 2024

kenafoster commented Jul 22, 2024