Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] - Setup alerts for uptime monitoring #2558

Closed
aktech opened this issue Jul 9, 2024 · 2 comments
Closed

[ENH] - Setup alerts for uptime monitoring #2558

aktech opened this issue Jul 9, 2024 · 2 comments

Comments

@aktech
Copy link
Member

aktech commented Jul 9, 2024

Feature description

After we have setup some sort of uptime monitoring: #2557 it would be very useful to setup alerts on those like:

  • Email
  • Slack, etc

Value and/or benefit

  • Know immediately when the system's health is compromised

Anything else?

No response

@kcpevey
Copy link
Contributor

kcpevey commented Jul 18, 2024

Things I think would be interesting to see:

  • notification when things go down (might not be possible since we're monitoring from within, we might have to settle for some sort of heartbeat push update)
  • heartbeat check on conda-store
  • how long the current users have been on the platform (to identify when someone has accidentally left a server running)
  • Current apps
    • number of apps running
    • total number of servers used by these apps
    • uptime for each app
  • Daily/weekly stats
    • total number of users per day
    • average uptime per user
    • total disk usage for user storage/shared directories
    • total disk usage by conda-store
    • max/average memory consumption of jhub server and conda-store server

Some of these will likely be outside scope for the initial POC. I imagine some of them may require storing data over time which is likely outside of scope as well. I couldn't tell which could be achieved by scraping Loki logs and which would require an additional bit of storing information so I've just thrown all the ideas out there.

@kenafoster
Copy link
Contributor

All of Kim's ideas above are crucial - conda store heartbeat would be a top priority since in practice it has been the most likely to be unavailable.

Some additional suggestions

  • GOAL: Conda store performance
    • using some simple, consistent spec file, periodically submit a build job and record the elapsed time.
    • In addition to alerting on any lapses (builds getting stuck in "queued" state), we can also use this performance metric as a proxy to monitor worker pods' resource availability and any issues with upstream
  • GOAL: Monitor VM utilization for cost optimization
    • Monitor auto-scaling events within each K8S node pool. Alert when a node pool hits its maximum size. Monitor/reporting on number of nodes running per node group over time.
    • Aggregate utilization stats (CPU, RAM) per node in each node group. This could help measure if our node sizes are actually too big and wasting money. (one example would be if a worker pools' node size was 4 vCPUs and 64 GiB RAM, but the Nebari deployment usage pattern was usually only one user pod (or app pod) which required only 0.5 CPU and 2 GiB RAM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

5 participants