Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

send regular GPU utilization report with CronJob #5281

Merged
merged 29 commits into from
Feb 7, 2021

Conversation

suiguoxin
Copy link
Member

@suiguoxin suiguoxin commented Feb 2, 2021

  • create a k8s CronJob which :
    • query Prometheus regularly for GPU utilization data
    • send cluster-usage alerts to alert-manager
    • the alerts will be handled by alert-handler

This feature was previously implemented by @Binyang2014 here, here we just adapt the code into alert-manager.

The alert mail:
image

@suiguoxin suiguoxin requested a review from Binyang2014 February 2, 2021 01:58
@coveralls
Copy link

coveralls commented Feb 2, 2021

Coverage Status

Coverage decreased (-0.04%) to 33.896% when pulling f31f8cd on suiguoxin:cronjob into c26313e on microsoft:master.

@suiguoxin suiguoxin mentioned this pull request Feb 2, 2021
55 tasks
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can admin disable this report by comment the config?

@suiguoxin
Copy link
Member Author

Can admin disable this report by comment the config?

The code is refactored. After this change, the report will only be enabled when the schedule field is set.

@suiguoxin suiguoxin requested a review from Binyang2014 February 3, 2021 07:55
@suiguoxin suiguoxin merged commit edc67c2 into microsoft:master Feb 7, 2021
@suiguoxin suiguoxin deleted the cronjob branch February 7, 2021 05:22
@suiguoxin suiguoxin mentioned this pull request Mar 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants