-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1340: Added prometheus counters for all the jobs #1365
1340: Added prometheus counters for all the jobs #1365
Conversation
pulling latest changes from kubeflow/tf-operator to deepak-muley/tf-operator
TODO: 1. Decide if we should be renaming the tf_operator specific counters (backward compability needed?) 2. Update counters at all other places
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @deepak-muley. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
hence separated out the crds into its own folder same goes for make uninstall
TODO: Need to find out if other files like job.go is needed
@Jeffwan, to keep the code consistent with all the operator controllers, we need to consolidate/remove obsolete code from controller.v1/tensorflow. mainly none of the files like controller.go, job.go, pod.go and status.go exist in pytorch, xgboost and mxnet hence reading the code becomes confusing. i have move all the prometheus counters consistently to _controller.go files of each operator. Need help in identifying if above extra files are needed and if we can consolidate them to make code consistent in all dir. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. One thing I am not that sure if we want framework
be part of metrics or make it as a label?
We can consider which way makes dashboard building easier.
var ( | ||
mxJobsCreatedCount = promauto.NewCounterVec( | ||
prometheus.CounterOpts{ | ||
Name: "training_operator_mxjobs_created_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want mxjobs
part of the metrics name or a label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same comment. which would be the prometheus query here in both ways?
BTW, Can you extra manifest updates in a separate PR? Let's split into two and each can concentrate on one issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @deepak-muley!
I added few comments.
|
* separated out the manifests fix from #1365 * Update Makefile Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@deepak-muley Please rebase the manifest change from master |
* separated out the manifests fix from kubeflow#1365 * Update Makefile Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
TODO: 1. Decide if we should be renaming the tf_operator specific counters (backward compability needed?) 2. Update counters at all other places
TODO: 1. Decide if we should be renaming the tf_operator specific counters (backward compability needed?) 2. Update counters at all other places
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
/hold |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@deepak-muley Did you rebase upstream master? Seems master commits come to your dev branch |
Ref: #1340
Testing
following was observed on http://localhost:8080/metrics
training_operator_tfjobs_created_total{job_namespace="test-tf-operator"} 2
training_operator_tfjobs_successful_total{job_namespace="test-tf-operator"} 3
following was observed on http://localhost:8081/healthz
ok
following was observed on http://localhost:8081/readyz
ok
TODO:
1. Decide if we should be renaming the tf_operator specific counters
(backward compability needed?)