Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1340: Added prometheus counters for all the jobs #1365

Closed

Conversation

deepak-muley
Copy link
Contributor

@deepak-muley deepak-muley commented Aug 15, 2021

Ref: #1340

Testing

TODO:
1. Decide if we should be renaming the tf_operator specific counters
(backward compability needed?)

pulling latest changes from kubeflow/tf-operator to deepak-muley/tf-operator
TODO:
1. Decide if we should be renaming the tf_operator specific counters
   (backward compability needed?)
2. Update counters at all other places
@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign johnugeorge after the PR has been reviewed.
You can assign the PR to them by writing /assign @johnugeorge in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aws-kf-ci-bot
Copy link
Contributor

Hi @deepak-muley. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deepak-muley
Copy link
Contributor Author

/hold

@deepak-muley deepak-muley marked this pull request as ready for review August 16, 2021 05:56
@deepak-muley
Copy link
Contributor Author

@Jeffwan, to keep the code consistent with all the operator controllers, we need to consolidate/remove obsolete code from controller.v1/tensorflow. mainly none of the files like controller.go, job.go, pod.go and status.go exist in pytorch, xgboost and mxnet hence reading the code becomes confusing. i have move all the prometheus counters consistently to _controller.go files of each operator. Need help in identifying if above extra files are needed and if we can consolidate them to make code consistent in all dir.

Copy link
Member

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. One thing I am not that sure if we want framework be part of metrics or make it as a label?

We can consider which way makes dashboard building easier.

var (
mxJobsCreatedCount = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "training_operator_mxjobs_created_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want mxjobs part of the metrics name or a label?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same comment. which would be the prometheus query here in both ways?

@Jeffwan
Copy link
Member

Jeffwan commented Aug 16, 2021

BTW, Can you extra manifest updates in a separate PR? Let's split into two and each can concentrate on one issue

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @deepak-muley!
I added few comments.

@deepak-muley deepak-muley changed the title 1340: WIP: Added prometheus counters for all the jobs 1340: Added prometheus counters for all the jobs Aug 16, 2021
deepak-muley added a commit to deepak-muley/tf-operator that referenced this pull request Aug 17, 2021
@deepak-muley
Copy link
Contributor Author

BTW, Can you extra manifest updates in a separate PR? Let's split into two and each can concentrate on one issue

#1368

google-oss-robot pushed a commit that referenced this pull request Aug 17, 2021
* separated out the manifests fix from #1365

* Update Makefile

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@Jeffwan
Copy link
Member

Jeffwan commented Aug 17, 2021

@deepak-muley Please rebase the manifest change from master

andreyvelich and others added 4 commits August 17, 2021 13:17
* separated out the manifests fix from kubeflow#1365

* Update Makefile

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
TODO:
1. Decide if we should be renaming the tf_operator specific counters
   (backward compability needed?)
2. Update counters at all other places
TODO:
1. Decide if we should be renaming the tf_operator specific counters
   (backward compability needed?)
2. Update counters at all other places
@google-cla
Copy link

google-cla bot commented Aug 17, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@deepak-muley
Copy link
Contributor Author

/hold

@google-cla
Copy link

google-cla bot commented Aug 17, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@Jeffwan
Copy link
Member

Jeffwan commented Aug 17, 2021

@deepak-muley Did you rebase upstream master? Seems master commits come to your dev branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants