Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Proposal for exposing generic prometheus metrics in common operator #22

Open
ywskycn opened this issue Apr 23, 2019 · 11 comments
Open

Proposal for exposing generic prometheus metrics in common operator #22

ywskycn opened this issue Apr 23, 2019 · 11 comments

Comments

@ywskycn
Copy link
Member

ywskycn commented Apr 23, 2019

Proposal

Add generic metrics (jobs/pods/...) to the common operator, which can be directly enabled and used by operators built base on common operator

Motivation

To track some job-level metrics, currently we need to add prometheus metric code inside each job operator. For example, to know how many tfjobs created in the last hour, we need to add a Counter inside tf-operator. This request is very common and is needed for different operators. As we're moving common code to the common operator, we could also add metric-related code there, and can be used by all operators built base on the common one.

Details

For metric definition and registry, will add a new metrics folder and all metrics will be defined there. Some prelim metrics include # jobs/pods/services created, durations for various operations, etc.

For metrics updating:

  • For pods/services, we can directly add related metric code inside job_controller/pod.go and job_controller/service.go.
  • For jobs, to track the numbers, we may need to watch the creation events. Similar to controller_watches.

As the common project is still under active development, some details discussed above may be changed later. Comments will be very appreciated, @jlewi @richardsliu @gaocegege @jian-he .

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.93. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member

/cc @terrytangyuan

The feature LGTM.

@jian-he
Copy link
Contributor

jian-he commented Apr 23, 2019

lgtm, +1

@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 23, 2019

Sounds great to me. This would be a good way to standardize metrics collection. We could also expose some utility methods that operators can use to collect operator-specific custom metrics, which leads to shared best practices and standards across operators.

@richardsliu
Copy link
Contributor

Sounds great to me.

/cc @jlewi

@johnugeorge
Copy link
Member

Great. LGTM
One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.

@gaocegege
Copy link
Member

One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.

Sure. kubebuilder supports the feature, thus I think we can also implement it in common-operator if we design it well.

@merlintang
Copy link
Contributor

LGTM, this looks so good.

@yeya24
Copy link

yeya24 commented Oct 18, 2019

Any progress for this issue?

@gaocegege
Copy link
Member

@yeya24 AFAIK, there is no one working on it now.

@terrytangyuan
Copy link
Member

Hi all, I added a detailed outline of the Prometheus metrics we plan to coverage in common operator in #77. Please take a look and any feedback would be appreciated.

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants