Add proposal for Prometheus metrics coverage #77

terrytangyuan · 2020-04-23T20:19:56Z

This provides a detailed outline of the Prometheus metrics we plan to coverage in common operator. Related issue: #22.

Signed-off-by: terrytangyuan terrytangyuan@gmail.com

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

kubeflow-bot · 2020-04-23T20:20:01Z

This change is

terrytangyuan · 2020-04-23T20:22:49Z

/cc @ywskycn @Jeffwan @gaocegege @richardsliu @johnugeorge @merlintang @jian-he @carmark

docs/prometheus-metrics.md

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

terrytangyuan · 2020-04-27T18:44:30Z

Thanks everyone for the comments! I've converted the lists to tables which include the metric name, type, and description. I also added a few additional metrics as suggested. Hopefully it's much clearer now. Please take another look.

docs/prometheus-metrics.md

Jeffwan · 2020-04-29T23:24:04Z

docs/prometheus-metrics.md

+| up | Gauge | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) |
+
+Note that some of the above metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet
+integration which reports to Prometheus through our prometheus-operator installation.


Want to make sure the scope. This is outside operator. By default cadvisor expose the metrics and user can use these by their own.

Yes, but I think it's good to document this here so we know that we don't need to report these metrics by ourselves.

Jeffwan · 2020-04-29T23:33:43Z

docs/prometheus-metrics.md

+
+| Metric Name | Metric Type | Description |
+| ----------- | ------------| ----------- |
+| from_created_to_completed_job_duration_seconds_total | Counter | The duration between job created and job completed in seconds |


Minor: I am thinking if we should change to job_duration_from_created_to_complated_seconds_total. Another thing is seems it would be good to use complete deleted as labels, but duration requires two and it would be a little bit hard to query. I think adding labels into metrics to distinguish them makes sense.

I am following the naming practice outlined here: https://prometheus.io/docs/practices/naming/. I prefer the current naming without label as it's more intuitive but we can certainly revisit/revise later.

Jeffwan · 2020-04-29T23:34:35Z

Beside above minor comments, it looks good to me. Wait to see if someone else has the feedback

terrytangyuan · 2020-04-30T14:00:23Z

/assign @gaocegege @johnugeorge

docs/prometheus-metrics.md

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

terrytangyuan · 2020-04-30T20:01:56Z

@yeya24 Thanks! Great suggestions. I have updated the metric types in the doc.

PTAL @gaocegege @johnugeorge @Jeffwan

yeya24 · 2020-04-30T20:21:05Z

docs/prometheus-metrics.md

+| completed_jobs_total | Counter | The total number of completed jobs |
+| restarted_jobs_total | Counter | The total number of restarted jobs |
+| pending_jobs_total | Gauge | The total number of pending jobs |
+| failed_jobs_total | Counter | The total number of failed jobs |


@terrytangyuan Forgot to mention this one. Do you think it is more appropriate to make this Gauge as well? Do you want to represent the history failures or the current failed jobs?

Can we list the metrics label in this doc as well? This is important and useful, too. Like we can combine pending jobs running jobs and failed jobs into one metric job_status{status="pending/failed/running"}, WDYT?

Let's keep it as it is for now so that the metrics are consistent for metrics with past tense v.s. metrics with present continuous tense. Currently there are no labels yet as it's hard to differentiate metrics with two different tenses and choose different metric types for those metrics.

Jeffwan · 2020-05-01T20:15:01Z

/lgtm

terrytangyuan · 2020-05-01T21:11:20Z

/approve

k8s-ci-robot · 2020-05-01T21:11:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add proposal for Prometheus metrics coverage

cf4c0b1

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

k8s-ci-robot requested review from gaocegege and richardsliu April 23, 2020 20:20

k8s-ci-robot added the size/M label Apr 23, 2020

terrytangyuan mentioned this pull request Apr 23, 2020

Proposal for exposing generic prometheus metrics in common operator #22

Open

k8s-ci-robot requested review from carmark, Jeffwan, jian-he, johnugeorge, merlintang and ywskycn April 23, 2020 20:22

gaocegege reviewed Apr 24, 2020

View reviewed changes

docs/prometheus-metrics.md Outdated Show resolved Hide resolved

Jeffwan reviewed Apr 24, 2020

View reviewed changes

docs/prometheus-metrics.md Outdated Show resolved Hide resolved

Jeffwan reviewed Apr 24, 2020

View reviewed changes

docs/prometheus-metrics.md Show resolved Hide resolved

merlintang reviewed Apr 24, 2020

View reviewed changes

docs/prometheus-metrics.md Show resolved Hide resolved

terrytangyuan added 2 commits April 27, 2020 14:23

Convert to table and add metric names

3b2daaa

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

Add metric types

4893e7a

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

johnugeorge reviewed Apr 28, 2020

View reviewed changes

docs/prometheus-metrics.md Outdated Show resolved Hide resolved

Remove common_operator_is_leader

49a0e8a

Jeffwan reviewed Apr 29, 2020

View reviewed changes

Jeffwan approved these changes Apr 29, 2020

View reviewed changes

k8s-ci-robot assigned gaocegege and johnugeorge Apr 30, 2020

yeya24 reviewed Apr 30, 2020

View reviewed changes

docs/prometheus-metrics.md Outdated Show resolved Hide resolved

yeya24 reviewed Apr 30, 2020

View reviewed changes

docs/prometheus-metrics.md Outdated Show resolved Hide resolved

docs/prometheus-metrics.md Outdated Show resolved Hide resolved

Address comments

1aa73e7

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

yeya24 reviewed Apr 30, 2020

View reviewed changes

k8s-ci-robot assigned Jeffwan May 1, 2020

k8s-ci-robot added the lgtm label May 1, 2020

k8s-ci-robot added the approved label May 1, 2020

k8s-ci-robot merged commit 1e61243 into kubeflow:master May 1, 2020

terrytangyuan deleted the prom-metrics-doc branch May 1, 2020 21:14

georgkaleido pushed a commit to georgkaleido/common that referenced this pull request Jun 9, 2022

chore(deps): update all Yarn dependencies (2021-12-01) (kubeflow#77)

759749d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proposal for Prometheus metrics coverage #77

Add proposal for Prometheus metrics coverage #77

terrytangyuan commented Apr 23, 2020

kubeflow-bot commented Apr 23, 2020

terrytangyuan commented Apr 23, 2020

terrytangyuan commented Apr 27, 2020

Jeffwan Apr 29, 2020

terrytangyuan Apr 30, 2020

Jeffwan Apr 29, 2020

terrytangyuan Apr 30, 2020

Jeffwan commented Apr 29, 2020

terrytangyuan commented Apr 30, 2020

terrytangyuan commented Apr 30, 2020 •

edited

Loading

yeya24 Apr 30, 2020

terrytangyuan May 1, 2020

Jeffwan commented May 1, 2020

terrytangyuan commented May 1, 2020

k8s-ci-robot commented May 1, 2020

Add proposal for Prometheus metrics coverage #77

Add proposal for Prometheus metrics coverage #77

Conversation

terrytangyuan commented Apr 23, 2020

kubeflow-bot commented Apr 23, 2020

terrytangyuan commented Apr 23, 2020

terrytangyuan commented Apr 27, 2020

Jeffwan Apr 29, 2020

Choose a reason for hiding this comment

terrytangyuan Apr 30, 2020

Choose a reason for hiding this comment

Jeffwan Apr 29, 2020

Choose a reason for hiding this comment

terrytangyuan Apr 30, 2020

Choose a reason for hiding this comment

Jeffwan commented Apr 29, 2020

terrytangyuan commented Apr 30, 2020

terrytangyuan commented Apr 30, 2020 • edited Loading

yeya24 Apr 30, 2020

Choose a reason for hiding this comment

terrytangyuan May 1, 2020

Choose a reason for hiding this comment

Jeffwan commented May 1, 2020

terrytangyuan commented May 1, 2020

k8s-ci-robot commented May 1, 2020

terrytangyuan commented Apr 30, 2020 •

edited

Loading