Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Add proposal for Prometheus metrics coverage #77

Merged
merged 5 commits into from
May 1, 2020

Conversation

terrytangyuan
Copy link
Member

This provides a detailed outline of the Prometheus metrics we plan to coverage in common operator. Related issue: #22.

Signed-off-by: terrytangyuan terrytangyuan@gmail.com

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
@kubeflow-bot
Copy link

This change is Reviewable

@terrytangyuan
Copy link
Member Author

docs/prometheus-metrics.md Outdated Show resolved Hide resolved
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
@terrytangyuan
Copy link
Member Author

Thanks everyone for the comments! I've converted the lists to tables which include the metric name, type, and description. I also added a few additional metrics as suggested. Hopefully it's much clearer now. Please take another look.

| up | Gauge | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) |

Note that some of the above metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet
integration which reports to Prometheus through our prometheus-operator installation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure the scope. This is outside operator. By default cadvisor expose the metrics and user can use these by their own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think it's good to document this here so we know that we don't need to report these metrics by ourselves.


| Metric Name | Metric Type | Description |
| ----------- | ------------| ----------- |
| from_created_to_completed_job_duration_seconds_total | Counter | The duration between job created and job completed in seconds |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I am thinking if we should change to job_duration_from_created_to_complated_seconds_total. Another thing is seems it would be good to use complete deleted as labels, but duration requires two and it would be a little bit hard to query. I think adding labels into metrics to distinguish them makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am following the naming practice outlined here: https://prometheus.io/docs/practices/naming/. I prefer the current naming without label as it's more intuitive but we can certainly revisit/revise later.

@Jeffwan
Copy link
Member

Jeffwan commented Apr 29, 2020

Beside above minor comments, it looks good to me. Wait to see if someone else has the feedback

@terrytangyuan
Copy link
Member Author

/assign @gaocegege @johnugeorge

docs/prometheus-metrics.md Outdated Show resolved Hide resolved
docs/prometheus-metrics.md Outdated Show resolved Hide resolved
docs/prometheus-metrics.md Outdated Show resolved Hide resolved
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
@terrytangyuan
Copy link
Member Author

terrytangyuan commented Apr 30, 2020

@yeya24 Thanks! Great suggestions. I have updated the metric types in the doc.

PTAL @gaocegege @johnugeorge @Jeffwan

| completed_jobs_total | Counter | The total number of completed jobs |
| restarted_jobs_total | Counter | The total number of restarted jobs |
| pending_jobs_total | Gauge | The total number of pending jobs |
| failed_jobs_total | Counter | The total number of failed jobs |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan Forgot to mention this one. Do you think it is more appropriate to make this Gauge as well? Do you want to represent the history failures or the current failed jobs?

Can we list the metrics label in this doc as well? This is important and useful, too. Like we can combine pending jobs running jobs and failed jobs into one metric job_status{status="pending/failed/running"}, WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as it is for now so that the metrics are consistent for metrics with past tense v.s. metrics with present continuous tense. Currently there are no labels yet as it's hard to differentiate metrics with two different tenses and choose different metric types for those metrics.

@Jeffwan
Copy link
Member

Jeffwan commented May 1, 2020

/lgtm

@terrytangyuan
Copy link
Member Author

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1e61243 into kubeflow:master May 1, 2020
@terrytangyuan terrytangyuan deleted the prom-metrics-doc branch May 1, 2020 21:14
georgkaleido pushed a commit to georgkaleido/common that referenced this pull request Jun 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants