Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Add proposal for Prometheus metrics coverage #77

Merged
merged 5 commits into from
May 1, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions docs/prometheus-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Prometheus Metrics Coverage

We plan to collect a rich set of metrics in kubeflow/common's `JobController` using [Prometheus](https://prometheus.io/).
The goal is to report generic metrics (e.g. metrics related to pods/jobs/services) during the lifecycle of `JobController` so that:

* Other operators built on top of it will automatically report Prometheus metrics without additional efforts;
* It is easier for users of Kubeflow distributed training operators to monitor operator performance and behaviors using consistent set of metrics for different distributed training operators.

This document outlines the list of Prometheus metrics we plan to cover in `JobController`.

## Pod Metrics

The following metrics related to the lifecycle of pods will be reported:

* The total number of created pods
Jeffwan marked this conversation as resolved.
Show resolved Hide resolved
* The total number of restarted pods
* The total number of deleted pods
* The total number of failed pods

terrytangyuan marked this conversation as resolved.
Show resolved Hide resolved
The following metrics will be reported on each pod:

* CPU usage
* GPU usage
* Memory usage
* Network usage
* I/O usage
* Keep-Alive check
* Is-leader check
Jeffwan marked this conversation as resolved.
Show resolved Hide resolved

## Job Metrics

The following metrics related to the lifecycle of jobs will be reported:
Jeffwan marked this conversation as resolved.
Show resolved Hide resolved

* The total number of created jobs
* The total number of deleted jobs
* The total number of completed jobs
* The total number of restarted jobs
* The total number of pending jobs
* The total number of failed jobs

## Service Metrics

The following metrics related to the lifecycle of services will be reported:

* The total number of succeeded service creations
* The total number of failed service creations
* The total number of restarted service creations
* The total number of service patches
* The total number of deleted services

## Scheduling Metrics

The following metrics related to scheduling will be reported:

* The total number of created pod disruption policies
* The total number of deleted pod disruption policies
* The total number of created pod groups
* The total number of deleted pod groups