Metrics

Updating metrics

When new metrics/labels/exporters are added to be scraped by prometheus, make sure the following list is updated as well to keep track of what metrics/labels are needed or not.

The following is a list of metrics that are currently in use.

Cortex metrics

cortex_in_flight_requests with the following labels:
1. api_name
cortex_async_request_count with the following labels:
1. api_name
2. api_kind
3. status_code
cortex_async_active with the following labels:
1. api_name
2. api_kind
cortex_async_queued with the following labels:
1. api_name
2. api_kind
cortex_async_in_flight with the following labels:
1. api_name
2. api_kind
cortex_async_latency_bucket with the following labels:
1. api_name
2. api_kind
cortex_batch_succeeded with the following labels:
1. api_name
cortex_batch_failed with the following labels:
1. api_name
cortex_time_per_batch_sum with the following labels:
1. api_name
cortex_time_per_batch_count with the following labels:
1. api_name

Istio metrics

istio_requests_total with the following labels:
1. destination_service
2. response_code
istio_request_duration_milliseconds_bucket with the following labels:
1. destination_service
2. le
istio_request_duration_milliseconds_sum with the following labels:
1. destination_service
istio_request_duration_milliseconds_count with the following labels:
1. destination_service

Kubelet metrics

container_cpu_usage_seconds_total with the following labels:
1. pod
container_memory_working_set_bytes with the following labels:
1. pod

Kube-state-metrics metrics

kube_pod_container_resource_requests with the following labels:
1. pod
2. resource
kube_pod_info with the following labels:
1. pod
kube_deployment_status_replicas_available with the following labels:
1. deployment
kube_job_status_active with the following labels:
1. job_name

DCGM metrics

DCGM_FI_DEV_GPU_UTIL with the following labels:
1. pod
DCGM_FI_DEV_FB_USED with the following labels:
1. pod
DCGM_FI_DEV_FB_FREE with the following labels:
1. pod

Node metrics

node_cpu_seconds_total with the following labels:
1. job
2. mode
3. instance
4. cpu
node_load1 with the following labels:
1. job
2. instance
node_load5 with the following labels:
1. job
2. instance
node_load15 with the following labels:
1. job
2. instance
node_exporter_build_info with the following labels:
1. job
2. instance
node_memory_MemTotal_bytes with the following labels:
1. job
2. instance
node_memory_MemFree_bytes with the following labels:
1. job
2. instance
node_memory_Buffers_bytes with the following labels:
1. job
2. instance
node_memory_Cached_bytes with the following labels:
1. job
2. instance
node_memory_MemAvailable_bytes with the following labels:
1. job
2. instance
node_disk_read_bytes_total with the following labels:
1. job
2. instance
3. device
node_disk_written_bytes_total with the following labels:
1. job
2. instance
3. device
node_disk_io_time_seconds_total with the following labels:
1. job
2. instance
3. device
node_filesystem_size_bytes with the following labels:
1. job
2. instance
3. fstype
4. mountpoint
5. device
node_filesystem_avail_bytes with the following labels:
1. job
2. instance
3. fstype
4. device
node_network_receive_bytes_total with the following labels:
1. job
2. instance
3. device
node_network_transmit_bytes_total with the following labels:
1. job
2. instance
3. device

Prometheus rules for the node exporter

instance:node_cpu_utilisation:rate1m from the following metrics:
1. node_cpu_seconds_total with the following labels:
  1. job
  2. mode
instance:node_num_cpu:sum from the following metrics:
1. node_cpu_seconds_total with the following labels:
  1. job
instance:node_load1_per_cpu:ratio from the following metrics:
1. node_load1 with the following labels:
  1. job
instance:node_memory_utilisation:ratio from the following metrics:
1. node_memory_MemTotal_bytes with the following labels:
  1. job
2. node_memory_MemAvailable_bytes with the following labels:
  1. job
instance:node_vmstat_pgmajfault:rate1m with the following metrics:
1. node_vmstat_pgmajfault with the following labels:
  1. job
instance_device:node_disk_io_time_seconds:rate1m with the following metrics:
1. node_disk_io_time_seconds_total with the following labels:
  1. job
  2. device
instance_device:node_disk_io_time_weighted_seconds:rate1m with the following metrics:
1. node_disk_io_time_weighted_seconds with the following labels:
  1. job
  2. device
instance:node_network_receive_bytes_excluding_lo:rate1m with the following metrics:
1. node_network_receive_bytes_total with the following labels:
  1. job
  2. device
instance:node_network_transmit_bytes_excluding_lo:rate1m with the following metrics:
1. node_network_transmit_bytes_total with the following labels:
  1. job
  2. device
instance:node_network_receive_drop_excluding_lo:rate1m with the following metrics:
1. node_network_receive_drop_total with the following labels:
  1. job
  2. device
instance:node_network_transmit_drop_excluding_lo:rate1m with the following metrics:
1. node_network_transmit_drop_total with the following labels:
  1. job
  2. device
cluster:cpu_utilization:ratio with the following metrics:
1. instance:node_cpu_utilisation:rate1m
2. instance:node_num_cpu:sum
cluster:load1:ratio with the following metrics:
1. instance:node_load1_per_cpu:ratio
cluster:memory_utilization:ratio with the following metrics:
1. instance:node_memory_utilisation:ratio
cluster:vmstat_pgmajfault:rate1m with the following metrics:
1. instance:node_vmstat_pgmajfault:rate1m
cluster:network_receive_bytes_excluding_low:rate1m with the following metrics:
1. instance:node_network_receive_bytes_excluding_lo:rate1m
cluster:network_transmit_bytes_excluding_lo:rate1m with the following metrics:
1. instance:node_network_transmit_bytes_excluding_lo:rate1m
cluster:network_receive_drop_excluding_lo:rate1m with the following metrics:
1. instance:node_network_receive_drop_excluding_lo:rate1m
cluster:network_transmit_drop_excluding_lo:rate1m with the following metrics:
1. instance:node_network_transmit_drop_excluding_lo:rate1m
cluster:disk_io_utilization:ratio with the following metrics:
1. instance_device:node_disk_io_time_seconds:rate1m
cluster:disk_io_saturation:ratio with the following metrics:
1. instance_device:node_disk_io_time_weighted_seconds:rate1m
cluster:disk_space_utilization:ratio with the following metrics:
1. node_filesystem_size_bytes with the following labels:
  1. job
  2. fstype
  3. mountpoint
2. node_filesystem_avail_bytes with the following labels:
  1. job
  2. fstype
  3. mountpoint

Re-introducing dropped metrics/labels

If you need to add some metrics/labels back for some particular use case, comment out every metricRelabelings: section (except the one from the prometheus-operator.yaml file), determine which metrics/labels you want to add back (i.e. by using the explorer from Grafana) and then re-edit the appropriate metricRelabelings: sections to account for the un-dropped metrics/labels.

Prometheus Analysis

Go Pprof

To analyse the memory allocations of prometheus, run kubectl port-forward prometheus-prometheus-0 9090:9090, and then run go tool pprof -symbolize=remote -inuse_space localhost:9090/debug/pprof/heap. Once you get the interpreter, you can run top or dot for a more detailed hierarchy of the memory usage.

TSDB

To analyse the TSDB of prometheus, exec into the prometheus-prometheus-0 pod, cd into /tmp, and run the following code-block:

wget https://github.com/prometheus/prometheus/releases/download/v1.7.3/prometheus-1.7.3.linux-amd64.tar.gz
tar -xzf prometheus-*
cd prometheus-*
./tsdb analyze /prometheus | less

Useful link: https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality

Or you can go to localhost:9090 -> Status -> TSDB Status, but it's not as complete as running a binary analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus.md

prometheus.md

Metrics

Updating metrics

Cortex metrics

Istio metrics

Kubelet metrics

Kube-state-metrics metrics

DCGM metrics

Node metrics

Prometheus rules for the node exporter

Re-introducing dropped metrics/labels

Prometheus Analysis

Go Pprof

TSDB

Files

prometheus.md

Latest commit

History

prometheus.md

File metadata and controls

Metrics

Updating metrics

Cortex metrics

Istio metrics

Kubelet metrics

Kube-state-metrics metrics

DCGM metrics

Node metrics

Prometheus rules for the node exporter

Re-introducing dropped metrics/labels

Prometheus Analysis

Go Pprof

TSDB