Skip to content

Latest commit

 

History

History
242 lines (216 loc) · 8.48 KB

prometheus.md

File metadata and controls

242 lines (216 loc) · 8.48 KB

Metrics

Updating metrics

When new metrics/labels/exporters are added to be scraped by prometheus, make sure the following list is updated as well to keep track of what metrics/labels are needed or not.

The following is a list of metrics that are currently in use.

Cortex metrics

  1. cortex_in_flight_requests with the following labels:
    1. api_name
  2. cortex_async_request_count with the following labels:
    1. api_name
    2. api_kind
    3. status_code
  3. cortex_async_active with the following labels:
    1. api_name
    2. api_kind
  4. cortex_async_queued with the following labels:
    1. api_name
    2. api_kind
  5. cortex_async_in_flight with the following labels:
    1. api_name
    2. api_kind
  6. cortex_async_latency_bucket with the following labels:
    1. api_name
    2. api_kind
  7. cortex_batch_succeeded with the following labels:
    1. api_name
  8. cortex_batch_failed with the following labels:
    1. api_name
  9. cortex_time_per_batch_sum with the following labels:
    1. api_name
  10. cortex_time_per_batch_count with the following labels:
    1. api_name

Istio metrics

  1. istio_requests_total with the following labels:
    1. destination_service
    2. response_code
  2. istio_request_duration_milliseconds_bucket with the following labels:
    1. destination_service
    2. le
  3. istio_request_duration_milliseconds_sum with the following labels:
    1. destination_service
  4. istio_request_duration_milliseconds_count with the following labels:
    1. destination_service

Kubelet metrics

  1. container_cpu_usage_seconds_total with the following labels:
    1. pod
  2. container_memory_working_set_bytes with the following labels:
    1. pod

Kube-state-metrics metrics

  1. kube_pod_container_resource_requests with the following labels:
    1. pod
    2. resource
  2. kube_pod_info with the following labels:
    1. pod
  3. kube_deployment_status_replicas_available with the following labels:
    1. deployment
  4. kube_job_status_active with the following labels:
    1. job_name

DCGM metrics

  1. DCGM_FI_DEV_GPU_UTIL with the following labels:
    1. pod
  2. DCGM_FI_DEV_FB_USED with the following labels:
    1. pod
  3. DCGM_FI_DEV_FB_FREE with the following labels:
    1. pod

Node metrics

  1. node_cpu_seconds_total with the following labels:
    1. job
    2. mode
    3. instance
    4. cpu
  2. node_load1 with the following labels:
    1. job
    2. instance
  3. node_load5 with the following labels:
    1. job
    2. instance
  4. node_load15 with the following labels:
    1. job
    2. instance
  5. node_exporter_build_info with the following labels:
    1. job
    2. instance
  6. node_memory_MemTotal_bytes with the following labels:
    1. job
    2. instance
  7. node_memory_MemFree_bytes with the following labels:
    1. job
    2. instance
  8. node_memory_Buffers_bytes with the following labels:
    1. job
    2. instance
  9. node_memory_Cached_bytes with the following labels:
    1. job
    2. instance
  10. node_memory_MemAvailable_bytes with the following labels:
    1. job
    2. instance
  11. node_disk_read_bytes_total with the following labels:
    1. job
    2. instance
    3. device
  12. node_disk_written_bytes_total with the following labels:
    1. job
    2. instance
    3. device
  13. node_disk_io_time_seconds_total with the following labels:
    1. job
    2. instance
    3. device
  14. node_filesystem_size_bytes with the following labels:
    1. job
    2. instance
    3. fstype
    4. mountpoint
    5. device
  15. node_filesystem_avail_bytes with the following labels:
    1. job
    2. instance
    3. fstype
    4. device
  16. node_network_receive_bytes_total with the following labels:
    1. job
    2. instance
    3. device
  17. node_network_transmit_bytes_total with the following labels:
    1. job
    2. instance
    3. device
Prometheus rules for the node exporter
  1. instance:node_cpu_utilisation:rate1m from the following metrics:
    1. node_cpu_seconds_total with the following labels:
      1. job
      2. mode
  2. instance:node_num_cpu:sum from the following metrics:
    1. node_cpu_seconds_total with the following labels:
      1. job
  3. instance:node_load1_per_cpu:ratio from the following metrics:
    1. node_load1 with the following labels:
      1. job
  4. instance:node_memory_utilisation:ratio from the following metrics:
    1. node_memory_MemTotal_bytes with the following labels:
      1. job
    2. node_memory_MemAvailable_bytes with the following labels:
      1. job
  5. instance:node_vmstat_pgmajfault:rate1m with the following metrics:
    1. node_vmstat_pgmajfault with the following labels:
      1. job
  6. instance_device:node_disk_io_time_seconds:rate1m with the following metrics:
    1. node_disk_io_time_seconds_total with the following labels:
      1. job
      2. device
  7. instance_device:node_disk_io_time_weighted_seconds:rate1m with the following metrics:
    1. node_disk_io_time_weighted_seconds with the following labels:
      1. job
      2. device
  8. instance:node_network_receive_bytes_excluding_lo:rate1m with the following metrics:
    1. node_network_receive_bytes_total with the following labels:
      1. job
      2. device
  9. instance:node_network_transmit_bytes_excluding_lo:rate1m with the following metrics:
    1. node_network_transmit_bytes_total with the following labels:
      1. job
      2. device
  10. instance:node_network_receive_drop_excluding_lo:rate1m with the following metrics:
    1. node_network_receive_drop_total with the following labels:
      1. job
      2. device
  11. instance:node_network_transmit_drop_excluding_lo:rate1m with the following metrics:
    1. node_network_transmit_drop_total with the following labels:
      1. job
      2. device
  12. cluster:cpu_utilization:ratio with the following metrics:
    1. instance:node_cpu_utilisation:rate1m
    2. instance:node_num_cpu:sum
  13. cluster:load1:ratio with the following metrics:
    1. instance:node_load1_per_cpu:ratio
  14. cluster:memory_utilization:ratio with the following metrics:
    1. instance:node_memory_utilisation:ratio
  15. cluster:vmstat_pgmajfault:rate1m with the following metrics:
    1. instance:node_vmstat_pgmajfault:rate1m
  16. cluster:network_receive_bytes_excluding_low:rate1m with the following metrics:
    1. instance:node_network_receive_bytes_excluding_lo:rate1m
  17. cluster:network_transmit_bytes_excluding_lo:rate1m with the following metrics:
    1. instance:node_network_transmit_bytes_excluding_lo:rate1m
  18. cluster:network_receive_drop_excluding_lo:rate1m with the following metrics:
    1. instance:node_network_receive_drop_excluding_lo:rate1m
  19. cluster:network_transmit_drop_excluding_lo:rate1m with the following metrics:
    1. instance:node_network_transmit_drop_excluding_lo:rate1m
  20. cluster:disk_io_utilization:ratio with the following metrics:
    1. instance_device:node_disk_io_time_seconds:rate1m
  21. cluster:disk_io_saturation:ratio with the following metrics:
    1. instance_device:node_disk_io_time_weighted_seconds:rate1m
  22. cluster:disk_space_utilization:ratio with the following metrics:
    1. node_filesystem_size_bytes with the following labels:
      1. job
      2. fstype
      3. mountpoint
    2. node_filesystem_avail_bytes with the following labels:
      1. job
      2. fstype
      3. mountpoint

Re-introducing dropped metrics/labels

If you need to add some metrics/labels back for some particular use case, comment out every metricRelabelings: section (except the one from the prometheus-operator.yaml file), determine which metrics/labels you want to add back (i.e. by using the explorer from Grafana) and then re-edit the appropriate metricRelabelings: sections to account for the un-dropped metrics/labels.

Prometheus Analysis

Go Pprof

To analyse the memory allocations of prometheus, run kubectl port-forward prometheus-prometheus-0 9090:9090, and then run go tool pprof -symbolize=remote -inuse_space localhost:9090/debug/pprof/heap. Once you get the interpreter, you can run top or dot for a more detailed hierarchy of the memory usage.

TSDB

To analyse the TSDB of prometheus, exec into the prometheus-prometheus-0 pod, cd into /tmp, and run the following code-block:

wget https://github.com/prometheus/prometheus/releases/download/v1.7.3/prometheus-1.7.3.linux-amd64.tar.gz
tar -xzf prometheus-*
cd prometheus-*
./tsdb analyze /prometheus | less

Useful link: https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality

Or you can go to localhost:9090 -> Status -> TSDB Status, but it's not as complete as running a binary analysis.