When new metrics/labels/exporters are added to be scraped by prometheus, make sure the following list is updated as well to keep track of what metrics/labels are needed or not.
The following is a list of metrics that are currently in use.
- cortex_in_flight_requests with the following labels:
- api_name
- cortex_async_request_count with the following labels:
- api_name
- api_kind
- status_code
- cortex_async_active with the following labels:
- api_name
- api_kind
- cortex_async_queued with the following labels:
- api_name
- api_kind
- cortex_async_in_flight with the following labels:
- api_name
- api_kind
- cortex_async_latency_bucket with the following labels:
- api_name
- api_kind
- cortex_batch_succeeded with the following labels:
- api_name
- cortex_batch_failed with the following labels:
- api_name
- cortex_time_per_batch_sum with the following labels:
- api_name
- cortex_time_per_batch_count with the following labels:
- api_name
- istio_requests_total with the following labels:
- destination_service
- response_code
- istio_request_duration_milliseconds_bucket with the following labels:
- destination_service
- le
- istio_request_duration_milliseconds_sum with the following labels:
- destination_service
- istio_request_duration_milliseconds_count with the following labels:
- destination_service
- container_cpu_usage_seconds_total with the following labels:
- pod
- container_memory_working_set_bytes with the following labels:
- pod
- kube_pod_container_resource_requests with the following labels:
- pod
- resource
- kube_pod_info with the following labels:
- pod
- kube_deployment_status_replicas_available with the following labels:
- deployment
- kube_job_status_active with the following labels:
- job_name
- DCGM_FI_DEV_GPU_UTIL with the following labels:
- pod
- DCGM_FI_DEV_FB_USED with the following labels:
- pod
- DCGM_FI_DEV_FB_FREE with the following labels:
- pod
- node_cpu_seconds_total with the following labels:
- job
- mode
- instance
- cpu
- node_load1 with the following labels:
- job
- instance
- node_load5 with the following labels:
- job
- instance
- node_load15 with the following labels:
- job
- instance
- node_exporter_build_info with the following labels:
- job
- instance
- node_memory_MemTotal_bytes with the following labels:
- job
- instance
- node_memory_MemFree_bytes with the following labels:
- job
- instance
- node_memory_Buffers_bytes with the following labels:
- job
- instance
- node_memory_Cached_bytes with the following labels:
- job
- instance
- node_memory_MemAvailable_bytes with the following labels:
- job
- instance
- node_disk_read_bytes_total with the following labels:
- job
- instance
- device
- node_disk_written_bytes_total with the following labels:
- job
- instance
- device
- node_disk_io_time_seconds_total with the following labels:
- job
- instance
- device
- node_filesystem_size_bytes with the following labels:
- job
- instance
- fstype
- mountpoint
- device
- node_filesystem_avail_bytes with the following labels:
- job
- instance
- fstype
- device
- node_network_receive_bytes_total with the following labels:
- job
- instance
- device
- node_network_transmit_bytes_total with the following labels:
- job
- instance
- device
- instance:node_cpu_utilisation:rate1m from the following metrics:
- node_cpu_seconds_total with the following labels:
- job
- mode
- node_cpu_seconds_total with the following labels:
- instance:node_num_cpu:sum from the following metrics:
- node_cpu_seconds_total with the following labels:
- job
- node_cpu_seconds_total with the following labels:
- instance:node_load1_per_cpu:ratio from the following metrics:
- node_load1 with the following labels:
- job
- node_load1 with the following labels:
- instance:node_memory_utilisation:ratio from the following metrics:
- node_memory_MemTotal_bytes with the following labels:
- job
- node_memory_MemAvailable_bytes with the following labels:
- job
- node_memory_MemTotal_bytes with the following labels:
- instance:node_vmstat_pgmajfault:rate1m with the following metrics:
- node_vmstat_pgmajfault with the following labels:
- job
- node_vmstat_pgmajfault with the following labels:
- instance_device:node_disk_io_time_seconds:rate1m with the following metrics:
- node_disk_io_time_seconds_total with the following labels:
- job
- device
- node_disk_io_time_seconds_total with the following labels:
- instance_device:node_disk_io_time_weighted_seconds:rate1m with the following metrics:
- node_disk_io_time_weighted_seconds with the following labels:
- job
- device
- node_disk_io_time_weighted_seconds with the following labels:
- instance:node_network_receive_bytes_excluding_lo:rate1m with the following metrics:
- node_network_receive_bytes_total with the following labels:
- job
- device
- node_network_receive_bytes_total with the following labels:
- instance:node_network_transmit_bytes_excluding_lo:rate1m with the following metrics:
- node_network_transmit_bytes_total with the following labels:
- job
- device
- node_network_transmit_bytes_total with the following labels:
- instance:node_network_receive_drop_excluding_lo:rate1m with the following metrics:
- node_network_receive_drop_total with the following labels:
- job
- device
- node_network_receive_drop_total with the following labels:
- instance:node_network_transmit_drop_excluding_lo:rate1m with the following metrics:
- node_network_transmit_drop_total with the following labels:
- job
- device
- node_network_transmit_drop_total with the following labels:
- cluster:cpu_utilization:ratio with the following metrics:
- instance:node_cpu_utilisation:rate1m
- instance:node_num_cpu:sum
- cluster:load1:ratio with the following metrics:
- instance:node_load1_per_cpu:ratio
- cluster:memory_utilization:ratio with the following metrics:
- instance:node_memory_utilisation:ratio
- cluster:vmstat_pgmajfault:rate1m with the following metrics:
- instance:node_vmstat_pgmajfault:rate1m
- cluster:network_receive_bytes_excluding_low:rate1m with the following metrics:
- instance:node_network_receive_bytes_excluding_lo:rate1m
- cluster:network_transmit_bytes_excluding_lo:rate1m with the following metrics:
- instance:node_network_transmit_bytes_excluding_lo:rate1m
- cluster:network_receive_drop_excluding_lo:rate1m with the following metrics:
- instance:node_network_receive_drop_excluding_lo:rate1m
- cluster:network_transmit_drop_excluding_lo:rate1m with the following metrics:
- instance:node_network_transmit_drop_excluding_lo:rate1m
- cluster:disk_io_utilization:ratio with the following metrics:
- instance_device:node_disk_io_time_seconds:rate1m
- cluster:disk_io_saturation:ratio with the following metrics:
- instance_device:node_disk_io_time_weighted_seconds:rate1m
- cluster:disk_space_utilization:ratio with the following metrics:
- node_filesystem_size_bytes with the following labels:
- job
- fstype
- mountpoint
- node_filesystem_avail_bytes with the following labels:
- job
- fstype
- mountpoint
- node_filesystem_size_bytes with the following labels:
If you need to add some metrics/labels back for some particular use case, comment out every metricRelabelings:
section (except the one from the prometheus-operator.yaml
file), determine which metrics/labels you want to add back (i.e. by using the explorer from Grafana) and then re-edit the appropriate metricRelabelings:
sections to account for the un-dropped metrics/labels.
To analyse the memory allocations of prometheus, run kubectl port-forward prometheus-prometheus-0 9090:9090
, and then run go tool pprof -symbolize=remote -inuse_space localhost:9090/debug/pprof/heap
. Once you get the interpreter, you can run top
or dot
for a more detailed hierarchy of the memory usage.
To analyse the TSDB of prometheus, exec into the prometheus-prometheus-0
pod, cd
into /tmp
, and run the following code-block:
wget https://github.com/prometheus/prometheus/releases/download/v1.7.3/prometheus-1.7.3.linux-amd64.tar.gz
tar -xzf prometheus-*
cd prometheus-*
./tsdb analyze /prometheus | less
Useful link: https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality
Or you can go to localhost:9090
-> Status
-> TSDB Status
, but it's not as complete as running a binary analysis.