This document describes instruments and labels for common system level metrics in OpenTelemetry. Consider the general metric semantic conventions when creating instruments not explicitly defined in the specification.
- Metric Instruments
system.cpu.
- Processor metricssystem.memory.
- Memory metricssystem.paging.
- Paging/swap metricssystem.disk.
- Disk controller metricssystem.filesystem.
- Filesystem metricssystem.network.
- Network metricssystem.process.
- Aggregate system process metricssystem.{os}.
- OS Specific System Metrics
Description: System level processor metrics.
Name | Description | Units | Instrument Type | Value Type | Label Key(s) | Label Values |
---|---|---|---|---|---|---|
system.cpu.time | s | SumObserver | Double | state | idle, user, system, interrupt, etc. | |
cpu | CPU number [0..n-1] | |||||
system.cpu.utilization | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | |
cpu | CPU number (0..n) |
Description: System level memory metrics. This does not include paging/swap memory.
Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
---|---|---|---|---|---|---|
system.memory.usage | By | UpDownSumObserver | Int64 | state | used, free, cached, etc. | |
system.memory.utilization | 1 | ValueObserver | Double | state | used, free, cached, etc. |
Description: System level paging/swap memory metrics.
Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
---|---|---|---|---|---|---|
system.paging.usage | Unix swap or windows pagefile usage | By | UpDownSumObserver | Int64 | state | used, free |
system.paging.utilization | 1 | ValueObserver | Double | state | used, free | |
system.paging.faults | {faults} | SumObserver | Int64 | type | major, minor | |
system.paging.operations | {operations} | SumObserver | Int64 | type | major, minor | |
direction | in, out |
Description: System level disk performance metrics.
Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
---|---|---|---|---|---|---|
system.disk.io | By | SumObserver | Int64 | device | (identifier) | |
direction | read, write | |||||
system.disk.operations | {operations} | SumObserver | Int64 | device | (identifier) | |
direction | read, write | |||||
system.disk.io_time1 | Time disk spent activated | s | SumObserver | Double | device | (identifier) |
system.disk.operation_time2 | Sum of the time each operation took to complete | s | SumObserver | Double | device | (identifier) |
direction | read, write | |||||
system.disk.merged | {operations} | SumObserver | Int64 | device | (identifier) | |
direction | read, write |
1 The real elapsed time ("wall clock") used in the I/O path (time from operations running in parallel are not counted). Measured as:
- Linux: Field 13 from procfs-diskstats
- Windows: The complement of "Disk% Idle
Time"
performance counter:
uptime * (100 - "Disk\% Idle Time") / 100
2 Because it is the sum of time each request took, parallel-issued requests each contribute to make the count grow. Measured as:
- Linux: Fields 7 & 11 from procfs-diskstats
- Windows: "Avg. Disk sec/Read" perf counter multiplied by "Disk Reads/sec" perf counter (similar for Writes)
Description: System level filesystem metrics.
Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
---|---|---|---|---|---|---|
system.filesystem.usage | By | UpDownSumObserver | Int64 | device | (identifier) | |
state | used, free, reserved | |||||
type | ext4, tmpfs, etc. | |||||
mode | rw, ro, etc. | |||||
mountpoint | (path) | |||||
system.filesystem.utilization | 1 | ValueObserver | Double | device | (identifier) | |
state | used, free, reserved | |||||
type | ext4, tmpfs, etc. | |||||
mode | rw, ro, etc. | |||||
mountpoint | (path) |
Description: System level network metrics.
Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
---|---|---|---|---|---|---|
system.network.dropped1 | Count of packets that are dropped or discarded even though there was no error | {packets} | SumObserver | Int64 | device | (identifier) |
direction | transmit, receive | |||||
system.network.packets | {packets} | SumObserver | Int64 | device | (identifier) | |
direction | transmit, receive | |||||
system.network.errors2 | Count of network errors detected | {errors} | SumObserver | Int64 | device | (identifier) |
direction | transmit, receive | |||||
system.network.io | By | SumObserver | Int64 | device | (identifier) | |
direction | transmit, receive | |||||
system.network.connections | {connections} | UpDownSumObserver | Int64 | device | (identifier) | |
protocol | tcp, udp, etc. | |||||
state | e.g. for tcp |
1 Measured as:
- Linux: the
drop
column in/proc/dev/net
(source). - Windows:
InDiscards
/OutDiscards
fromGetIfEntry2
.
2 Measured as:
- Linux: the
errs
column in/proc/dev/net
(source). - Windows:
InErrors
/OutErrors
fromGetIfEntry2
.
Description: System level aggregate process metrics. For metrics at the individual process level, see process metrics.
Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
---|---|---|---|---|---|---|
system.process.count | Total number of processes in each state | {processes} | UpDownSumObserver | Int64 | status | running, sleeping, etc. |
Instrument names for system level metrics that have different and conflicting
meaning across multiple OSes should be prefixed with system.{os}.
and
follow the hierarchies listed above for different entities like CPU, memory,
and network.
For example, UNIX load average over a given interval is not well standardized and its value across different UNIX like OSes may vary despite being under similar load:
Without getting into the vagaries of every Unix-like operating system in existence, the load average more or less represents the average number of processes that are in the running (using the CPU) or runnable (waiting for the CPU) states. One notable exception exists: Linux includes processes in uninterruptible sleep states, typically waiting for some I/O activity to complete. This can markedly increase the load average on Linux systems.
(source of quote, linux source code)
An instrument for load average over 1 minute on Linux could be named
system.linux.cpu.load_1m
, reusing the cpu
name proposed above and having
an {os}
prefix to split this metric across OSes.