-
Notifications
You must be signed in to change notification settings - Fork 182
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
System metrics semantic conventions (#937)
* System metrics semantic conventions Conventions from [OTEP 119](open-telemetry/oteps#119) * change process count to UpDownSumObserver * fix system.cpu.utilization, use better example * first several comments * add description columns, update units to UCUM * markdown-toc * clarify OS process level metrics * clarify load average exapmle * move general conventions + OTEP 108 into README.md * renamed swap -> paging * add addition fs labels * fix links * fix link * Update specification/metrics/semantic_conventions/README.md Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> * Update specification/metrics/semantic_conventions/README.md Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> * fix tigran comments * add disk io_time and operation_time * add descriptions/footnotes for dropped packets and net errors * lint, more info for net dropped packets/errors * "dropped_packets" -> "dropped" * Apply suggestions from James' code review Co-authored-by: James Bebbington <jbebbington@google.com> * comments from James' code review * clarify windows perf counter * Update specification/metrics/semantic_conventions/README.md Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com> * reflow text Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> Co-authored-by: James Bebbington <jbebbington@google.com> Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
- Loading branch information
1 parent
b48cb0c
commit 3146dc0
Showing
4 changed files
with
358 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,118 @@ | ||
# Metrics Semantic Conventions | ||
|
||
TODO: Add semantic conventions for metric names and labels. | ||
The following semantic conventions surrounding metrics are defined: | ||
|
||
Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md), | ||
OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own | ||
[Resource Semantic Conventions](../../resource/semantic_conventions/README.md). | ||
* [HTTP Metrics](http-metrics.md): Semantic conventions and instruments for HTTP metrics. | ||
* [System Metrics](system-metrics.md): Semantic conventions and instruments for standard system metrics. | ||
* [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics. | ||
* [Runtime Environment Metrics](runtime-environment-metrics.md): Semantic conventions and instruments for runtime environment metrics. | ||
|
||
Apart from semantic conventions for metrics and | ||
[traces](../../trace/semantic_conventions/README.md), OpenTelemetry also | ||
defines the concept of overarching [Resources](../../resource/sdk.md) with | ||
their own [Resource Semantic | ||
Conventions](../../resource/semantic_conventions/README.md). | ||
|
||
## General Guidelines | ||
|
||
Metric names and labels exist within a single universe and a single | ||
hierarchy. Metric names and labels MUST be considered within the universe of | ||
all existing metric names. When defining new metric names and labels, | ||
consider the prior art of existing standard metrics and metrics from | ||
frameworks/libraries. | ||
|
||
Associated metrics SHOULD be nested together in a hierarchy based on their | ||
usage. Define a top-level hierarchy for common metric categories: for OS | ||
metrics, like CPU and network; for app runtimes, like GC internals. Libraries | ||
and frameworks should nest their metrics into a hierarchy as well. This aids | ||
in discovery and adhoc comparison. This allows a user to find similar metrics | ||
given a certain metric. | ||
|
||
The hierarchical structure of metrics defines the namespacing. Supporting | ||
OpenTelemetry artifacts define the metric structures and hierarchies for some | ||
categories of metrics, and these can assist decisions when creating future | ||
metrics. | ||
|
||
Common labels SHOULD be consistently named. This aids in discoverability and | ||
disambiguates similar labels to metric names. | ||
|
||
["As a rule of thumb, **aggregations** over all the dimensions of a given | ||
metric **SHOULD** be | ||
meaningful,"](https://prometheus.io/docs/practices/naming/#metric-names) as | ||
Prometheus recommends. | ||
|
||
Semantic ambiguity SHOULD be avoided. Use prefixed metric names in cases | ||
where similar metrics have significantly different implementations across the | ||
breadth of all existing metrics. For example, every garbage collected runtime | ||
has slightly different strategies and measures. Using a single set of metric | ||
names for GC, not divided by the runtime, could create dissimilar comparisons | ||
and confusion for end users. (For example, prefer `runtime.java.gc*` over | ||
`runtime.gc.*`.) Measures of many operating system metrics are similarly | ||
ambiguous. | ||
|
||
Conventional metrics or metrics that have their units included in | ||
OpenTelemetry metadata (e.g. `metric.WithUnit` in Go) SHOULD NOT include the | ||
units in the metric name. Units may be included when it provides additional | ||
meaning to the metric name. Metrics MUST, above all, be understandable and | ||
usable. | ||
|
||
## General Metric Semantic Conventions | ||
|
||
The following semantic conventions aim to keep naming consistent. They | ||
provide guidelines for most of the cases in this specification and should be | ||
followed for other instruments not explicitly defined in this document. | ||
|
||
### Instrument Naming | ||
|
||
- **limit** - an instrument that measures the constant, known total amount of | ||
something should be called `entity.limit`. For example, `system.memory.limit` | ||
for the total amount of memory on a system. | ||
|
||
- **usage** - an instrument that measures an amount used out of a known total | ||
(**limit**) amount should be called `entity.usage`. For example, | ||
`system.memory.usage` with label `state = used | cached | free | ...` for the | ||
amount of memory in a each state. Where appropriate, the sum of **usage** | ||
over all label values SHOULD be equal to the **limit**. | ||
|
||
A measure of the amount of an unlimited resource consumed is differentiated | ||
from **usage**. | ||
|
||
- **utilization** - an instrument that measures the *fraction* of **usage** | ||
out of its **limit** should be called `entity.utilization`. For example, | ||
`system.memory.utilization` for the fraction of memory in use. Utilization | ||
values are in the range `[0, 1]`. | ||
|
||
- **time** - an instrument that measures passage of time should be called | ||
`entity.time`. For example, `system.cpu.time` with label `state = idle | user | ||
| system | ...`. **time** measurements are not necessarily wall time and can | ||
be less than or greater than the real wall time between measurements. | ||
|
||
**time** instruments are a special case of **usage** metrics, where the | ||
**limit** can usually be calculated as the sum of **time** over all label | ||
values. **utilization** for time instruments can be derived automatically | ||
using metric event timestamps. For example, `system.cpu.utilization` is | ||
defined as the difference in `system.cpu.time` measurements divided by the | ||
elapsed time. | ||
|
||
- **io** - an instrument that measures bidirectional data flow should be | ||
called `entity.io` and have labels for direction. For example, | ||
`system.network.io`. | ||
|
||
- Other instruments that do not fit the above descriptions may be named more | ||
freely. For example, `system.paging.faults` and `system.network.packets`. | ||
Units do not need to be specified in the names since they are included during | ||
instrument creation, but can be added if there is ambiguity. | ||
|
||
### Units | ||
|
||
Units should follow the [UCUM](http://unitsofmeasure.org/ucum.html) (need | ||
more clarification in | ||
[#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)). | ||
|
||
- Instruments for **utilization** metrics (that measure the fraction out of a | ||
total) are dimensionless and SHOULD use the default unit `1` (the unity). | ||
- Instruments that measure an integer count of something SHOULD use the | ||
default unit `1` (the unity) and | ||
[annotations](https://ucum.org/ucum.html#para-curly) with curly braces to | ||
give additional meaning. For example `{packets}`, `{errors}`, `{faults}`, | ||
etc. |
22 changes: 22 additions & 0 deletions
22
specification/metrics/semantic_conventions/process-metrics.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Semantic Conventions for OS Process Metrics | ||
|
||
This document describes instruments and labels for common OS process level | ||
metrics in OpenTelemetry. Also consider the [general metric semantic | ||
conventions](README.md#general-metric-semantic-conventions) when creating | ||
instruments not explicitly defined in this document. OS process metrics are | ||
not related to the runtime environment of the program, and should take | ||
measurements from the operating system. For runtime environment metrics see | ||
[semantic conventions for runtime environment | ||
metrics](runtime-environment-metrics.md). | ||
|
||
<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` --> | ||
|
||
<!-- toc --> | ||
|
||
- [Metric Instruments](#metric-instruments) | ||
|
||
<!-- tocstop --> | ||
|
||
## Metric Instruments | ||
|
||
TODO |
44 changes: 44 additions & 0 deletions
44
specification/metrics/semantic_conventions/runtime-environment-metrics.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Semantic Conventions for Runtime Environment Metrics | ||
|
||
This document includes semantic conventions for runtime environment level | ||
metrics in OpenTelemetry. Also consider the [general | ||
metric](README.md#general-metric-semantic-conventions), [system | ||
metrics](system-metrics.md) and [OS Process metrics](process-metrics.md) | ||
semantic conventions when instrumenting runtime environments. | ||
|
||
<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` --> | ||
|
||
<!-- toc --> | ||
|
||
- [Metric Instruments](#metric-instruments) | ||
* [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment) | ||
|
||
<!-- tocstop --> | ||
|
||
## Metric Instruments | ||
|
||
Runtime environments vary widely in their terminology, implementation, and | ||
relative values for a given metric. For example, Go and Python are both | ||
garbage collected languages, but comparing heap usage between the Go and | ||
CPython runtimes directly is not meaningful. For this reason, this document | ||
does not propose any standard top-level runtime metric instruments. See [OTEP | ||
108](https://github.com/open-telemetry/oteps/pull/108/files) for additional | ||
discussion. | ||
|
||
### Runtime Environment Specific Metrics - `runtime.{environment}.` | ||
|
||
Metrics specific to a certain runtime environment should be prefixed with | ||
`runtime.{environment}.` and follow the semantic conventions outlined in | ||
[general metric semantic | ||
conventions](README.md#general-metric-semantic-conventions). Authors of | ||
runtime instrumentations are responsible for the choice of `{environment}` to | ||
avoid ambiguity when interpreting a metric's name or values. | ||
|
||
For example, some programming languages have multiple runtime environments | ||
that vary significantly in their implementation, like [Python which has many | ||
implementations](https://wiki.python.org/moin/PythonImplementations). For | ||
such languages, consider using specific `{environment}` prefixes to avoid | ||
ambiguity, like `runtime.cpython.` and `runtime.pypy.`. | ||
|
||
There are other dimensions even within a given runtime environment to | ||
consider, for example pthreads vs green thread implementations. |
Oops, something went wrong.