System metrics semantic conventions (#937)

* System metrics semantic conventions Conventions from [OTEP 119](open-telemetry/oteps#119) * change process count to UpDownSumObserver * fix system.cpu.utilization, use better example * first several comments * add description columns, update units to UCUM * markdown-toc * clarify OS process level metrics * clarify load average exapmle * move general conventions + OTEP 108 into README.md * renamed swap -> paging * add addition fs labels * fix links * fix link * Update specification/metrics/semantic_conventions/README.md Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> * Update specification/metrics/semantic_conventions/README.md Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> * fix tigran comments * add disk io_time and operation_time * add descriptions/footnotes for dropped packets and net errors * lint, more info for net dropped packets/errors * "dropped_packets" -> "dropped" * Apply suggestions from James' code review Co-authored-by: James Bebbington <jbebbington@google.com> * comments from James' code review * clarify windows perf counter * Update specification/metrics/semantic_conventions/README.md Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com> * reflow text Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> Co-authored-by: James Bebbington <jbebbington@google.com> Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
open-telemetry · Oct 15, 2020 · 3146dc0 · 3146dc0
1 parent b48cb0c
commit 3146dc0
Show file tree

Hide file tree

Showing 4 changed files with 358 additions and 4 deletions.
diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md
@@ -1,7 +1,118 @@
 # Metrics Semantic Conventions
 
-TODO: Add semantic conventions for metric names and labels.
+The following semantic conventions surrounding metrics are defined:
 
-Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md),
-OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own
-[Resource Semantic Conventions](../../resource/semantic_conventions/README.md).
+* [HTTP Metrics](http-metrics.md): Semantic conventions and instruments for HTTP metrics.
+* [System Metrics](system-metrics.md): Semantic conventions and instruments for standard system metrics.
+* [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics.
+* [Runtime Environment Metrics](runtime-environment-metrics.md): Semantic conventions and instruments for runtime environment metrics.
+
+Apart from semantic conventions for metrics and
+[traces](../../trace/semantic_conventions/README.md), OpenTelemetry also
+defines the concept of overarching [Resources](../../resource/sdk.md) with
+their own [Resource Semantic
+Conventions](../../resource/semantic_conventions/README.md).
+
+## General Guidelines
+
+Metric names and labels exist within a single universe and a single
+hierarchy. Metric names and labels MUST be considered within the universe of
+all existing metric names. When defining new metric names and labels,
+consider the prior art of existing standard metrics and metrics from
+frameworks/libraries.
+
+Associated metrics SHOULD be nested together in a hierarchy based on their
+usage. Define a top-level hierarchy for common metric categories: for OS
+metrics, like CPU and network; for app runtimes, like GC internals. Libraries
+and frameworks should nest their metrics into a hierarchy as well. This aids
+in discovery and adhoc comparison. This allows a user to find similar metrics
+given a certain metric.
+
+The hierarchical structure of metrics defines the namespacing. Supporting
+OpenTelemetry artifacts define the metric structures and hierarchies for some
+categories of metrics, and these can assist decisions when creating future
+metrics.
+
+Common labels SHOULD be consistently named. This aids in discoverability and
+disambiguates similar labels to metric names.
+
+["As a rule of thumb, **aggregations** over all the dimensions of a given
+metric **SHOULD** be
+meaningful,"](https://prometheus.io/docs/practices/naming/#metric-names) as
+Prometheus recommends.
+
+Semantic ambiguity SHOULD be avoided. Use prefixed metric names in cases
+where similar metrics have significantly different implementations across the
+breadth of all existing metrics. For example, every garbage collected runtime
+has slightly different strategies and measures. Using a single set of metric
+names for GC, not divided by the runtime, could create dissimilar comparisons
+and confusion for end users. (For example, prefer `runtime.java.gc*` over
+`runtime.gc.*`.) Measures of many operating system metrics are similarly
+ambiguous.
+
+Conventional metrics or metrics that have their units included in
+OpenTelemetry metadata (e.g. `metric.WithUnit` in Go) SHOULD NOT include the
+units in the metric name. Units may be included when it provides additional
+meaning to the metric name. Metrics MUST, above all, be understandable and
+usable.
+
+## General Metric Semantic Conventions
+
+The following semantic conventions aim to keep naming consistent. They
+provide guidelines for most of the cases in this specification and should be
+followed for other instruments not explicitly defined in this document.
+
+### Instrument Naming
+
+- **limit** - an instrument that measures the constant, known total amount of
+something should be called `entity.limit`. For example, `system.memory.limit`
+for the total amount of memory on a system.
+
+- **usage** - an instrument that measures an amount used out of a known total
+(**limit**) amount should be called `entity.usage`. For example,
+`system.memory.usage` with label `state = used | cached | free | ...` for the
+amount of memory in a each state. Where appropriate, the sum of **usage**
+over all label values SHOULD be equal to the **limit**.
+
+  A measure of the amount of an unlimited resource consumed is differentiated
+  from **usage**.
+
+- **utilization** - an instrument that measures the *fraction* of **usage**
+out of its **limit** should be called `entity.utilization`. For example,
+`system.memory.utilization` for the fraction of memory in use. Utilization
+values are in the range `[0, 1]`.
+
+- **time** - an instrument that measures passage of time should be called
+`entity.time`. For example, `system.cpu.time` with label `state = idle | user
+| system | ...`. **time** measurements are not necessarily wall time and can
+be less than or greater than the real wall time between measurements.
+
+  **time** instruments are a special case of **usage** metrics, where the
+  **limit** can usually be calculated as the sum of **time** over all label
+  values. **utilization** for time instruments can be derived automatically
+  using metric event timestamps. For example, `system.cpu.utilization` is
+  defined as the difference in `system.cpu.time` measurements divided by the
+  elapsed time.
+
+- **io** - an instrument that measures bidirectional data flow should be
+called `entity.io` and have labels for direction. For example,
+`system.network.io`.
+
+- Other instruments that do not fit the above descriptions may be named more
+freely. For example, `system.paging.faults` and `system.network.packets`.
+Units do not need to be specified in the names since they are included during
+instrument creation, but can be added if there is ambiguity.
+
+### Units
+
+Units should follow the [UCUM](http://unitsofmeasure.org/ucum.html) (need
+more clarification in
+[#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)).
+
+- Instruments for **utilization** metrics (that measure the fraction out of a
+total) are dimensionless and SHOULD use the default unit `1` (the unity).
+- Instruments that measure an integer count of something SHOULD use the
+default unit `1` (the unity) and
+[annotations](https://ucum.org/ucum.html#para-curly) with curly braces to
+give additional meaning. For example `{packets}`, `{errors}`, `{faults}`,
+etc.
diff --git a/specification/metrics/semantic_conventions/process-metrics.md b/specification/metrics/semantic_conventions/process-metrics.md
@@ -0,0 +1,22 @@
+# Semantic Conventions for OS Process Metrics
+
+This document describes instruments and labels for common OS process level
+metrics in OpenTelemetry. Also consider the [general metric semantic
+conventions](README.md#general-metric-semantic-conventions) when creating
+instruments not explicitly defined in this document. OS process metrics are
+not related to the runtime environment of the program, and should take
+measurements from the operating system. For runtime environment metrics see
+[semantic conventions for runtime environment
+metrics](runtime-environment-metrics.md).
+
+<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->
+
+<!-- toc -->
+
+- [Metric Instruments](#metric-instruments)
+
+<!-- tocstop -->
+
+## Metric Instruments
+
+TODO
diff --git a/specification/metrics/semantic_conventions/runtime-environment-metrics.md b/specification/metrics/semantic_conventions/runtime-environment-metrics.md
@@ -0,0 +1,44 @@
+# Semantic Conventions for Runtime Environment Metrics
+
+This document includes semantic conventions for runtime environment level
+metrics in OpenTelemetry. Also consider the [general
+metric](README.md#general-metric-semantic-conventions), [system
+metrics](system-metrics.md) and [OS Process metrics](process-metrics.md)
+semantic conventions when instrumenting runtime environments.
+
+<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->
+
+<!-- toc -->
+
+- [Metric Instruments](#metric-instruments)
+  * [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment)
+
+<!-- tocstop -->
+
+## Metric Instruments
+
+Runtime environments vary widely in their terminology, implementation, and
+relative values for a given metric. For example, Go and Python are both
+garbage collected languages, but comparing heap usage between the Go and
+CPython runtimes directly is not meaningful. For this reason, this document
+does not propose any standard top-level runtime metric instruments. See [OTEP
+108](https://github.com/open-telemetry/oteps/pull/108/files) for additional
+discussion.
+
+### Runtime Environment Specific Metrics - `runtime.{environment}.`
+
+Metrics specific to a certain runtime environment should be prefixed with
+`runtime.{environment}.` and follow the semantic conventions outlined in
+[general metric semantic
+conventions](README.md#general-metric-semantic-conventions). Authors of
+runtime instrumentations are responsible for the choice of `{environment}` to
+avoid ambiguity when interpreting a metric's name or values.
+
+For example, some programming languages have multiple runtime environments
+that vary significantly in their implementation, like [Python which has many
+implementations](https://wiki.python.org/moin/PythonImplementations). For
+such languages, consider using specific `{environment}` prefixes to avoid
+ambiguity, like `runtime.cpython.` and `runtime.pypy.`.
+
+There are other dimensions even within a given runtime environment to
+consider, for example pthreads vs green thread implementations.