Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric naming conventions #108

Merged
Merged
Changes from 4 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
a59a7fd
Proposal for metric naming conventions
tedpennings May 21, 2020
cece358
Add Node example metrics
tedpennings May 21, 2020
63443ef
Node.js instead of Node
tedpennings May 21, 2020
17b8993
Rename file, add Prometheus quote
tedpennings May 22, 2020
444cb79
Second round of revisions
tedpennings May 28, 2020
1727642
Working group feedback
tedpennings May 28, 2020
1ead372
More feedback changes
tedpennings May 28, 2020
6238766
Minor clarifications
tedpennings May 28, 2020
0914ded
Word choice
tedpennings May 28, 2020
509b6e5
Whitespace to check CLA status
tedpennings May 28, 2020
4316780
Update text/metrics/0108-naming-conventions.md
tedpennings Jun 11, 2020
a9fdbfe
Update text/metrics/0108-naming-conventions.md
tedpennings Jun 11, 2020
b60dfa7
Update text/metrics/0108-naming-conventions.md
tedpennings Jun 11, 2020
3fb9258
Update text/metrics/0108-naming-conventions.md
tedpennings Jun 11, 2020
17ab176
Update text/metrics/0108-naming-conventions.md
tedpennings Jun 11, 2020
0ba3d76
Update text/metrics/0108-naming-conventions.md
tedpennings Jun 11, 2020
5a31caf
Update text/metrics/0108-naming-conventions.md
tedpennings Jun 11, 2020
f2c990e
Code review feedback, remove discussion section
tedpennings Jun 11, 2020
2191122
Remove some discussion topics, and fix an example
tedpennings Jun 11, 2020
1d61a72
Merge branch 'master' into metric-naming-conventions
bogdandrutu Jul 10, 2020
8b88b36
Rename OTEP 108 to metric naming _guidelines_
justinfoote Jul 14, 2020
faa0138
Merge branch 'master' into metric-naming-conventions
yurishkuro Jul 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions text/metrics/0108-naming-conventions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Metric naming conventions

## Purpose

Metric names and labels are the primary read-interface to metric data. The names and taxonomy need to be understandable and discoverable during routine exploration -- and this becomes critical during incidents.

## Guidelines

Namespace similar metrics together. Define top-level namespaces for common metric categories: for OS, like CPU and network; for app runtimes, like the JVM. This aids in discovery and adhoc comparison

Provide consistent names for common label. Provides discoverability and disambiguation similar to metric names.

"As a rule of thumb, [aggregations] over all the dimensions of a given metric should be meaningful," as Prometheus recommends.

Avoid semantic ambiguity. Use namespaced metric names in cases where similar metrics have significantly different implementations across the holistics of metrics. For example, every garbage collected runtime has a slightly different strategies and measures. Using common metric names for GC, not namespaces by the runtime, could create dissimilar comparisons and confusion for end users. Measures of operating system memory are similar.
tedpennings marked this conversation as resolved.
Show resolved Hide resolved

## Conventions

All metrics have a limited set of common labels:
* `system.host`
tedpennings marked this conversation as resolved.
Show resolved Hide resolved

### Operation System

#### CPU

`system.cpu` is the core metric name.

Standard usage labels include, non-exhaustively,
* `cpu.idle`
* `cpu.user`
* `cpu.sys`
* `cpu.real`
* `cpu.iowait`
* `cpu.nice`
* `cpu.interrupt`
* `cpu.softirq`
* `cpu.steal`
tedpennings marked this conversation as resolved.
Show resolved Hide resolved

A user can derive total CPU capacity by summing `system.cpu` across all labels.

A user can derive CPU utilization by summing all values for the `cpu.idle` label and comparing that with all `system.cpu` values across all labels.

`system.cpu` may include labels for per-core measures:
* `cpu.core.[0-n]`, eg, `cpu.core.3`

Cores should be reported ordinally as ordered by the operating system. It is recommended that values begin at 0.

It is recommended that per-core labels should not be reported by default to reduce cardinality; a user should opt-in with via configuration.

#### Network

`system.network` is the core metric name.

Standard labels include, non-exhaustively,
* `network.sent`
* `network.received`

`system.network` may include labels for per-NIC measures:
* `network.nic.[0-n]`, eg, `network.nic.3`

NICs should be reported ordinally as ordered by the operating system. It is recommended that values begin at 0.

Interfaces may also be reported by OS name, eg, `en3` in the label `network.nic.en3`.

It is recommended that per-NIC labels should not be reported by default to reduce cardinality; a user should opt-in with via configuration.

TODO: what other network labels? dropped? how about low-level things like window size?

#### Memory

`system.memory` is the core metric name.

Standard labels include, non-exhaustively,
* `memory.free`
* `memory.resident`
* `memory.shared`
* `memory.private`

TODO: how can a user derive total memory? how about memory utilization? memory allocations that are reported in more than one label may make this difficult.

#### More system metrics

Possibilities:
* Disk
* Filesystems
* Load
* Per-process
* Processes and threads?

Note for discussion: see this [excellent reference guide](https://docs.google.com/spreadsheets/d/11qSmzD9e7PnzaJPYRFdkkKbjTLrAKmvyQpjBjpJsR2s/edit#gid=0)

### Caution

Operating systems will report different labels for common metrics based on their architecture. Queries should be scoped to a host or cluster running the same operating system to avoid aggregating dissimilar measures.


#### TODO lower-level system metrics to consider
* CPU interrupts
* System calls
* Swap and paging

### Application Runtime

All runtime metric names should be reported with a namespace that includes the name of the runtime, eg, `runtime.go.*`.

#### Go

All Go runtime metrics should be within the `runtime.go.*` namespace.

Common metrics include:
* `runtime.go.goroutines`
* `runtime.go.heap_alloc`
* `runtime.go.gc` with labels `gc.count`
tedpennings marked this conversation as resolved.
Show resolved Hide resolved

#### Java

All Java runtime metrics should be within the `runtime.java.*` namespace.

* `runtime.java.threads` with optional labels for thread pools, eg, `runtime.java.thread_pool.[name]`
* `runtime.java.heap_alloc`
* `runtime.java.gc` with labels `gc.count` and `gc.time`

#### Node.js

All Node runtime metrics should be within the `runtime.nodejs.*` namespace.

* `runtime.nodejs.gc` with labels `pause.time` and `pause.count`
* `runtime.nodejs.heap_alloc` with labels `heap.total_size`, `heap.available_size`, `heap.used_heap_size` **TODO** confirm
* `runtime.nodejs.event_loop` with TBD labels **TODO**

Note: We use `nodejs` here to disambiguate from the `node` term in Kubernetes and elsewhere.

#### More runtimes

**TODO** ...et al... please contribute more runtimes as comments

## User-defined metrics

All user-defined metric conventions below are **recommendations**.

We recommend that users namespace their metrics into logical groups, eg, `shopping_cart.add_item`, `shopping_cart.remove_item`, `shopping_cart.increase_quantity`, and so forth.

We recommend that users consider common labels for their organization. For example, an organization may wish to track the performance of their systems for a specific customer organization; in this case, a common `customer.organization` label could be applied generically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common labels SHOULD be consistently named

What does "common labels" mean? Common in what context/scope? A single service? An organization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to avoid a debate over naming label keys. For example, you have a label key named "service" and have used it on some metrics, and I have a label key named "service" and have used it on some different metrics. How are we to know that those labels are not the same? The answer would be to add namespacing of labels. I recall the OpenCensus guidelines were to prefix your label names with a DNS prefix that you own. So I might have a lightstep.com/service label and you might have an uber.com/service label. For this to create a good user experience, I'd like the DNS prefix to not display by default. What would you like to see, @yurishkuro?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read this guidance is: if we have a label that should be added to many different categories of metric instrument, and that label's semantic meaning is the same across all those categories, its name should be consistent.

The most obvious example I can think of would be status, whose value will be a CanonicalSpanStatus.

As a user, I would find it intuitive when searching my metrics in my UI to always find the success/failure information under the same status label.

I'm not sure I understand the example service label. Would it be the name of the service being instrumented? If so, perhaps we would want some semantic conventions around how to apply Resource attributes as metric labels.

If this is the case, I'm not sure if we need to change this line. Is this guidance not clear enough? What wording would make our meaning more clear?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to say we can merge this and debate this topic again as we modify the specification.

## Questions for PR review

* Separators
* namespace separators, eg, `runtime.go`
tedpennings marked this conversation as resolved.
Show resolved Hide resolved
* word-token separtors inside a metric name, eg, `heap_alloc`
tedpennings marked this conversation as resolved.
Show resolved Hide resolved

* What about things that overlap with tracing span data like upstream/downstream callers or originating systems?