Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is direction dimension a mistake in semantic conventions? #2589

Closed
tigrannajaryan opened this issue May 30, 2022 · 13 comments · Fixed by #2617
Closed

Is direction dimension a mistake in semantic conventions? #2589

tigrannajaryan opened this issue May 30, 2022 · 13 comments · Fixed by #2617
Assignees
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory

Comments

@tigrannajaryan
Copy link
Member

We have metrics that have a "direction" attribute, for example disk metrics.

This seems to be in contradiction to the general guidelines about attributes

"As a rule of thumb, aggregations over all the attributes of a given metric SHOULD be meaningful," as Prometheus recommends.

I am not sure what aggregation is meaningful for "direction" attribute.

Should the attribute be removed and the metrics split into 2, one for read and one for write?

It does look pretty natural to express this dimension as an attribute, it is low cardinality, it groups 2 very related measurements. Is this an indication that the guidelines need to be adjusted or this is an exception from the rule? Or perhaps there is a meaningful aggregation of "direction" that I just don't see?

@tigrannajaryan tigrannajaryan added spec:metrics Related to the specification/metrics directory area:semantic-conventions Related to semantic conventions labels May 30, 2022
@reyang
Copy link
Member

reyang commented May 31, 2022

+1, also in most cases the removal of this dimension would result in meaningless value (e.g. what does a "total read/write bytes" mean for a disk?)

In addition, many disk/NICs have extra operations beyond read and write (I just took a screenshot from my laptop):

image

@bogdandrutu
Copy link
Member

@tigrannajaryan does the aggregation not represent "total" bytes send/received over the network? Is that not meaningful?

@tigrannajaryan
Copy link
Member Author

@tigrannajaryan does the aggregation not represent "total" bytes send/received over the network? Is that not meaningful?

Yes, probably meaningful for network (e.g. total bytes/throughput can be interesting to know as a percentage of bidirectional bandwidth available for half-duplex physical connections).

Is it also meaningful for disks? I am not sure what's the meaning of total combined bytes written and read for disk. Sure, both are bytes flowing from/to the same device, maybe that's useful somehow?

I am not against keeping the direction if we think it is meaningful (but see also @reyang comment which shows we do not fully understand the allowed values for the dimension).

@codeboten
Copy link
Contributor

codeboten commented Jun 1, 2022

Speaking specifically to @reyang's example, it would appear that aggregating read/writes/other would be meaningful as per the following description of "I/O other bytes" :

"I/O Other Bytes" is "The number of bytes transferred in input/output operations generated by a process that are neither reads nor writes, including file, network, and device I/Os

@jmacd
Copy link
Contributor

jmacd commented Jun 1, 2022

I believe direction was a mistake in our semantic conventions. I support undo-ing it.
Prometheus documentation addresses this specifically:

Read/write and send/receive are best as separate metrics, rather than as a label. This is usually because you care about only one of them at a time, and it is easier to use them that way.

I consider this a good exception to the rule otherwise given by Prometheus quoted above:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics.

The issue about directional transfers tests this rule: we are able to find ways that the sum() of network traffic is meaningful and even useful. The usefulness/meaningfulness needs to be weighed against potential harm due to "uselessness". If we believe "usually you care about only one of them at a time", and we believe the common mode of viewing metrics is to aggregate them, there's real harm in having a direction label as it will obscure two useful things in a commonly-useless way.

Is there an underlying logic we can use for similar situations? Here's one answer: in what sense are network bytes read and written the same? In what sense are they different? It makes sense to sum() or rate(sum()) these counters when they are the same, and it does not make to sum() or rate(sum()) these counters when they are not the same. Of course, a byte is a byte, so the sum is meaningful (units are "By"). However, if a metric is being used as a proxy variable for load -- and the load induced by a byte read is different than the load induced by a byte written, then it is harmful/useless to combine them (units are "{bytes_read}" and "{bytes_written}").

@carlosalberto
Copy link
Contributor

I believe direction was a mistake in our semantic conventions. I support undo-ing it.

At least we should start by addressing this with upstream metrics addition (and yes, later on update the existing ones ;) )

codeboten pushed a commit to codeboten/opentelemetry-specification that referenced this issue Jun 15, 2022
This PR looks to address the concerns voiced in open-telemetry#2589. It adds new metrics that include the direction in their names, and marks existing metrics as deprecated.

Fix open-telemetry#2589
@bertysentry
Copy link
Contributor

I have mixed feelings about this. If you take the network traffic example, you can calculate the bandwidth utilization with something like: max(rate(system.network.io)) / hw.network.speed, where the max() function applies to each direction of the system.network.io metric, for a full-duplex link.

I don't see how we could calculate the same bandwidth utilization metric from system.network.io.receive and system.network.io.transmit.

Also, as noted by @reyang, having separate metrics is reasonable when you have a strictly defined set of dimensions. It does fit well with network (receive vs transmit), but doesn't even for a simple metric like system.disk.io.

Last point: What are the drawbacks of using a direction attribute versus a separate metric? As explained above and by others, there are a few minor benefits of using the direction attribute, but I can't see anything in the "cons" column. Performance? Readability? Consistency?

@bertysentry
Copy link
Contributor

This change will require changes in 7 receivers for consistency:

  • activedirectorydsreceiver
  • elasticsearchreceiver
  • hostmetricsreceiver
  • iisreceiver
  • kubeletstatsreceiver
  • mongodbatlasreceiver
  • nsxtreceiver

@reyang
Copy link
Member

reyang commented Aug 16, 2022

During the Aug. 16th, 2022 Specification SIG meeting, there were discussions about network metrics - based on the comment here folks believe that Disk I/O should NOT have a dimension/attribute called direction, the question is whether Network I/O should do the same or not, and issue #2726 suggested Network I/O should keep direction.

Here are what I've found so far about Windows operating systems (NT kernel based):

  1. "Network" is a vague term, NT models network performance counters as several things:
  2. Process only has I/O Read/Write/Other Bytes/Operations, it doesn't have direct connection to the underlying Network Interface nor Network Adapter. And it does have "Other Bytes" (I can imagine a process trying to listen on a TCP port, which is considered as control bytes but not Rx/Tx).

@sebastien-rosset
Copy link
Contributor

Shouldn't direction be renamed to something like hw.io_direction?

@bertysentry
Copy link
Contributor

@sebastien-rosset I don't see a reason to use a specific attribute for hardware metrics. direction is specified at the general level for IO metrics so it shall be used in specific semantic conventions that cover IOs as well.

@sebastien-rosset
Copy link
Contributor

sebastien-rosset commented Jan 20, 2023

The spec states the instrument should have attributes for direction; oddly, plural is used, and it's not stating the attribute should be named direction.

io - an instrument that measures bidirectional data flow should be called entity.io and have attributes for direction. For example, system.network.io.

That would also break the semantic rules:

  1. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/README.md#general-guidelines and
  2. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-naming.md#attribute-naming

Metric names and attributes exist within a single universe and a single hierarchy. Metric names and attributes MUST be considered within the universe of all existing metric names. When defining new metric names and attributes, consider the prior art of existing standard metrics and metrics from frameworks/libraries.

Perhaps it should be hw.io_direction. Another namespace could have a different concept of "direction" such as up, down, or East, West, left, right, etc. For example a wind sensor could have a direction attribute.

In addition, the spec states:

Names SHOULD NOT coincide with namespaces. For example if service.instance.id is an attribute name then it is no longer valid to have an attribute named service.instance because service.instance is already a namespace. Because of this rule be careful when choosing names: every existing name prohibits existence of an equally named namespace in the future, and vice versa: any existing namespace prohibits existence of an equally named attribute key in the future.

Defining lots of attributes with no name hierarchy has the side effect of preventing any future use of that word as a namespace. Currently, I can see these attributes are declared with a flat naming structure: direction, vendor, model, serial_number, vendor, firmware_version, bios_version, driver_version, firmware_version, id, name, parent, chemistry, capacity, limit_type, state, sensor_location, type, task, raid_level, logical_addresses, physical_addresses, smart_attribute.

capacity is defined as the capacity in Watts-hours or Amper-hours, which is highly specific to batteries. However, the word capacity is contextual, it could mean something entirely different in another namespace. It seems that by allowing names to be flat, more flat-names will be created in short order, which inevitably will lead to collisions or other naming problems. At some point, shouldn't there be a registry of namespaces, similar to what was done at the IETF or IEEE?

Under the semantic guidelines, HTTP, RPC, Database, System, Process, Runtime Environment, FaaS are consistently using hierarchical names (with state noted as an exception). Only "Hardware" uses flat names liberally.

Out of curiosity, I looked at all the possible values for state:

state: idle | user | system | ...
state: used | cached | free | ...
state: idle, used (for database connection pool)
state: idle, user, system, interrupt, etc
state: used, free, cached, etc.
state: used, free, reserved
state: close, close_wait, closing, delete, established, fin_wait_1, fin_wait_2, last_ack, listen, syn_recv, syn_sent, time_wait.
state: system, user, wait
state: ok, degraded, failed
state: charging, discharging
state: ok, degraded, failed, charging, discharging
state: ok, degraded, failed, predicted_failure
state: ok, degraded, failed, open
state: remaining
state: ok, degraded, failed, needs_cleaning

@sebastien-rosset
Copy link
Contributor

I created #3129 since it's a different issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants