-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand event-model description to more clearly delineate instrument v… #1614
Changes from 7 commits
e2b00e5
7beeda8
a55a904
b02c148
396b71c
ff221ab
4974fc8
b162a83
67d12a8
ab97a94
7063eef
42772a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -146,26 +146,45 @@ OpenTelemetry fragments metrics into three interacting models: | |
|
||
### Event Model | ||
|
||
This specification uses as its foundation a | ||
[Metrics API consisting of 6 model instruments](api.md), each having distinct | ||
semantics, that were prototyped in several OpenTelemetry SDKs between July 2019 | ||
and June 2020. The model instruments and their specific use-cases are meant to | ||
anchor our understanding of the OpenTelemetry data model and are divided into | ||
three categories: | ||
|
||
- Synchronous vs. Asynchronous. The act of calling a Metrics API in a | ||
synchronous context means the application/library calls the SDK, typically having | ||
associated trace context and baggage; an Asynchronous instrument is called at | ||
collection time, through a callback, and lacks context. | ||
- Adding vs. Grouping. Whereas adding instruments express a sum, grouping | ||
instruments characterize a group of measurements. The numbers passed to adding | ||
instruments define division, in the algebraic sense, while the numbers passed | ||
to grouping instruments are generally not. Adding instrument values are always | ||
parts of a sum, while grouping instrument values are individual measurements. | ||
- Monotonic vs. Non-Monotonic. The adding instruments are categorized by whether | ||
the derivative of the quantity they express is non-negative. Monotonic | ||
instruments are primarily useful for monitoring a rate value, whereas | ||
non-monotonic instruments are primarily useful for monitoring a total value. | ||
The event model is where recording of data happens. Its foundation is made of | ||
[Instruments](api.md), which are used to record data observations via events. | ||
These raw events are then transformed in some fashion before being sent to some | ||
other system. OpenTelemetry metrics are designed such that the same instrument | ||
and events can be used in different ways to generate metric streams. | ||
|
||
![Events → Streams](img/model-event-layer.png) | ||
|
||
Even though observation events could be reported directly to a backend, in | ||
practice this would be infeasible due to the sheer volume of data used in | ||
observability systems, and the limited amount of network/cpu telemetry | ||
collection resources available for telemetry collection purposes. The best | ||
example of this is the Histogram metric where raw events are recorded in a | ||
compressed format rather than individual timeseries. | ||
|
||
While OpenTelemetry provides flexibility in how instruments can be transformed | ||
into metric streams, the instruments are defined such that a reasonable default | ||
mapping can be provided. | ||
|
||
The [OpenTelemetry metric instruments](api.md) are designed around the | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
following concerns: | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- Synchronous vs. Asynchronous collection. | ||
- Synchronous instruments are those where an application/library records a | ||
metric data point inline. In this scenario, OpenTelemetry *can* attach | ||
context (e.g. baggage) to recorded data points. | ||
- Asynchronous instruments are those where OpenTelemetry (not the application) | ||
will execute a callback (or other similar mechanism) to pull data points | ||
on demand. This is generally done at regular intervals. | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Adding vs. Grouping aggregation. | ||
- Adding instruments express a sum. All points recorded via this instrument | ||
are parts of a whole. | ||
- Grouping instruments characterize a group of measurements. All points | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where does this leave things like quantiles? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quantile is a form of grouping. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still believe this is a useful concept, but I am worried we've lost the connection to the API design by now. There's something about how Gauge and Histogram instruments are the same semantics with different default aggregations: both group individual measurements. The point is that Gauge and Histogram inputs are semantically different than Sum inputs, because of individuality. This is meant to help the user choose between Counter and Histogram, for example. @reyang I mean to follow up on last week's API/SDK SIG meeting, in which we discussed some of this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The goal of this is NOT to define what instruments the API uses, but show the flexibility of the model to adapt to differing instruments. I'm thinking about doing the following:
This specification should be about WHAT metrics mean, not what instruments are available in the API. |
||
recorded via this instrument are individual measurements. | ||
- Monotonic vs. Non-Monotonic (adding instruments only). These instruments are | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
categorized by whether the derivative of the quantity they express is | ||
non-negative. | ||
- Monotonic instruments are primarily useful for monitoring a rate value. | ||
- Non-monotonic instruments are primarily useful for monitoring a total value. | ||
|
||
In the Event model, the primary data are (instrument, number) points, originally | ||
observed in real time or on demand (for the synchronous and asynchronous cases, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does
total_latency
aggregation ever makes sense? It may not be the best idea to include example like thisThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want an example that shows how a single instrument could turn into all of Sum, Histogram + Gauge.
While I can synthesize hair-brained scenarios where I think "total latency" might make sense (e.g. ridiculous statistics like How many days have users waited for our website in aggregate), You're right it's not a super useful example. Going to take a day to brainstorm and looking for other ideas on this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the example was a count (e.g.,
request_size
), then max, sum, and histogram outputs all make sense.However, one drawback with this example (sum alongside a histogram) is that a histogram data point already contains a sum, so exposing a separate metric with the sum is somehow not useful. The max function makes a better example, you could export the maximum over
[1m]
,[10m]
,[1hr]
, and so on. (Related: open-telemetry/opentelemetry-proto#279)The example export one histogram by
metric_by_a_and_b{attributeA,attributeB}
, one one histogram bymetric_by_a_and_c{attributeA,attributeC}
, maybe? I would emphasize that when doing this kind of output, separate metric names MUST be used to avoid metric data being recombined with itself.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to use request size. I agree that histogram already has sum. I've also added some caveats to the image to denote it's meant to show the power, not a practical sceanrio.
PTAL at the new verbage. Also, I don't want to throw the baby out with the bathwater here. Look at the whole context of the doc as an introduction to the space of metrics. We need to both:
We do not need to specify the Otel API or its behavior here. We only need to specify what concepts are allowed in OTel metric streams. Lots of these nuances and details belong in the API docs.
I see these docs server two users primarily:
Instrumentation users who are generating metrics should be using the API specification.