-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics API with RFC 0003 #87
Conversation
Comments for Meter More comments Add more comments Fix typos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API looks good. There may be more questions when we start to attach an SDK and think about how to efficiently perform pre-aggregation when it is requested.
updater_function: The callback function to execute. | ||
""" | ||
|
||
def remove_time_series(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the remove
method should take a Timeseries object, instead of a LabelSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might revisit this, if we decide to remove TimeSeries
entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we treat Metric
as a registry for these value containers (TimeSeries
right now, but 👍 to changing this name or axing the class completely) then I don't think we should expect to get the same instance for the same label values at different times.
Another argument for keeping this signature: We expect users to call this to stop reporting a metric for a given set of label values. If they've got the time series they can use its label values without making another call to get_or_create_timeseries
here.
""" | ||
|
||
@abstractmethod | ||
def get_default_time_series(self) -> 'object': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, would you agree that this is equivalent to self.get_or_create_time_series([])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Open Questions section of RFC 0003, this is stated as "should we eliminate GetDefaultHandle()".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I have full context on the metrics API enough to be a good approver or otherwise.
But my thoughts on the UX side of things are:
- consolidating create methods into a single method might be worth it.
- allowing strings for shortcuts to things like LabelKey, LabelValue saves a ton of typing on the consumer side.
- I think time series is a misnomer, since it looks like just a way to get an aggregator with pre-populated tag values. I feel like even the spec should change to clarify that usage.
for the exported metric are deferred. | ||
""" | ||
|
||
def create_float_counter(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on passing in the float or int as types to the create method?
Might also be a good use case for enums:
def create(self, metric_type=METRIC_TYPE.counter, data_type=float, ...)
""" | ||
|
||
@abstractmethod | ||
def get_or_create_time_series(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably for the RFC or OTel spec, but is there a better description than timeseries? I think most metric systems don't expose a nuance between a timeseries and an individual logging of a measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best explanation for the name "timeseries" is that something got lost in the translation from OC code to the OT spec.
TimeSeries
used to be a list of aMetric
's timestamped measurements.- Gauges and counters aren't meant to record multiple measurements per export interval, just report the current value (for each set of labels) at export time. To be consistent with
Metric
s, they could export each value as a single-pointTimeSeries
. - Fast-forward to the OT spec: we've kept the name, but changed the underlying logic.
TimeSeries
is now a container for single values, andGauge
s are nowMetric
s.
I may be missing something here, but I don't see a reason that these should still be called "time series".
|
||
# Metrics sent to some exporter | ||
COUNTER_METRIC_TESTING = COUNTER.get_or_create_time_series(LABEL_VALUE_TESTING) | ||
COUNTER_METRIC_STAGING = COUNTER.get_or_create_time_series(LABEL_VALUE_STAGING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API still feel too complicated to me, but I don't think that's a reflection of this PR.
I guess maybe a question to the spec, but why are we not attempting to expose a simpler consumer API? for example, referencing prometheus/client_python:
from prometheus_client import Counter
c = Counter('my_requests_total', 'HTTP Failures', ['method', 'endpoint'])
c.labels('get', '/').inc()
c.labels('post', '/submit').inc()
The behavior behind the hood, using OTel terminology:
- the instantation of a global Meter() object, creating the gauge.
- the creation of time series with the labels provided
- the aggregation of the values, pivoted by the labels
Using that to clarify my thinking, I think the things that feel out of place to me are:
- time series. What does that really mean, to the end user? It seems like the time series object that is returned back is a counter with specific labels built in. I feel like a method with a name like
counter.with_predefined_labels(LABEL_VALUE_STAGING)
might make more sense.
I think if the terminology were improved a little bit, that might help things. in addition, maybe providing some shorthand, such as LabelValues being a LabelValue object or a string. If strings were accepted, you'd probably have an interface almost identical to prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a great feature to pilot in this repo (in another PR) and push up to specs or RFCs for discussion.
"""Returns a `CounterTimeSeries` with a cumulative float value.""" | ||
|
||
|
||
class CounterInt(Metric): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity: how will these APIs be used in practice? I kind of imagine that we wouldn't hook in different implementations of a counter or gauge: the behavior feels pretty clear and might be more beneficial to remain consistent.
Again an example that makes opentelemetry-sdk basically a must to use opentelemetry-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We see this problem with almost every class in the metrics package. Tracing has separate API and SDK packages largely because vendors want to be able to change the implementation. Metrics has separate API and SDK packages to be consistent with tracing, but if (e.g.) vendors are going to change metrics, they're much more likely to add new aggregations or metric types than to change these built-in classes.
This touches on a deeper issue regarding the API/SDK separation. The API already includes some implementation to support context propagation without an SDK, so how do we decide where to draw the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great question to discuss in the specification. I agree with all the points you've raised here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current metrics RFC discusses the semantic connection between metrics APIs and trace context. The current specification describes three kinds of metric "instrument", cumulative, gauge, and measure, which are pure API objects. Handles (known as TimeSeries
currently) are SDK-specific implementations of the Add()
, Set()
, and Record()
APIs. Nothing is said about aggregations at the API level apart from the the standard interpretation given to each, which is to say that by default cumulative metric instruments aggregate a sum, gauge metric instruments aggregate a last-value, and measure instruments aggregate a rate of some distribution.
open-telemetry/oteps#29
By separating the API from the implementation, it should be possible for to define an exporter for direct integration with existing metrics client libraries. This will require some sort of new metrics event exporter to be specified.
METER = Meter() | ||
LABEL_KEYS = [LabelKey("environment", | ||
"the environment the application is running in")] | ||
MEASURE = METER.create_float_measure("idle_cpu_percentage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be an api to allow direct recording, if no label values are passed?
Would simplify the creation quite a bit for a trivial counter
idle_cpu = METER.create_float_measure("idle_cpu_percentage", "cpu idle over time", 'percentage")
idle_cpu.record(psutil.cpu_times_percent().idle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added to the Open Questions section of RFC 0003 this question. I am in favor of the idea.
Good stuff by the way! I think it's only minor issues and maybe requiring less terminology to start, the API is getting simplerl |
+1 same here. |
@toumorokoshi I agree that "Timeseries" is not a great term. In the Go repository, it currently stands at "Handle". In OpenCensus, it was "Entry". I also don't like the method "GetOrCreate" for this thing, whatever we call it, because "GetOrCreate" implies some behavior not the semantics implied. In a streaming implementation, there will be nothing to create, after which "get" by itself will not be very descriptive. I'd name the method "With", following the terminology in Prometheus. Then the code will read
|
LGTM?
|
that the metric is associated with. | ||
|
||
Returns: | ||
A new `GaugeFloat` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor personal opinion here, it might be better to use create_float_guage
with FloatGauge
rather than mixing "float gauge" and "gauge float". Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think I will leave this for the next PR, seeing as we are getting rid of TimeSeries and these naming changes should be done all at once for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current metrics RFC PR open-telemetry/oteps#29, I've renamed TimeSeries
to Handle
following a consensus reached in the 8/21 working session, but it's just a name change relative to the code here.
import typing | ||
|
||
|
||
class CounterTimeSeries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I have the same question regarding the TimeSeries
name :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In open-telemetry/oteps#29 this becomes Handle
. Please review.
""" | ||
def __init__(self, | ||
value: str) -> None: | ||
self.value = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's any good reasons to have mandatory containers for label values? Currently it seems to be engineered like that just for the sake of completeness.
As far as I understand majority of industry solutions, there's no support for anything but strings for labels, and it's unlikely to change in the years to come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good point and a common ask. I think it will make sense to simply use strings for the label values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think each language should decide how to handle this question. If there is some kind of built-in support for key-value in the language, they should prefer that. In the Go repository, we have a common core.KeyValue
type that can be used as both an event label and a metric label. In logging (events), it's more common to have non-string values.
I'm not sure the API needs to specify anything about how a vendor should behave when, say, a numeric value is passed as the value for a label (and the compiler didn't prevent it). I would say the default behavior should coerce values to a string silently, not something to worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In overall picture of the world, at this point I believe the original distinction between "gauges" vs "counters" seems to have lost its meaning. "Counter" is now basically the same as "gauge" with some extra artificial limitations. Can we move forward and just deal with "aggregates" vs "raw measurements", agreeing on some term for both (previously suggested "metric" and "measurement" seems to fit quite well)?
@GreyCat The counter metric will be changed to a "cumulative" metric, in which the value has an option of either going up or down. This will be done in a separate PR. Cumulative expresses a computation of a sum. The meaning, purpose and behavior of these types of metrics are inherently different than those of a gauge. For those reasons, I believe there is a strong argument for keeping the distinction between the two. |
@lzchen
Formally, the very same is allowed for "counters" now too. I don't quite follow how observing through a callback is related to these scenarios in whole: it is a very generic mechanism, applicable to all possible emissions at all.
It actually can be (and will be). If in a certain interval we've got 3 readings of CPU usage of 40%, then 50%, then 60%, we'll happily have a sum of 40+50+60 = 150, count of 3, and thus derive that average is 150 / 3 = 50%. It's not any different from other use cases proposed for counters. There are quite a few systems that don't make this distinction, and, what's even worse from my point of view, some systems (coming from statsd semantics, namely) use that "gauge" vs "counter" distinction for totally different purpose, having a notion of "flush interval" and either resetting (for counter) or not resetting (for gauge) on the expiration of flush interval.
Makes sense, thanks!
Not really. Cumulative just expresses that there is a state, and we're aiming to have multiple ways to keep that state: that's summing, keeping min/max/last, keeping tabs on a histograms / distributions, etc. |
I was referring to capturing the initial readings of CPU Usage. For example if the user wanted to see the CPU usage simply at certain points in time (instead of aggregating), in which the value will be set through a callback.
In this context, cumulative represents a specific type of preaggregation, which we represent with as only a sum. It's purpose is to represent metrics when the value is a quantity, the sum is of primary interest, the event count and distribution are not of primary interest. For the cases you have listed (histograms, distributions, etc.), the measure metric should be used. But going back to your question of |
@GreyCat I hope the current RFC 0003 and its new sibling (unnumbered, probably 0007) on metric handles explain the intent of the specification, which is to separate the API from the implementation with a clear semantic explanation. The essence of this is that there are three kinds of object with distinct verbs Add(), Set(), and Record(). Handles of these are the way an SDK can control the behavior behind these events. Note that PR29 removes RFC 0004-metric-configurable-aggregation.md, which addressed the topics you raised. We decided this could be left out of the API, although it's something I personally like very much. Currently the RFC (0003) says what the default aggregation will be for each of the three kinds of metric, and does not specify any way to influence this behavior at the site where metrics objects are used. Carrying over from OpenCensus, there is a desire to specify a "view" API for application code to declare various levels of detail in metrics reporting. As you suggest, sometimes we're interested in recording both the count and the sum for a cumulative metric, as opposed to only the count. The SDK is certainly capable of changing this behavior. It may be possible for the application to configure this behavior (via "views", maybe). Should the application be capable of suggesting or recommending the aggregation itself? That's the topic of 0004 that is currently off the table. An SDK can do this as it pleases. Please comment on PR29, thanks! |
Thanks for the focus here! Loving the direction this is going. |
This is a continuation of [#68], with the initial Metrics API as well as Metrics RFC 0003 outlined here: open-telemetry/oteps#4
The RFC included three changes:
MeasureBatch
class to support recording of multiple observed values simultaneously.TODO:
An example recording of raw statistics:
An example of pre-aggregated metrics: