Metrics API with RFC 0003 #87

lzchen · 2019-08-14T22:59:38Z

This is a continuation of [#68], with the initial Metrics API as well as Metrics RFC 0003 outlined here: open-telemetry/oteps#4

The RFC included three changes:

Get rid of Measure and Measurement class and have Measure be another Metric (like gauge and counter)
Have the ability to pass in pre-defined label values (that match the label keys when creating the metric) when creating the time series for a metric. This is an important optimization, as programs with long-lived objects can compute pre-defined label values once, rather than once per call site.
The former raw stats API supported all-or-none recording of interdependent measurements, RFC introduces MeasureBatch class to support recording of multiple observed values simultaneously.

TODO:

Decisions on supporting Resource and Components (should they simply be labels?)
DistributedContext (might be a label)
Decisions for MeasureBatch function signature (how to record)

An example recording of raw statistics:

METER = Meter()
LABEL_KEYS = [LabelKey("environment",
                       "the environment the application is running in")]
MEASURE = METER.create_float_measure("idle_cpu_percentage",
                                     "cpu idle over time",
                                     "percentage",
                                     LastValueAggregation)
LABEL_VALUE_TESTING = [LabelValue("Testing")]
LABEL_VALUE_STAGING = [LabelValue("Staging")]

# Metrics sent to some exporter
MEASURE_METRIC_TESTING = MEASURE.get_or_create_time_series(LABEL_VALUE_TESTING)
MEASURE_METRIC_STAGING = MEASURE.get_or_create_time_series(LABEL_VALUE_STAGING)

# record individual measures
idle = psutil.cpu_times_percent().idle
MEASURE_METRIC_STAGING.record(idle)

# record multiple observed values
batch = MeasureBatch()
batch.record([(MEASURE_METRIC_TESTING, idle), (MEASURE_METRIC_STAGING, idle)])

An example of pre-aggregated metrics:

METER = Meter()
LABEL_KEYS = [LabelKey("environment", 
                       "the environment the application is running in")]
COUNTER = METER.create_int_counter("sum numbers", 
                                      "sum numbers over time",
                                      "number",
                                      LABEL_KEYS)
LABEL_VALUE_TESTING = [LabelValue("Testing")]
LABEL_VALUE_STAGING = [LabelValue("Staging")]

# Metrics sent to some exporter
COUNTER_METRIC_TESTING = COUNTER.get_or_create_time_series(LABEL_VALUE_TESTING)
COUNTER_METRIC_STAGING = COUNTER.get_or_create_time_series(LABEL_VALUE_STAGING)

for i in range(100):
    COUNTER_METRIC_STAGING.add(i)

Comments for Meter More comments Add more comments Fix typos

…try-python into metrics-rfc

jmacd

The API looks good. There may be more questions when we start to attach an SDK and think about how to efficiently perform pre-aggregation when it is requested.

jmacd · 2019-08-15T23:20:51Z

opentelemetry-api/src/opentelemetry/metrics/__init__.py

+            updater_function: The callback function to execute.
+        """
+
+    def remove_time_series(self,


I wonder if the remove method should take a Timeseries object, instead of a LabelSet?

I have no problem with this.

Might revisit this, if we decide to remove TimeSeries entirely.

If we treat Metric as a registry for these value containers (TimeSeries right now, but 👍 to changing this name or axing the class completely) then I don't think we should expect to get the same instance for the same label values at different times.

Another argument for keeping this signature: We expect users to call this to stop reporting a metric for a given set of label values. If they've got the time series they can use its label values without making another call to get_or_create_timeseries here.

opentelemetry-api/src/opentelemetry/metrics/__init__.py

jmacd · 2019-08-15T23:23:55Z

opentelemetry-api/src/opentelemetry/metrics/__init__.py

+        """
+
+    @abstractmethod
+    def get_default_time_series(self) -> 'object':


Just checking, would you agree that this is equivalent to self.get_or_create_time_series([])?

In the Open Questions section of RFC 0003, this is stated as "should we eliminate GetDefaultHandle()".

toumorokoshi

I don't think I have full context on the metrics API enough to be a good approver or otherwise.

But my thoughts on the UX side of things are:

consolidating create methods into a single method might be worth it.
allowing strings for shortcuts to things like LabelKey, LabelValue saves a ton of typing on the consumer side.
I think time series is a misnomer, since it looks like just a way to get an aggregator with pre-populated tag values. I feel like even the spec should change to clarify that usage.

toumorokoshi · 2019-08-16T05:10:31Z

opentelemetry-api/src/opentelemetry/metrics/__init__.py

+    for the exported metric are deferred.
+    """
+
+    def create_float_counter(self,


thoughts on passing in the float or int as types to the create method?

Might also be a good use case for enums:

def create(self, metric_type=METRIC_TYPE.counter, data_type=float, ...)

toumorokoshi · 2019-08-16T05:16:11Z

opentelemetry-api/src/opentelemetry/metrics/__init__.py

+    """
+
+    @abstractmethod
+    def get_or_create_time_series(self,


this is probably for the RFC or OTel spec, but is there a better description than timeseries? I think most metric systems don't expose a nuance between a timeseries and an individual logging of a measure.

I think the best explanation for the name "timeseries" is that something got lost in the translation from OC code to the OT spec.

TimeSeries used to be a list of a Metric's timestamped measurements.

Gauges and counters aren't meant to record multiple measurements per export interval, just report the current value (for each set of labels) at export time. To be consistent with Metrics, they could export each value as a single-point TimeSeries.

Fast-forward to the OT spec: we've kept the name, but changed the underlying logic. TimeSeries is now a container for single values, and Gauges are now Metrics.

I may be missing something here, but I don't see a reason that these should still be called "time series".

toumorokoshi · 2019-08-17T03:47:44Z

opentelemetry-api/src/opentelemetry/metrics/examples/pre_aggregated.py

+
+# Metrics sent to some exporter
+COUNTER_METRIC_TESTING = COUNTER.get_or_create_time_series(LABEL_VALUE_TESTING)
+COUNTER_METRIC_STAGING = COUNTER.get_or_create_time_series(LABEL_VALUE_STAGING)


This API still feel too complicated to me, but I don't think that's a reflection of this PR.

I guess maybe a question to the spec, but why are we not attempting to expose a simpler consumer API? for example, referencing prometheus/client_python:

from prometheus_client import Counter c = Counter('my_requests_total', 'HTTP Failures', ['method', 'endpoint']) c.labels('get', '/').inc() c.labels('post', '/submit').inc()

The behavior behind the hood, using OTel terminology:

the instantation of a global Meter() object, creating the gauge.

the creation of time series with the labels provided

the aggregation of the values, pivoted by the labels

Using that to clarify my thinking, I think the things that feel out of place to me are:

time series. What does that really mean, to the end user? It seems like the time series object that is returned back is a counter with specific labels built in. I feel like a method with a name like counter.with_predefined_labels(LABEL_VALUE_STAGING) might make more sense.

I think if the terminology were improved a little bit, that might help things. in addition, maybe providing some shorthand, such as LabelValues being a LabelValue object or a string. If strings were accepted, you'd probably have an interface almost identical to prometheus.

This sounds like a great feature to pilot in this repo (in another PR) and push up to specs or RFCs for discussion.

opentelemetry-api/src/opentelemetry/metrics/time_series.py

toumorokoshi · 2019-08-17T03:52:21Z

opentelemetry-api/src/opentelemetry/metrics/__init__.py

+        """Returns a `CounterTimeSeries` with a cumulative float value."""
+
+
+class CounterInt(Metric):


Just out of curiosity: how will these APIs be used in practice? I kind of imagine that we wouldn't hook in different implementations of a counter or gauge: the behavior feels pretty clear and might be more beneficial to remain consistent.

Again an example that makes opentelemetry-sdk basically a must to use opentelemetry-api

We see this problem with almost every class in the metrics package. Tracing has separate API and SDK packages largely because vendors want to be able to change the implementation. Metrics has separate API and SDK packages to be consistent with tracing, but if (e.g.) vendors are going to change metrics, they're much more likely to add new aggregations or metric types than to change these built-in classes.

This touches on a deeper issue regarding the API/SDK separation. The API already includes some implementation to support context propagation without an SDK, so how do we decide where to draw the line?

Sounds like a great question to discuss in the specification. I agree with all the points you've raised here.

The current metrics RFC discusses the semantic connection between metrics APIs and trace context. The current specification describes three kinds of metric "instrument", cumulative, gauge, and measure, which are pure API objects. Handles (known as TimeSeries currently) are SDK-specific implementations of the Add(), Set(), and Record() APIs. Nothing is said about aggregations at the API level apart from the the standard interpretation given to each, which is to say that by default cumulative metric instruments aggregate a sum, gauge metric instruments aggregate a last-value, and measure instruments aggregate a rate of some distribution.
open-telemetry/oteps#29

By separating the API from the implementation, it should be possible for to define an exporter for direct integration with existing metrics client libraries. This will require some sort of new metrics event exporter to be specified.

toumorokoshi · 2019-08-17T03:58:59Z

opentelemetry-api/src/opentelemetry/metrics/examples/raw.py

+METER = Meter()
+LABEL_KEYS = [LabelKey("environment",
+                       "the environment the application is running in")]
+MEASURE = METER.create_float_measure("idle_cpu_percentage",


should there be an api to allow direct recording, if no label values are passed?

Would simplify the creation quite a bit for a trivial counter

idle_cpu = METER.create_float_measure("idle_cpu_percentage", "cpu idle over time", 'percentage") idle_cpu.record(psutil.cpu_times_percent().idle)

I added to the Open Questions section of RFC 0003 this question. I am in favor of the idea.

toumorokoshi · 2019-08-17T04:03:22Z

Good stuff by the way! I think it's only minor issues and maybe requiring less terminology to start, the API is getting simplerl

reyang · 2019-08-17T04:57:08Z

allowing strings for shortcuts to things like LabelKey, LabelValue saves a ton of typing on the consumer side.

+1 same here.

jmacd · 2019-08-19T19:53:25Z

@toumorokoshi I agree that "Timeseries" is not a great term. In the Go repository, it currently stands at "Handle". In OpenCensus, it was "Entry".

I also don't like the method "GetOrCreate" for this thing, whatever we call it, because "GetOrCreate" implies some behavior not the semantics implied. In a streaming implementation, there will be nothing to create, after which "get" by itself will not be very descriptive.

I'd name the method "With", following the terminology in Prometheus.

Then the code will read

  handle = metric.With(labels...)
  handle.Set(value, labels...)

opentelemetry-api/src/opentelemetry/metrics/aggregation.py

…try-python into metrics-rfc

lzchen · 2019-08-22T23:55:40Z

LGTM?
TODOs:

Keeping/removing TimeSeries
How to handle Aggregations
Which metric types are we supporting
Terminology

reyang · 2019-08-23T18:02:27Z

opentelemetry-api/src/opentelemetry/metrics/__init__.py

+                that the metric is associated with.
+
+        Returns:
+            A new `GaugeFloat`


Minor personal opinion here, it might be better to use create_float_guage with FloatGauge rather than mixing "float gauge" and "gauge float". Up to you.

Makes sense. I think I will leave this for the next PR, seeing as we are getting rid of TimeSeries and these naming changes should be done all at once for consistency.

In the current metrics RFC PR open-telemetry/oteps#29, I've renamed TimeSeries to Handle following a consensus reached in the 8/21 working session, but it's just a name change relative to the code here.

reyang · 2019-08-23T18:13:39Z

opentelemetry-api/src/opentelemetry/metrics/time_series.py

+import typing
+
+
+class CounterTimeSeries:


I guess I have the same question regarding the TimeSeries name :)

In open-telemetry/oteps#29 this becomes Handle. Please review.

GreyCat · 2019-09-03T14:28:36Z

opentelemetry-api/src/opentelemetry/metrics/__init__.py

+    """
+    def __init__(self,
+                 value: str) -> None:
+        self.value = value


I wonder if there's any good reasons to have mandatory containers for label values? Currently it seems to be engineered like that just for the sake of completeness.

As far as I understand majority of industry solutions, there's no support for anything but strings for labels, and it's unlikely to change in the years to come.

This is good point and a common ask. I think it will make sense to simply use strings for the label values.

I think each language should decide how to handle this question. If there is some kind of built-in support for key-value in the language, they should prefer that. In the Go repository, we have a common core.KeyValue type that can be used as both an event label and a metric label. In logging (events), it's more common to have non-string values.

I'm not sure the API needs to specify anything about how a vendor should behave when, say, a numeric value is passed as the value for a label (and the compiler didn't prevent it). I would say the default behavior should coerce values to a string silently, not something to worry about.

GreyCat

In overall picture of the world, at this point I believe the original distinction between "gauges" vs "counters" seems to have lost its meaning. "Counter" is now basically the same as "gauge" with some extra artificial limitations. Can we move forward and just deal with "aggregates" vs "raw measurements", agreeing on some term for both (previously suggested "metric" and "measurement" seems to fit quite well)?

lzchen · 2019-09-03T16:15:14Z

@GreyCat
Thanks for the review. Gauge metrics express a pre-calculated value that is either "set" or observed through a callback. These metrics usually cannot be represented using a sum or rate because the measurement interval is arbitrary (such as getting cpu_usage). As well, when setting a gauge explicitly, it occurs within an implicit or explicit context, as opposed to the callback, where there is no context.

The counter metric will be changed to a "cumulative" metric, in which the value has an option of either going up or down. This will be done in a separate PR.

Cumulative expresses a computation of a sum. The meaning, purpose and behavior of these types of metrics are inherently different than those of a gauge. For those reasons, I believe there is a strong argument for keeping the distinction between the two.

GreyCat · 2019-09-03T17:12:18Z

@lzchen
Sorry, I don't quite follow.

Gauge metrics express a pre-calculated value that is either "set" or observed through a callback.

Formally, the very same is allowed for "counters" now too. I don't quite follow how observing through a callback is related to these scenarios in whole: it is a very generic mechanism, applicable to all possible emissions at all.

These metrics usually cannot be represented using a sum or rate because the measurement interval is arbitrary (such as getting cpu_usage).

It actually can be (and will be). If in a certain interval we've got 3 readings of CPU usage of 40%, then 50%, then 60%, we'll happily have a sum of 40+50+60 = 150, count of 3, and thus derive that average is 150 / 3 = 50%.

It's not any different from other use cases proposed for counters. There are quite a few systems that don't make this distinction, and, what's even worse from my point of view, some systems (coming from statsd semantics, namely) use that "gauge" vs "counter" distinction for totally different purpose, having a notion of "flush interval" and either resetting (for counter) or not resetting (for gauge) on the expiration of flush interval.

The counter metric will be changed to a "cumulative" metric, in which the value has an option of either going up or down. This will be done in a separate PR.

Makes sense, thanks!

Cumulative expresses a computation of a sum.

Not really. Cumulative just expresses that there is a state, and we're aiming to have multiple ways to keep that state: that's summing, keeping min/max/last, keeping tabs on a histograms / distributions, etc.

lzchen · 2019-09-03T18:03:15Z

@GreyCat

It actually can be (and will be). If in a certain interval we've got 3 readings of CPU usage of 40%, then 50%, then 60%, we'll happily have a sum of 40+50+60 = 150, count of 3, and thus derive that average is 150 / 3 = 50%.

I was referring to capturing the initial readings of CPU Usage. For example if the user wanted to see the CPU usage simply at certain points in time (instead of aggregating), in which the value will be set through a callback.

Not really. Cumulative just expresses that there is a state, and we're aiming to have multiple ways to keep that state: that's summing, keeping min/max/last, keeping tabs on a histograms / distributions, etc.

In this context, cumulative represents a specific type of preaggregation, which we represent with as only a sum. It's purpose is to represent metrics when the value is a quantity, the sum is of primary interest, the event count and distribution are not of primary interest. For the cases you have listed (histograms, distributions, etc.), the measure metric should be used.

But going back to your question of counter vs. gauge, the introduction of the cumulative will have a clearer picture of the distinction between that metric and gauge, which will be done in a separate PR.

…try-python into metrics-rfc

jmacd · 2019-09-03T20:05:14Z

@GreyCat I hope the current RFC 0003 and its new sibling (unnumbered, probably 0007) on metric handles explain the intent of the specification, which is to separate the API from the implementation with a clear semantic explanation. The essence of this is that there are three kinds of object with distinct verbs Add(), Set(), and Record(). Handles of these are the way an SDK can control the behavior behind these events.

See open-telemetry/oteps#29

Note that PR29 removes RFC 0004-metric-configurable-aggregation.md, which addressed the topics you raised. We decided this could be left out of the API, although it's something I personally like very much.

Currently the RFC (0003) says what the default aggregation will be for each of the three kinds of metric, and does not specify any way to influence this behavior at the site where metrics objects are used. Carrying over from OpenCensus, there is a desire to specify a "view" API for application code to declare various levels of detail in metrics reporting. As you suggest, sometimes we're interested in recording both the count and the sum for a cumulative metric, as opposed to only the count. The SDK is certainly capable of changing this behavior. It may be possible for the application to configure this behavior (via "views", maybe). Should the application be capable of suggesting or recommending the aggregation itself? That's the topic of 0004 that is currently off the table. An SDK can do this as it pleases. Please comment on PR29, thanks!

toumorokoshi · 2019-09-12T21:05:55Z

Thanks for the focus here! Loving the direction this is going.

lzchen added 15 commits July 30, 2019 14:52

Create functions

6ca4274

Comments for Meter More comments Add more comments Fix typos

fix lint

b23cec1

Fix lint

981eece

fix typing

8ea9709

Remove options, constructors, seperate labels

00b4f11

Consistent naming for float and int

34c87ce

Abstract time series

df8ae34

Use ABC

a2561ac

Fix typo

1ece493

Fix docs

ce9268a

seperate measure classes

f5f9f01

Add examples

74a1815

fix lint

0a0b8ee

Update to RFC 0003

555bf50

Add spancontext, measurebatch

d6b1113

lzchen requested review from c24t, carlosalberto, Oberon00 and reyang as code owners August 14, 2019 22:59

lzchen requested a review from toumorokoshi August 14, 2019 22:59

lzchen mentioned this pull request Aug 14, 2019

Metrics API #68

Closed

lzchen added 2 commits August 15, 2019 10:27

Merge branch 'master' of https://github.com/open-telemetry/openteleme…

c819109

…try-python into metrics-rfc

Fix docs

18cfc71

jmacd approved these changes Aug 15, 2019

View reviewed changes

toumorokoshi reviewed Aug 17, 2019

View reviewed changes

carlosalberto reviewed Aug 19, 2019

View reviewed changes

opentelemetry-api/src/opentelemetry/metrics/aggregation.py Outdated Show resolved Hide resolved

Merge branch 'master' of https://github.com/open-telemetry/openteleme…

f646555

…try-python into metrics-rfc

lzchen added 5 commits August 22, 2019 13:26

skip examples

66c0a56

white space

e2c4a7e

fix spacing

2fb7646

fix imports

eb711cb

fix imports

baa3a32

reyang reviewed Aug 23, 2019

View reviewed changes

GreyCat reviewed Sep 3, 2019

View reviewed changes

lzchen added 2 commits September 3, 2019 11:10

Merge branch 'master' of https://github.com/open-telemetry/openteleme…

5c30a9c

…try-python into metrics-rfc

LabelValues to str

211b20c

lzchen requested a review from a-feld as a code owner September 3, 2019 18:22

lzchen added 2 commits September 3, 2019 11:35

Black formatting

bffe040

fix isort

0759b9a

jmacd mentioned this pull request Sep 11, 2019

Acceptance PRs for proposed OTEPs open-telemetry/oteps#44

Closed

Remove aggregation

44d62f8

toumorokoshi self-requested a review September 12, 2019 21:06

toumorokoshi approved these changes Sep 12, 2019

View reviewed changes

lzchen added 3 commits September 12, 2019 14:09

Fix names

c5ab2df

Remove aggregation from docs

50d2de5

Fix lint

d79bc7d

lzchen merged commit fb11568 into open-telemetry:master Sep 12, 2019

lzchen mentioned this pull request Sep 19, 2019

Add metrics API #48

Closed

srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this pull request Nov 1, 2020

Improve Tracer API docs (open-telemetry#87)

06ac7aa

		"""Returns a `CounterTimeSeries` with a cumulative float value."""


		class CounterInt(Metric):

Metrics API with RFC 0003 #87

Metrics API with RFC 0003 #87

Conversation

lzchen commented Aug 14, 2019 • edited Loading

jmacd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toumorokoshi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toumorokoshi commented Aug 17, 2019

reyang commented Aug 17, 2019

jmacd commented Aug 19, 2019

lzchen commented Aug 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lzchen Sep 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GreyCat left a comment

Choose a reason for hiding this comment

lzchen commented Sep 3, 2019 • edited Loading

GreyCat commented Sep 3, 2019

lzchen commented Sep 3, 2019 • edited Loading

jmacd commented Sep 3, 2019

toumorokoshi commented Sep 12, 2019

lzchen commented Aug 14, 2019 •

edited

Loading

lzchen Sep 3, 2019 •

edited

Loading

lzchen commented Sep 3, 2019 •

edited

Loading

lzchen commented Sep 3, 2019 •

edited

Loading