Lang metrics export #544

Sushisource · 2023-05-08T23:54:39Z

This takes a crack at what it will look like to export metrics to lang.

DRAFT since it's still subject to change, though the existing implementation being backed by the new interface does work.

Sushisource · 2023-05-08T23:55:22Z

core-api/src/telemetry/metrics.rs

+/// Implementors of this trait are expected to be defined in each language's bridge.
+/// The implementor is responsible for the allocation/instantiation of new metric meters which
+/// Core has requested.
+pub trait CoreMeter: Send + Sync + Debug {


Most public API stuff is in this file

cretz · 2023-05-09T13:09:20Z

core-api/src/telemetry/metrics.rs

+/// Core has requested.
+pub trait CoreMeter: Send + Sync + Debug {
+    fn new_attributes(&self, attribs: MetricsAttributesOptions) -> MetricAttributes;
+    fn counter(&self, name: &str) -> Arc<dyn Counter>;


Are there any perf concerns with the cost of dyn for metrics? Is this dynamically dispatched for each metrics call?

See other answer. Kinda need to to keep things sane. That's not the overhead I'm worried about so much as attribute passing.

cretz · 2023-05-09T13:10:10Z

core-api/src/telemetry/metrics.rs

+/// Core has requested.
+pub trait CoreMeter: Send + Sync + Debug {
+    fn new_attributes(&self, attribs: MetricsAttributesOptions) -> MetricAttributes;
+    fn counter(&self, name: &str) -> Arc<dyn Counter>;


Can you just return Counter here and make thread-safety the implementer's problem? I don't think we need to force Arc here.

I can't. I actually had quite the struggle with the type system on trying to get this to be as little overhead as possible. Traits cannot return existential types (can't return impl Counter) yet. I would need to make the trait generic, or have an associated type, but that would infect the entire rest of the code with generics just for this purpose which would be no bueno.

Thus, these methods have to return some kind of pointer, so might as well be Arc here since Box is just going to get shoved in one anyway

The only other option would make the instrument types also be enums, which is another spot where there can be a mis-match between the attribute type and the instrument type, which is sorta sad. But, given that can already happen, I'd probably be fine with enum-ing them too.

Word, seems like a language limitation then. Definitely don't want generic "metric" enum with generic value setters or any of that. If you don't believe the cost of Arc + dyn would be more than, say, having counter_add(&self, name: &str, value: u64, attributes &MetricAttributes) on this trait directly (expecting lang to memoize counter based on name and maybe attributes), then no prob.

I don't mean enum instrument, I mean enum counter (& histogram, gauge) which is either otel-in-core backed, or lang-backed.

The lang-backed version would essentially do what you suggest there, except it would store an internal u64 or something which would be used for a pre-hashed or indexed lookup lang side rather than hashing the name every time. (just like the attributes are)

cretz · 2023-05-09T13:15:10Z

core-api/src/telemetry/metrics.rs

+}
+
+pub trait Counter: Send + Sync {
+    fn add(&self, value: u64, attributes: &MetricAttributes);


Providing attributes at metric record time can get a bit expensive for naive implementations that need to group by collective attributes. There is a cost associated with not pre-binding attributes. Can we consider an alteration in the API (even if the primary impl of that doesn't use it that way) that associates metric attributes with the metric instead of the individual value? And can you have with_new_attrs on each of these metrics instead? This way implementers can implement these metric impls having already created the counter with its attributes. And it also discourages users of these metrics from adding any attribute willy-nilly at record time.

The default OTel impl can just store these attrs with the counter and provide them at record time. I know that shifting Arc to be a caller problem and having you keep the attributes until record time is going to require you make your own struct wrapper for the default OTel impl of each metric instead of the easier impl Counter for metrics::Counter<u64>, but I think it makes for a cleaner general-purpose API.

Having thought about this, I now fear that the way the metrics are recorded may preclude this. While counter.with_new_attrs(attrs).add(123) is just as cheap as counter.add(123, attrs), if you are always calling with_new_attrs on every record time anyways then there's no value here. So this can be ignored.

Just know that many non-OTel implementations are required to group by these attributes before applying the value. But otherwise, this can be ignored.

Looks like this will work with the Otel JS metrics API: https://github.com/open-telemetry/opentelemetry-js/blob/b8b63083869fb5850fd4830ded0fc509a2cd0f0a/api/src/metrics/Metric.ts#L65

We'll lose the otel context but other than that everything else should work the way it is.

Yes. This passing is actually the part I'm more concerned about perf wise. But, I forgot to write this in the description, you'll notice I make it so lang allocates attributes on it's side and then they can be re-used: https://github.com/temporalio/sdk-core/pull/544/files#diff-3d2660d932700d89b3484aac58d79baa90dc4ae978995144d0fa9dceb1e0ce1eR91

So new attributes will be created like that for things which change infrequently. Things which potentially change every call have to get passed anyway. The downside of course is lang has to maintain that cache, but it should be fine for it to be add-only, so pretty simple.

That lang controls creating this set (e.g. w/ pre-built hashing) does seem to help, so long as the Rust side doesn't commonly create this attribute set on each recording (here and there can make sense though).

Go Prometheus for instance takes the labels (what we call attributes here) and uses fnv1-a to hash them for grouping purposes, but is optimized to do that ahead of time if the caller does it ahead of time. And so they don't take labels at record time to discourage dynamic labelling.

I'm ok with this as is so long as core code tries its best within reason to reuse the attribute set.

I'm missing how lang is supposed to maintain this cache, aren't most attributes defined core side?

When new attributes need to get defined, they are send over as strings -- and then the return value from that is an ID that is used to refer to them in the future. Attributes that aren't expected to be re-used may also be passed as strings on a per-call basis.

cretz · 2023-05-09T13:17:01Z

core-api/src/telemetry/metrics.rs

+                // TODO: Pass in? Hard to make work with trait
+                ctx: opentelemetry::Context::current(),


Just documenting that users can use thread-locals or thread contexts or whatever is enough I think.

bergundy

Overall this LGTM, I'll try to integrate this with TS when I get some non-versioning time.
Maybe let's leave this PR open until we have the first use case.

core-api/src/telemetry.rs

cretz · 2023-07-31T13:58:25Z

core/src/telemetry/metrics.rs

+
+/// Buffers [MetricEvent]s for periodic consumption by lang
+#[derive(Debug)]
+pub struct MetricsCallBuffer {


Is this visible outside the crate?

Indeed it is

Sushisource · 2023-08-03T01:25:58Z

Needs #544 to get merged first

cretz · 2023-08-03T12:22:28Z

Needs #544 to get merged first

I assume you mean #584

cretz · 2023-08-07T19:27:59Z

core-api/src/telemetry/metrics.rs

+}
+
+pub trait Gauge: Send + Sync {
+    // When referring to durations, this value is in millis


Just crawling this PR again...is there a use of gauge for durations?

I don't think so at the moment

👍 I vote we remove the comment and when we do have duration based gauges (I can't think of when) we can revisit how to expose.

cretz · 2023-08-07T19:29:25Z

core-api/src/telemetry/metrics.rs

+}
+
+pub trait Histogram: Send + Sync {
+    // When referring to durations, this value is in millis


Is there a case where we use histograms for anything but durations? In the Go SDK, we were able to get away with three metric types: counter, gauge, and timer (which is histogram, but we called it timer and clearly used durations). I wonder if a similar thing would make sense here.

I am worried that people will want to know whether you are recording a duration or a numeric value.

They very easily could be non-durations, and either way it's a histogram. The language is clear with "when" - I don't think this is an issue. At the end of the day it's still a histogram regardless of what unit you've chosen.

But I think it's helpful for our users to know whether they're recording a duration or not. We can say "may be milliseconds or just some other number" but then they're gonna have some extra conditionals checking string metric name against a known list to know it's a duration. I think we should have a separate concept for duration-based histograms and non-duration-based histograms (and if there are only duration-based histograms ever used, we can wait until we need the non-duration-based to add it).

I disagree. There aren't a ton of other metrics APIs that do this. There are metric types, which generally correlate to how things are aggregated, and there are units. Confusing units with the method of aggregation is I think not an appropriate design. Or, perhaps more importantly, at the end of the day these are the aggregation methods we have available to us on the backend, so it makes sense to be clear about which one's getting used.

If we would like to accept units, sure, that's totally reasonable, but IMO it's a definite no-no to consider two metric types to be different simply because they are using different units.

If we would like to accept units, sure, that's totally reasonable, but IMO it's a definite no-no to consider two metric types to be different simply because they are using different units.

👍 I will happily take units. I just figured separate traits was the best way to represent separate units instead of int + enum or something. I wasn't wanting different metric types on the downstream system (they both still become the same histograms). For example, with Prometheus in some languages people create either a "timer" or a "histogram", but the former is just sugar over the latter with the type as duration.

Any way you can let the user know the histogram is for a duration vs non-duration would be helpful to the potential implementer.

FYIW I feels like Prometheus encourage us to take the wrong direction, and we should not follow it's lead simply on the principle that "most engines out there" are doing it that way. Metric emitters should focus on what those values really means (ie to them), rather than how they will be stored or presented.

The fact is that some place in the code is either emitting time values, or something else, but in the end, the unit for that metric is unambiguous for the caller and should therefore also be clearly specified as part the "metric" contract. The data is a duration, expressed in ms. The fact that it will likely be presented as an histogram is of lesser significancy.

FYIW I feels like Prometheus encourage us to take the wrong direction, and we should not follow it's lead simply on the principle that "most engines out there" are doing it that way.

It's not about a specific backend, that was just an example. It's not taking me in a certain direction. Very simply, we should type our datatypes as specifically as is useful. Knowing that something is a duration and not just an integer is useful. Being able to translate durations separately from integers is useful.

I gave an example of a backend where that translation is useful. Surely we agree that knowing something is a duration is useful to a user? And if we agree on that, surely we agree a duration type (or enum or different trait or whatever) is a better way of knowing that than checking a predefined string literal?

the unit for that metric is unambiguous for the caller

Not true if someone has to == "milliseconds" in their code. That's pretty ambiguous. Hope we don't add "seconds" next. Or is it "Milliseconds" or "ms", ug, who knows, hopefully the docs are less ambiguous than our data typing.

The data is a duration, expressed in ms.

More like the data is an integer, expressed with an arbitrary string you must know to assume ms.

Having said all of that, if the team believes people need to compare against a string literal if they need to do anything with the duration value specifically, ok. But please document what the literal is that we use in core for this value. I know we're gonna have to document that lang side.

FYIW I feels like Prometheus encourage us to take the wrong direction, and we should not follow it's lead simply on the principle that "most engines out there" are doing it that way.

It's not about a specific backend, that was just an example. It's not taking me in a certain direction. Very simply, we should type our datatypes as specifically as is useful. Knowing that something is a duration and not just an integer is useful. Being able to translate durations separately from integers is useful.

Hey, easy! We're pretty much saying the same 😄. What we have today is what I see as wrong, notably because it treats metrics data as raw numbers to be aggregated into some buckets, instead of focussing on the actual meaning of these values from the emitter's POV. And I think that mainly come from our Prometheus heritage.

the unit for that metric is unambiguous for the caller

Not true if someone has to == "milliseconds" in their code. That's pretty ambiguous. Hope we don't add "seconds" next. Or is it "Milliseconds" or "ms", ug, who knows, hopefully the docs are less ambiguous than our data typing.

Again, I'm saying the same. At any specific code location that emit metrics, the unit in which values are provided are very likely non-ambiguous. For example, a specific code line emitting a metric entry knows that the value it is pushing is a duration, expressed in milliseconds. But ambiguity begins immediately after that point. Different call sites may be emitting entries for the same metric... Are all of these call sites emitting values in the same unit? Supposing that the metric is getting presented as an histogram, are bucket boundaries defined using the correct unit? Is that unit clear to the end user?

All of these assume some form of API contract, which is purely implicit at the moment. With pluggable, lang-side metric exporters, this implicitness will only become more penalizing than before, especially considering that the target system may have different ways of aggregating/presenting values.

Can we at least document all the string units we set

Core doesn't even set any units, right now, but I can add that to this PR.

Anyway, my main point here is it doesn't really make much sense to expose the unit type safety stuff at the Core level, because lang can do that, and the translation to calls into core is quite easy to get right. I agree of course that if users want to specify unit & associated value types in lang that that makes sense.

The only thing I could do in Core would maybe be something like accepting std::time::Duration for some kinda time-specific histogram, but even then I'm just immediately going to toss that out the window when I make the real metric call, and lang should've already done that with it's type.

IE: I don't think any of this should result in any change to this PR, besides specifying units & descriptions in Core's own metrics.

The only thing I could do in Core would maybe be something like accepting std::time::Duration for some kinda time-specific histogram, but even then I'm just immediately going to toss that out the window when I make the real metric call,

But I, the language dev, am not going to toss that out the window. I am going to give it to the user's metric impl and they're gonna be happy I differentiated.

But we've discussed enough, string is good enough I suppose. Let's just document the string core uses for its durations so I can have a hardcoded conditional for stronger typing lang side.

Sushisource · 2023-08-10T21:40:15Z

I realized the underling library update seems to have imposed some naming changes on gauges too. I want to make sure those are resolved before I merged this.

cretz · 2023-08-11T13:31:12Z

core-api/src/telemetry/metrics.rs

+    pub description: Cow<'static, str>,
+    /// Unit information that will appear in metadata if the backend supports it
+    #[builder(setter(into), default = "\"\".into()")]
+    pub unit: Cow<'static, str>,


Can we document the unit string core uses for its durations? We will have to document this lang side so that when users use durations or when users write a backend, they know to match (though lang side may be more strongly typed than Rust here I suspect for differing unit types, which is a weird statement).

Sushisource requested a review from a team as a code owner May 8, 2023 23:54

Sushisource commented May 8, 2023

View reviewed changes

cretz reviewed May 9, 2023

View reviewed changes

bergundy approved these changes May 9, 2023

View reviewed changes

cretz approved these changes May 9, 2023

View reviewed changes

Sushisource force-pushed the metrics-to-lang branch from 5710c0f to ce24c9d Compare May 16, 2023 17:27

Sushisource mentioned this pull request May 17, 2023

[Feature Request] Possibly add config options for exporting time metrics in seconds w/ unit suffix #554

Closed

mjameswh mentioned this pull request Jul 19, 2023

[Feature Request] Forward Core's metrics to TypeScript-land temporalio/sdk-typescript#1177

Open

cretz mentioned this pull request Jul 24, 2023

[Feature Request] Metric fowarding support and custom metric support temporalio/sdk-python#354

Closed

cretz reviewed Jul 24, 2023

View reviewed changes

core-api/src/telemetry.rs Outdated Show resolved Hide resolved

Sushisource force-pushed the metrics-to-lang branch 2 times, most recently from 980edcd to 748343d Compare July 26, 2023 23:14

cretz reviewed Jul 31, 2023

View reviewed changes

Sushisource force-pushed the metrics-to-lang branch from 748343d to e2773a0 Compare August 3, 2023 16:04

mjameswh approved these changes Aug 3, 2023

View reviewed changes

cretz mentioned this pull request Aug 4, 2023

Trying to start a workflow from asp.net and the client hangs temporalio/sdk-dotnet#44

Closed

cretz reviewed Aug 7, 2023

View reviewed changes

cretz reviewed Aug 11, 2023

View reviewed changes

Sushisource added 9 commits August 17, 2023 12:00

First pass

ec27485

Fix merge problem

d8a131a

Eliminate MetricsExporter enum in favor of CoreMeter implementations

3d19475

Implement metric buffer

3905847

Handle merge conflicts from upgrade

2e0a294

Make everything work properly with new otel APIs / changes from other PR

d982619

Remove dedicated metrics thread

732af39

Make new attribute creation more efficient with merging

bbab705

Add test to verify behavior of metrics context w/ lang buffer

b37ee6b

Sushisource added 4 commits August 17, 2023 12:00

Lint fix

cab7e51

Add descriptions and units to created instruments

868d29b

Ensure counter suffixes aren't added

1112950

Added units/descriptions to Core metrics

c4f88a3

Sushisource force-pushed the metrics-to-lang branch from 88c1053 to c4f88a3 Compare August 17, 2023 19:00

Sushisource enabled auto-merge (squash) August 17, 2023 19:02

Sushisource merged commit cac33da into temporalio:master Aug 17, 2023

Sushisource deleted the metrics-to-lang branch August 17, 2023 20:18

This was referenced Aug 18, 2023

[Feature Request] Advanced metrics support temporalio/sdk-dotnet#126

Closed

[Feature Request] Advanced metrics support temporalio/sdk-python#369

Closed

mjameswh mentioned this pull request Aug 29, 2023

[Feature Request] Add support for custom metrics temporalio/sdk-typescript#1229

Closed

mjameswh mentioned this pull request Nov 16, 2023

chore: Update SDK Core to 617612aa temporalio/sdk-typescript#1294

Merged

		// TODO: Pass in? Hard to make work with trait
		ctx: opentelemetry::Context::current(),

Lang metrics export #544

Lang metrics export #544

Uh oh!

Conversation

Sushisource commented May 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bergundy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource commented Aug 3, 2023

Uh oh!

cretz commented Aug 3, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz May 9, 2023 •

edited

Loading

Sushisource May 9, 2023 •

edited

Loading

cretz May 9, 2023 •

edited

Loading

Sushisource May 9, 2023 •

edited

Loading

cretz May 9, 2023 •

edited

Loading

cretz Aug 8, 2023 •

edited

Loading

Sushisource Aug 8, 2023 •

edited

Loading

cretz Aug 9, 2023 •

edited

Loading

cretz Aug 11, 2023 •

edited

Loading

mjameswh Aug 11, 2023 •

edited

Loading

Sushisource Aug 11, 2023 •

edited

Loading

cretz Aug 11, 2023 •

edited

Loading