Emit critical time and processing time meters #6953

andreasohlund · 2024-02-16T15:12:27Z

No description provided.

bbrandt · 2024-03-22T00:13:18Z

src/NServiceBus.Core/OpenTelemetry/Metrics/Meters.cs

+     NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.processingtime", "ms", "The time in milliseconds between when the message was pulled from the queue until processed by the endpoint.");
+
+    internal static readonly Histogram<double> CriticalTime =
+        NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.criticaltime", "ms", "The time in milliseconds between when the message was sent until processed by the endpoint.");


Any chance this will make it into the NSB v9 release that is RTM?

@bbrandt, we started working on some of those metrics in #7048. The goal is to release them in 9.1. We don't have an ETA yet, but we're actively working on it.

andreasohlund · 2024-03-22T07:03:08Z

The scope for v9 is locked (will be released shortly) so this will likely go into 9.1.

In the meantime, the shim in Particular/docs.particular.net#6452 should do the trick.

We might make a few minor changes like splitting the retry counter into separate ones for delayed and immediate retries but all in all, it should be roughly the same.

Hope this makes sense.

bbrandt · 2024-04-25T15:39:53Z

With the failures metric able to have many different possible "FailureType" outcomes this is a bit like how HTTP requests could have many different result status. The main difference being HTTP status is a well-defined list of possibilities, but FailureType can vary per application/domain.

It would be a breaking change, but if we could consider failures to be a non-binary outcome would it make more sense to organize the success/failure outcome into a single result metric as described here?
https://promlabs.com/blog/2023/09/19/errors-successes-totals-which-metrics-should-i-expose-to-prometheus/#recommended-a-single-metric-for-non-binary-outcomes

andreasohlund · 2024-04-26T05:41:26Z

With the failures metric able to have many different possible "FailureType" outcomes this is a bit like how HTTP requests could have many different result status. The main difference being HTTP status is a well-defined list of possibilities, but FailureType can vary per application/domain.

Not sure they are exactly equivalent given that we don't have success codes but rather a single succeeded state.

In essence, you are talking about adding a "Result" metric with a status attribute that could be status could be failed or succeeded right?

Can you elaborate on how that would simplify your use case?

bbrandt · 2024-04-26T15:09:29Z

Oh, yes. The case is same as this one, determining what percentage of my requests are failures:
Particular/docs.particular.net#6572 (comment)

nservicebus_messaging_failures alone is an interesting metric but having a rate of 2 failures per second when I have 1000 fetches/second flowing through may be a lower priority issue, but if my flow rate is 3 fetches/second with 2 failures/second, then it may be more important to know that 2/3rds of my requests are failing. When feeding this through alerting, there are other factors, such as business criticality of the message, but ignoring that for now.

Given all that, you are right. This is quite different than HTTP status codes. NSB failures/successes/total is more like the binary outcome case in the blog article where the recommendation is:

exposing errors and totals is usually the best choice

That is exactly what you've done already. Some may argue successes is unnecessary, but every metric exporter out there has a way to filter out metrics, so if anyone doesn't want the successes metric they can exclude it.

Sorry for cluttering up your PR for "critical time and processing time meters". This was probably not the appropriate place for this conversation, but thanks for letting me "think aloud" here anyway.

andreasohlund · 2024-04-29T06:06:39Z

Thanks for taking the time to chime in @bbrandt it helps us vet our ideas as well, much appreciated ❤️

bbrandt · 2024-05-25T00:33:21Z

@andreasohlund To avoid a breaking change later, I recommend switching these units from milliseconds to seconds before release.

References:
pyrra-dev/pyrra#667 (comment)
open-telemetry/opentelemetry-specification#2977 (comment)
https://stackoverflow.com/questions/75974699/should-request-duration-metrics-be-in-seconds-or-milliseconds-potential-discre
open-telemetry/opentelemetry-dotnet#4797

bbrandt · 2024-05-25T00:38:14Z

It may also be worthwhile to align with the OpenTelemetry Semantic Conventions for Messaging Metrics when it is stable:
https://opentelemetry.io/docs/specs/semconv/messaging/messaging-metrics/

ramonsmits · 2024-05-29T11:34:23Z

src/NServiceBus.Core/OpenTelemetry/Metrics/MessagingMetricsFeature.cs

+                    new(MeterTags.MessageType, messageTypes ?? "")
+            });
+
+            Meters.ProcessingTime.Record((e.CompletedAt - e.StartedAt).TotalMilliseconds, tags);


@bbrandt I think you're referring to this right? From what I understand is that they recommend to add a suffix to identify unit of measurement. I'm not sure if this value can be a double/float. We likely still want to report duration in floating point when reporting in seconds. The current code also passed double (result of TotalMilliseconds).

btw, for histograms using seconds likely wouldn't make much sense for message processing. Wouldn't you want to have millisecond granularity at the base of a histogram?

btw, for histograms using seconds likely wouldn't make much sense for message processing. Wouldn't you want to have millisecond granularity at the base of a histogram?

This is addressed in comments for open-telemetry/opentelemetry-dotnet#4797.
If you create a histogram with ms units, the default histogram boundaries to { 0, 5, 10, 25, 50, 75,
100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000} and if you create with seconds as the unit the default histogram boundaries are those same values, divided by 1000.

@bbrandt I think you're referring to this right? From what I understand is that they recommend to add a suffix to identify unit of measurement. I'm not sure if this value can be a double/float. We likely still want to report duration in floating point when reporting in seconds. The current code also passed double (result of TotalMilliseconds).

Yes. Value will be reported as a double.

My preference is using https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.stopwatch?view=net-8.0 instead of DateTime.UtcNow, but this is a minor difference.

ramonsmits · 2024-05-29T11:36:47Z

src/NServiceBus.Core/OpenTelemetry/Metrics/Meters.cs

@@ -16,4 +16,10 @@ class Meters

    internal static readonly Counter<long> TotalFailures =
        NServiceBusMeter.CreateCounter<long>("nservicebus.messaging.failures", description: "Total number of messages processed unsuccessfully by the endpoint.");
+
+    internal static readonly Histogram<double> ProcessingTime =
+     NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.processingtime", "ms", "The time in milliseconds between when the message was pulled from the queue until processed by the endpoint.");


What is the relationship between NServiceBusMeter.CreateHistogram and OpenTelemetry?

See CreateHistogram:
https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.metrics.meter.createhistogram?view=net-8.0

NserviceBusMeter is of type Meter:
https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.metrics.meter?view=net-8.0

There is actually no direct reference to the OpenTelemetry SDK from Meter in .NET or from NServiceBus, even though there is an endpointConfiguration.EnableOpenTelemetry() method.

It was actually a great decision by the .NET team to keep system.diagnostics.metrics decoupled from OpenTelemetry, seeing as how many unified observability/telemetry tools have come before it. There's no guarantee OpenTelemetry will be the last, but there has been a great deal more buying across the industry it seems. That's all in the realm of interesting, but not important.

What is important is that Meter surfaces information in a way it can be consumed by otel and any proprietary tools or future standard tools.

Sorry about mistakes, while on my phone.

mauroservienti · 2024-06-06T13:50:58Z

@andreasohlund To avoid a breaking change later, I recommend switching these units from milliseconds to seconds before release.

References: pyrra-dev/pyrra#667 (comment) open-telemetry/opentelemetry-specification#2977 (comment) https://stackoverflow.com/questions/75974699/should-request-duration-metrics-be-in-seconds-or-milliseconds-potential-discre open-telemetry/opentelemetry-dotnet#4797

@bbrandt, for our understanding, is this #7059 what you mean?

bbrandt · 2024-06-06T22:03:18Z

@andreasohlund To avoid a breaking change later, I recommend switching these units from milliseconds to seconds before release.

References: pyrra-dev/pyrra#667 (comment) open-telemetry/opentelemetry-specification#2977 (comment) https://stackoverflow.com/questions/75974699/should-request-duration-metrics-be-in-seconds-or-milliseconds-potential-discre open-telemetry/opentelemetry-dotnet#4797

@bbrandt, for our understanding, is this #7059 what you mean?

Yes. The main thing is the shift towards seconds as the standard unit used by OpenTelemetry and Prometheus for metrics.

Another more subjective thing to consider is whether we should be following the upcoming OpenTelemetry semantic convention for metric and tag/label names:
https://opentelemetry.io/docs/specs/semconv/messaging/messaging-metrics/

The histogram messaging.receive.duration would be for both successful and failed messages, with the existence of the error.type label as the discriminator. Histograms natively include a counter, so the current success and failed counters would be redundant.

Going with a standard means if I have handlers using NSB and other handlers for messaging Kafka they theoretically could be monitored or aggregated together easily.

The argument against is this is not a finalized standard, subject to change.

mauroservienti · 2024-06-20T17:10:58Z

We raised the following PRs to introduce handling, processing, and critical time:

I'm going ahead and closing this one

bbrandt · 2024-06-20T18:48:50Z

@mauroservienti Another very useful metric I have implemented is a gauge for active handlers:

    public static readonly UpDownCounter<long> ActiveHandlers =
        NServiceBusMeter.CreateUpDownCounter<long>("nservicebus.messaging.active_handlers", description: "Number of handlers executing concurrently.");

For reference, this is the replacement behavior I have for this change as well as combining success, failure, and processing into a single processing time metric with FailureType to indicate success and the various failure states:

public class ReceiveDiagnosticsV2Behavior : IBehavior<IIncomingPhysicalMessageContext, IIncomingPhysicalMessageContext>
{
    private readonly string _queueName;
    public ReceiveDiagnosticsV2Behavior(IOptions<ServiceBusEndpointOptions> options)
    {
        _queueName = options.Value.EndpointName;
    }
    public async Task Invoke(IIncomingPhysicalMessageContext context, Func<IIncomingPhysicalMessageContext, Task> next)
    {
        var stopwatch = Stopwatch.StartNew();
        context.MessageHeaders.TryGetMessageType(out var messageTypes);
        var queueNameTag = new KeyValuePair<string, object>(EmitNServiceBusMetrics.Tags.QueueName, _queueName ?? string.Empty);
        var messageTypeTag = new KeyValuePair<string, object>(EmitNServiceBusMetrics.Tags.MessageType, messageTypes ?? string.Empty);
        EmitNServiceBusMetrics.ActiveHandlers.Add(1,
            queueNameTag,
            messageTypeTag);
        EmitNServiceBusMetrics.TotalFetched.Add(1,
            queueNameTag, 
            messageTypeTag);
        try
        {
            await next(context).ConfigureAwait(false);
        }
        catch (Exception ex) when (!ex.IsCausedBy(context.CancellationToken))
        {
            var failureTimeSeconds = stopwatch.Elapsed.TotalSeconds;
            EmitNServiceBusMetrics.ProcessingTime.Record(failureTimeSeconds,
                queueNameTag,
                messageTypeTag,
                new(EmitNServiceBusMetrics.Tags.FailureType, ex.GetType())
            );
            throw;
        }
        finally
        {
            EmitNServiceBusMetrics.ActiveHandlers.Add(-1,
                queueNameTag,
                messageTypeTag);
        }
    }
}
internal static class ExceptionExtensions
{
#pragma warning disable PS0003 // A parameter of type CancellationToken on a non-private delegate or method should be optional
    public static bool IsCausedBy(this Exception ex, CancellationToken cancellationToken) => ex is OperationCanceledException && cancellationToken.IsCancellationRequested;
#pragma warning restore PS0003 // A parameter of type CancellationToken on a non-private delegate or method should be optional
}

mauroservienti · 2024-06-21T14:01:33Z

Thanks for the input, @bbrandt. We'll leave that out of 9.1. I raised #7083 to keep track of it.

First draft

538ced3

andreasohlund added this to the 9.1.0 milestone Feb 16, 2024

This comment was marked as outdated.

Sign in to view

github-actions bot added the stale label Mar 18, 2024

andreasohlund changed the title ~~Emit critcal time and processing time via open telemetry~~ Emit critical time and processing time meters Mar 20, 2024

github-actions bot removed the stale label Mar 21, 2024

bbrandt reviewed Mar 22, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

github-actions bot added the stale label Apr 22, 2024

github-actions bot removed the stale label Apr 26, 2024

ramonsmits reviewed May 29, 2024

View reviewed changes

mauroservienti assigned mauroservienti, DavidBoike and saratry Jun 6, 2024

mauroservienti closed this Jun 20, 2024

mauroservienti deleted the otel-critical-time branch June 20, 2024 17:11

mauroservienti mentioned this pull request Jun 21, 2024

OpenTelemetry Metrics: Emit an UpDownCounter (gauge) for active handlers #7083

Open

SzymonPobiega removed this from the 9.1.0 milestone Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit critical time and processing time meters #6953

Emit critical time and processing time meters #6953

andreasohlund commented Feb 16, 2024

This comment was marked as outdated.

bbrandt Mar 22, 2024

mauroservienti Jun 6, 2024

andreasohlund commented Mar 22, 2024

This comment was marked as outdated.

bbrandt commented Apr 25, 2024

andreasohlund commented Apr 26, 2024

bbrandt commented Apr 26, 2024

andreasohlund commented Apr 29, 2024

bbrandt commented May 25, 2024

bbrandt commented May 25, 2024

ramonsmits May 29, 2024

ramonsmits May 29, 2024

bbrandt May 30, 2024

bbrandt May 31, 2024

ramonsmits May 29, 2024

bbrandt May 31, 2024

mauroservienti commented Jun 6, 2024

bbrandt commented Jun 6, 2024

mauroservienti commented Jun 20, 2024

bbrandt commented Jun 20, 2024

mauroservienti commented Jun 21, 2024

Emit critical time and processing time meters #6953

Emit critical time and processing time meters #6953

Conversation

andreasohlund commented Feb 16, 2024

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreasohlund commented Mar 22, 2024

This comment was marked as outdated.

bbrandt commented Apr 25, 2024

andreasohlund commented Apr 26, 2024

bbrandt commented Apr 26, 2024

andreasohlund commented Apr 29, 2024

bbrandt commented May 25, 2024

bbrandt commented May 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauroservienti commented Jun 6, 2024

bbrandt commented Jun 6, 2024

mauroservienti commented Jun 20, 2024

bbrandt commented Jun 20, 2024

mauroservienti commented Jun 21, 2024