Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit critical time and processing time meters #6953

Closed
wants to merge 1 commit into from

Conversation

andreasohlund
Copy link
Member

No description provided.

@andreasohlund andreasohlund added this to the 9.1.0 milestone Feb 16, 2024

This comment was marked as outdated.

@github-actions github-actions bot added the stale label Mar 18, 2024
@andreasohlund andreasohlund changed the title Emit critcal time and processing time via open telemetry Emit critical time and processing time meters Mar 20, 2024
@github-actions github-actions bot removed the stale label Mar 21, 2024
NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.processingtime", "ms", "The time in milliseconds between when the message was pulled from the queue until processed by the endpoint.");

internal static readonly Histogram<double> CriticalTime =
NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.criticaltime", "ms", "The time in milliseconds between when the message was sent until processed by the endpoint.");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance this will make it into the NSB v9 release that is RTM?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrandt, we started working on some of those metrics in #7048. The goal is to release them in 9.1. We don't have an ETA yet, but we're actively working on it.

@andreasohlund
Copy link
Member Author

The scope for v9 is locked (will be released shortly) so this will likely go into 9.1.

In the meantime, the shim in Particular/docs.particular.net#6452 should do the trick.

We might make a few minor changes like splitting the retry counter into separate ones for delayed and immediate retries but all in all, it should be roughly the same.

Hope this makes sense.

This comment was marked as outdated.

@github-actions github-actions bot added the stale label Apr 22, 2024
@bbrandt
Copy link

bbrandt commented Apr 25, 2024

With the failures metric able to have many different possible "FailureType" outcomes this is a bit like how HTTP requests could have many different result status. The main difference being HTTP status is a well-defined list of possibilities, but FailureType can vary per application/domain.

It would be a breaking change, but if we could consider failures to be a non-binary outcome would it make more sense to organize the success/failure outcome into a single result metric as described here?
https://promlabs.com/blog/2023/09/19/errors-successes-totals-which-metrics-should-i-expose-to-prometheus/#recommended-a-single-metric-for-non-binary-outcomes

@github-actions github-actions bot removed the stale label Apr 26, 2024
@andreasohlund
Copy link
Member Author

With the failures metric able to have many different possible "FailureType" outcomes this is a bit like how HTTP requests could have many different result status. The main difference being HTTP status is a well-defined list of possibilities, but FailureType can vary per application/domain.

Not sure they are exactly equivalent given that we don't have success codes but rather a single succeeded state.

In essence, you are talking about adding a "Result" metric with a status attribute that could be status could be failed or succeeded right?

Can you elaborate on how that would simplify your use case?

@bbrandt
Copy link

bbrandt commented Apr 26, 2024

Oh, yes. The case is same as this one, determining what percentage of my requests are failures:
Particular/docs.particular.net#6572 (comment)

nservicebus_messaging_failures alone is an interesting metric but having a rate of 2 failures per second when I have 1000 fetches/second flowing through may be a lower priority issue, but if my flow rate is 3 fetches/second with 2 failures/second, then it may be more important to know that 2/3rds of my requests are failing. When feeding this through alerting, there are other factors, such as business criticality of the message, but ignoring that for now.

Given all that, you are right. This is quite different than HTTP status codes. NSB failures/successes/total is more like the binary outcome case in the blog article where the recommendation is:

exposing errors and totals is usually the best choice

That is exactly what you've done already. Some may argue successes is unnecessary, but every metric exporter out there has a way to filter out metrics, so if anyone doesn't want the successes metric they can exclude it.

Sorry for cluttering up your PR for "critical time and processing time meters". This was probably not the appropriate place for this conversation, but thanks for letting me "think aloud" here anyway.

@andreasohlund
Copy link
Member Author

Thanks for taking the time to chime in @bbrandt it helps us vet our ideas as well, much appreciated ❤️

@bbrandt
Copy link

bbrandt commented May 25, 2024

@andreasohlund To avoid a breaking change later, I recommend switching these units from milliseconds to seconds before release.

References:
pyrra-dev/pyrra#667 (comment)
open-telemetry/opentelemetry-specification#2977 (comment)
https://stackoverflow.com/questions/75974699/should-request-duration-metrics-be-in-seconds-or-milliseconds-potential-discre
open-telemetry/opentelemetry-dotnet#4797

@bbrandt
Copy link

bbrandt commented May 25, 2024

It may also be worthwhile to align with the OpenTelemetry Semantic Conventions for Messaging Metrics when it is stable:
https://opentelemetry.io/docs/specs/semconv/messaging/messaging-metrics/

new(MeterTags.MessageType, messageTypes ?? "")
});

Meters.ProcessingTime.Record((e.CompletedAt - e.StartedAt).TotalMilliseconds, tags);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrandt I think you're referring to this right? From what I understand is that they recommend to add a suffix to identify unit of measurement. I'm not sure if this value can be a double/float. We likely still want to report duration in floating point when reporting in seconds. The current code also passed double (result of TotalMilliseconds).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, for histograms using seconds likely wouldn't make much sense for message processing. Wouldn't you want to have millisecond granularity at the base of a histogram?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, for histograms using seconds likely wouldn't make much sense for message processing. Wouldn't you want to have millisecond granularity at the base of a histogram?

This is addressed in comments for open-telemetry/opentelemetry-dotnet#4797.
If you create a histogram with ms units, the default histogram boundaries to { 0, 5, 10, 25, 50, 75,
100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000} and if you create with seconds as the unit the default histogram boundaries are those same values, divided by 1000.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrandt I think you're referring to this right? From what I understand is that they recommend to add a suffix to identify unit of measurement. I'm not sure if this value can be a double/float. We likely still want to report duration in floating point when reporting in seconds. The current code also passed double (result of TotalMilliseconds).

Yes. Value will be reported as a double.

My preference is using https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.stopwatch?view=net-8.0 instead of DateTime.UtcNow, but this is a minor difference.

@@ -16,4 +16,10 @@ class Meters

internal static readonly Counter<long> TotalFailures =
NServiceBusMeter.CreateCounter<long>("nservicebus.messaging.failures", description: "Total number of messages processed unsuccessfully by the endpoint.");

internal static readonly Histogram<double> ProcessingTime =
NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.processingtime", "ms", "The time in milliseconds between when the message was pulled from the queue until processed by the endpoint.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relationship between NServiceBusMeter.CreateHistogram and OpenTelemetry?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See CreateHistogram:
https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.metrics.meter.createhistogram?view=net-8.0

NserviceBusMeter is of type Meter:
https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.metrics.meter?view=net-8.0

There is actually no direct reference to the OpenTelemetry SDK from Meter in .NET or from NServiceBus, even though there is an endpointConfiguration.EnableOpenTelemetry() method.

It was actually a great decision by the .NET team to keep system.diagnostics.metrics decoupled from OpenTelemetry, seeing as how many unified observability/telemetry tools have come before it. There's no guarantee OpenTelemetry will be the last, but there has been a great deal more buying across the industry it seems. That's all in the realm of interesting, but not important.

What is important is that Meter surfaces information in a way it can be consumed by otel and any proprietary tools or future standard tools.

Sorry about mistakes, while on my phone.

@mauroservienti
Copy link
Member

@andreasohlund To avoid a breaking change later, I recommend switching these units from milliseconds to seconds before release.

References: pyrra-dev/pyrra#667 (comment) open-telemetry/opentelemetry-specification#2977 (comment) https://stackoverflow.com/questions/75974699/should-request-duration-metrics-be-in-seconds-or-milliseconds-potential-discre open-telemetry/opentelemetry-dotnet#4797

@bbrandt, for our understanding, is this #7059 what you mean?

@bbrandt
Copy link

bbrandt commented Jun 6, 2024

@andreasohlund To avoid a breaking change later, I recommend switching these units from milliseconds to seconds before release.

References: pyrra-dev/pyrra#667 (comment) open-telemetry/opentelemetry-specification#2977 (comment) https://stackoverflow.com/questions/75974699/should-request-duration-metrics-be-in-seconds-or-milliseconds-potential-discre open-telemetry/opentelemetry-dotnet#4797

@bbrandt, for our understanding, is this #7059 what you mean?

Yes. The main thing is the shift towards seconds as the standard unit used by OpenTelemetry and Prometheus for metrics.

Another more subjective thing to consider is whether we should be following the upcoming OpenTelemetry semantic convention for metric and tag/label names:
https://opentelemetry.io/docs/specs/semconv/messaging/messaging-metrics/

The histogram messaging.receive.duration would be for both successful and failed messages, with the existence of the error.type label as the discriminator. Histograms natively include a counter, so the current success and failed counters would be redundant.

Going with a standard means if I have handlers using NSB and other handlers for messaging Kafka they theoretically could be monitored or aggregated together easily.

The argument against is this is not a finalized standard, subject to change.

@mauroservienti
Copy link
Member

We raised the following PRs to introduce handling, processing, and critical time:

I'm going ahead and closing this one

@mauroservienti mauroservienti deleted the otel-critical-time branch June 20, 2024 17:11
@bbrandt
Copy link

bbrandt commented Jun 20, 2024

@mauroservienti Another very useful metric I have implemented is a gauge for active handlers:

    public static readonly UpDownCounter<long> ActiveHandlers =
        NServiceBusMeter.CreateUpDownCounter<long>("nservicebus.messaging.active_handlers", description: "Number of handlers executing concurrently.");

For reference, this is the replacement behavior I have for this change as well as combining success, failure, and processing into a single processing time metric with FailureType to indicate success and the various failure states:

public class ReceiveDiagnosticsV2Behavior : IBehavior<IIncomingPhysicalMessageContext, IIncomingPhysicalMessageContext>
{
    private readonly string _queueName;
    public ReceiveDiagnosticsV2Behavior(IOptions<ServiceBusEndpointOptions> options)
    {
        _queueName = options.Value.EndpointName;
    }
    public async Task Invoke(IIncomingPhysicalMessageContext context, Func<IIncomingPhysicalMessageContext, Task> next)
    {
        var stopwatch = Stopwatch.StartNew();
        context.MessageHeaders.TryGetMessageType(out var messageTypes);
        var queueNameTag = new KeyValuePair<string, object>(EmitNServiceBusMetrics.Tags.QueueName, _queueName ?? string.Empty);
        var messageTypeTag = new KeyValuePair<string, object>(EmitNServiceBusMetrics.Tags.MessageType, messageTypes ?? string.Empty);
        EmitNServiceBusMetrics.ActiveHandlers.Add(1,
            queueNameTag,
            messageTypeTag);
        EmitNServiceBusMetrics.TotalFetched.Add(1,
            queueNameTag, 
            messageTypeTag);
        try
        {
            await next(context).ConfigureAwait(false);
        }
        catch (Exception ex) when (!ex.IsCausedBy(context.CancellationToken))
        {
            var failureTimeSeconds = stopwatch.Elapsed.TotalSeconds;
            EmitNServiceBusMetrics.ProcessingTime.Record(failureTimeSeconds,
                queueNameTag,
                messageTypeTag,
                new(EmitNServiceBusMetrics.Tags.FailureType, ex.GetType())
            );
            throw;
        }
        finally
        {
            EmitNServiceBusMetrics.ActiveHandlers.Add(-1,
                queueNameTag,
                messageTypeTag);
        }
    }
}
internal static class ExceptionExtensions
{
#pragma warning disable PS0003 // A parameter of type CancellationToken on a non-private delegate or method should be optional
    public static bool IsCausedBy(this Exception ex, CancellationToken cancellationToken) => ex is OperationCanceledException && cancellationToken.IsCancellationRequested;
#pragma warning restore PS0003 // A parameter of type CancellationToken on a non-private delegate or method should be optional
}

@mauroservienti
Copy link
Member

Thanks for the input, @bbrandt. We'll leave that out of 9.1. I raised #7083 to keep track of it.

@SzymonPobiega SzymonPobiega removed this from the 9.1.0 milestone Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants