-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit critical time and processing time meters #6953
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.processingtime", "ms", "The time in milliseconds between when the message was pulled from the queue until processed by the endpoint."); | ||
|
||
internal static readonly Histogram<double> CriticalTime = | ||
NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.criticaltime", "ms", "The time in milliseconds between when the message was sent until processed by the endpoint."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance this will make it into the NSB v9 release that is RTM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope for v9 is locked (will be released shortly) so this will likely go into 9.1. In the meantime, the shim in Particular/docs.particular.net#6452 should do the trick. We might make a few minor changes like splitting the retry counter into separate ones for delayed and immediate retries but all in all, it should be roughly the same. Hope this makes sense. |
This comment was marked as outdated.
This comment was marked as outdated.
With the failures metric able to have many different possible "FailureType" outcomes this is a bit like how HTTP requests could have many different result status. The main difference being HTTP status is a well-defined list of possibilities, but FailureType can vary per application/domain. It would be a breaking change, but if we could consider failures to be a non-binary outcome would it make more sense to organize the success/failure outcome into a single result metric as described here? |
Not sure they are exactly equivalent given that we don't have success codes but rather a single succeeded state. In essence, you are talking about adding a "Result" metric with a status attribute that could be status could be failed or succeeded right? Can you elaborate on how that would simplify your use case? |
Oh, yes. The case is same as this one, determining what percentage of my requests are failures:
Given all that, you are right. This is quite different than HTTP status codes. NSB failures/successes/total is more like the binary outcome case in the blog article where the recommendation is:
That is exactly what you've done already. Some may argue Sorry for cluttering up your PR for "critical time and processing time meters". This was probably not the appropriate place for this conversation, but thanks for letting me "think aloud" here anyway. |
Thanks for taking the time to chime in @bbrandt it helps us vet our ideas as well, much appreciated ❤️ |
@andreasohlund To avoid a breaking change later, I recommend switching these units from milliseconds to seconds before release. References: |
It may also be worthwhile to align with the OpenTelemetry Semantic Conventions for Messaging Metrics when it is stable: |
new(MeterTags.MessageType, messageTypes ?? "") | ||
}); | ||
|
||
Meters.ProcessingTime.Record((e.CompletedAt - e.StartedAt).TotalMilliseconds, tags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbrandt I think you're referring to this right? From what I understand is that they recommend to add a suffix to identify unit of measurement. I'm not sure if this value can be a double/float. We likely still want to report duration in floating point when reporting in seconds. The current code also passed double (result of TotalMilliseconds
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, for histograms using seconds likely wouldn't make much sense for message processing. Wouldn't you want to have millisecond granularity at the base of a histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, for histograms using seconds likely wouldn't make much sense for message processing. Wouldn't you want to have millisecond granularity at the base of a histogram?
This is addressed in comments for open-telemetry/opentelemetry-dotnet#4797.
If you create a histogram with ms units, the default histogram boundaries to { 0, 5, 10, 25, 50, 75,
100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000} and if you create with seconds as the unit the default histogram boundaries are those same values, divided by 1000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbrandt I think you're referring to this right? From what I understand is that they recommend to add a suffix to identify unit of measurement. I'm not sure if this value can be a double/float. We likely still want to report duration in floating point when reporting in seconds. The current code also passed double (result of
TotalMilliseconds
).
Yes. Value will be reported as a double.
My preference is using https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.stopwatch?view=net-8.0 instead of DateTime.UtcNow, but this is a minor difference.
@@ -16,4 +16,10 @@ class Meters | |||
|
|||
internal static readonly Counter<long> TotalFailures = | |||
NServiceBusMeter.CreateCounter<long>("nservicebus.messaging.failures", description: "Total number of messages processed unsuccessfully by the endpoint."); | |||
|
|||
internal static readonly Histogram<double> ProcessingTime = | |||
NServiceBusMeter.CreateHistogram<double>("nservicebus.messaging.processingtime", "ms", "The time in milliseconds between when the message was pulled from the queue until processed by the endpoint."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relationship between NServiceBusMeter.CreateHistogram and OpenTelemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See CreateHistogram:
https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.metrics.meter.createhistogram?view=net-8.0
NserviceBusMeter is of type Meter:
https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.metrics.meter?view=net-8.0
There is actually no direct reference to the OpenTelemetry SDK from Meter in .NET or from NServiceBus, even though there is an endpointConfiguration.EnableOpenTelemetry()
method.
It was actually a great decision by the .NET team to keep system.diagnostics.metrics decoupled from OpenTelemetry, seeing as how many unified observability/telemetry tools have come before it. There's no guarantee OpenTelemetry will be the last, but there has been a great deal more buying across the industry it seems. That's all in the realm of interesting, but not important.
What is important is that Meter surfaces information in a way it can be consumed by otel and any proprietary tools or future standard tools.
Sorry about mistakes, while on my phone.
@bbrandt, for our understanding, is this #7059 what you mean? |
Yes. The main thing is the shift towards seconds as the standard unit used by OpenTelemetry and Prometheus for metrics. Another more subjective thing to consider is whether we should be following the upcoming OpenTelemetry semantic convention for metric and tag/label names: The histogram Going with a standard means if I have handlers using NSB and other handlers for messaging Kafka they theoretically could be monitored or aggregated together easily. The argument against is this is not a finalized standard, subject to change. |
We raised the following PRs to introduce handling, processing, and critical time:
I'm going ahead and closing this one |
@mauroservienti Another very useful metric I have implemented is a gauge for active handlers: public static readonly UpDownCounter<long> ActiveHandlers =
NServiceBusMeter.CreateUpDownCounter<long>("nservicebus.messaging.active_handlers", description: "Number of handlers executing concurrently."); For reference, this is the replacement behavior I have for this change as well as combining success, failure, and processing into a single processing time metric with public class ReceiveDiagnosticsV2Behavior : IBehavior<IIncomingPhysicalMessageContext, IIncomingPhysicalMessageContext>
{
private readonly string _queueName;
public ReceiveDiagnosticsV2Behavior(IOptions<ServiceBusEndpointOptions> options)
{
_queueName = options.Value.EndpointName;
}
public async Task Invoke(IIncomingPhysicalMessageContext context, Func<IIncomingPhysicalMessageContext, Task> next)
{
var stopwatch = Stopwatch.StartNew();
context.MessageHeaders.TryGetMessageType(out var messageTypes);
var queueNameTag = new KeyValuePair<string, object>(EmitNServiceBusMetrics.Tags.QueueName, _queueName ?? string.Empty);
var messageTypeTag = new KeyValuePair<string, object>(EmitNServiceBusMetrics.Tags.MessageType, messageTypes ?? string.Empty);
EmitNServiceBusMetrics.ActiveHandlers.Add(1,
queueNameTag,
messageTypeTag);
EmitNServiceBusMetrics.TotalFetched.Add(1,
queueNameTag,
messageTypeTag);
try
{
await next(context).ConfigureAwait(false);
}
catch (Exception ex) when (!ex.IsCausedBy(context.CancellationToken))
{
var failureTimeSeconds = stopwatch.Elapsed.TotalSeconds;
EmitNServiceBusMetrics.ProcessingTime.Record(failureTimeSeconds,
queueNameTag,
messageTypeTag,
new(EmitNServiceBusMetrics.Tags.FailureType, ex.GetType())
);
throw;
}
finally
{
EmitNServiceBusMetrics.ActiveHandlers.Add(-1,
queueNameTag,
messageTypeTag);
}
}
}
internal static class ExceptionExtensions
{
#pragma warning disable PS0003 // A parameter of type CancellationToken on a non-private delegate or method should be optional
public static bool IsCausedBy(this Exception ex, CancellationToken cancellationToken) => ex is OperationCanceledException && cancellationToken.IsCancellationRequested;
#pragma warning restore PS0003 // A parameter of type CancellationToken on a non-private delegate or method should be optional
} |
No description provided.