More precise option to identify CPU spikes/burst using Open Telemetry #109056

sourabh1007 · 2024-10-20T17:22:05Z

Background

We are currently working on instrumenting the Cosmos DB SDK to capture CPU usage and related performance metrics in a way that is compatible with OpenTelemetry. Customers can leverage the following libraries for system usage metrics:

OpenTelemetry.Instrumentation.Runtime 1.9.0
OpenTelemetry.Instrumentation.Process 0.5.0-beta.6
Built-in Metrics and Diagnostics in .NET
These libraries provide CPU and memory usage metrics, but the export interval is user-defined, often set to one minute or more.

Given that customer experience is highly sensitive to performance fluctuations, we’ve observed that frequent, short CPU and memory spikes—often unrelated to core processes—are a major factor in high-latency Cosmos DB operations.

Questions

Is there a way to fine-tune the aggregation/export interval for these metrics to capture more granular data?
Are there specific metrics available to help identify short-lived CPU and memory spikes?
Are there any metrics that can assist in detecting thread starvation?

ref. Azure/azure-cosmos-dotnet-v3#4818

noahfalk · 2024-10-22T03:18:26Z

Answers for your specific questions below, but in general I would say that metrics are not historically the tool that gets used for this task. Its probably more typical that devs would use profiling tools (local-dev or production variants) or enabling higher verbosity networking events such as http client instrumentation or potentially kernel events.

Writing custom logic that does high frequency measurements of CPU or allocation should certainly be possible, but comes with tradeoffs profilers typically deal with in terms of measurement overhead and how to aggregate or store the large amount of data that is produced.

Is there a way to fine-tune the aggregation/export interval for these metrics to capture more granular data?

If you are using OpenTelemetry to collect and transmit the metric data then OTel's logic will control the frequency this occurs at. You may get a more complete answer asking in the OTel repo but I believe the only way to have metrics with different reporting intervals is to create more than one pipeline. There was also a recent question here which discusses similar things.

(If you are not using OpenTelemetry, .NET's APIs can be polled as frequently as you want to call them but now your own in-proc code is responsible for what to do with the results)

Are there specific metrics available to help identify short-lived CPU and memory spikes?

There is a resource monitoring feature (https://learn.microsoft.com/en-us/dotnet/core/diagnostics/diagnostic-resource-monitoring#example-resource-monitoring-usage) that I believe has configurable options to do sub-sampling. This is the only thing I am aware of that has a sampling rate independent of the rate the metric is reported. However, it still reports a cumulative total over all samples so if your goal is to detect small time duration outliers this may not help.

Are there any metrics that can assist in detecting thread starvation?

If you are looking for threadpool starvation this guide discusses some relevant metrics: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/debug-threadpool-starvation. If you meant you want to detect runnable threads that aren't being scheduled quickly by the OS I'm not aware of any metric off the shelf in .NET that does that.

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 20, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 20, 2024

teo-tsirpanis added question Answer questions and provide assistance, not an issue with source code or documentation. area-System.Diagnostics.Metric and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More precise option to identify CPU spikes/burst using Open Telemetry #109056

More precise option to identify CPU spikes/burst using Open Telemetry #109056

sourabh1007 commented Oct 20, 2024 •

edited

Loading

noahfalk commented Oct 22, 2024 •

edited

Loading

More precise option to identify CPU spikes/burst using Open Telemetry #109056

More precise option to identify CPU spikes/burst using Open Telemetry #109056

Comments

sourabh1007 commented Oct 20, 2024 • edited Loading

noahfalk commented Oct 22, 2024 • edited Loading

sourabh1007 commented Oct 20, 2024 •

edited

Loading

noahfalk commented Oct 22, 2024 •

edited

Loading