Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More precise option to identify CPU spikes/burst using Open Telemetry #109056

Open
sourabh1007 opened this issue Oct 20, 2024 · 1 comment
Open
Labels
area-System.Diagnostics.Metric question Answer questions and provide assistance, not an issue with source code or documentation. untriaged New issue has not been triaged by the area owner

Comments

@sourabh1007
Copy link

sourabh1007 commented Oct 20, 2024

Background

We are currently working on instrumenting the Cosmos DB SDK to capture CPU usage and related performance metrics in a way that is compatible with OpenTelemetry. Customers can leverage the following libraries for system usage metrics:

OpenTelemetry.Instrumentation.Runtime 1.9.0
OpenTelemetry.Instrumentation.Process 0.5.0-beta.6
Built-in Metrics and Diagnostics in .NET
These libraries provide CPU and memory usage metrics, but the export interval is user-defined, often set to one minute or more.

Given that customer experience is highly sensitive to performance fluctuations, we’ve observed that frequent, short CPU and memory spikes—often unrelated to core processes—are a major factor in high-latency Cosmos DB operations.

Questions

Is there a way to fine-tune the aggregation/export interval for these metrics to capture more granular data?
Are there specific metrics available to help identify short-lived CPU and memory spikes?
Are there any metrics that can assist in detecting thread starvation?

ref. Azure/azure-cosmos-dotnet-v3#4818

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 20, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 20, 2024
@teo-tsirpanis teo-tsirpanis added question Answer questions and provide assistance, not an issue with source code or documentation. area-System.Diagnostics.Metric and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 21, 2024
@noahfalk
Copy link
Member

noahfalk commented Oct 22, 2024

Answers for your specific questions below, but in general I would say that metrics are not historically the tool that gets used for this task. Its probably more typical that devs would use profiling tools (local-dev or production variants) or enabling higher verbosity networking events such as http client instrumentation or potentially kernel events.

Writing custom logic that does high frequency measurements of CPU or allocation should certainly be possible, but comes with tradeoffs profilers typically deal with in terms of measurement overhead and how to aggregate or store the large amount of data that is produced.
  

Is there a way to fine-tune the aggregation/export interval for these metrics to capture more granular data?

If you are using OpenTelemetry to collect and transmit the metric data then OTel's logic will control the frequency this occurs at. You may get a more complete answer asking in the OTel repo but I believe the only way to have metrics with different reporting intervals is to create more than one pipeline. There was also a recent question here which discusses similar things.

(If you are not using OpenTelemetry, .NET's APIs can be polled as frequently as you want to call them but now your own in-proc code is responsible for what to do with the results)

Are there specific metrics available to help identify short-lived CPU and memory spikes?

There is a resource monitoring feature (https://learn.microsoft.com/en-us/dotnet/core/diagnostics/diagnostic-resource-monitoring#example-resource-monitoring-usage) that I believe has configurable options to do sub-sampling. This is the only thing I am aware of that has a sampling rate independent of the rate the metric is reported. However, it still reports a cumulative total over all samples so if your goal is to detect small time duration outliers this may not help.

Are there any metrics that can assist in detecting thread starvation?

If you are looking for threadpool starvation this guide discusses some relevant metrics: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/debug-threadpool-starvation. If you meant you want to detect runnable threads that aren't being scheduled quickly by the OS I'm not aware of any metric off the shelf in .NET that does that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Diagnostics.Metric question Answer questions and provide assistance, not an issue with source code or documentation. untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

3 participants