More precise option to identify CPU spikes/burst using Open Telemetry #109056
Labels
area-System.Diagnostics.Metric
question
Answer questions and provide assistance, not an issue with source code or documentation.
untriaged
New issue has not been triaged by the area owner
Background
We are currently working on instrumenting the Cosmos DB SDK to capture CPU usage and related performance metrics in a way that is compatible with OpenTelemetry. Customers can leverage the following libraries for system usage metrics:
OpenTelemetry.Instrumentation.Runtime 1.9.0
OpenTelemetry.Instrumentation.Process 0.5.0-beta.6
Built-in Metrics and Diagnostics in .NET
These libraries provide CPU and memory usage metrics, but the export interval is user-defined, often set to one minute or more.
Given that customer experience is highly sensitive to performance fluctuations, we’ve observed that frequent, short CPU and memory spikes—often unrelated to core processes—are a major factor in high-latency Cosmos DB operations.
Questions
Is there a way to fine-tune the aggregation/export interval for these metrics to capture more granular data?
Are there specific metrics available to help identify short-lived CPU and memory spikes?
Are there any metrics that can assist in detecting thread starvation?
ref. Azure/azure-cosmos-dotnet-v3#4818
The text was updated successfully, but these errors were encountered: