-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics for System.Runtime #85372
Comments
CC @reyang |
@JamesNK FYI in OpenTelemetry .NET we've implemented these:
Having these in in future versions of the runtime would be awesome, the existing OpenTelemetry instrumentation libraries can do a runtime detection and leverage the runtime instrumentation if it is there. Eventually as old versions of the runtime get deprecated, we'll land in a better situation where we don't need a separate instrumentation library as things are "baked in". |
I think OTel would still have something because .NET counters won't follow the OTel naming standard. However, the implementation should be very simple because built-in counters will provide all the information needed. |
@JamesNK is this work you were planning to pursue yourself or just recording the request? |
Recording the request. |
@JamesNK I moved this to future milestone. Please let me know if there is strong demand to have this in .NET 8.0. |
The main reason to implement these as metrics in 8, is so that we can wean people off eventcounters and onto the metrics instead. As these are the main process-wide counters, getting them converted will be a major signal towards that goal. There are likely few counters that need many dimensions here, as most are process wide. We should evaluate the work in comparison to the infrastructure needed to implement it. |
Hey folks! I am interested in trying to help implement this. |
Hi @omajid, glad to have help! I'm guessing that most of the work on this feature will be investigating design options and trying to get a concensus on the best design rather than writing the implementation code. If that is something you are interested in taking a stab at thats great. If you are interested in having someone else work through the design first thats fine too, but I don't know necessarily when that would occur. If you did want to pursue the design part, these are the major questions that come to mind right now:
My hunch is that, yes, some kinds of changes are going to be appealing but we need to figure out what are the impacts of different kinds of changes, is there anything we can do to make migration easier, and then figure out which changes seem worthwhile. For (3) my guess is that we would make it a static singleton, but we need to figure out how that intersects with DI Meter work and the new Meter config work so there might be stalls in there where one design needs to wait for stuff to resolve in the other, or they have to be resolved simultaneously. I think there is design inspiration we could take from https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process. For changes/removals to existing counters there is also some past discussion here: #77530 So if all this sounds like something you still want to dive into I think a first step would be to create an initial proposal (in a gist or a PR'ed markdown doc) describing what instruments would be exposed. Thanks! |
Hey, @noahfalk! Thanks for the various links. These are great questions. I have been prototyping an implementation and I came up with some similar questions (and some possible answers to what you asked). I wouldn't mind helping with the design, though I am not a runtime or OpenTelemetry expert. Advice from anyone more familiar with this is more than welcome. I have been looking at OpenTelemetry's docs as a great starting point from which to evaluate design ideas.
I think if we are creating a Metrics based implementation for first-class support for OpenTelemetry, we should take advantage of that and provide similar (or additional) information, but in a way that is easier to consume and/or feels more natural for anyone looking to consume it via an OpenTelemetry-compatible tool. The opentelemetry-dotnet-contrib docs almost match the existing EventCounters of System.Runtime, with a few differences.
This isn't currently listed as something used in the OpenTelemetry docs, and isn't done in the EventCounter implementation either. So I think we can pass on this for a first stab? If we find some good use cases, we should consider using adding Histograms for those.
Yes. In fact, I think we have to. Otherwise we provide OpenTelemetry-compatible metrics but violate all assumptions in the ecosystem, making things harder to parse and use. For example, all our metrics via EventCounters have a single name, but OpenTelemetry expects metrics to be namespaced via dots:
There's also prior art in the form of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime which does a great job creating a hierarchy, in the form of Though I see that https://github.com/dotnet/runtime/pull/85447/files does things differently and I am not sure why.
I think we should? We aren't putting our users on a path to success by putting thing that are confusing or likely to be misinterpreted in. Specially in a new/fresh design. I also think we should leave out various
I hadn't really thought about this. The current design (eg, looking at the output of
This shouldn't matter from a usage point of view, right? Could we make a static signleton for now and later switch to DI without breaking users? |
No worries. I think its fine to toss out ideas and then get feedback on it. If we need folks with certain areas of expertise we'll try to find them. Ultimately if there is no consensus forming and a contentious decision needs to be made I can make it.
If we add a histogram in the future where there previously the Meter has no similar instrument defined, that seems straightforward and easy to postpone. What feels less straightforward would be adding a counter or gauge now, then later deciding it would have been better to define that instrument as a histogram. For example we might propose an ObservableGauge that was an average GC pause duration, then later we think oops, maybe that should have been a gc pause duration histogram instead.
A common pattern that has arisen with the OTel work is that .NET will have some pre-existing convention or naming scheme, then OTel defines a new scheme that isn't consistent. No matter which one we choose to use it will always be inconsistent with something, either inconsistent with OTel recommendations or inconsistent with a .NET developer's past experience of the platform. When this happens we try to make a judgement call about which behavior more .NET developers are going to prefer in the long run, and often we do wind up favoring .NET self-consistency instead of OTel consistency. The pattern we've landed on in other places (example) is that we are staying consistent with .NET metric naming convention rather than switching to OTel naming convention. I'm expecting we'd do the same here. For folks who want something that conforms tightly to OTel naming and semantic conventions, the instrumentation packages from OTel such as https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime package better fill that role right now. I expect what we'll want to build fairly soon (but not as part of this PR) are mechanisms that make schema conversion very easy so that users can get the data into whatever shape they need.
Whoops, that is actually the one I meant to link to above rather than Process. Glad you found it anyways :)
Above I mentioned how we can't be consistent with both past-precedent and with OTel so we had to make a choice. As to why we choose this way, a few reasons:
Yeah, that sounds pretty reasonable to me as well.
I think the one place this came up in the past was in the discussion of GC related metrics. One set of customers will ask for fairly detailed metrics in a specific area, but I worried that if we add too much to the runtime Meter it will be confusing for users who have simple needs. I think direction things were going though is that we shouldn't worry too much about adding more detailed metrics as long as most users are seeing the metrics via dashboards and docs that can guide them in slowly rather than seeing raw dumps of every available instrument.
One area it might matter is with the Meter config work. For example in logging there is no concept of a static singleton |
A couple of questions:
|
Recently I've been assuming that the runtime counters will be a static singleton Meter defined in System.Diagnostics.DiagnosticSource, so no dependency on DI there. However if want to listen to it via the Meter config work, that would take a DI dependency. Other ways such MeterListener, OpenTelemetry, or external tools do not require DI.
I think a good portion of the counters are the result of EventCounters compensating for not having dimensions. For example there are 5 different heap size metrics (gen0, gen1, gen2, LOH, POH) that could probably be a single metric with a dimension. I'd suggest we start with a design that doesn't explicitly namespace them and if it still feels overwhelming then we think on how we'd split them.
Probably. I like OpenTelemetry's approach of separating these metrics as 'Process' metrics rather than 'Runtime'. These high-level stats like CPU-usage and VM-usage are measurements the OS tracks for all processes rather than anything specific to a language runtime like Java, Python, or .NET.
I'm hoping we don't have a huge number of them + we can provide better guidance than we currently do. Today our docs mostly say "Here is what each counter measures". I think we should get to the point where the docs say "These are the counters we think are most useful for health monitoring, here is a default dashboard you can use, here is how you might use this data to start an investigation of different common problems..."
I'm hoping we aren't going to have so many that more elaborate naming schemes are needed, but I certainly don't rule it out. I'd propose starting with something that looks like OTel's runtime metrics but using .NET's traditional naming conventions.
I think looking at what OpenTelemetry did with runtime metrics is a good starting point. I'm guessing we'd land somewhere quite similar. |
From #79459 (comment):
|
cc @Maoni0 for the last metric |
I highly recommend getting input from OTEL experts, e.g. @lmolkova, on counters, tags and naming. It was a great help when putting together aspnetcore metrics. Also, we should document the metrics on learn.microsoft.com and OTEL semantic conventions docs. With aspnetcore metrics there is lightweight docs on learn.microsoft.com, and links to details docs on OTEL semantic conventions for people who want more detail. |
dotnet-counters connects to the MetricsEventSource which uses a MeterListener internally to obtain the data. There shouldn't be any alternative path to get the data from arbitrary Meters (excluding truly shady approaches like private reflection).
+1. I think that is the path we are already on by virtue of posting the sem-conv proposal in the OTel repo. |
@noahfalk Yeah, I was heading in that direction myself, although I was hoping to avoid it if there was some clever way. One thing I did consider this morning was whether |
Btw @lmolkova is currently out, but she is scheduled to be back in a week. I'm glad to get other feedback but I do want to get her feedback specifically on this one :) |
I'm out for two weeks starting on Monday, but I will keep an eye on these discussions. I'll continue playing with a POC to implement this, and once we have a design, we can determine what (if anything) needs to be prepared for API review, etc. |
@noahfalk @stevejgordon looking at the PR open-telemetry/semantic-conventions#1035 and I am seeing the proposal is missing at least three metrics comparing to what we expose in https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters.
Is it intentional we don't want to include these? |
@tarekgh, not really. @noahfalk suggested starting with the metrics exposed via the existing OTel contrib library, so I didn't review the runtime event source. We can consider proposing those, too, or they could be added later. @noahfalk An alternative implementation I've been thinking about last night is whether we should consider adding an |
Can those GC events be received by an EventListener? I'd imagine that the events are fired to ETW from the unmanaged part of the runtime, rather than by EventSource, and that makes them invisible to EventListener. |
@KalleOlaviNiemitalo, I believe they are piped through and can be observed as per this post from @Maoni0. The reason I am considering this as an option is it opens the door to collecting GC duration and perhaps some other useful metrics if we base at least some of them on these richer events. |
I see; the events are buffered in unmanaged EventPipe code, and a thread in managed code pulls them via EventPipeInternal.GetNextEvent, so the runtime doesn't need to call managed code in the middle of garbage collection. |
I would deliberately not include this one. OpenTelemetry includes WorkingSet, Cpu, and other OS level metrics in a separate group of process metrics. I think its fine if we had a built-in implementation of process metrics too, I just wouldn't lump them in the same Meter with the runtime metrics.
I'd be fine with it as long as @Maoni0 is. It also raises the question if we only want the gen0 value of this or do we want higher generation budgets too.
This metric has history as being confusing and I think folks would be better off observing the rate of change in the clr.gc.pause.time metric. We did look at adding this to the OTel metrics in the past and decided against it. Some past discussion.
Although functionally it works I'd worry you are going to incur higher perf overheads for no clear benefit. Creating the first EventListener in the process requires a thread to pump the events for a callback + blocks of virtual memory are allocated to store the buffered events prior to dispatching them. |
Catching up on the discussion here. Having CLR metrics would be great! I've shared some feedback on the open-telemetry/semantic-conventions#1035 and happy to help polish names and attributes. One thing I hope we can discuss more is if translating existing counters is enough. I believe in some cases we can do better - specifically when we want to measure duration or a distribution of something:
Some of these can be addressed incrementally, but if we'd rather eventually report GC pause time as a histogram, we should not add it as a counter now - adding a histogram in future will introduce duplication as it will allow to derive all the counts from it. |
All of those things are theoretically possible (and sound nice!) but two big caveats:
I'm glad to contemplate brand new metrics and think about them as part of a roadmap, but I'd put implementing them as part of a different feature that we do in a future .NET release.
This seems like an awkward place for us to be. The GC pause time counter is a pretty important metric that many people count on. If we omit it I think it could be a hard sell that anyone should bother with the built-in runtime counters at all for .NET 9. What are the concerns around adding a pause time Counter now and potentially in the future adding a histogram? I see that one is derivable from the other so enabling both isn't optimal storage efficiency, but it doesn't seem like it would rise to the level that I'd expect most .NET developers to care about it. If they did care about it couldn't they easily opt to drop one of the two metrics? |
Thanks for the context @noahfalk ! To confirm my understanding:
It makes sense and sounds good! From .NET 9 timeline perspective, would it make sense to limit the scope to the most basic metrics (or at least make sure they are in)? WDYT? |
Yep!
I think the performance penalty for histograms largely depends on how frequently the thing they are measuring occurs and how good of job we do at optimizing everything. There is some level of performance overhead where its OK to still be on by default so we'd probably decide case-by-case which ones were efficient/optimized enough and which ones should be opt-in.
Yeah, I'm hoping we would be able to include all the metrics that OTel's separate runtime instrumentation offers. For CPU and memory it seemed like OTel recommended having a separate 'Runtime' and 'Process' meters which seemed fine by me. |
From my side, I agree that it would be ideal to port the existing OTel contrib runtime metrics as the initial set with the goal that consumers no longer need an extra instrumentation package to collect the "core" metrics. The timescales are tight if the implementation gets much more complex than mostly porting the contrib implementation. I'll try to prototype this during the week, but I think finalising the conventions is a blocker to getting this in before the .NET 9 cutoff. |
I think having runtime metrics in .NET is something that would have impact for years and we should not rush it for the sake of eliminating existing otel library. I'm happy to work with you on semconv and make sure it's not a blocker. Again, I don't think we need to port all the metrics and by reducing the scope we can significantly reduce the complexity. |
Its not primarily about eliminating that library. My main motivation is to get OTel style metrics available out-of-the box for all .NET projects. We've already done this for a bunch of the stack in .NET 8 but runtime metrics stick out like a sore thumb now. Also using the OTel SDK isn't the only way that folks see metrics, we want to provide them in other experiences like dotnet-counters and codeless instrumentation which only happens when they are built-in. Right now we've got 17 proposed metrics, all with a working reference implementation, docs, customers are using them right now, and they've already gotten though a previous attempt at making it align nicely with the OTel semantic conventions. Spending a few weeks to review them seems like that should be plenty of time no?
Reduce complexity for whom? I think its pretty likely that if we cut any of the current OTel runtime instrumentation metrics (and provide no obvious alternative) then next release I'll have GH issues and emails from .NET customers saying "what happened to XYZ metric?" We already cut a bunch of those metrics in the past when we switched from Windows Performance Counters -> EventCounters and then customers told us where we went too far. A couple examples here and here. |
I've started an initial port of the metrics into For now, I've gone with: private const string MeterName = "System.Diagnostics.Runtime"; @tarekgh Regarding the .NET API review process, will this name also need discussion there? |
If we are not exposing any public API, we don't need to go through the design review. For Meter name, if the involved people @noahfalk @lmolkova @reyang @CodeBlanch @JamesNK @cijothomas agree on the meter name, that will be enough to proceed. The proposed name looks good to me as it conforms to the guidelines. The name passed to the Meter constructor should be unique to distinguish it from other Meters. We recommend OpenTelemetry naming guidelines, which use dotted hierarchical names. Assembly names or namespace names for code being instrumented are usually a good choice. If an assembly adds instrumentation for code in a second, independent assembly, the name should be based on the assembly that defines the Meter, not the assembly whose code is being instrumented. |
Metrics names are public API. They're not a .NET API and doesn't need to go through the standard .NET review process, but it definitely needs to be reviewed. |
I would propose "System.Runtime" for the Meter name, the same as we are currently using for our runtime EventCounters. We've done the same pattern where we have EventCounters registered in EventSources named "System.Net.Http", "System.Net.NameResolution", "Microsoft.AspNetCore.Hosting", etc. and we named the Meter identically. Effectively it means that folks only need to remember one name, rather than one name for the EventCounters based data and 2nd name for the Meter based data. It also implies that as soon as the Meter shows up named "System.Runtime", that is what dotnet-counters is going to show by default because it has a hard-coded default on that name + a policy that automatically prefers Meter over EventSource when both exist with the same name. |
Not sure if you were proposing some form of review beyond Tarek pinged folks above and they will hopefully weigh in on this issue via comments? Thats all I was planning on. Also adding @samsp-msft to the review ping list :) |
@noahfalk, I've put together an initial draft of the PR for the implementation in my fork. Would you mind taking a look and seeing if it aligns with what you were expecting? I'll keep the PR there until the semantic conventions are finalised, and then I can create one in runtime. |
Today there are event counters for System.Runtime: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters
Metrics should be added to this area. Advantages:
MeterListener
.System.Diaganostics.Metrics
. For example,opentelemetry-net
.What instruments should we have?
exception-count
counter include the exception type name as a tag? Then tooling can provide a breakdown of not just the total exception count but the exception count grouped by type.The text was updated successfully, but these errors were encountered: