Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more runtime GC counters #38851

Merged
merged 13 commits into from
Aug 4, 2020
Merged

Add more runtime GC counters #38851

merged 13 commits into from
Aug 4, 2020

Conversation

sywhang
Copy link
Contributor

@sywhang sywhang commented Jul 7, 2020

This adds the following new runtime GC counters:

  • GC Heap Fragmentation %
  • Time Between Last Two GCs
  • Time Between Last Two Gen 0 GCs
  • Time Between Last Two Gen 1 GCs
  • Time Between Last Two Gen 2 GCs
  • Last Gen 0 GC Pause Time
  • Last Gen 1 GC Pause Time
  • Last Gen 2 GC Pause Time

Issues tracking this PR are #1099 and #31951.

internal static extern ulong GetGenerationTimeBetweenGC(int gen);

[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern int GetGenerationLastGCDuration(int gen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this information is important for telemetry, should we consider putting it on GCMemoryInfo?
cc: @maoni

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCMemoryInfo already includes pause info. I don't think we want to include duration because for BGC it's not useful; for non BGC it's basically the same as pause.

the user of the GetGCMemoryInfo API would detect the GC frequency to the precision their sampling interval allows, eg if they call this API once every second and detected 2 gen0 GCs, they don't know how long exactly elapsed between the 2 GCs but they know they have 2 gen0 GCs in 1s.

@Maoni0
Copy link
Member

Maoni0 commented Jul 9, 2020

the counter name says "pause time", the API says "GC duration". for BGCs, these 2 are very different. the time between BGC start and end could be long but only a small portion of it is paused.

the calculation for g_GenerationLastGCDuration for BGC is neither though. there could be gen0/1 GCs happen during a BGC which is explained here: https://docs.microsoft.com/en-us/dotnet/standard/garbage-collection/background-gc. you can also see https://devblogs.microsoft.com/dotnet/gc-etw-events-3/ for a sequence of GC events when a BGC happens (GCStart is fired in UpdatePreGCCounters and GCEnd is fired in UpdatePostGCCounters) so you can see how the calculation is done is neither the BGC duration nor its pause time. I've been explaining to folks for a long time that on full framework the % time in GC perf counter is not accurate for BGC. I don't think we should be exposing more stuff with this inaccurate calculation.

for GC counters, since this is part of a public surface, I would strongly encourage you to find out how this will be used to decide whether you want duration or pause time. pause time for BGC is already calculated with the new GetGCMemoryInfo API implementation so if that's what you want, you can get it from there. in fact I would encourage you to see if you should get other GCs' duration from that too - that includes suspension whereas this calculation does not. do we want to expose 2 different kinds of duration?

@noahfalk
Copy link
Member

I think we should back up a bit. These changes have been done under a bug-fix style workflow but this is a new feature with public surface area. There may have been some confusion because past counters were added with little fanfare. I suspect the major difference is that those counters were largely replicating functionality that .NET Framework already had whereas these are novel requests that have never been subjected to much scrutiny.

My suggestion is to write a doc and submit it as a PR in https://github.com/dotnet/runtime/tree/master/docs/design/features. The doc should have:
a) What problem does the counter help customers solve and how does it help them do that? (Ideally this can be written using the same text that we will need to later put on docs.microsoft.com to explain the counters to users)
b) What is the algorithm used to calculate the counter's value.
Once we agree on the design then implementation review will hopefully be straightforward.

Here are some things I expect will come up and hopefully get resolved in that discussion:

  1. I'm not certain what problem Ben aims to solve (I can make an educated guess, but I'd rather not build features on my guess :)
  2. "The last GC" is ambiguous, foreground and background GCs can run in parallel.
  3. All the "last pause duration" and "last inter-gc interval" metrics are samples of size 1 in some interval
    • Is the size 1 sample sufficient? Should we instead be calculating a statistic that doesn't ignore the N-1 preceding GCs?
    • Is the non-random nature of the sampling an issue? (things with uniform timing can align)
  4. Do the metrics need to distinguish between foreground and background GC? They probably have very different pause time characteristics and I doubt they are randomly distributed.
  5. Going the other direction, do we need to distinguish the generations at all to make the metrics worthwhile? If we can get the same amount of diagnostic power out of fewer consolidated metrics customers will probably appreciate any simplifications we can bring to their issue diagnosis tasks.

@sywhang sywhang mentioned this pull request Jul 22, 2020
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no comments aside from the precision issue Noah already pointed out above.

Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
* Add more GC counters

* Cleanup

* Change size of last GC stats arrays from total_generation_count to max_generation

* add asserts and fix error

* fix typo

* Fix Windows build

* Remove pause time / time between GC counters as discussed in design PR

* revert changes to GC for unused counters

* more unnecessary changes

* build fix

* last bit of undoing changes

* CR feedback
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants