-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 2 issues when triggering a garbage collection via EventPipe #102612
Conversation
- dotnet#99487: avoid double GC - dotnet#102572: handle client sequence number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to take a PR to fix this, but we need to modify the approach. Suggestions inline. Thanks!
@@ -2403,18 +2403,30 @@ VOID EtwCallbackCommon( | |||
// Special check for the runtime provider's ManagedHeapCollectKeyword. Profilers | |||
// flick this to force a full GC. | |||
if (g_fEEStarted && !g_fEEShutDown && bIsPublicTraceHandle && | |||
((MatchAnyKeyword & CLR_MANAGEDHEAPCOLLECT_KEYWORD) != 0)) | |||
#if !defined(HOST_UNIX) | |||
(ControlCode == EVENT_CONTROL_CODE_ENABLE_PROVIDER) && // only when the provider is enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unforetunately checking the control code for ENABLE_PROVIDER is only going to give you a partial fix. ETW convention (which EventPipe also follows) is that if there is at least one session still active for a given provider then disable still invokes the callback with ControlCode=ENABLE_PROVIDER.
I think there are two things that would potentially work:
-
For ETW we could bring back filtering for the CAPTURE_STATE+ENABLE control code. It looks like c16fd29 removed all the filtering on GC heap snapshots so after that they would have been triggered by any enable, disable or capture state. We could narrow it back down to what it was originally: enable and capture state only. Enable on ETW is still broader than ideal, but it preserves back compat and ETW profilers can always use CAPTURE_STATE instead if they want to do it precisely.
-
For EventPipe, you can distinguish a session disable from a session enable based on whether the sourceId parameter is NULL (Enable=not-null,Disable=null). That parameter isn't currently exposed in this CallbackCommon function but it is present in EventPipeEtwCallbackDotNETRuntime so you could change the signature on this common method to pass it through. Something like:
enum SessionChange
{
EventPipeSessionEnable,
EventPipeSessionDisable,
EtwSessionChangeUnknown
}
EventPipeEtwCallbackDotNETRuntime(sourceId, ...)
{
SessionChange change = sourceId != null ? EventPipeSessionEnable : EventPipeSessionDisable;
EtwCallbackCommon(change, ...);
}
EtwCallback(...)
{
EtwCallbackCommon(EtwSessionChangeUnknown, ...);
}
if ((FilterData != NULL) && | ||
(FilterData->Type == 1) && | ||
(FilterData->Size == sizeof(l64ClientSequenceNumber))) | ||
if ((FilterData != NULL) && (FilterData->Type == 0) && (FilterData->Size > 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to define a new convention to pass this data which works for EventPipe (thats fine), but breaks the previous convention that was used by ETW profilers. I'm fine to add a new EventPipe-friendly convention but the old one needs to keep working.
clientSequenceNumberAsString += strlen(clientSequenceNumberAsString) + 1; | ||
|
||
// extract the client sequence number from the string | ||
int result = _snscanf_s(clientSequenceNumberAsString, FilterData->Size - (strlen((char*)FilterData->Ptr) + 1), "%I64u", &l64ClientSequenceNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to create a new convention based on keys and values I think it would be good to handle potentially multiple key-value pairs being specified at the same time of which 0,1, or multiple of those keys might have the matching key name. For example we could define the key name as 'GCSeqNumber' and then the parsing code would need to determine to be able to handle inputs like these:
Foo\0Bar\0GCSeqNumber\025\0 -> 25
Foo\0Bar\0Baz\025\0 -> not present, default to 0
Foo\0GCSeqNumber\0Baz\025\0 -> not present, default to 0
Baz\025\0GCSeqNumber\0NotNumber\0 -> invalid number text, default to 0
GCSeqNumber\025\0Baz\0Foo\0GCSeqNumber\019\0 -> 25 (first one wins?)
@@ -673,6 +673,7 @@ disable_helper (EventPipeSessionID id) | |||
|
|||
while (ep_provider_callback_data_queue_try_dequeue (provider_callback_data_queue, &provider_callback_data)) { | |||
ep_rt_prepare_provider_invoke_callback (&provider_callback_data); | |||
provider_callback_data.enabled = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not modify enabled here. EventPipe is following a convention established by ETW. Modifying when we emit the PROVIDER_ENABLED control code makes ETW/EventPipe behave differently which makes it more difficult for anyone to write logic in a provider callback that works correctly for both.
Hi @chrisnas! Do you plan on moving forward with this PR in the short term, or would it be OK if we moved this to "draft" mode while development is still in progress? |
Hi @chrisnas! I am moving this PR to "draft" mode - please let us know if you plan on addressing feedback in this PR. |
Sorry for the delay @tommcdon! |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Fix the following issues:
I've tested with dotnet-gcstats (for the client sequence number + number of GCs) and dotnet-fullgc (to trigger a GC with a custom client sequence number)
@noahfalk / @brianrob: I'm not sure to understand why the enabled/disabled state was based on the number of session being > 0 because this number is always >= 1 when the functions are called.