Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various fixes for generation aware analysis #70764

Merged
merged 5 commits into from
Aug 7, 2022

Conversation

cshung
Copy link
Member

@cshung cshung commented Jun 15, 2022

No description provided.

@cshung cshung self-assigned this Jun 15, 2022
@ghost
Copy link

ghost commented Jun 15, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: cshung
Assignees: cshung
Labels:

area-GC-coreclr

Milestone: -

@Maoni0 Maoni0 requested a review from noahfalk June 15, 2022 04:12
@Maoni0
Copy link
Member

Maoni0 commented Jun 15, 2022

@noahfalk, @davmason this is what I was seeing when it was right after EventPipeAdapter::Initialize. an AV on this stack -

coreclr!ep_stream_writer_write_vcall+0x29 [inlined in coreclr!ep_file_initialize_file+0x58]:
00007ff9`4cc53290 488b01 mov rax,qword ptr [rcx] ds:00000000`00000000=????????????????
------
# Child-SP RetAddr Call Site
00 (Inline Function) --------`-------- coreclr!ep_stream_writer_write_vcall+0x29 [C:\runtime-mvc\src\native\eventpipe\ep-stream.c @ 659]
01 (Inline Function) --------`-------- coreclr!ep_stream_writer_write+0x29 [C:\runtime-mvc\src\native\eventpipe\ep-stream.c @ 672]
02 000000a8`cc77f230 00007ff9`4cc5545f coreclr!ep_file_initialize_file+0x58 [C:\runtime-mvc\src\native\eventpipe\ep-file.c @ 403]
03 000000a8`cc77f290 00007ff9`4ca93bd6 coreclr!ep_session_start_streaming+0x17 [C:\runtime-mvc\src\native\eventpipe\ep-session.c @ 382]
04 000000a8`cc77f2c0 00007ff9`4c998546 coreclr!ep_finish_init+0x955a6 [C:\runtime-mvc\src\native\eventpipe\ep.c @ 1380]
05 (Inline Function) --------`-------- coreclr!EventPipeAdapter::FinishInitialize+0x5 [C:\runtime-mvc\src\coreclr\vm\eventpipeadapter.h @ 173]
06 000000a8`cc77f2f0 00007ff9`4c9d7203 coreclr!EEStartupHelper+0x96e [C:\runtime-mvc\src\coreclr\vm\ceemain.cpp @ 943]
07 000000a8`cc77f530 00007ff9`4c9d71aa coreclr!EEStartup+0x27 [C:\runtime-mvc\src\coreclr\vm\ceemain.cpp @ 1112]
08 000000a8`cc77f580 00007ff9`4c9d70e8 coreclr!EnsureEEStarted+0x92 [C:\runtime-mvc\src\coreclr\vm\ceemain.cpp @ 315]
09 000000a8`cc77f5d0 00007ff9`4c9f387a coreclr!CorHost2::Start+0x58 [C:\runtime-mvc\src\coreclr\vm\corhost.cpp @ 102]
0a 000000a8`cc77f610 00007ff7`de485971 coreclr!coreclr_initialize+0x15a [C:\runtime-mvc\src\coreclr\dlls\mscoree\exports.cpp @ 251]

@noahfalk
Copy link
Member

I'm suspicious that this reordering shouldn't have been necessary and the root of the issue is still lurking. It should be possible to enable a new session before calling FinishInitialize() and it is supposed to automatically defer starting it until FinishInitialize() runs. Based on @Maoni0's callstack the deferal worked but then when we resumed the session state wasn't completely initialized and it immediately triggered the AV. @davmason its probably worthwhile to examine why the session state was bad. Based on the callstack I assume it was an null pointer for the stream_writer member.

@davmason
Copy link
Member

Was there a repro for this? I'm happy to take a look if there is a repro or dump

@Maoni0
Copy link
Member

Maoni0 commented Jun 15, 2022

thanks! I'll share the dump/symbols.

@noahfalk
Copy link
Member

@Maoni0 do you have some steps to reproduce the AV by running an app? I think we'll want to step through the initialization code and figure out why the NULL valued field never got initialized.

@cshung
Copy link
Member Author

cshung commented Jun 15, 2022

My blog post here has a basic example of using the generation aware analysis. Since the AV happened very early during initialization, it is probably unrelated to the actual app being analyzed. There are perhaps some tricky environment variable values that trigger the bug, but I am not sure.

@Maoni0
Copy link
Member

Maoni0 commented Jun 15, 2022

right, all I did was set a few env vars to enable genaware analysis -

OMPlus_GCGenAnalysisBytes=1
COMPlus_GCGenAnalysisGen=1
COMPlus_GCGenAnalysisIndex=2

and start a test with corerun and hit this on initialization.

unfortunately I can no longer repro this even with the change reverted... it started seeming randomly too. maybe something just took a bit longer/shorter to create which caused something in that stream_writer not being ready sometimes.

@cshung
Copy link
Member Author

cshung commented Jun 16, 2022

I have found and pushed another potential fix. In case there is an error creating the session (e.g. the file is not writable), gcGenAnalysisEventPipeSessionId would be 0 but gcGenAnalysisTrace was set to true. Later on, the code here and there would attempt to access the broken session and nothing good will happen.

This fix probably cannot explain why the initialization AV but good to have.

@jakobbotsch
Copy link
Member

jakobbotsch commented Jun 16, 2022

This might be a completely different issue, but I have seen a similar crash with the same stack trace as @Maoni0 when I have used a COMPlus_EventPipeOutputPath pointing to an invalid path, e.g. COMPlus_EventPipeOutputPath=foobar\abc.nettrace where foobar does not exist. So there may be some issue with gracefully handling when the file is inaccessible for various reasons.

@davmason
Copy link
Member

Playing around with it I can repro the same thing @jakobbotsch reports.

The comment for COMPlus_EventPipeOutputPath says "full path excluding file name"

RETAIL_CONFIG_STRING_INFO(INTERNAL_EventPipeOutputPath, W("EventPipeOutputPath"), "The full path excluding file name for the trace file that will be written when COMPlus_EnableEventPipe=1")

But the code does the opposite, it expects the full path including file name, the only modification it does is it will replace {pid} with the pid.

ep_config = ep_rt_config_value_get_config ();
ep_config_output_path = ep_rt_config_value_get_output_path ();
ep_char8_t pidStr[24];
ep_rt_utf8_string_snprintf(pidStr, ARRAY_SIZE (pidStr), "%u", (unsigned)ep_rt_current_process_get_id());
while (true)
{
if (ep_rt_utf8_string_replace(&ep_config_output_path, "{pid}", pidStr))
{
// In case there is a second use of {pid} in the output path
continue;
}
// No more instances of {pid} in the OutputPath
break;
}
ep_circular_mb = ep_rt_config_value_get_circular_mb ();
output_path = NULL;
output_path = ep_config_output_path ? ep_config_output_path : "trace.nettrace";

The file path via environment variable path isn't hardened to a bad path, and will try to create the stream_writer on that path, if it fails then it will set file->stream_writer to null and during the delayed initialization it will AV trying to use stream_writer.

@noahfalk and @lateralusX, the environment varaibles are not considered a customer scenario, correct? We should update the comment in clrconfigvalues.h and do something to gracefully terminate the session, but I am not super concerned about making it a great experience if it's not a customer scenario.

@cshung and @Maoni0 - let me know if this is not the case you are hitting. I am able to set the generation aware analysis as long as I either don't start another session or give a full path including file name. But happy to debug further if there's something else lurking here.

@noahfalk
Copy link
Member

@noahfalk and @lateralusX, the environment varaibles are not considered a customer scenario, correct?

We do tell customers about these environment variables and they are documented:
https://docs.microsoft.com/en-us/dotnet/core/diagnostics/eventpipe#trace-using-environment-variables
I'd suggest we validate the inputs well enough that we don't AV, and ideally we emit a warning that the path was invalid.

@Maoni0
Copy link
Member

Maoni0 commented Jun 17, 2022

this is not what I hit. the sequence of things -

  • I could create the gcgenaware file just fine with setting the genaware env vars;
  • suddenly I couldn't anymore and sometimes I could observe the AV, sometimes it failed silently (that's the other thing Andrew fixed). I did change the GC code but the changes I made there shouldn't have affected this since this is not even to the point where GC starts using the eventing stuff yet. but certainly maybe there's some timing changed.
  • I talked to Andrew and he suggested the 1st fix. I applied that one and didn't observe AV anymore;
  • tried to repro for you guys by reverting that fix, could no longer repro.

so unfortuantely I don't have a repro currently. will let you know if I ever observe this again.

@lateralusX
Copy link
Member

I think we at least should do similar NULL check on file_stream_writer,

file_stream_writer = ep_file_stream_writer_alloc (output_path);
, as we do on ipc_stream_writer
ep_raise_error_if_nok (ipc_stream_writer != NULL);
, since failing to open the path should return a NULL file_stream_writer and that should in turn fail the complete allocation of the session in ep_session_alloc.

@cshung
Copy link
Member Author

cshung commented Aug 4, 2022

@Maoni0, @noahfalk, @lateralusX

Can we take a look at this PR again?

On top of the initial change, I have made a few more changes to make generation-aware analysis more robust:

  • Fail the event pipe session construction if we fail to open the stream, that avoids the AV
  • If we detected the failure to open the session, turn off tracing
  • Append PID to the file names to avoid writing to the same file in case of multiple processes.

I think this PR is ready to go.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo inline comments. We should also make sure the final merged commit gets an appropriate commit message. The original goal of 'initialize later' is only a minority of the total work now.

src/coreclr/utilcode/stresslog.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/finalizerthread.cpp Outdated Show resolved Hide resolved
@noahfalk
Copy link
Member

noahfalk commented Aug 5, 2022

@davmason

@cshung cshung merged commit e43e315 into dotnet:main Aug 7, 2022
@cshung cshung deleted the public/initialize-later branch August 7, 2022 00:36
@cshung cshung changed the title Initialize Generation Aware Analysis later Various fixes for generation aware analysis Aug 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants