-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access Violation in jithelpers with Dynamic PGO enabled #87597
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsWe have a scenario in which we are getting an access violation exception during a particular state of our RavenDB cluster when Dynamic PGO is enabled. The setup is 3 servers, 2 of them are down and a client app is constantly trying to connect via TCP to the running server. We have a way to reproduce it. The key seems to be that it is happening only with Dynamic PGO enabled. Disabling it makes that no crash is experienced. We have a full dump of the process when the crash happened, available here: We're running using:
Exception
Stacktrace
HeapThere is no heap corruption:
RavenDB codeReference to our code that is shown in the stacktrace:
|
@dotnet/jit-contrib FYI |
Jitted code leading to the crash is 00007ffd`70f4efb3 48b90000bc6dfd7f0000 mov rcx, 7FFD6DBC0000h
00007ffd`70f4efbd baffffffff mov edx, 0FFFFFFFFh
00007ffd`70f4efc2 e87922505c call coreclr!JIT_ClassInitDynamicClass (7ffdcd451240) Here
I don't know if the value of I can't make sense of the DomainLocalAssembly:
though the first bit of it looks ok:
@mangod9 can somebody from the runtime take a look too? |
@EgorBo feel free to dig in from the jit side in the meantime. |
is this a consistent repro? @janvorli in case this is related to assembly unloading or something. |
@arekpalinski (also cc @redknightlois) have you ever seen a similiar assert in ravendb's unit tests with PGO enabled? |
@mangod9 this reproduces quite consistently on our end. It requires a specific setup where 2 nodes of RavenDB cluster are down (A and B) and the client app is talking to one node that is up (node C). The communication is over TCP, and after establishing some connections server C is throwing exceptions because some of them were supposed to be handled by node A or B (we don't do the failover then because we don't have cluster majority at that time - only 1 node is up). Steps to reproduce:
After some time you'll see the crash of server. You can also reproduce by running the server from the source code in Release mode (debugger can be attached at that time).
@EgorBo we didn't see anything like that during the unit test runs. |
@arekpalinski thanks! I was able to reproduce locally |
I've updated the repro to use the latest daily .NET 8.0 and the crash still reproduces, when I tried to run it with Checked bits I hit:
(PGO is unrelated, tried R2R=0/1, also tried to disable server gc mode) |
Link to the actual assertion |
@AndyAyersMS the value should be a dynamic class ID, so unless of have that many dynamic classes, it is bogus. I am not familiar with the underlying runtime code, but it seems the value is coming from runtime/src/coreclr/vm/class.h Lines 954 to 959 in 1784251
The other calls the following function that can also return FFFFFFFF based on the assert:runtime/src/coreclr/vm/methodtable.inl Lines 1338 to 1354 in 1784251
So it seems that debugging the case when the |
The GC assert is not related to the issue, but could be something to fix on gc side cc @dotnet/gc - it seems that frequent calls to |
@janvorli I'm debugging the assert right now, here where it comes from
Stacktrace:
|
It always happens in JIT when JIT compiles |
|
The class with dynamic statics problem is |
Sounds like the canonical instantiation may be somehow causing the trouble here. @davidwrighton can you please try to help here? |
So I guess the problem that we shouldn't call |
My guess that we have that speculative inlining when we try to inline generic methods but if we hit a need for a runtime lookup we immidiately give up but maybe it somehow didn't happen here? |
Looks like maybe the inlining attempt should have failed at the |
we call |
So do we have the wrong context handle here? |
@EgorBo if you could please share with me a dump with the GC assert, that'd be great. I'll take a look. |
handled offline |
@Maoni0, assume your fix is unrelated to the original issue? Your fix is only for Debug builds? |
Yes, the original issue is definitely non-gc related, in fact, #87847 should fix it (it does locally). The GC assert is just an issue I hit when I used Checked bits to repro the original issue. |
@mangod9 correct, unrelated and only on debug builds. |
Can you please backport the fix to dotnet 7.0? This is a blocker for us which prevents from enabling Dynamic PGO in our product. |
The original fix is a bit intrusive and risky but we might be able to land a simpler fix for .NET 7.0 |
We'd really appreciate that. It is very important for us. |
Do you require Dynamic PGO with .Net 7? Since .Net 8 (which will have it enabled by default) is going to be released soon, can't you just wait with enabling PGO for that? |
We're in the testing phase of new version of RavenDB. We're gonna relase it on .NET 7. The earlier we'll be able to verify everything works with Dynamic PGO enabled the better for us. Since our usage of .NET is quite specific then I believe it could be beneficial for you as well that we'll verify PGO before .NET 8 release. |
We have a scenario in which we are getting an access violation exception during a particular state of our RavenDB cluster when Dynamic PGO is enabled.
The setup is 3 servers, 2 of them are down and a client app is constantly trying to connect via TCP to the running server.
The connections are established but then some of the connections are dropped because they were supposed to by handled by a different node (which is down).
We have a way to reproduce it. The key seems to be that it is happening only with Dynamic PGO enabled. Disabling it makes that no crash is experienced.
We have a full dump of the process when the crash happened, available here:
https://drive.google.com/file/d/1EC0Gwz_ljWuCdJX2WaKDxfdGMpoINnba/view?usp=sharing
We're running using:
Exception
Stacktrace
Heap
There is no heap corruption:
RavenDB code
Reference to our code that is shown in the stacktrace:
https://github.com/ravendb/ravendb/blob/9f86bd3d3b08e07f67a5a238bd8dd77933774765/src/Raven.Server/RavenServer.cs#L2491
The text was updated successfully, but these errors were encountered: