-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stack overflow handling on Unix #89799
Conversation
There was a race between cleaning up signal handlers at process abort and multiple threads failing with stack overflow at the same time. When a process exits due to a stack overflow, we should not restore the SIGSEGV handler so that all other threads failing due to the same condition just wait until the process abort completes.
Build errors |
The test is failing on Unix because the output of the test contains the following log from createdump after the stack trace and so the expected last line with "Main" is not found.
I need to update the test to be resilient to this. |
Also make it resilient to extra output after the stack trace like createdump output
I've added disabling of minidumps for the child processes that are supposed to crash with stack overflow. It is not necessary and consuming resources / time on the CI machines. We should probably do that for the other coreclr tests that have intentionally crashing children too. |
@@ -2524,7 +2524,7 @@ PROCAbort(int signal, siginfo_t* siginfo) | |||
|
|||
// Restore all signals; the SIGABORT handler to prevent recursion and | |||
// the others to prevent multiple core dumps from being generated. | |||
SEHCleanupSignals(); | |||
SEHCleanupSignals(false /* isChildProcess */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How high is the risk of reentrancy here? We probably don't want to end in a recursive case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProcAbort is calling abort() right after this line, so I don't see how it could be reentered on the same thread.
Can the same problem happen with other signals? For example, what happens if there are multiple threads crashing with NullReferenceException? If I am reading the code correctly, the first one is going to unregister the signal handler and the next one is going bring the process down immediately since our signal handler won't be registered anymore. Should we keep the signal handlers registered and instead block in them if a fatal process shutdown is underway? |
This is a targeted fix for an issue specific to the current implementation of stack overflow handling. The reason why I have originally (when implementing the stack overflow handling in the past) added the logic that lets through only the first stack overflow and makes all others wait was that I needed to create an extra thread for the stack trace logging to be reliable in the case of stack overflow and it didn't seem to make sense to have many of them writing the stack trace past each other. In general, I think that it would make sense to not to unregister the hardware exception signal handlers at all at process abort and implement some waiting mechanism in the handlers, but we would need to solve an additional race between the time the abort is called and the time the hardware exception signal handler is called. In fact, thanks to your feedback I have just realized that even with the fix in this PR, we can have a race causing the same problem as before if we had any unhandled exception or anything causing process abort on one thread and stack overflow on another one. So, it seems it would make sense to modify this PR to handle all the cases, including the ones you've mentioned, properly (and move it to .NET 9) |
@janvorli is this ready to merge or more work is required as part of this change? |
@mangod9 please see the end of my comment above: #89799 (comment). Based on JanK's feedback, I've decided making some more changes for other hardware failure signals. I have also found there is a possibility of a different race than the one I've fixed. So it needs some additional work. |
There was a race between cleaning up signal handlers at process abort and multiple threads failing with stack overflow at the same time.
When a process exits due to a stack overflow, we should not restore the SIGSEGV handler so that all other threads failing due to the same condition just wait until the process abort completes.
Close #46175