-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix an edge case in the Tarjan GC bridge that leads to losing xref information #112825
base: main
Are you sure you want to change the base?
Conversation
In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost.
Example code that got broken by the bug: // Run this on UI thread (eg. in MainActivity.OnCreate)
Task.Run(GCLoop); // <== Repeat this line up to 4 times depending on HW/emulator to make the repro more reliable
_ = AsyncStreamWriter();
public static async Task GCLoop()
{
while (true)
{
GC.Collect();
await Task.Delay(10);
}
}
public static async Task AsyncStreamWriter()
{
var bs = new ByteArrayOutputStream();
var osi = new Android.Runtime.OutputStreamInvoker(bs);
try
{
while (true)
await osi.WriteAsync(new byte[2]);
}
catch (ObjectDisposedException ex)
{
// <== Fails here because ByteArrayOutputStream got finalized/disposed
System.Environment.FailFast(ex.ToString());
}
} The task machinery of If a GC occurs at this point and starts the GC bridge machinery we end up with both To understand why the GC bridge reported it incorrectly, we must look at the log (produced with custom
The key part of the log is that there are several cycles in the object graph (or strongly connected component / SCC in terms of how the graph is processed). The inner SCC is processed correctly and you can see that |
For those who are more visual and can bear my handscribbling: Yellow are bridge objects. Blue is the first SCC in the graph, it itself points to a bridge object. Green is the second SCC graph. It points to no bridge objects by itself but it does contain the blue SCC which does. When processing the green SCC it erroneously decided that since there's no bridge objects it doesn't need to be colored and threw away the XREFs calculated from the blue SCC. |
/cc @BrzVlad |
@filipnavara I'm still going through this and trying to understand both the current SCC algorithm and the impact of the above. I'm also going to defer to @BrzVlad for the final sign off as he has more experience with this area than me. |
I'm more than happy to answer any questions in case it helps with this PR or the work on CoreCLR bridge.
Sure. I think this bug was essentially cropping up many times over the years but no one really tracked it down... As a follow-up we can try to go down the history lane to see if it fixes the cases where people previously had to switch to the "new" bridge. |
Ah, that history is interesting. If you could do that or get some indication that the other algorithms handle this correctly and now the tarjan one does too, that would help give us more confidence in this change. |
The sample code above runs correctly under .NET 8/9 when switching to the "new" bridge. Likewise, it runs correctly after the fix but crashes reliably without it. There's a mode of the runtime that runs both bridges at the same time and compares the results. I can re-run the sample under that mode (unless the code bit rotted too much since the mono/mono times). |
Oh wow. I didn't know that. If you could check that out please do. I'm also happy to review a PR if it has bit-rot and a few fixes are needed. |
The good news is that the bridge comparison works in released MonoVM builds, so you can run .NET 8:
.NET 9:
However, I would expect it to bail with different values. I need to figure out the details. The Tarjan bridge needs be setup to run in precise mode for this to work, which is not the default. UPD: With this patch applied it no longer crashes even with the |
With
(top being the primary Tarjan bridge, bottom being the New bridge used for comparison) There's a cut & paste error in the |
What does the DUMP_GRAPH look like with the fix, interested to compare before and after. |
Note that the order of the objects is not stable. Sometimes |
Thanks @filipnavara - your help here is appreciated more than you know. From my perspective, this assuages any concerns I have with this change. I'm still going to defer to someone with more experience in this code base though. If you'd like to put up the fixed up code above - still commented out, feel free. I'm confident enough to sign off on that. |
I am playing with the idea of running the test cases as a standalone executable that includes unmodified sgen-tarjan-bridge.c: The branch has a minimal example with the broken object graph scenario. It can run it on desktop and report back the output of the bridge code. With a bit of love we could probably clean it up and pair it with a fuzzer to track down other failures like #106410. |
@filipnavara I agree this is a good opportunity to start testing the |
Yes, I am aware of it. I still prefer to test the bridge in complete isolation because it allows me to avoid hitting other GC behaviors (ie. conservative stack scan keeping objects alive), overhead, and to run it with specific graphs in a tight loop. It may, however, be good enough for testing specific test cases like the ones in
I am open to this but I would prefer if this PR is not blocked on it. |
I still believe we should come up with better testing strategy but that's a separate concern from fixing the bug. I added the test case to |
@filipnavara Sorry for adding confusion. |
Feel free to ask questions in case you hit something non-obvious. I'll be on Discord and GitHub. Notably, neither this PR nor #113044 requires understanding all the details. Both are relatively isolated issues in the bridge code that don't modify the core of the algorithm. I highly recommend running the standalone sample in https://github.com/filipnavara/runtime/tree/standalone-gc-bridge-test under debugger to see the values that trigger the erroneous conditions. |
@@ -748,7 +748,7 @@ create_scc (ScanData *data) | |||
|
|||
for (i = dyn_array_ptr_size (&loop_stack) - 1; i >= 0; --i) { | |||
ScanData *other = (ScanData *)dyn_array_ptr_get (&loop_stack, i); | |||
found_bridge |= other->is_bridge; | |||
found_bridge |= other->is_bridge || dyn_array_ptr_size (&other->xrefs) > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking something like this would be more clear with the intention of the change.
@@ -745,10 +745,16 @@ create_scc (ScanData *data)
gboolean found = FALSE;
gboolean found_bridge = FALSE;
ColorData *color_data = NULL;
+ gboolean can_reduce_color = TRUE;
for (i = dyn_array_ptr_size (&loop_stack) - 1; i >= 0; --i) {
ScanData *other = (ScanData *)dyn_array_ptr_get (&loop_stack, i);
found_bridge |= other->is_bridge;
+ if (dyn_array_ptr_size (&other->xrefs) > 0) {
+ // This scc will have more xrefs than the ones from the color_merge_array,
+ // we will need to create a new color to store this information.
+ can_reduce_color = FALSE;
+ }
if (found_bridge || other == data)
break;
}
@@ -756,9 +762,12 @@ create_scc (ScanData *data)
if (found_bridge) {
color_data = new_color (TRUE);
++num_colors_with_bridges;
- } else {
+ } else if (can_reduce_color) {
color_data = reduce_color ();
+ } else {
+ color_data = new_color (FALSE);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work too and it seems reasonable.
* (ie. the task continuation is hooked through the SynchronizationContext | ||
* implentation and rooted only by Android bridge objects). | ||
*/ | ||
static void NestedCycles () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to have a starting point for some passing bridge tests (comparing between new and tarjan). I will add them in a separate PR, maybe also including this test, so we can start incrementally fixing and testing the tarjan bridge.
In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost.
Consider the following object graph:
RunnableImplementor
andByteArrayOutputStream
are bridge objects that are eligible for collection on the .NET side.ByteArrayOutputStream
is assigned a color in the color graph. We start traversing the graph through depth first search fromRunnableImplementor
to find the strongly connected components using the Tarjan algorithm. Once we discover the link from<>c__DisplayClass2_0
toByteArrayOutputStream
we make a note of the color that needs to be recorded in the color graph to produce an xref. We then exit the recursion and find thatInner SCC
forms a strongly connected component and we merge all the recorded colors into[AsyncStateMachineBox'1]->xref
array. Then we proceed to continue with the Tarjan algorithm which eventually finds theOuter SCC
. Once again we want to merge all the colors of the nodes in the outer SCC. The implementation assumes that there could be any links in the color graph only if any objects in the SCC point to a bridge object. However, it failed to consider thatInner SCC
's nodes were already processed and contain the summary information innode->xref
field. If there were any other links to bridge objects it would still allocate a color for theOuter SCC
and behave correctly. However, if there are no other links to bridge objects in the reminder ofOuter SCC
it would fail to allocate a color and silently drop thenode->xref
list on any contained node.Fixes dotnet/android#9039
Ref dotnet/android#9789 for discussion on the root cause