Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an edge case in the Tarjan GC bridge that leads to losing xref information #112825

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

filipnavara
Copy link
Member

@filipnavara filipnavara commented Feb 23, 2025

In the Tarjan SCC bridge processing there's a color graph used to find out connections between SCCs. There was a rare case which only manifested when a cycle in the object graph points to another cycle that points to a bridge object. We only recognized direct bridge pointers but not pointers to other non-bridge SCCs that in turn point to bridges and where we already calculated the xrefs. These xrefs were then lost.

Consider the following object graph:

flowchart TB
    subgraph InnerSCC["Inner SCC"]
        AsyncStreamWriter["<AsyncStreamWriter>d__2"]
        AsyncStateMachineBox["AsyncStateMachineBox`1"]
    end
    subgraph OuterSCC["Outer SCC"]
        c__DisplayClass2_0["<>c__DisplayClass2_0"]
        Action
        InnerSCC
    end
    RunnableImplementor --> Action
    Action --> c__DisplayClass2_0 & AsyncStateMachineBox
    c__DisplayClass2_0 --> Action
    AsyncStateMachineBox --> Action & AsyncStreamWriter
    AsyncStreamWriter --> AsyncStateMachineBox & ByteArrayOutputStream
    
    style RunnableImplementor stroke-dasharray:2
    style ByteArrayOutputStream stroke-dasharray:2
Loading

RunnableImplementor and ByteArrayOutputStream are bridge objects that are eligible for collection on the .NET side.

ByteArrayOutputStream is assigned a color in the color graph. We start traversing the graph through depth first search from RunnableImplementor to find the strongly connected components using the Tarjan algorithm. Once we discover the link from <>c__DisplayClass2_0 to ByteArrayOutputStream we make a note of the color that needs to be recorded in the color graph to produce an xref. We then exit the recursion and find that Inner SCC forms a strongly connected component and we merge all the recorded colors into [AsyncStateMachineBox'1]->xref array. Then we proceed to continue with the Tarjan algorithm which eventually finds the Outer SCC. Once again we want to merge all the colors of the nodes in the outer SCC. The implementation assumes that there could be any links in the color graph only if any objects in the SCC point to a bridge object. However, it failed to consider that Inner SCC's nodes were already processed and contain the summary information in node->xref field. If there were any other links to bridge objects it would still allocate a color for the Outer SCC and behave correctly. However, if there are no other links to bridge objects in the reminder of Outer SCC it would fail to allocate a color and silently drop the node->xref list on any contained node.

Fixes dotnet/android#9039
Ref dotnet/android#9789 for discussion on the root cause

In the Tarjan SCC bridge processing there's a color graph used to find out
connections between SCCs. There was a rare case which only manifested when
a cycle in the object graph points to another cycle that points to a bridge
object. We only recognized direct bridge pointers but not pointers to other
non-bridge SCCs that in turn point to bridges and where we already calculated
the xrefs. These xrefs were then lost.
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 23, 2025
@filipnavara
Copy link
Member Author

filipnavara commented Feb 23, 2025

Example code that got broken by the bug:

        // Run this on UI thread (eg. in MainActivity.OnCreate)  
        Task.Run(GCLoop); // <== Repeat this line up to 4 times depending on HW/emulator to make the repro more reliable
        _ = AsyncStreamWriter();

        public static async Task GCLoop()
        {
            while (true)
            {
                GC.Collect();
                await Task.Delay(10);
            }
        }

        public static async Task AsyncStreamWriter()
        {
            var bs = new ByteArrayOutputStream();
            var osi = new Android.Runtime.OutputStreamInvoker(bs);
            try
            {
                while (true)
                    await osi.WriteAsync(new byte[2]);
            }
            catch (ObjectDisposedException ex)
            {
                // <== Fails here because ByteArrayOutputStream got finalized/disposed
                System.Environment.FailFast(ex.ToString());
            }
        }

The task machinery of AsyncStreamWriter captures the synchronization context from the UI thread (Android.App.SyncContext). When the continuation of the osi.WriteAsync call is queue through the SyncContext.Post method it creates a RunnableImplementation object (Java bridged) that hold the only reference to the object graph of the task. The RunnableImplementation object itself is not rooted in .NET.

If a GC occurs at this point and starts the GC bridge machinery we end up with both ByteArrayOutputStream and RunnableImplementation seen as unreferenced objects on the .NET side. The GC bridge is responsible for reporting that graph to the Android GC bridge and preserving the cross-reference (xref) between RunnableImplementation and ByteArrayOutputStream. The bug, however, caused the two objects to be reported as independent and leading to ByteArrayOutputStream being finalized while there was still a reference to it from live code rooted by the RunnableImplementation.

To understand why the GC bridge reported it incorrectly, we must look at the log (produced with custom DUMP_GRAPH build):

-----------------
+scanning ByteArrayOutputStream (0x6d21ff42f0) index 0 color 0x0
-finishing ByteArrayOutputStream (0x6d21ff42f0) index 0 low-index 0 color 0x0
-finished ByteArrayOutputStream (0x6d21ff42f0) index 0 low-index 0 color 0x0
|SCC 0x7079a06010 rooted in ByteArrayOutputStream (0x6d21ff42f0) has bridge 1
	loop stack: (0/0)
	member ByteArrayOutputStream (0x6d21ff42f0) index 0 low-index 0 color 0x0 state 2
+scanning RunnableImplementor (0x6db6b821b0) index 1 color 0x0
	= pushing 0x6db6b789b0 Action -> pushed!
+scanning Action (0x6db6b789b0) index 2 color 0x0
	= pushing 0x6db6b649d0 <>c__DisplayClass2_0 -> pushed!
+scanning <>c__DisplayClass2_0 (0x6db6b649d0) index 3 color 0x0
	= pushing 0x6db6b78b30 SendOrPostCallback -> alive
	= pushing 0x6db6b79a30 Action -> pushed!
+scanning Action (0x6db6b79a30) index 4 color 0x0
	= pushing 0x6db6b827a0 AsyncStateMachineBox`1 -> pushed!
+scanning AsyncStateMachineBox`1 (0x6db6b827a0) index 5 color 0x0
	= pushing 0x6db6b79a30 Action -> already marked
	= pushing 0x6d24010cf8 ExecutionContext -> alive
	= pushing 0x6db6b827f0 <AsyncStreamWriter>d__2 -> pushed!
+scanning <AsyncStreamWriter>d__2 (0x6db6b827f0) index 6 color 0x0
	= pushing 0x6db6b827a0 AsyncStateMachineBox`1 -> already marked
	= pushing 0x6d21ff42f0 ByteArrayOutputStream -> already marked
	= pushing 0x6db6b64890 OutputStreamInvoker -> pushed!
	= pushing 0x6d24015e60 ReadWriteTask -> alive
+scanning OutputStreamInvoker (0x6db6b64890) index 7 color 0x0
	= pushing 0x6db6b74830 SemaphoreSlim -> pushed!
	= pushing 0x6d21ff42f0 ByteArrayOutputStream -> already marked
+scanning SemaphoreSlim (0x6db6b74830) index 8 color 0x0
	= pushing 0x6db6b501a8 StrongBox`1 -> opaque
-finishing SemaphoreSlim (0x6db6b74830) index 8 low-index 8 color 0x0
	compute low 0x6db6b74830 ->0x6db6b501a8 (StrongBox`1) 0x0 (-2 / -2, color 0x0)
-finished SemaphoreSlim (0x6db6b74830) index 8 low-index 8 color 0x0
|SCC 0x0 rooted in SemaphoreSlim (0x6db6b74830) has bridge 0
	loop stack: (1/1)(2/2)(3/3)(4/4)(5/5)(6/6)(7/7)(8/8)
	member SemaphoreSlim (0x6db6b74830) index 8 low-index 8 color 0x0 state 2
-finishing OutputStreamInvoker (0x6db6b64890) index 7 low-index 7 color 0x0
	compute low 0x6db6b64890 ->0x6db6b74830 (SemaphoreSlim) 0x7079a041f8 (8 / 8, color 0x0)
	compute low 0x6db6b64890 ->0x6d21ff42f0 (ByteArrayOutputStream) 0x7079a04018 (0 / 0, color 0x7079a06010)
		add color 0x7079a06010 to color_merge_array
-finished OutputStreamInvoker (0x6db6b64890) index 7 low-index 7 color 0x0
|SCC 0x7079a06010 rooted in OutputStreamInvoker (0x6db6b64890) has bridge 0
	loop stack: (1/1)(2/2)(3/3)(4/4)(5/5)(6/6)(7/7)
	member OutputStreamInvoker (0x6db6b64890) index 7 low-index 7 color 0x0 state 2
-finishing <AsyncStreamWriter>d__2 (0x6db6b827f0) index 6 low-index 6 color 0x0
	compute low 0x6db6b827f0 ->0x6db6b827a0 (AsyncStateMachineBox`1) 0x7079a04168 (5 / 5, color 0x0)
	compute low 0x6db6b827f0 ->0x6d21ff42f0 (ByteArrayOutputStream) 0x7079a04018 (0 / 0, color 0x7079a06010)
		add color 0x7079a06010 to color_merge_array
	compute low 0x6db6b827f0 ->0x6db6b64890 (OutputStreamInvoker) 0x7079a041c8 (7 / 7, color 0x7079a06010)
	compute low 0x6db6b827f0 ->0x6d24015e60 (ReadWriteTask) 0x0 (-2 / -2, color 0x0)
-finished <AsyncStreamWriter>d__2 (0x6db6b827f0) index 6 low-index 5 color 0x0
-finishing AsyncStateMachineBox`1 (0x6db6b827a0) index 5 low-index 5 color 0x0
	compute low 0x6db6b827a0 ->0x6db6b79a30 (Action) 0x7079a04138 (4 / 4, color 0x0)
	compute low 0x6db6b827a0 ->0x6d24010cf8 (ExecutionContext) 0x0 (-2 / -2, color 0x0)
	compute low 0x6db6b827a0 ->0x6db6b827f0 (<AsyncStreamWriter>d__2) 0x7079a04198 (6 / 5, color 0x0)
-finished AsyncStateMachineBox`1 (0x6db6b827a0) index 5 low-index 4 color 0x0
-finishing Action (0x6db6b79a30) index 4 low-index 4 color 0x0
	compute low 0x6db6b79a30 ->0x6db6b827a0 (AsyncStateMachineBox`1) 0x7079a04168 (5 / 4, color 0x0)
-finished Action (0x6db6b79a30) index 4 low-index 4 color 0x0
|SCC 0x0 rooted in Action (0x6db6b79a30) has bridge 0
	loop stack: (1/1)(2/2)(3/3)(4/4)(5/4)(6/5)
	member <AsyncStreamWriter>d__2 (0x6db6b827f0) index 6 low-index 5 color 0x0 state 2
	member AsyncStateMachineBox`1 (0x6db6b827a0) index 5 low-index 4 color 0x0 state 2
	member Action (0x6db6b79a30) index 4 low-index 4 color 0x0 state 2
-finishing <>c__DisplayClass2_0 (0x6db6b649d0) index 3 low-index 3 color 0x0
	compute low 0x6db6b649d0 ->0x6db6b78b30 (SendOrPostCallback) 0x0 (-2 / -2, color 0x0)
	compute low 0x6db6b649d0 ->0x6db6b79a30 (Action) 0x7079a04138 (4 / 4, color 0x0)
-finished <>c__DisplayClass2_0 (0x6db6b649d0) index 3 low-index 3 color 0x0
|SCC 0x0 rooted in <>c__DisplayClass2_0 (0x6db6b649d0) has bridge 0
	loop stack: (1/1)(2/2)(3/3)
	member <>c__DisplayClass2_0 (0x6db6b649d0) index 3 low-index 3 color 0x0 state 2
-finishing Action (0x6db6b789b0) index 2 low-index 2 color 0x0
	compute low 0x6db6b789b0 ->0x6db6b649d0 (<>c__DisplayClass2_0) 0x7079a04108 (3 / 3, color 0x0)
-finished Action (0x6db6b789b0) index 2 low-index 2 color 0x0
|SCC 0x0 rooted in Action (0x6db6b789b0) has bridge 0
	loop stack: (1/1)(2/2)
	member Action (0x6db6b789b0) index 2 low-index 2 color 0x0 state 2
-finishing RunnableImplementor (0x6db6b821b0) index 1 low-index 1 color 0x0
	compute low 0x6db6b821b0 ->0x6db6b789b0 (Action) 0x7079a040d8 (2 / 2, color 0x0)
-finished RunnableImplementor (0x6db6b821b0) index 1 low-index 1 color 0x0
|SCC 0x7079a06038 rooted in RunnableImplementor (0x6db6b821b0) has bridge 1
	loop stack: (1/1)
	member RunnableImplementor (0x6db6b821b0) index 1 low-index 1 color 0x0 state 2
+scanning MainActivity (0x6d21ff4280) index 9 color 0x0
-finishing MainActivity (0x6d21ff4280) index 9 low-index 9 color 0x0
-finished MainActivity (0x6d21ff4280) index 9 low-index 9 color 0x0
|SCC 0x7079a06060 rooted in MainActivity (0x6d21ff4280) has bridge 1
	loop stack: (9/9)
	member MainActivity (0x6d21ff4280) index 9 low-index 9 color 0x0 state 2
+scanning Looper (0x6d21ff42b8) index 10 color 0x0
-finishing Looper (0x6d21ff42b8) index 10 low-index 10 color 0x0
-finished Looper (0x6d21ff42b8) index 10 low-index 10 color 0x0
|SCC 0x7079a06088 rooted in Looper (0x6d21ff42b8) has bridge 1
	loop stack: (10/10)
	member Looper (0x6d21ff42b8) index 10 low-index 10 color 0x0 state 2
----summary----
bridges:
	ByteArrayOutputStream (0x6d21ff42f0) index 0 color 0x7079a06010
	RunnableImplementor (0x6db6b821b0) index 1 color 0x7079a06038
	MainActivity (0x6d21ff4280) index 9 color 0x7079a06060
	Looper (0x6d21ff42b8) index 10 color 0x7079a06088
colors after tarjan:
	0:  bridges: 0 
	0:  bridges: 1 
	0:  bridges: 9 
	0:  bridges: 10 
***** API *****
number of SCCs 4
TOTAL XREFS 0
---xrefs:

The key part of the log is that there are several cycles in the object graph (or strongly connected component / SCC in terms of how the graph is processed). The inner SCC is processed correctly and you can see that add color X to color_merge_array messages (Code) are produced that would later lead to producing xrefs in the final graph. These are eventually transferred to node xrefs here. However, once the algorithm analyzes the next SCC in the graph, it should add the collected xrefs there on this line. This never happened because this other SCC didn't have any bridge objects by itself, it had regular .NET object AND a previously processes SCC that pointed to a bridge object. Since the previously processes SCC was not considered a "bridge object", color data for the outer SCC were never allocated and then never merged with the already gathered xrefs, leading to an information loss here (color_data == NULL && dyn_array_ptr_size (&other->xrefs) > 0))...

@filipnavara
Copy link
Member Author

For those who are more visual and can bear my handscribbling:

image

Yellow are bridge objects. Blue is the first SCC in the graph, it itself points to a bridge object. Green is the second SCC graph. It points to no bridge objects by itself but it does contain the blue SCC which does. When processing the green SCC it erroneously decided that since there's no bridge objects it doesn't need to be colored and threw away the XREFs calculated from the blue SCC.

@simonrozsival
Copy link
Member

/cc @BrzVlad

@filipnavara filipnavara changed the title Fix an edge case in the Tarjan SCC that lead to losing xref information Fix an edge case in the Tarjan SCC that leads to losing xref information Feb 23, 2025
@filipnavara filipnavara changed the title Fix an edge case in the Tarjan SCC that leads to losing xref information Fix an edge case in the Tarjan GC bridge that leads to losing xref information Feb 23, 2025
@filipnavara filipnavara marked this pull request as ready for review February 25, 2025 08:26
@AaronRobinsonMSFT
Copy link
Member

@filipnavara I'm still going through this and trying to understand both the current SCC algorithm and the impact of the above. I'm also going to defer to @BrzVlad for the final sign off as he has more experience with this area than me.

@filipnavara
Copy link
Member Author

I'm still going through this and trying to understand both the current SCC algorithm and the impact of the above.

I'm more than happy to answer any questions in case it helps with this PR or the work on CoreCLR bridge.

I'm also going to defer to @BrzVlad for the final sign off as he has more experience with this area than me.

Sure. I think this bug was essentially cropping up many times over the years but no one really tracked it down... As a follow-up we can try to go down the history lane to see if it fixes the cases where people previously had to switch to the "new" bridge.

@AaronRobinsonMSFT
Copy link
Member

As a follow-up we can try to go down the history lane to see if it fixes the cases where people previously had to switch to the "new" bridge.

Ah, that history is interesting. If you could do that or get some indication that the other algorithms handle this correctly and now the tarjan one does too, that would help give us more confidence in this change.

@filipnavara
Copy link
Member Author

If you could do that or get some indication that the other algorithms handle this correctly and now the tarjan one does too...

The sample code above runs correctly under .NET 8/9 when switching to the "new" bridge. Likewise, it runs correctly after the fix but crashes reliably without it.

There's a mode of the runtime that runs both bridges at the same time and compares the results. I can re-run the sample under that mode (unless the code bit rotted too much since the mono/mono times).

@AaronRobinsonMSFT
Copy link
Member

There's a mode of the runtime that runs both bridges at the same time and compares the results. I can re-run the sample under that mode (unless the code bit rotted too much since the mono/mono times).

Oh wow. I didn't know that. If you could check that out please do. I'm also happy to review a PR if it has bit-rot and a few fixes are needed.

@filipnavara
Copy link
Member Author

filipnavara commented Feb 26, 2025

The good news is that the bridge comparison works in released MonoVM builds, so you can run adb shell setprop debug.mono.env "MONO_GC_DEBUG=bridge-compare-to=new" and get this:

.NET 8:

02-26 21:33:13.508 10744 10770 I me.androidgcbug: Explicit concurrent mark compact GC freed 60KB AllocSpace bytes, 0(0B) LOS objects, 49% free, 2012KB/4024KB, paused 127us,593us total 4.874ms
02-26 21:33:13.508 10744 10770 E me.androidgcbug: SCCS count expected 0 but got 1

.NET 9:

02-26 21:37:19.295 15936 15958 I me.androidgcbug: Explicit concurrent mark compact GC freed 1285KB AllocSpace bytes, 3(60KB) LOS objects, 49% free, 2003KB/4007KB, paused 102us,659us total 5.560ms
02-26 21:37:19.295 15936 15958 E me.androidgcbug: SCCS count expected 0 but got 1
02-26 21:37:19.295 15936 15958 F monodroid: SCCS count expected 0 but got 1
02-26 21:37:19.295 15936 15958 F monodroid: Abort at mono-log-adapter.cc:46:3 ('static void xamarin::android::internal::MonodroidRuntime::mono_log_handler(const char *, const char *, const char *, mono_bool, void *)')

However, I would expect it to bail with different values. I need to figure out the details. The Tarjan bridge needs be setup to run in precise mode for this to work, which is not the default.

UPD: With this patch applied it no longer crashes even with the bridge-compare-to mode.

@steveisok steveisok requested a review from lateralusX February 26, 2025 20:56
@filipnavara
Copy link
Member Author

filipnavara commented Feb 26, 2025

However, I would expect it to bail with different values. I need to figure out the details.

With dump_processor_state fixed and enabled (filipnavara@f7b8726) it actually shows the exact difference described above:

02-26 22:25:28.305 25037 25071 V me.androidgcbug: ------
02-26 22:25:28.305 25037 25071 V me.androidgcbug: SCCS 4
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 0:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7c601d0280(Looper)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 1:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7c601d02b8(MainActivity)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 2:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7c601d02f0(ByteArrayOutputStream)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 3:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7cf7331c10(RunnableImplementor)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: XREFS 0
02-26 22:25:28.305 25037 25071 V me.androidgcbug: -------
02-26 22:25:28.305 25037 25071 V me.androidgcbug: ------
02-26 22:25:28.305 25037 25071 V me.androidgcbug: SCCS 4
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 0:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7cf7331c10(RunnableImplementor)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 1:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7c601d02f0(ByteArrayOutputStream)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 2:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7c601d02b8(MainActivity)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	SCC 3:
02-26 22:25:28.305 25037 25071 V me.androidgcbug:  0x7c601d0280(Looper)
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 
02-26 22:25:28.305 25037 25071 V me.androidgcbug: XREFS 1
02-26 22:25:28.305 25037 25071 V me.androidgcbug: 	0 -> 1
02-26 22:25:28.305 25037 25071 V me.androidgcbug: -------

(top being the primary Tarjan bridge, bottom being the New bridge used for comparison)

There's a cut & paste error in the printf here so it says SCCS instead of XREFS.

@lateralusX
Copy link
Member

What does the DUMP_GRAPH look like with the fix, interested to compare before and after.

@filipnavara
Copy link
Member Author

What does the DUMP_GRAPH look like with the fix, interested to compare before and after.

-----------------
+scanning Looper (0x7cc25802b8) index 0 color 0x0
-finishing Looper (0x7cc25802b8) index 0 low-index 0 color 0x0
-finished Looper (0x7cc25802b8) index 0 low-index 0 color 0x0
|SCC 0x7fa32a6010 rooted in Looper (0x7cc25802b8) has bridge 1
	loop stack: (0/0)
	member Looper (0x7cc25802b8) index 0 low-index 0 color 0x0 state 2
+scanning ByteArrayOutputStream (0x7cc25802f0) index 1 color 0x0
-finishing ByteArrayOutputStream (0x7cc25802f0) index 1 low-index 1 color 0x0
-finished ByteArrayOutputStream (0x7cc25802f0) index 1 low-index 1 color 0x0
|SCC 0x7fa32a6038 rooted in ByteArrayOutputStream (0x7cc25802f0) has bridge 1
	loop stack: (1/1)
	member ByteArrayOutputStream (0x7cc25802f0) index 1 low-index 1 color 0x0 state 2
+scanning RunnableImplementor (0x7cf732de40) index 2 color 0x0
	= pushing 0x7cf73597b0 Action -> pushed!
+scanning Action (0x7cf73597b0) index 3 color 0x0
	= pushing 0x7cf7328a30 <>c__DisplayClass2_0 -> pushed!
+scanning <>c__DisplayClass2_0 (0x7cf7328a30) index 4 color 0x0
	= pushing 0x7cf73596b0 SendOrPostCallback -> alive
	= pushing 0x7cf7359bb0 Action -> pushed!
+scanning Action (0x7cf7359bb0) index 5 color 0x0
	= pushing 0x7cf732e7f0 AsyncStateMachineBox`1 -> pushed!
+scanning AsyncStateMachineBox`1 (0x7cf732e7f0) index 6 color 0x0
	= pushing 0x7cf7359bb0 Action -> already marked
	= pushing 0x7c6e00b160 ExecutionContext -> alive
	= pushing 0x7cf732e840 <AsyncStreamWriter>d__2 -> pushed!
+scanning <AsyncStreamWriter>d__2 (0x7cf732e840) index 7 color 0x0
	= pushing 0x7cf732e7f0 AsyncStateMachineBox`1 -> already marked
	= pushing 0x7cc25802f0 ByteArrayOutputStream -> already marked
	= pushing 0x7cf7328870 OutputStreamInvoker -> pushed!
	= pushing 0x7c6e004698 ReadWriteTask -> alive
+scanning OutputStreamInvoker (0x7cf7328870) index 8 color 0x0
	= pushing 0x7cf7354870 SemaphoreSlim -> pushed!
	= pushing 0x7cc25802f0 ByteArrayOutputStream -> already marked
+scanning SemaphoreSlim (0x7cf7354870) index 9 color 0x0
	= pushing 0x7cf733c1a8 StrongBox`1 -> opaque
-finishing SemaphoreSlim (0x7cf7354870) index 9 low-index 9 color 0x0
	compute low 0x7cf7354870 ->0x7cf733c1a8 (StrongBox`1) 0x0 (-2 / -2, color 0x0)
-finished SemaphoreSlim (0x7cf7354870) index 9 low-index 9 color 0x0
|SCC 0x0 rooted in SemaphoreSlim (0x7cf7354870) has bridge 0
	loop stack: (2/2)(3/3)(4/4)(5/5)(6/6)(7/7)(8/8)(9/9)
	member SemaphoreSlim (0x7cf7354870) index 9 low-index 9 color 0x0 state 2
-finishing OutputStreamInvoker (0x7cf7328870) index 8 low-index 8 color 0x0
	compute low 0x7cf7328870 ->0x7cf7354870 (SemaphoreSlim) 0x7fa32a41c8 (9 / 9, color 0x0)
	compute low 0x7cf7328870 ->0x7cc25802f0 (ByteArrayOutputStream) 0x7fa32a4048 (1 / 1, color 0x7fa32a6038)
		add color 0x7fa32a6038 to color_merge_array
-finished OutputStreamInvoker (0x7cf7328870) index 8 low-index 8 color 0x0
|SCC 0x7fa32a6038 rooted in OutputStreamInvoker (0x7cf7328870) has bridge 0
	loop stack: (2/2)(3/3)(4/4)(5/5)(6/6)(7/7)(8/8)
	member OutputStreamInvoker (0x7cf7328870) index 8 low-index 8 color 0x0 state 2
-finishing <AsyncStreamWriter>d__2 (0x7cf732e840) index 7 low-index 7 color 0x0
	compute low 0x7cf732e840 ->0x7cf732e7f0 (AsyncStateMachineBox`1) 0x7fa32a4138 (6 / 6, color 0x0)
	compute low 0x7cf732e840 ->0x7cc25802f0 (ByteArrayOutputStream) 0x7fa32a4048 (1 / 1, color 0x7fa32a6038)
		add color 0x7fa32a6038 to color_merge_array
	compute low 0x7cf732e840 ->0x7cf7328870 (OutputStreamInvoker) 0x7fa32a4198 (8 / 8, color 0x7fa32a6038)
	compute low 0x7cf732e840 ->0x7c6e004698 (ReadWriteTask) 0x0 (-2 / -2, color 0x0)
-finished <AsyncStreamWriter>d__2 (0x7cf732e840) index 7 low-index 6 color 0x0
-finishing AsyncStateMachineBox`1 (0x7cf732e7f0) index 6 low-index 6 color 0x0
	compute low 0x7cf732e7f0 ->0x7cf7359bb0 (Action) 0x7fa32a4108 (5 / 5, color 0x0)
	compute low 0x7cf732e7f0 ->0x7c6e00b160 (ExecutionContext) 0x0 (-2 / -2, color 0x0)
	compute low 0x7cf732e7f0 ->0x7cf732e840 (<AsyncStreamWriter>d__2) 0x7fa32a4168 (7 / 6, color 0x0)
-finished AsyncStateMachineBox`1 (0x7cf732e7f0) index 6 low-index 5 color 0x0
-finishing Action (0x7cf7359bb0) index 5 low-index 5 color 0x0
	compute low 0x7cf7359bb0 ->0x7cf732e7f0 (AsyncStateMachineBox`1) 0x7fa32a4138 (6 / 5, color 0x0)
-finished Action (0x7cf7359bb0) index 5 low-index 5 color 0x0
|SCC 0x7fa32a6060 rooted in Action (0x7cf7359bb0) has bridge 1
	loop stack: (2/2)(3/3)(4/4)(5/5)(6/5)(7/6)
	member <AsyncStreamWriter>d__2 (0x7cf732e840) index 7 low-index 6 color 0x0 state 2
	member AsyncStateMachineBox`1 (0x7cf732e7f0) index 6 low-index 5 color 0x0 state 2
	member Action (0x7cf7359bb0) index 5 low-index 5 color 0x0 state 2
-finishing <>c__DisplayClass2_0 (0x7cf7328a30) index 4 low-index 4 color 0x0
	compute low 0x7cf7328a30 ->0x7cf73596b0 (SendOrPostCallback) 0x0 (-2 / -2, color 0x0)
	compute low 0x7cf7328a30 ->0x7cf7359bb0 (Action) 0x7fa32a4108 (5 / 5, color 0x7fa32a6060)
		add color 0x7fa32a6060 to color_merge_array
-finished <>c__DisplayClass2_0 (0x7cf7328a30) index 4 low-index 4 color 0x0
|SCC 0x7fa32a6060 rooted in <>c__DisplayClass2_0 (0x7cf7328a30) has bridge 0
	loop stack: (2/2)(3/3)(4/4)
	member <>c__DisplayClass2_0 (0x7cf7328a30) index 4 low-index 4 color 0x0 state 2
-finishing Action (0x7cf73597b0) index 3 low-index 3 color 0x0
	compute low 0x7cf73597b0 ->0x7cf7328a30 (<>c__DisplayClass2_0) 0x7fa32a40d8 (4 / 4, color 0x7fa32a6060)
		add color 0x7fa32a6060 to color_merge_array
-finished Action (0x7cf73597b0) index 3 low-index 3 color 0x0
|SCC 0x7fa32a6060 rooted in Action (0x7cf73597b0) has bridge 0
	loop stack: (2/2)(3/3)
	member Action (0x7cf73597b0) index 3 low-index 3 color 0x0 state 2
-finishing RunnableImplementor (0x7cf732de40) index 2 low-index 2 color 0x0
	compute low 0x7cf732de40 ->0x7cf73597b0 (Action) 0x7fa32a40a8 (3 / 3, color 0x7fa32a6060)
		add color 0x7fa32a6060 to color_merge_array
-finished RunnableImplementor (0x7cf732de40) index 2 low-index 2 color 0x0
|SCC 0x7fa32a6088 rooted in RunnableImplementor (0x7cf732de40) has bridge 1
	loop stack: (2/2)
	member RunnableImplementor (0x7cf732de40) index 2 low-index 2 color 0x0 state 2
----summary----
bridges:
	Looper (0x7cc25802b8) index 0 color 0x7fa32a6010
	ByteArrayOutputStream (0x7cc25802f0) index 1 color 0x7fa32a6038
	RunnableImplementor (0x7cf732de40) index 2 color 0x7fa32a6088
colors after tarjan:
	0:  bridges: 0 
	0:  bridges: 1 
	0: 0x7fa32a6038 
	0: 0x7fa32a6060  bridges: 2 
***** API *****
number of SCCs 4
TOTAL XREFS 1
---xrefs:
	2 -> 1

Note that the order of the objects is not stable. Sometimes ByteArrayOutputStream is scanned first, sometimes it comes after RunnableImplementor. Either way it produces the XREFs now.

@AaronRobinsonMSFT
Copy link
Member

With dump_processor_state fixed and enabled (filipnavara@f7b8726) it actually shows the exact difference described above:

Thanks @filipnavara - your help here is appreciated more than you know. From my perspective, this assuages any concerns I have with this change. I'm still going to defer to someone with more experience in this code base though.

If you'd like to put up the fixed up code above - still commented out, feel free. I'm confident enough to sign off on that.

@filipnavara
Copy link
Member Author

I am playing with the idea of running the test cases as a standalone executable that includes unmodified sgen-tarjan-bridge.c:
https://github.com/filipnavara/runtime/tree/standalone-gc-bridge-test

The branch has a minimal example with the broken object graph scenario. It can run it on desktop and report back the output of the bridge code. With a bit of love we could probably clean it up and pair it with a fuzzer to track down other failures like #106410.

@BrzVlad
Copy link
Member

BrzVlad commented Mar 2, 2025

@filipnavara I agree this is a good opportunity to start testing the tarjan bridge more actively. This could easily be achieved by comparing it with the output produced by the new bridge. Are you aware of the MONO_GC_DEBUG="bridge=..." debug option ? It makes a certain type to be considered as a bridge so it is included in the bridge algorithm, without requiring any mono embedder. In mono/mono we used to have such tests, for example https://github.com/mono/mono/blob/main/mono/tests/sgen-bridge-pathologies.cs. What do you think of this approach ? Maybe you could add a test that shows the scenario that your fix addresses, based on this.

@filipnavara
Copy link
Member Author

filipnavara commented Mar 2, 2025

Are you aware of the MONO_GC_DEBUG="bridge=..." debug option ?

Yes, I am aware of it. I still prefer to test the bridge in complete isolation because it allows me to avoid hitting other GC behaviors (ie. conservative stack scan keeping objects alive), overhead, and to run it with specific graphs in a tight loop. It may, however, be good enough for testing specific test cases like the ones in sgen-bridge-pathologies.cs.

Maybe you could add a test that shows the scenario that your fix addresses, based on this.

I am open to this but I would prefer if this PR is not blocked on it.

@filipnavara
Copy link
Member Author

I still believe we should come up with better testing strategy but that's a separate concern from fixing the bug. I added the test case to sgen-bridge-pathologies.cs.

@BrzVlad
Copy link
Member

BrzVlad commented Mar 3, 2025

@filipnavara Sorry for adding confusion. sgen-bridge-pathologies.cs is not really a test, but rather an app showcasing various edge scenarios, where different bridge implementations can show different performance characteristics. It wasn't used as a test in mono/mono. Aside from that, we haven't really ported any bridge tests from mono. We should add a new runtime test for this. I'm planning next week to take a deeper dive into the tarjan bridge algorithm, since I don't remember much of its details, in order to provide a more relevant review for this PR. I can also take care of adding the runtime test.

@filipnavara
Copy link
Member Author

I'm planning next week to take a deeper dive into the tarjan bridge algorithm, since I don't remember much of its details, in order to provide a more relevant review for this PR.

Feel free to ask questions in case you hit something non-obvious. I'll be on Discord and GitHub.

Notably, neither this PR nor #113044 requires understanding all the details. Both are relatively isolated issues in the bridge code that don't modify the core of the algorithm. I highly recommend running the standalone sample in https://github.com/filipnavara/runtime/tree/standalone-gc-bridge-test under debugger to see the values that trigger the erroneous conditions.

@@ -748,7 +748,7 @@ create_scc (ScanData *data)

for (i = dyn_array_ptr_size (&loop_stack) - 1; i >= 0; --i) {
ScanData *other = (ScanData *)dyn_array_ptr_get (&loop_stack, i);
found_bridge |= other->is_bridge;
found_bridge |= other->is_bridge || dyn_array_ptr_size (&other->xrefs) > 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking something like this would be more clear with the intention of the change.

@@ -745,10 +745,16 @@ create_scc (ScanData *data)
        gboolean found = FALSE;
        gboolean found_bridge = FALSE;
        ColorData *color_data = NULL;
+       gboolean can_reduce_color = TRUE;
 
        for (i = dyn_array_ptr_size (&loop_stack) - 1; i >= 0; --i) {
                ScanData *other = (ScanData *)dyn_array_ptr_get (&loop_stack, i);
                found_bridge |= other->is_bridge;
+               if (dyn_array_ptr_size (&other->xrefs) > 0) {
+                       // This scc will have more xrefs than the ones from the color_merge_array,
+                       // we will need to create a new color to store this information.
+                       can_reduce_color = FALSE;
+               }
                if (found_bridge || other == data)
                        break;
        }
@@ -756,9 +762,12 @@ create_scc (ScanData *data)
        if (found_bridge) {
                color_data = new_color (TRUE);
                ++num_colors_with_bridges;
-       } else {
+       } else if (can_reduce_color) {
                color_data = reduce_color ();
+       } else {
+               color_data = new_color (FALSE);
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work too and it seems reasonable.

* (ie. the task continuation is hooked through the SynchronizationContext
* implentation and rooted only by Android bridge objects).
*/
static void NestedCycles ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to have a starting point for some passing bridge tests (comparing between new and tarjan). I will add them in a separate PR, maybe also including this test, so we can start incrementally fixing and testing the tarjan bridge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-mono community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HttpClient ObjectDisposed after SDK upgrade from 34.0.95 -> 34.0.113
6 participants