-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Simplifying access to thread static variables #84566
Conversation
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsFixes: #84373
|
@@ -41,37 +39,18 @@ LEAF_END RhpStackProbe, _TEXT | |||
|
|||
LEAF_ENTRY RhpGetThreadStaticBaseForType, _TEXT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "fast" access got a lot simpler here. RhpGetThreadStaticBaseForType
is what we want eventually not called, but directly inlined by the JIT into callers.
We could yet make this a bit simpler by removing an indirection into the array of storage instances. It may be somewhat challenging though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper for Arm64 is now implemented
src/coreclr/nativeaot/System.Private.CoreLib/src/Internal/Runtime/ThreadStatics.cs
Outdated
Show resolved
Hide resolved
@@ -41,37 +39,18 @@ LEAF_END RhpStackProbe, _TEXT | |||
|
|||
LEAF_ENTRY RhpGetThreadStaticBaseForType, _TEXT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?
no reason. Also no reason for the helper to be in assembly. |
are you thinking about it as part of this PR? |
yes. It would also make it easier to do further changes - no need to fix multiple helpers. |
What is the speedup from just this change? |
i think the cost of the call is still there, so this might not be a huge improvement. |
db1c5d2
to
723e2ec
Compare
I see about 15% speedup on x64. That is with both old and new implementations still making a call to the helper. With a simple microbenchmark which calls a method accessing bunch of threadstatics in a loop I see:
the benchmark: using System.Diagnostics;
using System.Runtime.CompilerServices;
namespace ConsoleApp22
{
class C00 {[ThreadStatic] public static int t_i00;}
class C01 {[ThreadStatic] public static int t_i01;}
class C02 {[ThreadStatic] public static int t_i02;}
class C03 {[ThreadStatic] public static int t_i03;}
class C04 {[ThreadStatic] public static int t_i04;}
class C05 {[ThreadStatic] public static int t_i05;}
class C06 {[ThreadStatic] public static int t_i06;}
class C07 {[ThreadStatic] public static int t_i07;}
class C08 {[ThreadStatic] public static int t_i08;}
class C09 {[ThreadStatic] public static int t_i09;}
class C10 {[ThreadStatic] public static int t_i10;}
class C11 {[ThreadStatic] public static int t_i11;}
class C12 {[ThreadStatic] public static int t_i12;}
class C13 {[ThreadStatic] public static int t_i13;}
class C14 {[ThreadStatic] public static int t_i14;}
class C15 {[ThreadStatic] public static int t_i15;}
class C16 {[ThreadStatic] public static int t_i16;}
class C17 {[ThreadStatic] public static int t_i17;}
class C18 {[ThreadStatic] public static int t_i18;}
class C19 {[ThreadStatic] public static int t_i19;}
class C20 {[ThreadStatic] public static int t_i20;}
class C21 {[ThreadStatic] public static int t_i21;}
class C22 {[ThreadStatic] public static int t_i22;}
class C23 {[ThreadStatic] public static int t_i23;}
class C24 {[ThreadStatic] public static int t_i24;}
class C25 {[ThreadStatic] public static int t_i25;}
class C26 {[ThreadStatic] public static int t_i26;}
class C27 {[ThreadStatic] public static int t_i27;}
class C28 {[ThreadStatic] public static int t_i28;}
class C29 {[ThreadStatic] public static int t_i29;}
internal class Program
{
const int iters = 1000000;
static void Main(string[] args)
{
for (; ; )
{
Time(AccessTLS);
}
}
static void Time(Action a)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
a();
}
sw.Stop();
System.Console.WriteLine(sw.ElapsedMilliseconds);
}
static void AccessTLS()
{
for (int i = 0; i < iters; i++)
{
OneTLSAccess();
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void OneTLSAccess()
{
C00.t_i00 = C00.t_i00 +C01.t_i01 + C02.t_i02 +C03.t_i03 + C04.t_i04 +C05.t_i05 + C06.t_i06 +C07.t_i07 +
C08.t_i08 +C09.t_i09 + C10.t_i10 +C11.t_i11 + C12.t_i12 +C13.t_i13 + C14.t_i14 +C15.t_i15 +
C16.t_i16 +C17.t_i17 + C18.t_i18 +C19.t_i19 + C20.t_i20 +C21.t_i21 + C22.t_i22 +C23.t_i23 +
C24.t_i24 +C25.t_i25 + C26.t_i26 +C27.t_i27 + C28.t_i28 +C29.t_i29;
}
}
} |
The codegen looks like the folllowing: before the changes:
after the changes:
|
TypeManager* pSingleTypeManager = GetRuntimeInstance()->GetSingleTypeManager(); | ||
if (pSingleTypeManager != NULL) | ||
{ | ||
InitInlineThreadStatics(pSingleTypeManager); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make thread attach a potential source of unhandled OOM that leads to fail fast. This fail fast is impossible for user code to catch or recover from. I am not sure whether it is a good trade-off to make for native AOT. It will make native AOT less suitable for system-programming like tasks where uncatchable OOMs are a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible solution is to ignore the OOM on thread attach and live the storage as NULL and turn the AV on the first use of the threadstatic into an OOM exception. That could be inconvenient though when the access is JIT-inlined.
Or we can do one null-check for the whole thing and call the initializer on the first use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will switch this to "allocate on first use" pattern. In such case the OOM is most likely to happen while creating managed Thread
, either directly or indirectly by accessing managed thread ID.
It would still be possible to access a random threadstatic and get an unexpected OOM, but I think it would be always catcheable.
@MichalStrehovsky The codegen for the helper on Linux is indeed not very good. Unlike on Windows where it seems better than hand-written assembly. System.Collections.Tests`::RhpGetThreadStaticBaseForType(uint32_t):
0x55555542c040 <+0>: push rbp
0x55555542c041 <+1>: mov rbp, rsp
0x55555542c044 <+4>: push rbx
0x55555542c045 <+5>: push rax
0x55555542c046 <+6>: mov ebx, edi
-> 0x55555542c048 <+8>: mov rax, qword ptr fs:[0x0]
0x55555542c051 <+17>: lea rax, [rax - 0xf0]
0x55555542c058 <+24>: mov rax, qword ptr [rax + 0x98]
0x55555542c05f <+31>: add ebx, 0x2
0x55555542c062 <+34>: mov rax, qword ptr [rax + 8*rbx]
0x55555542c066 <+38>: add rsp, 0x8
0x55555542c06a <+42>: pop rbx
0x55555542c06b <+43>: pop rbp
0x55555542c06c <+44>: ret This looks pretty bad. I wonder what makes the compiler confused. |
Are we forcing frame pointers somehow? |
The frame is there on Linux because this is generated as a call. It gets turned into a mov by linker magic during linking. But linker magic is not able to mop up things around it. Linker magic can be only done when we produce executable. Not when we produce a shared library. This was a good article on Linux TLS I read some time ago: https://maskray.me/blog/2021-02-14-all-about-thread-local-storage |
Call to |
Right, we do disable frame pointer optimizations and that is likely causing this kind of code: runtime/eng/native/configurecompiler.cmake Line 407 in 0ac097f
The reason is "to make it easier to profile". Perhaps CoreClr has other reasons for requiring frame pointers (but why on Unix only?) |
Profiling tools on Linux want to have RBP chain. We can double check whether it is still the case - I would expect it to be.
We do that on Windows too (look for |
Fedora had a big discussion about it last year (didn't follow where that went but it is likely still an issue if they were having heated discussions about it): https://www.phoronix.com/news/Fedora-37-No-Omit-Frame-Pointer |
Not suppressing that optimization results in as good codegen as there could be: System.Collections.Tests`::RhpGetThreadStaticBaseForType(uint32_t):
0x55555542be20 <+0>: push rbx
0x55555542be21 <+1>: mov ebx, edi
-> 0x55555542be23 <+3>: mov rax, qword ptr fs:[0x0]
0x55555542be2c <+12>: lea rax, [rax - 0xf0]
0x55555542be33 <+19>: mov rax, qword ptr [rax + 0x98]
0x55555542be3a <+26>: add ebx, 0x2
0x55555542be3d <+29>: mov rax, qword ptr [rax + 8*rbx]
0x55555542be41 <+33>: pop rbx
0x55555542be42 <+34>: ret |
perf allows to choose DWARF, FP chain (a bit more lightweight), or LBR (very lightweight, but Intel-specific).
This would probably be the best option, if possible. |
Thinking about this - does RyuJIT support the equivalent of |
On platforms that can use frame chains (all platforms except Win x64), RyuJIT is smart about adding the frame to more complex methods only, so that the trivial methods do not pay for the frame. Also, it does not allocate the frame pointer register so that methods with omitted frame do not break the frame chain. For the frame chain walking profiler, it looks as if there was more method inlining - some methods are omitted in the frame chain, but the frame chain still gives you pretty accurate picture of the callstack. Look for Also, there is |
GCC has multiple ways to do this (attributes, pragmas), but clang has none. There is also |
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
NativeAOT OSX failures are #85600 |
mono failures appear to be unrelated. Either device connection, build or test failures that happen in a noop test run as well. #85694 |
Fixes: #84373