[NativeAOT] Optimize the access pattern to threadstatics. #84373

VSadov · 2023-04-05T19:08:42Z

Prior to implementing JIT inlining of threadstatic access, it makes sense to simplify the access pattern.

There could be some opportunities. For example we may not need per-module indirection.
Re: #82973 (comment)

ghost · 2023-04-05T19:08:50Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Prior to implementing JIT inlining of threadstatic access, it makes sense to simplify the access pattern.

There could be some opportunities. For example we may not need per-module indirection.

Author:	VSadov
Assignees:	VSadov
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

VSadov · 2023-04-06T22:33:58Z

According to @MichalStrehovsky multimodule scenario is not something that is commonly used, but it is functional (there are tests) and it may make sense to not break it, if possible.

jkotas · 2023-04-07T00:59:43Z

The multimodule scenario leaves performance on the table in general. It is ok to use a slow helper for it if we want to keep it working.

We should focus on optimizing the thread-static access pattern for the single module case with assumption that the JIT can help by inlining it. How far do you think it makes to push it? Should we push it to be as efficient as thread statics in C where possible (non-GC non-generic statics)?

VSadov · 2023-04-07T17:52:22Z

Should we push it to be as efficient as thread statics in C where possible (non-GC non-generic statics)?

C threadstatic can be very efficient. Here is the access to the current thread in the native code:

    Thread * pThread = ThreadStore::GetCurrentThread();
00007FF7BAB06FD2  mov         edx,dword ptr [_tls_index (07FF7BACB2B48h)]  
00007FF7BAB06FD8  mov         rax,qword ptr gs:[58h]  
00007FF7BAB06FE1  mov         ebx,10h  
00007FF7BAB06FE6  add         rbx,qword ptr [rax+rdx*8]

This is basically getting native thread static tls_CurrentThread. There is nothing dynamic about tls_CurrentThread, apart from threads being created/destroyed dynamically. Indirection off gs handles that.
(on other platforms it will be x18, TPIDR_EL0 or similar)

I think we can have this for unmanaged threadstatics only if we emit unmanaged threadstatics in the same way as C and let the C runtime manage them. It would be attractive for threadstatics such as managed thread ID.
I have no idea if that is feasible. This is beyond my understanding how things are described in the obj files and presented to the native linker.

Managed and dynamic threadstatics will need something similar to what we have now, but I think we can reduce number of indirections. If the above is too hard or fragile, then unmanaged stuff could be on the same plan as managed.

VSadov · 2023-04-07T18:06:30Z

For the current scheme - I think allocating a storage per type is a bit of a waste. At least in the nondynamic case. I think allocating just one instance per thread that provides storage to all nondynamic threadstatics, could work better.
I think it is uncommon for types to have really a lot of threadstatics, so laziness in terms of allocating storage per type is unlikely to save us any space, just adds an extra bound check, null check, and indirection.

As I looked through the use of threadstatics in libraries tests, the per-type array can get up to 512 bytes - that is to hold up to 64 references to per-type storages.

I think the "combo" instance will rarely be larger than 1K.
Even 4K per-thread would not be that big of a deal. And that would be for apps that really indulge in threadstatics.

VSadov · 2023-04-07T18:50:36Z

in a first approximation I'd have two slots on the native thread instance:

one "static" slot for a combo instance that covers nondynamic module0 threadstatics. Not sure if it needs to be lazy. Perhaps we can just set it up right away on thread attach.
JIT would inline the access to this. There is nothing to check or call, if we allocate the store eagerly.
It will be the sequence like in C++ with one additional indirection to get the ref to the per-type store.
and another "dynamic" slot that would be on the same plan as it is now - to handle multiple modules and dynamic types. Both seem to be very rare scenarios. This can be as lazy as what we have now. Access to this may not be inlined by the JIT

The tricky part about "combo" instance is that GetThreadStaticBaseForType would need to be changed to return ref to the per-type store.
The location could be computed from typeTlsIndex as we know per-type store sizes, but it would be more efficient for the non-dynamic case to operate in terms of something like typeTlsOffset.

VSadov · 2023-04-07T19:10:51Z

CC: @kunalspathak

VSadov · 2023-04-07T20:08:10Z

There is nothing to check or call, if we allocate the store eagerly.

Actually, code does not know statically if it is in module0, so we will need one check. Either for the module number==0 or whether there are multiple modules.

jkotas · 2023-04-07T20:33:33Z

The compiler knows whether it is emitting multi-module code (--multifile switch).

VSadov · 2023-04-07T21:40:14Z

The compiler knows whether it is emitting multi-module code (--multifile switch).

That is good! Then we can tell the jit to just call the helper.

VSadov · 2023-04-07T22:08:15Z

Note that we have two helpers:

GetThreadStaticBaseForType (C#)
RhpGetThreadStaticBaseForType (asm) - only called in nongeneric case.

The reason we have RhpGetThreadStaticBaseForType is really just to do INLINE_GETTHREAD, so we do not need to make yet another call to fetch the native TLS.

My plan is to simplify the storage how we can, which should result in simpler asm helper for the fast path.

As the next step @kunalspathak will teach JIT to inline that piece of asm, so that RhpGetThreadStaticBaseForType is no longer needed.

VSadov · 2023-04-10T15:46:22Z

An initial implementation of simpler access to threadstatics - #84566

It separates the fast "inlineable" case from multimodule/dynamic case and preinitializes the storage, so that the asm helper does not need to check for anything or call anything.

It does not yet introduce a "combo" instance. It looks like that could require some changes in how JIT deals with threadstatics - i.e. the helper would return a ref, not an object.
Not sure yet how disruptive such change could be.

Fixes: #84373 - [x] separate "fast" inlinable case . (used for singlemodule, not dynamic cases, when optimizing) - [x] make the storage for fast threadstatics a single "combo" instance instead of array of instances.

VSadov added the area-NativeAOT-coreclr label Apr 5, 2023

VSadov self-assigned this Apr 5, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Apr 5, 2023

VSadov mentioned this issue Apr 5, 2023

[JIT] Add support to inline the field access of primitive types marked with TLS #82973

Merged

7 tasks

VSadov mentioned this issue Apr 10, 2023

[NativeAOT] Simplifying access to thread static variables #84566

Merged

2 tasks

ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 10, 2023

agocke added this to AppModel Apr 20, 2023

VSadov closed this as completed in #84566 May 4, 2023

ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels May 4, 2023

ghost locked as resolved and limited conversation to collaborators Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NativeAOT] Optimize the access pattern to threadstatics. #84373

[NativeAOT] Optimize the access pattern to threadstatics. #84373

VSadov commented Apr 5, 2023 •

edited

Loading

ghost commented Apr 5, 2023

VSadov commented Apr 6, 2023 •

edited

Loading

jkotas commented Apr 7, 2023

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023 •

edited

Loading

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023

jkotas commented Apr 7, 2023

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023 •

edited

Loading

VSadov commented Apr 10, 2023

[NativeAOT] Optimize the access pattern to threadstatics. #84373

[NativeAOT] Optimize the access pattern to threadstatics. #84373

Comments

VSadov commented Apr 5, 2023 • edited Loading

ghost commented Apr 5, 2023

VSadov commented Apr 6, 2023 • edited Loading

jkotas commented Apr 7, 2023

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023 • edited Loading

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023

jkotas commented Apr 7, 2023

VSadov commented Apr 7, 2023

VSadov commented Apr 7, 2023 • edited Loading

VSadov commented Apr 10, 2023

VSadov commented Apr 5, 2023 •

edited

Loading

VSadov commented Apr 6, 2023 •

edited

Loading

VSadov commented Apr 7, 2023 •

edited

Loading

VSadov commented Apr 7, 2023 •

edited

Loading