Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Optimize the access pattern to threadstatics. #84373

Closed
VSadov opened this issue Apr 5, 2023 · 12 comments · Fixed by #84566
Closed

[NativeAOT] Optimize the access pattern to threadstatics. #84373

VSadov opened this issue Apr 5, 2023 · 12 comments · Fixed by #84566

Comments

@VSadov
Copy link
Member

VSadov commented Apr 5, 2023

Prior to implementing JIT inlining of threadstatic access, it makes sense to simplify the access pattern.

There could be some opportunities. For example we may not need per-module indirection.
Re: #82973 (comment)

@VSadov VSadov self-assigned this Apr 5, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 5, 2023
@ghost
Copy link

ghost commented Apr 5, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Prior to implementing JIT inlining of threadstatic access, it makes sense to simplify the access pattern.

There could be some opportunities. For example we may not need per-module indirection.

Author: VSadov
Assignees: VSadov
Labels:

area-NativeAOT-coreclr

Milestone: -

@VSadov
Copy link
Member Author

VSadov commented Apr 6, 2023

According to @MichalStrehovsky multimodule scenario is not something that is commonly used, but it is functional (there are tests) and it may make sense to not break it, if possible.

@jkotas
Copy link
Member

jkotas commented Apr 7, 2023

The multimodule scenario leaves performance on the table in general. It is ok to use a slow helper for it if we want to keep it working.

We should focus on optimizing the thread-static access pattern for the single module case with assumption that the JIT can help by inlining it. How far do you think it makes to push it? Should we push it to be as efficient as thread statics in C where possible (non-GC non-generic statics)?

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

Should we push it to be as efficient as thread statics in C where possible (non-GC non-generic statics)?

C threadstatic can be very efficient. Here is the access to the current thread in the native code:

    Thread * pThread = ThreadStore::GetCurrentThread();
00007FF7BAB06FD2  mov         edx,dword ptr [_tls_index (07FF7BACB2B48h)]  
00007FF7BAB06FD8  mov         rax,qword ptr gs:[58h]  
00007FF7BAB06FE1  mov         ebx,10h  
00007FF7BAB06FE6  add         rbx,qword ptr [rax+rdx*8]  

This is basically getting native thread static tls_CurrentThread. There is nothing dynamic about tls_CurrentThread, apart from threads being created/destroyed dynamically. Indirection off gs handles that.
(on other platforms it will be x18, TPIDR_EL0 or similar)

I think we can have this for unmanaged threadstatics only if we emit unmanaged threadstatics in the same way as C and let the C runtime manage them. It would be attractive for threadstatics such as managed thread ID.
I have no idea if that is feasible. This is beyond my understanding how things are described in the obj files and presented to the native linker.

Managed and dynamic threadstatics will need something similar to what we have now, but I think we can reduce number of indirections. If the above is too hard or fragile, then unmanaged stuff could be on the same plan as managed.

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

For the current scheme - I think allocating a storage per type is a bit of a waste. At least in the nondynamic case. I think allocating just one instance per thread that provides storage to all nondynamic threadstatics, could work better.
I think it is uncommon for types to have really a lot of threadstatics, so laziness in terms of allocating storage per type is unlikely to save us any space, just adds an extra bound check, null check, and indirection.

As I looked through the use of threadstatics in libraries tests, the per-type array can get up to 512 bytes - that is to hold up to 64 references to per-type storages.

I think the "combo" instance will rarely be larger than 1K.
Even 4K per-thread would not be that big of a deal. And that would be for apps that really indulge in threadstatics.

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

in a first approximation I'd have two slots on the native thread instance:

  • one "static" slot for a combo instance that covers nondynamic module0 threadstatics. Not sure if it needs to be lazy. Perhaps we can just set it up right away on thread attach.
    JIT would inline the access to this. There is nothing to check or call, if we allocate the store eagerly.
    It will be the sequence like in C++ with one additional indirection to get the ref to the per-type store.

  • and another "dynamic" slot that would be on the same plan as it is now - to handle multiple modules and dynamic types. Both seem to be very rare scenarios. This can be as lazy as what we have now. Access to this may not be inlined by the JIT

The tricky part about "combo" instance is that GetThreadStaticBaseForType would need to be changed to return ref to the per-type store.
The location could be computed from typeTlsIndex as we know per-type store sizes, but it would be more efficient for the non-dynamic case to operate in terms of something like typeTlsOffset.

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

CC: @kunalspathak

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

There is nothing to check or call, if we allocate the store eagerly.

Actually, code does not know statically if it is in module0, so we will need one check. Either for the module number==0 or whether there are multiple modules.

@jkotas
Copy link
Member

jkotas commented Apr 7, 2023

The compiler knows whether it is emitting multi-module code (--multifile switch).

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

The compiler knows whether it is emitting multi-module code (--multifile switch).

That is good! Then we can tell the jit to just call the helper.

@VSadov
Copy link
Member Author

VSadov commented Apr 7, 2023

Note that we have two helpers:

  • GetThreadStaticBaseForType (C#)
  • RhpGetThreadStaticBaseForType (asm) - only called in nongeneric case.

The reason we have RhpGetThreadStaticBaseForType is really just to do INLINE_GETTHREAD, so we do not need to make yet another call to fetch the native TLS.

My plan is to simplify the storage how we can, which should result in simpler asm helper for the fast path.

As the next step @kunalspathak will teach JIT to inline that piece of asm, so that RhpGetThreadStaticBaseForType is no longer needed.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 10, 2023
@VSadov
Copy link
Member Author

VSadov commented Apr 10, 2023

An initial implementation of simpler access to threadstatics - #84566

It separates the fast "inlineable" case from multimodule/dynamic case and preinitializes the storage, so that the asm helper does not need to check for anything or call anything.

It does not yet introduce a "combo" instance. It looks like that could require some changes in how JIT deals with threadstatics - i.e. the helper would return a ref, not an object.
Not sure yet how disruptive such change could be.

@agocke agocke added this to AppModel Apr 20, 2023
VSadov added a commit that referenced this issue May 4, 2023
Fixes: #84373

- [x] separate "fast" inlinable case . (used for singlemodule, not dynamic cases, when optimizing)
- [x] make the storage for fast threadstatics a single "combo" instance instead of array of instances.
@ghost ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels May 4, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants