-
Notifications
You must be signed in to change notification settings - Fork 0
Thread-local storage #52
Comments
IntroductionWe need thread-local memory to provide efficient thread-local storage for Mu programs. Existing uses
ImplementationDedicated registerThe most efficient way to implement thread-local storage is to reserve a register which always points to a memory region dedicated to the current thread. All thread-local memory location is within that region and can be addressed by an offset related to that register. For example, if we use the thread_local int a;
thread_local long b;
thread_local void* c; can be implemented as (in C-like pseudo code): uintptr_t FS = malloc(24); // big enough for a,b,c
int* a_ptr = (int*)(FS + 0);
long* b_ptr = (int*)(FS + 8);
void** c_ptr = (int*)(FS + 16); The thread-local memory is allocated (probably in the heap) and the value of register Different ABI use different registers. X86-64 Unix uses FS. Some ARM ABI uses the CP15 control register, while others use the r9 general purpose register. System (C) dynamic loaders need to relocate the offsets of thread-local variables when shared objects are dynamically loaded, because shared objects can be loaded in different orders. It must also allocate the increased thread-local storage for existing threads in some reserved spaces, but the room for this flexibility may be limited. Global map of threadID -> pointerAlternatively, thread-local storage can be imlemented as a global POSIX thread (PThread) also supports such key-value style thread-local storage. Other approachesA (fixed) address region can be mapped for each thread to its thread-specific region. This needs the cooperation of the operating system (mmap per thread). |
Proposal 1 (C style)Changes to the Mu IR and the Mu APINow the Mu memory consists of the heap, the stacks, the global memory and the thread-local memory. The allocation units in the thread-local memory are called thread-local cells. The new top-level definition ".threadlocal" defines a thread-local cell image.
After a bundle that contains thread-local cell images is loaded, when a thread is created, a thread-local cell is allocated for that thread for each image. The life time of a thread-local cell starts from the creation of its thread, and ends when the thread dies. If a new bundle is loaded after a thread is created, newer thread-local cells will not be created even if the new bundle contains new thread-local cell images. The new instruction THREADLOCAL gets an iref of a given thread-local cell of the current thread.
The
Design decisionsThread local cell images are statically defined like in C. In this way, all thread-local cells can be bulk-allocated per-thread, and thread-local cells can be addressed by an indirection from a reserved register. This is efficient. The offsets into the thread-local block can be calculated when defining. Mu does not provide the key-value style thread-local as POSIX or Java. Even so, the client can use a thread-local cell to hold a reference to a Newly-defined thread-local cell images do not expand existing thread-local blocks of existing threads. Even in C, the room for expansion is still limited for existing threads. Mu could solve the problem by triggering a GC when defining new thread-local cell images and "move" the thread-local block, but if any thread-local cell is pinned, it cannot be moved. So we have to trade flexibility for efficiency. Potential problems
|
Proposal 2 (Mu-specific style)Changes to the Mu IR and the Mu APINo new top-level definitions are introduced. No new kinds of memory are introduced (i.e. we still only have 3 kinds of memory: heap, stack and global). Each Mu thread have a single thread-local Two new common instructions are added:
The
The MuCtx *ctx = muvm->new_context(muvm);
MuStackRefValue *stack = ctx->new_stack(......);
MuRef *thread_local_object = ctx->new_fixed(ctx, ID_OF_SOME_STRUCT);
MuThreadRefValue *thread = ctx->new_thread(ctx,
stack, // the stack
thread_local_object, // the initial thread-local ref<void>
..., ..., ..., NULL); // other arguments go here Two new API functions MuRefValue (*get_threadlocal)(MuCtx *ctx, MuThreadRefValue thread);
void (*set_threadlocal)(MuCtx *ctx, MuThreadRefValue thread, MuRefValue new_threadlocal); These two functions can only be call in traps on the current thread that caused the trap. It is not allowed to use outside trap handlers or to use on other threads not currently handled (otherwise there will be data races). Design decisionsOnly one single object reference per thread. Even so, the client can fully customise the structure of the object it refers to. And Mu can use GC to collect the object. The value is a The client decide what that If the client needs some flexibility of adding new thread-local fields at run time, it can trap existing programs (using WATCHPOINTS) and re-allocate the thread-local object and then copy. It can also implement its own two-level or multi-level indirection tables, depending on the demand of speed. Every time the client adds more thread-local structure fields, it needs to know the previous structure. This offloads the "relocation" job to the client. Mu is designed primarily for JIT compiling, so it always knows the structure of the thread-local struct when it compiles new Mu IR bundles.
Since older struct is a prefix of the next, with proper Potential problems
|
Regarding proposal2, it only reserves one register for thread-local objref. @steveblackburn suggested that the client may want more thread-local data to be reserved in registers for performance-critical purposes. But there are several challenges:
|
Add thread-local memory to Mu, in addition to the existing heap, stack and global memory.
Proposal 1: the C-like approach, has known problems
Proposal 2 (preferred): a more aggressive design
The text was updated successfully, but these errors were encountered: