Skip to content
This repository has been archived by the owner on Aug 2, 2019. It is now read-only.

Thread-local storage #52

Open
wks opened this issue Apr 21, 2016 · 4 comments
Open

Thread-local storage #52

wks opened this issue Apr 21, 2016 · 4 comments

Comments

@wks
Copy link
Member

wks commented Apr 21, 2016

Add thread-local memory to Mu, in addition to the existing heap, stack and global memory.

Proposal 1: the C-like approach, has known problems
Proposal 2 (preferred): a more aggressive design

@wks
Copy link
Member Author

wks commented Apr 22, 2016

Introduction

We need thread-local memory to provide efficient thread-local storage for Mu programs.

Existing uses

  • C/C++/Java/C#/Python/Ruby/... all provide thread-local storage to the users.
  • C used to use the global variable errno to indicate errors from library functions, and later in C11 made errno thread-local.
  • C++ uses thread-local storage for exception handling. Exception values and metadata are temporarily stored in a thread-local buffer.
  • Efficient memory allocators (such as jemalloc) and garbage collectors usually keep thread-local memory pools.
  • Thread-local storage is a way to implement dynamic scoping.

Implementation

Dedicated register

The most efficient way to implement thread-local storage is to reserve a register which always points to a memory region dedicated to the current thread. All thread-local memory location is within that region and can be addressed by an offset related to that register.

For example, if we use the FS segment register to point to the beginning of the thread-local memory region, all thread-local variables can be addressed related to FS. This means, the following thread-local variables (in C):

thread_local int a;
thread_local long b;
thread_local void* c;

can be implemented as (in C-like pseudo code):

uintptr_t FS = malloc(24); // big enough for a,b,c
int* a_ptr = (int*)(FS + 0);
long* b_ptr = (int*)(FS + 8);
void** c_ptr = (int*)(FS + 16);

The thread-local memory is allocated (probably in the heap) and the value of register FS is assigned when a thread is created.

Different ABI use different registers. X86-64 Unix uses FS. Some ARM ABI uses the CP15 control register, while others use the r9 general purpose register.

System (C) dynamic loaders need to relocate the offsets of thread-local variables when shared objects are dynamically loaded, because shared objects can be loaded in different orders. It must also allocate the increased thread-local storage for existing threads in some reserved spaces, but the room for this flexibility may be limited.

Global map of threadID -> pointer

Alternatively, thread-local storage can be imlemented as a global HashMap<ThreadID, ThreadLocal<T>>. This involves a table lookup every time a thread-local is accessed, but has more flexibility than the C-like approach. For example, the hashmap can grow arbitrarily.

POSIX thread (PThread) also supports such key-value style thread-local storage.

Other approaches

A (fixed) address region can be mapped for each thread to its thread-specific region. This needs the cooperation of the operating system (mmap per thread).

@wks
Copy link
Member Author

wks commented Apr 22, 2016

Proposal 1 (C style)

Changes to the Mu IR and the Mu API

Now the Mu memory consists of the heap, the stacks, the global memory and the thread-local memory. The allocation units in the thread-local memory are called thread-local cells.

The new top-level definition ".threadlocal" defines a thread-local cell image.

.threadlocal @my_error_number <@i64>

After a bundle that contains thread-local cell images is loaded, when a thread is created, a thread-local cell is allocated for that thread for each image. The life time of a thread-local cell starts from the creation of its thread, and ends when the thread dies.

If a new bundle is loaded after a thread is created, newer thread-local cells will not be created even if the new bundle contains new thread-local cell images.

The new instruction THREADLOCAL gets an iref of a given thread-local cell of the current thread.

%iref = THREADLOCAL <@i64> @my_error_number    // %iref is an iref<@i64>

// After this instruction, the iref can be used like global cells.
%old_value = LOAD <@i64> %iref
%new_value = ...
STORE <@i64> %iref %new_value

The handle_from_threadlocal API function can get the iref to a thread-local cell:

MuIRefValue (*handle_from_threadlocal)(MuCtx *ctx, MuThreadRefValue thread, MuID thread_local_img);

Design decisions

Thread local cell images are statically defined like in C. In this way, all thread-local cells can be bulk-allocated per-thread, and thread-local cells can be addressed by an indirection from a reserved register. This is efficient. The offsets into the thread-local block can be calculated when defining.

Mu does not provide the key-value style thread-local as POSIX or Java. Even so, the client can use a thread-local cell to hold a reference to a HashMap, where keys are mapped to thread-local values.

Newly-defined thread-local cell images do not expand existing thread-local blocks of existing threads. Even in C, the room for expansion is still limited for existing threads. Mu could solve the problem by triggering a GC when defining new thread-local cell images and "move" the thread-local block, but if any thread-local cell is pinned, it cannot be moved. So we have to trade flexibility for efficiency.

Potential problems

  • There is no way to initialise thread-local cells before a thread is created.
  • Problem with thread-local cell pinning and defining new thread-local cells.

@wks
Copy link
Member Author

wks commented Apr 22, 2016

Proposal 2 (Mu-specific style)

Changes to the Mu IR and the Mu API

No new top-level definitions are introduced.

No new kinds of memory are introduced (i.e. we still only have 3 kinds of memory: heap, stack and global).

Each Mu thread have a single thread-local ref<void> value (we need a better name for this value), which can refer to an arbitrary Mu object.

Two new common instructions are added:

@uvm.set_threadlocal(%ref: ref<void>) -> ()
@uvm.get_threadlocal() -> ref<void>

The @uvm.set_threadlocal instruction sets the thread-local ref<void> value to the argument; the @uvm.get_threadlocal instruction gets the previously-set ref<void> value. There is no memory order problem, because it is only ever accessed by the current thread.

// Get the thread-local reference
%tl = COMMINST @uvm.get_threadlocal      // %tl is ref<void>

// Cast it to the appropriate type
%tl_typed = REFCAST<@refvoid @ref_to_some_struct> %tl    // %tl_typed is ref<@some_struct>

// Then use it as usual
%tl_iref = GETIREF <@some_struct> %tl_typed   // %tl_iref is iref<@some_struct>
... = GETFIELDIREF ...
... = LOAD ...

The new_thread API function is modified to take an extra optional ref<void> handle to any Mu object. It will be the initial thread-local ref<void> value of the new thread.

MuCtx *ctx = muvm->new_context(muvm);
MuStackRefValue *stack = ctx->new_stack(......);
MuRef *thread_local_object = ctx->new_fixed(ctx, ID_OF_SOME_STRUCT);

MuThreadRefValue *thread = ctx->new_thread(ctx,
  stack, // the stack
  thread_local_object, // the initial thread-local ref<void>
  ..., ..., ..., NULL); // other arguments go here

Two new API functions get_threadlocal and set_threadlocal are provided so that the client can get/set the thread-local ref<void> values after a thread is created:

MuRefValue (*get_threadlocal)(MuCtx *ctx, MuThreadRefValue thread);
void (*set_threadlocal)(MuCtx *ctx, MuThreadRefValue thread, MuRefValue new_threadlocal);

These two functions can only be call in traps on the current thread that caused the trap. It is not allowed to use outside trap handlers or to use on other threads not currently handled (otherwise there will be data races).

Design decisions

Only one single object reference per thread. Even so, the client can fully customise the structure of the object it refers to. And Mu can use GC to collect the object.

The value is a ref<void> so it fits in a register. If we reserve a dedicated register, it is as cheap as the system's default register-indirected addressing (because GETIREF and GETFIELDIREF are free). If the ref<void> does not occupy a register, but resides in a thread-local memory block (the micro VM will always have some thread-local data, such as GC memory pool, and we just reserve one extra word to hold this user-defined ref<void>), it will need one indirection, which is still not very expensive.

The client decide what that ref<void> refers to, so Mu no longer need to provide the mechanisms to allocate or expand thread-local blocks.

If the client needs some flexibility of adding new thread-local fields at run time, it can trap existing programs (using WATCHPOINTS) and re-allocate the thread-local object and then copy. It can also implement its own two-level or multi-level indirection tables, depending on the demand of speed.

Every time the client adds more thread-local structure fields, it needs to know the previous structure. This offloads the "relocation" job to the client. Mu is designed primarily for JIT compiling, so it always knows the structure of the thread-local struct when it compiles new Mu IR bundles.

// bundle1
.typedef @thread_local_struct1 = <@i64 @double @refvoid>

// bundle2, loaded later
.typedef @thread_local_struct2 = <@thread_local_struct1 @extra_field>

// bundle3, loaded after both bundle1 and bundle2
.typedef @thread_local_struct3 = <@thread_local_struct2 @more @extra @fields>

Since older struct is a prefix of the next, with proper REFCAST and GETFIELDIREF, subsequently compiled Mu IR code can address their appropriate fields.

Potential problems

  • Depending on whether we reserve a dedicated register, it may or may not be faster than the C-style approach.
  • This may be inconvenient for pre-compiled Mu IR code which is shipped independently from other Mu IR modules (RPython on Mu does not do this. It compiles all code into one single Mu IR bundle). The Mu IR loader will need to "patch" the Mu IR code while loading. Whether this is easy to do depends on how the Mu IR is shipped (if it is the text form, some RegExp substitution will work).

@wks
Copy link
Member Author

wks commented May 3, 2016

Regarding proposal2, it only reserves one register for thread-local objref. @steveblackburn suggested that the client may want more thread-local data to be reserved in registers for performance-critical purposes. But there are several challenges:

  • The number of registers varies greatly among architectures. For ARMv8 and POWER, registers are not a problem; x86, on the other hand, has very few general purpose registers. The Mu IR/API, however, needs to provide a uniform API across platforms.
    • One way to work around this is to let some thread-local objrefs use machine registers, while others use memory. Despite of the difference in performance, we can still have a uniform API.
  • Thread-locals compete with local variables for registers. If the JIT compiler has used some registers for local variables, it cannot re-purposed them for thread-locals unless recompiling the affected programs.
  • What if the client wishes to directly access something that is not objref? In this case, maybe proposal 1 is more relevant.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant