-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coroutines support #362
Comments
Hi! thanks for the bump ivmai, I wasn't aware of this request. I just published how we use The TL;DR would be
Feel free to ask questions. |
To my understanding the next-level support of coroutines could be providing the wrappers of the relevant platform-specific calls like GC_CreateFiber, GC_SwitchToFiber, etc. |
We haven't got around to reimplementing the gc/coroutine integration yet. My understanding is that crystal-lang has found a way to pretend to bdwgc that coroutines do not exist, all active threads have native stacks. Their solution needs to do a lot of bookkeeping and locking to uphold this fiction. It seems that an interface like the following would be sufficient to allow bdwgc to handle coroutines correctly. typedef int GC_stack_handle;
// Register an extra stack that any thread may switch to.
// This saves the stack info for the GC to search through when the sp is not in the native stack.
GC_extra_stack_handle
GC_register_extra_stack(void *lo, void* hi, void *sp);
void
GC_unregister_extra_stack(GC_extra_stack_handle handle);
// Save the current stack pointer to either the native stack or a stack from the extra stacks
// This must be called before switching to a different coroutine, so that the relevant parts of the inactive stack can be scanned
void
GC_save_stack_pointer();
// or perhaps equivalently
// void GC_before_stack_switch();
// To my understanding, the following won't be necessary, because bdwgc stop-the-world is already capable of getting all active stack pointers. This could update the extra_stack's saved stack pointer, but that info will be immediately outdated. Maybe it could implement some sanity check assertion? Otherwise, I don't think it's necessary.
// void GC_after_stack_switch();
This would be nice to have, but I'd already consider the interface I've sketched to already be next-level. Also, offering examples instead of full-on pre-wrapped coroutine libraries will get bdwgc 99% there without the hassle of maintaining some of that code's details, like adding to the configure script, etc. |
https://github.com/NixOS/nixpkgs/blob/master/pkgs/tools/package-management/nix/patches/boehmgc-coroutine-sp-fallback.patch |
Not to disagree with the above, but what is preventing us to implement wrappers over system API to manipulate with the coroutines like we have for threads (GC_pthread_create, etc.)? |
But, for the first, could you let me know which system primitives and how are used to manipulate with coroutines? Windows has the corresponding API (CreateFiber, ConvertThreadToFiber, SwitchToFiber, etc.), what's about Mac OS, Linux and other Unix platforms? Are there any portable libraries to deal with coroutines? |
Coroutines aren't standardized on unix-like systems, so I will kindly refer to wikipedia, which does a better job than I could, trying to explain the mess. C++20 does have a coroutine language feature that delegates a large part of the coroutine logic to user types. I don't have any experience with this feature, but it seems like something that bdwgc could support or perhaps even should support. I don't think this will be the only way to integrate bdwgc with C++, because the coroutine language feature is quite invasive, requiring changes to the methods that run in the coroutine.
Nix uses boost coroutine2, specifically the protected_fixedsize implementation with a custom wrapper to achieve a suboptimal but safe integration on linux (probably also darwin soon?).
I'm not familiar with the nitty gritty details of how the switching is implemented deep down.
I don't know much about other libraries. With coroutine2 we only use the "asymmetric" coroutines it has some examples in the asymetric coroutine introduction. Here's how we hook into coroutine creation and destruction with boost coroutine2 https://github.com/NixOS/nix/pull/4944/files#diff-f118e4c6f6e02148b887fdf627352311fca5a3a4eadf0b4a9d9f348e0be464ff I have not found a way to hook into the switching methods. We'll probably want to wrap the coroutine type in a new type that adds the necessary calls. It'd be nice for the wrapped type to implement the exact same interface, but a simpler interface could get the job done. |
Looking into Win fibers documentation (https://docs.microsoft.com/en-us/windows/win32/procthread/fibers), I see such primitives for creating, deletion and switching between fibers (here I omit Fiber Local Storage (FLS) primitives):
I suppose that generally other coroutine providers have similar set of primitives (which we should be aware when thinking of adding coroutines support to bdwgc). If there are other, please add. |
If fiber is created by CreateFiberEx, how to determine its stack bottom? |
This does not seems to me as reliable, the approach should be I think somewhat like: |
Basically we need to consider behavior of bdwgc (and how client should inform bdwgc) during these operations about stackful coroutines (fibers):
|
Tip for me: Some of popular coroutine libraries:
|
I think bdwgc already has the necessary API (it looks to me, I have no PoC at present) to deal with coroutines correctly (and efficiently) but certain changes should be done in the thread-related code of gc. How client could inform gc about coroutines:
gc internals to fix:
Notes:
This approach should not be difficult to try, would be good to hear opinion about it. |
It seems that inactive has a distinct meaning in bdwgc that I did not intend to reference. I meant stack of a suspended coroutine. My thinking was that on thread suspension, you could suspend all "OS" threads, giving you the stack pointers of the active coroutines ( / normal threads). The suspended (by yield) ones don't need suspension by stop_world. By saving the stack pointer before leaving, you should have enough information stored to mark those suspended stacks as well. By accepting coroutines, stacks aren't 1:1 with threads anymore, so from a design perspective it would make sense to treat them separately. A I hope my ideas are helpful, but please consider them with a grain of salt, as I don't have nearly your knowledge of the code and the intricacies of the system. |
In order to allow concurrent fiber switches in Crystal we use a RWLock. Switching fibers acquire a reader, collections acquire a writer. It can be an improvement for later if the API hides whatever lock. But using a single mutex might impact in performance a lot. At least that was the case in Crystal. |
@bcardiff, I agree. There could be 2 solutions:
|
@roberth, thank you for your opinion, I will consider it.
Yes, initial intention was to avoid suspend signal to come while client executes some syscall. But essentially it could be treated as save_stack_pointer (and indicate that signal should not be sent because the stack pointer is saved already and the thread is not mutating any GC pointers). I don't see any big difference between stackful coroutines and threads (from the GC's side) - both have stack and context. |
What happens to coroutines on a process fork? It seems to me that they should survive in the child process. If someone has experience with it, let me know please. |
IIRC in Crystal we do fork to perform an exec only, so we don't need to keep coroutines. |
@roberth, after trying to go with an approach based on GC_do_blocking, I'm more and more convinced that to support coroutines there's a need to separate "stack information" from thread: GC_thread entity should merely contain only thread id, thread-local free lists and pointer to named as e.g. GC_coroutine entity (containing the rest of the information), and GC_push_all_stacks should traverse the table containing all GC_coroutine entities, both active (suspended by GC_stop_world) and inactive ones. |
I need this too for ecl coroutines support. I've implemented the API but bdwgc collects all the way in and I get segfault. ecl uses setjmp/longjmp to manage non-local exits. |
(refactoring) Issue #362 (bdwgc). * darwin_stop_world.c (GC_stack_range_for): Define and crtn local variable; use crtn instead of p to access stack_ptr, topOfStack, stack_end, altstack, altstack_size, normstack, normstack_size, backing_store_end, backing_store_ptr. * pthread_stop_world.c [!NACL] (GC_suspend_handler_inner, GC_push_all_stacks): Likewise. * pthread_support.c [!GC_NO_FINALIZATION] (GC_reset_finalizer_nested, GC_check_finalizer_nested): Likewise. * pthread_support.c [!GC_WIN32_THREADS] (GC_register_altstack): Likewise. * pthread_support.c (do_blocking_enter, do_blocking_leave, GC_set_stackbottom, GC_get_my_stackbottom, GC_call_with_gc_active): Likewise. * win32_threads.c (GC_push_stack_for, GC_get_next_stack): Likewise. * darwin_stop_world.c (GC_push_all_stacks): Use p->crtn instead of p to access traced_stack_sect. * win32_threads.c (GC_suspend, GC_stop_world, GC_start_world): Likewise. * include/private/pthread_support.h (GC_StackContext_Rep): New struct type (move dummy, stack_end, stack_ptr, last_stack_min, initial_stack_base, topOfStack, backing_store_end, backing_store_ptr, altstack, altstack_size, normstack, normstack_size, finalizer_nested, finalizer_skipped, traced_stack_sect from GC_Thread_Rep). * include/private/pthread_support.h [!GC_NO_THREADS_DISCOVERY && GC_WIN32_THREADS] (GC_StackContext_Rep.stack_end): Add volatile. * include/private/pthread_support.h [!GC_NO_FINALIZATION] (GC_StackContext_Rep.fnlz_pad): New field. * include/private/pthread_support.h (GC_stack_context_t): New type. * include/private/pthread_support.h (GC_Thread_Rep.crtn, GC_Thread_Rep.flags_pad): New field. * include/private/pthread_support.h [GC_NO_FINALIZATION] (GC_Thread_Rep.no_fnlz_pad): Remove field. * include/private/pthread_support.h (GC_threads): Move (and refine) comment from GC_Thread_Rep. * include/private/pthread_support.h [GC_WIN32_THREADS] (GC_record_stack_base): Change me argument to crtn. * pthread_stop_world.c [!NACL] (GC_store_stack_ptr): Likewise. * pthread_support.c (GC_record_stack_base): Likewise. * pthread_stop_world.c [NACL] (NACL_STORE_REGS, nacl_pre_syscall_hook, __nacl_suspend_thread_if_needed): Use p->crtn instead of p to access stack_ptr. * pthread_support.c (first_crtn): New static variable. * pthread_support.c (first_thread): Update comment. * pthread_support.c (first_thread_used): Remove variable. * pthread_support.c (GC_push_thread_structures): Push first_thread.crtn symbol * pthread_support.c (GC_push_thread_structures): Push first_crtn.backing_store_end instead of that in first_thread. * pthread_support.c [MPROTECT_VDB && GC_WIN32_THREADS] (GC_win32_unprotect_thread): Call GC_remove_protection() for t->crtn. * pthread_support.c (GC_new_thread): Use first_thread.crtn!=0 instead of first_thread_used; set first_thread.crtn to &first_crtn; allocate GC_StackContext_Rep object using GC_INTERNAL_MALLOC() and store the pointer to result->crtn. * pthread_support.c [CPPCHECK] (GC_new_thread): Call GC_noop1() for first_thread.flags_pad, for first_crtn.dummy instead of result->dummy, and for first_crtn.fnlz_pad instead of result->no_fnlz_pad. * pthread_support.c (GC_delete_thread): Call GC_INTERNAL_FREE(p->crtn) along with that for p. * pthread_support.c [CAN_HANDLE_FORK && (!THREAD_SANITIZER || !CAN_CALL_ATFORK)] (GC_remove_all_threads_but_me): Likewise. * pthread_support.c [GC_PTHREADS] (GC_pthread_join, GC_pthread_detach): Likewise. * pthread_support.c (GC_segment_is_thread_stack, GC_greatest_stack_base_below): Use p->crtn instead of p to access stack_end. * pthread_support.c (GC_call_with_gc_active): Move assertions about me fields to be after LOCK; add assertion that me->crtn==crtn. * win32_threads.c (dll_thread_table): Make it static. * win32_threads.c [!GC_NO_THREADS_DISCOVERY] (dll_crtn_table): New static variable. * win32_threads.c (GC_register_my_thread_inner): Reformat comment; set me->crtn. * win32_threads.c (GC_push_stack_for): Define stack_end local variable; immediately return (zero) if stack_end is NULL. * win32_threads.c (GC_push_all_stacks): Call GC_push_stack_for() (and increment nthreads) even if stack_end is NULL. * win32_threads.c (GC_get_next_stack): Rename s local variable to stack_end.
Here are the proposed API primitives for the coroutines support in bdwgc (implementation is still in progress, API may still subject to change):
|
What does ucp stand for? That coroutine implementation lets us hook into the stack allocation and deallocation methods, so I suppose we'd call It does not have a concept of converting between threads to fibers and back, so I suppose we'd ignore When do we call The |
It is just some custom identifier of a coroutine (e.g. it could be pointer returned by CreateFiber, or be of type ucontext_t* passed to swapcontext), the only requirement is it should be unique, does not change while coroutine is alive, and should be non-zero) |
Yes.
It is necessary for bdwgc. You can call GC_crtn_resume() only if the caller is a coroutine itself. This is similar to Win32 ConvertThreadToFiber purpose. This is needed for efficiency inside bdwgc (otherwise we would need to always put every thread into 2 tables - one for the thread and one for the case it is treated as a coroutine). The goods news is you do not need to call GC_crtn_deinit(0) (to convert a coroutine back to thread) - it is to be done automatically on thread destruction. |
I should also note that once created you could reuse the entity multiple time before deinit (provided its ucp_id is not changed).
Yes, OK to start immediately, but you need to fill in sb yourself. There is no way to set the stack bottom before coroutine starts running.
|
Let's say, use of GC_call_with_stack_base() is a portable way of filling in sb variable. |
Let's return back to this suggestion later (yes, I see this should be technically possible as you know the stack base and size in advance) if no better variants. |
Please give me a sample of code. GC_crtn_ret() should be placed just before each co_return (and before each return statement of the top-level coroutine function, and as a last statement of the top-level coroutine function). The expression passed to co_return should be simple ,e.g.: co_return GC_malloc(1); |
The support of context switching (GC_crtn_resume) is most difficult part, it's implementation is very similar to GC_do_blocking() one, GC_crtn_resume does:
For convenience resume_fn accepts 2 arguments with client-defined semantics (like koishi_resume). Both symmetric and asymmetric coroutines should be supported. |
@bcardiff, do you push all registers to the stack during context switch (e.g. by GC_with_callee_saves_pushed) or store them to context structure (and later scan them there)? |
@ivmai No, not all. In crystal we do the context switch as if we would do a normal C call mostly + tweaking the stack pointer register to change the returning stack. The (assembly) code that depends on the arch can be found at https://github.com/crystal-lang/crystal/blob/master/src/fiber/context/x86_64-sysv.cr#L22-L40 (and other files in the same directory). We preserve the stack pointer in the co-routine structure so we can restore it later, but other than that the registers are preserved in the stack itself according to the calling convention. Does this helps? |
@bcardiff, thank you, I see, you have own swap context routine which saves regs to the stack and then, during push_other_roots, ps is extracted from the coroutine context. |
Yes. That's right. How we push roots is essentially what you say. There is a bit of more details described in the following blog post https://crystal-lang.org/2022/02/16/bdw-gc-coroutines-support/ |
@roberth, I have modified your patch for upstream master, but still not sure if it is worth upstreaming, still thinking.
|
@roberth, you wrote
Do you mean it is hard to wrap yield() calls in serialize.cc by GC_crtn_resume?
->
and
->
Previously you proposed the API to inform the GC about context switches (GC_before_stack_switch, etc.), the usage looks like this:
This, of course, is easier to use but the fundamental issue with it is to guarantee all registers are saved and to get sp value after the save. |
I think I meant that it would not be possible to integrate it with boost coroutines without modifying the upstream implementation.
Isn't that a fundamental issue in general? Doesn't bdwgc have to solve that problem regardless of where the stack is?
Not really. I think the first name I gave for it better reflects the intent: Whether that saved, backup stack pointer is good enough would depend on the implementation of I must admit that my understanding of the machinery involved isn't as thorough as I would need to have a truly productive conversation about what the interface may look like, so I'll be quite happy to accept what you think the interface has to be. You're doing the work after all, and only an interface that's correctly implementable is useful. |
@roberth, please check commit f7a0708 - it adds API to check and modify sp during push_all_stacks.
If it works for you, then I land the patch to gc-8.2.4 (so you would be able e.g. detect presence of GC_set/get_sp_corrector() by bdwgc version comparison). PS. This is, of course, a workaround, the work in this issue is to be continued for the proper portable support of coroutines (it will be released in future v8.4.0). |
@ivmai That's much appreciated, but I can't promise that we get around to it soon. Patching the gc is not an obstacle at all for us, as Nix is usually built through Nix, which makes that easy. |
Good, let's work together for the final solution. |
I was thinking that maybe an approach that would work is to focus more on the stacks rather than in the coroutines. It would be nice if the GC is aware of the possible stacks and treat them as roots directly without the user needing to register those roots before collect. Of course this is influenced by the code is working in Crystal and what I imagine would've been nicer/simpler in terms of GC API. The (coroutines) stacks need to be allocated by the user since there might be some logic there. In Crystal for example we use a memory pool. The GC would need to know for a given stack if that is running in some thread or not. If it's not running, add it as root. If it's running the stack bottom of it should be the one used for that thread. This is essentially Crystal's before collect callback: https://github.com/crystal-lang/crystal/blob/9b97e8483528a2c26c3e57afa3704f8ebb429142/src/gc/boehm.cr#L352-L367 To do this we would need a structure to represent the stacks and a way to extract some runtime information:
// a callback that give a stack_ref will return in which pthread_id the stack is being executed or NULL if it's not being executed.
// if *pthread_id = NULL then the stack_top != NULL
// always stack_bottom != NULL
typedef void (GC_CALLBACK * GC_stack_execution_state_proc(void *stack_ref, void **pthread_id, void **stack_bottom, void **stack_top))
GC_API void GC_CALL GC_set_stack_execution_state(GC_stack_execution_state_proc); We need to create/destroy GC_API void GC_CALL GC_register_stack_ref(void *stack_ref);
GC_API void GC_CALL GC_delete_stack_ref(void *stack_ref); The user will still need to use How the user's runtime do the context switch will be out of scope for the GC. The only needed bit is to have the information provided by As I mentioned before in Crystal it was crucial to have a RWLock for this to work. So maybe that should happen before this API. |
I edited the previous comment. I was missing the stack bottom in the |
Maybe, but for now I think I will return to it (#473) after finishing with this issue. |
Agree, we need to provide create/destroy API functions. Details of this create/destroy API may vary:
A small question here: should bdwgc remove the coroutine stack if there no reference to it from the client? I think No because the stack might be some native memory which is not managed by the bdwgc. Other misc related questions:
|
Hello, Thanks to everyone and their work here 😄 I've been working on coroutines in the V language, and I experienced segfault's due to the GC's incorrect thread stack base.
static void sp_corrector(void** sp_ptr, void* tid) {
size_t stack_size;
char* stack_addr;
#ifdef __APPLE__
stack_size = pthread_get_stacksize_np((pthread_t)tid);
stack_addr = (char*) pthread_get_stackaddr_np((pthread_t)tid);
#elif defined(_WIN64)
ULONG_PTR stack_low, stack_high;
GetCurrentThreadStackLimits(&stack_low, &stack_high);
stack_size = stack_high - stack_low;
stack_addr = (char*)stack_low;
#elif defined(__linux__)
pthread_attr_t gattr;
pthread_getattr_np((pthread_t)tid, &gattr);
pthread_attr_getstack(&gattr, (void**)&stack_addr, &stack_size);
pthread_attr_destroy(&gattr);
#else
assert("unsupported platform");
#endif
char *sp = (char*)*sp_ptr;
if(sp <= stack_addr || sp >= stack_addr+stack_size) {
*sp_ptr = (void*)stack_addr;
}
} Thanks for the great help @ivmai 🥂
What is the benefit of that check, apposed to just updating the stack pointer every time? I'm just wondering if any performance benefit is negated by the check itself 🤔 |
Hello @joe-conigliaro
Please note that this code works only for platforms with the stack growing down (but MacOs, Win32 and Linux always have stack growing down, so it is OK, this is just a note).
No, in my sample above stackaddr is of void* (thus to should be either converted to char* or to word before adding stack_sz).
This is essential for performance! In most cases the sp is supposed to be between stack top/bottom (e.g. probably relatively close to stackaddr+stack_sz value). With this check the collector will scan only space between current sp and stackaddr+stack_sz, without this check - berween stackaddr and stackaddr+stack_sz. |
Thanks for the feedback @ivmai!
I understand that it would be ideal to just update the thread stack base when context switching happens, unfortunately that is not always possible. At least we have a solution for now, until a proper fix is added :)
Ahh I see, oops silly me! 😆 |
First of all a big thank you to ivmai and bcardiff for improving bdwgc wrt coroutine support.
I've been trying to improve a stability issue that the Nix language has encountered when coroutines (used for some I/O) interact with the gc. I wish I'd found some threads mentioned here earlier. Because of the lack of documentation around bdwgc and coroutines, my attempts to fix the problem weren't nearly as effective.
In any case, I completely understand that coroutines are an advanced application of gc that bdwgc is only starting to support. Nonetheless, a number of coroutine integrations exist in the wild that would all benefit from some centralized documentation, including recommendations and/or potential problems.
Coming back to Nix, for now its bdwgc/coroutine integration is a bit crude, scanning the whole area the stack can inhabit whenever a coroutine is involved. I'd like to improve this at some point, but for now I've patched bdwgc to push all available stack memory when the sp is outside the expected GC_thread stack area. Coroutine stacks are added with
GC_add_roots
.Besides being inaccurate (scanning stack garbage sometimes), do you see an immediate problem with this approach?
If so, I'll want to prioritize work on the gc integration. Improved documentation would be great for that, although I could work off those same referenced issues.
The text was updated successfully, but these errors were encountered: