Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coroutines support #362

Open
roberth opened this issue Jul 1, 2021 · 52 comments
Open

Coroutines support #362

roberth opened this issue Jul 1, 2021 · 52 comments

Comments

@roberth
Copy link
Contributor

roberth commented Jul 1, 2021

First of all a big thank you to ivmai and bcardiff for improving bdwgc wrt coroutine support.
I've been trying to improve a stability issue that the Nix language has encountered when coroutines (used for some I/O) interact with the gc. I wish I'd found some threads mentioned here earlier. Because of the lack of documentation around bdwgc and coroutines, my attempts to fix the problem weren't nearly as effective.
In any case, I completely understand that coroutines are an advanced application of gc that bdwgc is only starting to support. Nonetheless, a number of coroutine integrations exist in the wild that would all benefit from some centralized documentation, including recommendations and/or potential problems.

Coming back to Nix, for now its bdwgc/coroutine integration is a bit crude, scanning the whole area the stack can inhabit whenever a coroutine is involved. I'd like to improve this at some point, but for now I've patched bdwgc to push all available stack memory when the sp is outside the expected GC_thread stack area. Coroutine stacks are added with GC_add_roots.
Besides being inaccurate (scanning stack garbage sometimes), do you see an immediate problem with this approach?
If so, I'll want to prioritize work on the gc integration. Improved documentation would be great for that, although I could work off those same referenced issues.

@ivmai
Copy link
Owner

ivmai commented Jul 2, 2021

Related issues: #150, #274, #277

/cc @bcardiff

@bcardiff
Copy link
Contributor

Hi! thanks for the bump ivmai, I wasn't aware of this request.

I just published how we use GC_set_stackbottom/GC_get_stackbottom to bring coroutines to a multi-thread program in https://crystal-lang.org/2022/02/16/bdw-gc-coroutines-support.html . I hope this helps to clarify how it can be used.

The TL;DR would be

  • We use GC_get_stackbottom when threads starts to track the initial stack of the thread.
  • We use GC_set_stackbottom within a GC_set_push_other_roots.
  • We use a RWLock where coroutine context-switches are readers and GC collections are writers.
  • During GC_set_push_other_roots we also push the stack of all non-running coroutines

Feel free to ask questions.

@ivmai
Copy link
Owner

ivmai commented Mar 11, 2022

To my understanding the next-level support of coroutines could be providing the wrappers of the relevant platform-specific calls like GC_CreateFiber, GC_SwitchToFiber, etc.

@roberth
Copy link
Contributor Author

roberth commented Sep 1, 2022

We haven't got around to reimplementing the gc/coroutine integration yet.

My understanding is that crystal-lang has found a way to pretend to bdwgc that coroutines do not exist, all active threads have native stacks. Their solution needs to do a lot of bookkeeping and locking to uphold this fiction.
It seem that this wouldn't be necessary if it was possible for bdwgc to have information about not just the system stacks, but also all coroutine stacks. Particularly it would remove the need to lock the gc during every coroutine switch; instead relying on stopping the world.

It seems that an interface like the following would be sufficient to allow bdwgc to handle coroutines correctly.
I'm assuming that "extra_stacks" are a class of objects that is separate from existing thread or stack representations. Maybe an existing data structure can already serve this purpose. Consider the interface to be pseudo-code regardless.

typedef int GC_stack_handle;

// Register an extra stack that any thread may switch to.
// This saves the stack info for the GC to search through when the sp is not in the native stack.
GC_extra_stack_handle
GC_register_extra_stack(void *lo, void* hi, void *sp);

void
GC_unregister_extra_stack(GC_extra_stack_handle handle);

// Save the current stack pointer to either the native stack or a stack from the extra stacks
// This must be called before switching to a different coroutine, so that the relevant parts of the inactive stack can be scanned
void
GC_save_stack_pointer();
// or perhaps equivalently 
// void GC_before_stack_switch();

// To my understanding, the following won't be necessary, because bdwgc stop-the-world is already capable of getting all active stack pointers. This could update the extra_stack's saved stack pointer, but that info will be immediately outdated. Maybe it could implement some sanity check assertion? Otherwise, I don't think it's necessary.
// void GC_after_stack_switch();

To my understanding the next-level support of coroutines could be providing the wrappers of the relevant platform-specific calls like GC_CreateFiber, GC_SwitchToFiber, etc.

This would be nice to have, but I'd already consider the interface I've sketched to already be next-level.
Offering a pre-wrapped coroutine library would be next-next-level, but 95% of the problem is already solved by offering the building blocks for a simpler coroutine integration interface.

Also, offering examples instead of full-on pre-wrapped coroutine libraries will get bdwgc 99% there without the hassle of maintaining some of that code's details, like adding to the configure script, etc.

@ivmai
Copy link
Owner

ivmai commented Sep 1, 2022

@ivmai
Copy link
Owner

ivmai commented Sep 1, 2022

We haven't got around to reimplementing the gc/coroutine integration yet.

Not to disagree with the above, but what is preventing us to implement wrappers over system API to manipulate with the coroutines like we have for threads (GC_pthread_create, etc.)?

@ivmai
Copy link
Owner

ivmai commented Sep 1, 2022

But, for the first, could you let me know which system primitives and how are used to manipulate with coroutines? Windows has the corresponding API (CreateFiber, ConvertThreadToFiber, SwitchToFiber, etc.), what's about Mac OS, Linux and other Unix platforms? Are there any portable libraries to deal with coroutines?
Samples (or links to samples) on different platforms would be appreciated.

@roberth
Copy link
Contributor Author

roberth commented Sep 1, 2022

what is preventing us to implement wrappers over system API [...] what's about Mac OS, Linux and other Unix platforms?

Coroutines aren't standardized on unix-like systems, so I will kindly refer to wikipedia, which does a better job than I could, trying to explain the mess.
So supporting all of them directly by wrapping them is not feasible. We might also consider the possibility that bdwgc is not the only library that needs to integrate with coroutines in a codebase. Other candidates may be hints for tools such as valgrind or some sanitization framework, monitoring or whatnot. If all such facilities would wrap the same function, there would be no way to use them together, so it's good to have building blocks as well as convenient wrappers that are ready to go.

C++20 does have a coroutine language feature that delegates a large part of the coroutine logic to user types. I don't have any experience with this feature, but it seems like something that bdwgc could support or perhaps even should support. I don't think this will be the only way to integrate bdwgc with C++, because the coroutine language feature is quite invasive, requiring changes to the methods that run in the coroutine.

Are there any portable libraries

Nix uses boost coroutine2, specifically the protected_fixedsize implementation with a custom wrapper to achieve a suboptimal but safe integration on linux (probably also darwin soon?).
Wikipedia lists other coroutine libraries, many of which support multiple platform configurations.

could you let me know which system primitives and how are used to manipulate with coroutines?

I'm not familiar with the nitty gritty details of how the switching is implemented deep down.

Samples

I don't know much about other libraries. With coroutine2 we only use the "asymmetric" coroutines it has some examples in the asymetric coroutine introduction.

Here's how we hook into coroutine creation and destruction with boost coroutine2 https://github.com/NixOS/nix/pull/4944/files#diff-f118e4c6f6e02148b887fdf627352311fca5a3a4eadf0b4a9d9f348e0be464ff

I have not found a way to hook into the switching methods. We'll probably want to wrap the coroutine type in a new type that adds the necessary calls. It'd be nice for the wrapped type to implement the exact same interface, but a simpler interface could get the job done.

@ivmai
Copy link
Owner

ivmai commented Sep 2, 2022

which system primitives are used to manipulate with coroutines?

Looking into Win fibers documentation (https://docs.microsoft.com/en-us/windows/win32/procthread/fibers), I see such primitives for creating, deletion and switching between fibers (here I omit Fiber Local Storage (FLS) primitives):

void *CreateFiber(size_t dwStackCommitSize, FIBER_START_ROUTINE lpStartAddress, void *lpParameter);
void *CreateFiberEx(size_t dwStackCommitSize, size_t dwStackReserveSize, DWORD dwFlags, FIBER_START_ROUTINE lpStartAddress, void *lpParameter); // flags could be: FIBER_FLAG_FLOAT_SWITCH.
void *GetCurrentFiber(void);
void *ConvertThreadToFiber(void *lpParameter);
void *ConvertThreadToFiberEx(void *lpParameter, DWORD  dwFlags);
BOOL ConvertFiberToThread(void);
void DeleteFiber(void *lpFiber);
void SwitchToFiber(void *lpFiber);

I suppose that generally other coroutine providers have similar set of primitives (which we should be aware when thinking of adding coroutines support to bdwgc). If there are other, please add.

@ivmai
Copy link
Owner

ivmai commented Sep 2, 2022

Related issues: #150, #274, #277

/cc @roloffs, @raviqqe, @nickwanninger (to involve more people to the discussion)

@ivmai
Copy link
Owner

ivmai commented Sep 3, 2022

If fiber is created by CreateFiberEx, how to determine its stack bottom?

@ivmai
Copy link
Owner

ivmai commented Sep 3, 2022

// Save the current stack pointer to either the native stack or a stack from the extra stacks
// This must be called before switching to a different coroutine, so that the relevant parts of the inactive stack can be scanned
void GC_save_stack_pointer();
// or perhaps equivalently void GC_before_stack_switch();

This does not seems to me as reliable, the approach should be I think somewhat like:
GC_do_blocking(yield, <coroutine_to_yield_to>)

@ivmai
Copy link
Owner

ivmai commented Sep 4, 2022

Basically we need to consider behavior of bdwgc (and how client should inform bdwgc) during these operations about stackful coroutines (fibers):

  • create
  • start
  • suspend
  • resume
  • yield: this is suspend followed by resume
  • finish
  • destroy
  • garbage_collection

@ivmai
Copy link
Owner

ivmai commented Sep 4, 2022

Tip for me: Some of popular coroutine libraries:

@ivmai
Copy link
Owner

ivmai commented Sep 4, 2022

I think bdwgc already has the necessary API (it looks to me, I have no PoC at present) to deal with coroutines correctly (and efficiently) but certain changes should be done in the thread-related code of gc.

How client could inform gc about coroutines:

  • create - do nothing
  • start - call GC_call_wtih_stack_base({ GC_register_my_thread(); ; GC_unregister_my_thread(); }) // i.e. treat coroutine as a thread
  • yield - wrap yield (or switchTo) into GC_do_blocking
  • finish - call GC_unregister_my_thread() (already mentioned in start)
  • destroy - do nothing.

gc internals to fix:

  • GC_lookup_thread should skip entries in do-blocking state and return the first found one (there could be many with same thread it)
  • thread suspension and scanning: if thread is in do-blocking state (i.e. blocked) then do not suspend it and use save sp regardless of whether current thread is self or not (Win32 is already fixed but Darwin and Linux should be fixed)

Notes:

  • GC_do_blocking should be also fixed (probably later) to deal with the case when a coroutine could scheduled by multiple threads (not in parallel of course) - the fix is to update stored thread id (and rehash entry in GC_threads table if needed) on return from client code in GC_do_blocking.
  • TLS-related changes are omitted, let's use --disable-thread-local-alloc for PoC for now
  • GC_call_with_gc_active should be fixed too, let's don't call it from gctest for now
  • GC_set_stackbottom should be fixed to allow to be called from thread in do-blocking state (let's do it later)
  • GC_lookup_thread implementation (and GC_threads[] entry) is inefficient in case of hundreds of coroutines scheduled by a single thread (because of linear search in GC_lookup_thread among entries with same thread id)

This approach should not be difficult to try, would be good to hear opinion about it.

@roberth
Copy link
Contributor Author

roberth commented Sep 4, 2022

// Save the current stack pointer to either the native stack or a stack from the extra stacks
// This must be called before switching to a different coroutine, so that the relevant parts of the inactive stack can be scanned
void GC_save_stack_pointer();
// or perhaps equivalently void GC_before_stack_switch();

This does not seems to me as reliable, the approach should be I think somewhat like: GC_do_blocking(yield, <coroutine_to_yield_to>)

It seems that inactive has a distinct meaning in bdwgc that I did not intend to reference. I meant stack of a suspended coroutine.

My thinking was that on thread suspension, you could suspend all "OS" threads, giving you the stack pointers of the active coroutines ( / normal threads). The suspended (by yield) ones don't need suspension by stop_world. By saving the stack pointer before leaving, you should have enough information stored to mark those suspended stacks as well.
This way, you're not pretending that coroutines are full-on threads and you don't need to do any expensive operations around yield.

By accepting coroutines, stacks aren't 1:1 with threads anymore, so from a design perspective it would make sense to treat them separately. A thread_t doesn't identify a stack anymore, whereas a stack pointer does. By pretending that threads and stacks are still 1:1 you're "deceiving" the rest of the code so to speak. That might be a valid choice if you want to change the bdwgc interface as little as possible, but it feels risky.

I hope my ideas are helpful, but please consider them with a grain of salt, as I don't have nearly your knowledge of the code and the intricacies of the system.

@bcardiff
Copy link
Contributor

bcardiff commented Sep 4, 2022

yield - wrap yield (or switchTo) into GC_do_blocking

In order to allow concurrent fiber switches in Crystal we use a RWLock. Switching fibers acquire a reader, collections acquire a writer.

It can be an improvement for later if the API hides whatever lock. But using a single mutex might impact in performance a lot. At least that was the case in Crystal.

@ivmai
Copy link
Owner

ivmai commented Sep 5, 2022

In order to allow concurrent fiber switches in Crystal we use a RWLock. Switching fibers acquire a reader, collections acquire a writer.
It can be an improvement for later if the API hides whatever lock. But using a single mutex might impact in performance a lot. At least that was the case in Crystal.

@bcardiff, I agree. There could be 2 solutions:

  • a quick W/A: it is possible to expose GC_lookup_thread+do_blocking_enter and do_blocking_enter as public - then you will be able to use a reader lock in your own implementation of GC_do_blocking (I think it is better not to do this)
  • a long-term approach: rewrite bdwgc to use RWLock - e.g. introduce READER_[UN}LOCK() and use it for GC_do_blocking and other functions (this could be done incrementally function by function).

@ivmai
Copy link
Owner

ivmai commented Sep 5, 2022

@roberth, thank you for your opinion, I will consider it.

It seems that inactive has a distinct meaning in bdwgc that I did not intend to reference.

Yes, initial intention was to avoid suspend signal to come while client executes some syscall. But essentially it could be treated as save_stack_pointer (and indicate that signal should not be sent because the stack pointer is saved already and the thread is not mutating any GC pointers).

I don't see any big difference between stackful coroutines and threads (from the GC's side) - both have stack and context.
But if adding new API would let to deal specifically with the coroutines more efficiently, let's consider it. Right now I don't fully understand what GC_extra_stack_handle (This saves the stack info for the GC to search through when the sp is not in the native stack) and GC_save_stack_pointer (Save the current stack pointer to either the native stack or a stack from the extra stacks) should do. Would be nice if you could provide some pseudo code explaining the internals.

@ivmai
Copy link
Owner

ivmai commented Sep 30, 2022

What happens to coroutines on a process fork? It seems to me that they should survive in the child process. If someone has experience with it, let me know please.

@bcardiff
Copy link
Contributor

IIRC in Crystal we do fork to perform an exec only, so we don't need to keep coroutines.

@ivmai
Copy link
Owner

ivmai commented Oct 1, 2022

By accepting coroutines, stacks aren't 1:1 with threads anymore, so from a design perspective it would make sense to treat them separately.

@roberth, after trying to go with an approach based on GC_do_blocking, I'm more and more convinced that to support coroutines there's a need to separate "stack information" from thread: GC_thread entity should merely contain only thread id, thread-local free lists and pointer to named as e.g. GC_coroutine entity (containing the rest of the information), and GC_push_all_stacks should traverse the table containing all GC_coroutine entities, both active (suspended by GC_stop_world) and inactive ones.

@evrim
Copy link

evrim commented Nov 16, 2022

I need this too for ecl coroutines support. I've implemented the API but bdwgc collects all the way in and I get segfault. ecl uses setjmp/longjmp to manage non-local exits.

ivmai added a commit that referenced this issue Dec 11, 2022
(refactoring)

Issue #362 (bdwgc).

* darwin_stop_world.c (GC_stack_range_for): Define and crtn local
variable; use crtn instead of p to access stack_ptr, topOfStack,
stack_end, altstack, altstack_size, normstack, normstack_size,
backing_store_end, backing_store_ptr.
* pthread_stop_world.c [!NACL] (GC_suspend_handler_inner,
GC_push_all_stacks): Likewise.
* pthread_support.c [!GC_NO_FINALIZATION] (GC_reset_finalizer_nested,
GC_check_finalizer_nested): Likewise.
* pthread_support.c [!GC_WIN32_THREADS] (GC_register_altstack):
Likewise.
* pthread_support.c (do_blocking_enter, do_blocking_leave,
GC_set_stackbottom, GC_get_my_stackbottom, GC_call_with_gc_active):
Likewise.
* win32_threads.c (GC_push_stack_for, GC_get_next_stack): Likewise.
* darwin_stop_world.c (GC_push_all_stacks): Use p->crtn instead of p
to access traced_stack_sect.
* win32_threads.c (GC_suspend, GC_stop_world, GC_start_world):
Likewise.
* include/private/pthread_support.h (GC_StackContext_Rep): New struct
type (move dummy, stack_end, stack_ptr, last_stack_min,
initial_stack_base, topOfStack, backing_store_end, backing_store_ptr,
altstack, altstack_size, normstack, normstack_size, finalizer_nested,
finalizer_skipped, traced_stack_sect from GC_Thread_Rep).
* include/private/pthread_support.h [!GC_NO_THREADS_DISCOVERY
&& GC_WIN32_THREADS] (GC_StackContext_Rep.stack_end): Add volatile.
* include/private/pthread_support.h [!GC_NO_FINALIZATION]
(GC_StackContext_Rep.fnlz_pad): New field.
* include/private/pthread_support.h (GC_stack_context_t): New type.
* include/private/pthread_support.h (GC_Thread_Rep.crtn,
GC_Thread_Rep.flags_pad): New field.
* include/private/pthread_support.h [GC_NO_FINALIZATION]
(GC_Thread_Rep.no_fnlz_pad): Remove field.
* include/private/pthread_support.h (GC_threads): Move (and refine)
comment from GC_Thread_Rep.
* include/private/pthread_support.h [GC_WIN32_THREADS]
(GC_record_stack_base): Change me argument to crtn.
* pthread_stop_world.c [!NACL] (GC_store_stack_ptr): Likewise.
* pthread_support.c (GC_record_stack_base): Likewise.
* pthread_stop_world.c [NACL] (NACL_STORE_REGS, nacl_pre_syscall_hook,
__nacl_suspend_thread_if_needed): Use p->crtn instead of p to access
stack_ptr.
* pthread_support.c (first_crtn): New static variable.
* pthread_support.c (first_thread): Update comment.
* pthread_support.c (first_thread_used): Remove variable.
* pthread_support.c (GC_push_thread_structures): Push
first_thread.crtn symbol
* pthread_support.c (GC_push_thread_structures): Push
first_crtn.backing_store_end instead of that in first_thread.
* pthread_support.c [MPROTECT_VDB && GC_WIN32_THREADS]
(GC_win32_unprotect_thread): Call GC_remove_protection() for t->crtn.
* pthread_support.c (GC_new_thread): Use first_thread.crtn!=0 instead
of first_thread_used; set first_thread.crtn to &first_crtn; allocate
GC_StackContext_Rep object using GC_INTERNAL_MALLOC() and store the
pointer to result->crtn.
* pthread_support.c [CPPCHECK] (GC_new_thread): Call GC_noop1() for
first_thread.flags_pad, for first_crtn.dummy instead of result->dummy,
and for first_crtn.fnlz_pad instead of result->no_fnlz_pad.
* pthread_support.c (GC_delete_thread): Call GC_INTERNAL_FREE(p->crtn)
along with that for p.
* pthread_support.c [CAN_HANDLE_FORK && (!THREAD_SANITIZER
|| !CAN_CALL_ATFORK)] (GC_remove_all_threads_but_me): Likewise.
* pthread_support.c [GC_PTHREADS] (GC_pthread_join, GC_pthread_detach):
Likewise.
* pthread_support.c (GC_segment_is_thread_stack,
GC_greatest_stack_base_below): Use p->crtn instead of p to access
stack_end.
* pthread_support.c (GC_call_with_gc_active): Move assertions about me
fields to be after LOCK; add assertion that me->crtn==crtn.
* win32_threads.c (dll_thread_table): Make it static.
* win32_threads.c [!GC_NO_THREADS_DISCOVERY] (dll_crtn_table): New
static variable.
* win32_threads.c (GC_register_my_thread_inner): Reformat comment;
set me->crtn.
* win32_threads.c (GC_push_stack_for): Define stack_end local variable;
immediately return (zero) if stack_end is NULL.
* win32_threads.c (GC_push_all_stacks): Call GC_push_stack_for() (and
increment nthreads) even if stack_end is NULL.
* win32_threads.c (GC_get_next_stack): Rename s local variable to
stack_end.
ivmai added a commit that referenced this issue Dec 11, 2022
(fix of commit 7eb49a4)

Issue #362 (bdwgc).

* darwin_stop_world.c (GC_stack_range_for): Do not set crtn local
variable value until p is guaranteed to be non-NULL (i.e., do not set
crtn value unless DARWIN_DONT_PARSE_STACK).
ivmai added a commit that referenced this issue Dec 31, 2022
(fix of commits 7eb49a4, 128c0ec)

Issue #362 (bdwgc).

* darwin_stop_world.c (GC_stack_range_for): Define crtn local variable
only if DARWIN_DONT_PARSE_STACK.
* darwin_stop_world.c [DARWIN_DONT_PARSE_STACK && CPPCHECK]
(GC_stack_range_for): Call ABORT if p is NULL (before crtn = p->crtn).
@ivmai ivmai changed the title Documentation about integrating with coroutines + question Coroutines support Mar 1, 2023
@ivmai
Copy link
Owner

ivmai commented Mar 1, 2023

Here are the proposed API primitives for the coroutines support in bdwgc (implementation is still in progress, API may still subject to change):

// ucp_id is a coroutine/fiber handle provided by the underlying library/OS
// the sample is given for Win32 fibers mostly (similar way should be for make/set/swapcontext())
void GC_crtn_from_self(void *ucp_id); // usage: fiber = ConvertThreadToFiber(arg); GC_crtn_from_self(fiber);
void GC_crtn_create(void *ucp_id); // usage: fiber = CreateFiber(wrapped_fn, arg); GC_crtn_create(fiber);
void GC_crtn_deinit(void *ucp_id); // usage: 1. GC_crtn_deinit(fiber); DeleteFiber(fiber); or 2. GC_crtn_deinit(0); ConvertFiberToThread();
void GC_crtn_start(void *ucp_id, const struct GC_stack_base *sb); // usage: wrapped_fn(arg) { GC_call_with_stack_base({ GC_crtn_start(GetCurrentFiber(), sb); res = fn(arg); GC_crtn_ret(); return res; }); }
void GC_crtn_ret(void); // usage: 1. above; or 2. GC_crtn_ret(); co_return/setcontext();
void *GC_crtn_resume(resume_fn(ucp, cd), void *ucp, void *cd); // usage: GC_crtn_resume(co_await, fn, arg); co_yield or SwitchToFiber, or swapcontext could be also passed as resume_fn (ucp and cd are just custom args).

@roberth
Copy link
Contributor Author

roberth commented Mar 2, 2023

// ucp_id is a coroutine/fiber handle provided by the underlying library/OS

What does ucp stand for?
How do I make one for boost coroutines2? (We use protected_fixedsize_stack)

That coroutine implementation lets us hook into the stack allocation and deallocation methods, so I suppose we'd call GC_crtn_{create,deinit} in those places.

It does not have a concept of converting between threads to fibers and back, so I suppose we'd ignore GC_crtn_from_self and GC_crtn_deinit(0), or is this necessary in order to ensure that the original thread stack is retained in bdwgc's list of stacks?

When do we call GC_crtn_start? We already know the stack base of the coroutine when we allocate it. Should we "start" it immediately?

The GC_crtn_ret and GC_crtn_resume do not appear to be a good match for boost coroutine, as the primitives mentioned appear to be a bit more low level and do not help with the passing of values between the coroutines. At least, boost coroutine doesn't let use override or wrap the low level primitives it uses. If I recall correctly, the need for these two function came from the way bdwgc's internal stack bookkeeping had to be updated around switch time, but if bdwgc knows about all stacks, couldn't it derive the right information from the stack pointer at any given time; before, after and during the stack switch?
If the need for these functions can not be avoided, I think we'll have to move away from boost coroutine, and it won't be possible to integrate with some other fiber/coroutine implementations either, such as GHC's green thread stacks. That may not be a huge loss, because it's a pretty wild use case, but a bit of a loss nonetheless. (I think it's fine for those rare Haskell programs to use explicit GC roots all the time, instead of relying on stack scanning)

@ivmai
Copy link
Owner

ivmai commented Mar 2, 2023

What does ucp stand for?

It is just some custom identifier of a coroutine (e.g. it could be pointer returned by CreateFiber, or be of type ucontext_t* passed to swapcontext), the only requirement is it should be unique, does not change while coroutine is alive, and should be non-zero)

@ivmai
Copy link
Owner

ivmai commented Mar 2, 2023

That coroutine implementation lets us hook into the stack allocation and deallocation methods, so I suppose we'd call GC_crtn_{create,deinit} in those places.

Yes.

I suppose we'd ignore GC_crtn_from_self and GC_crtn_deinit(0), or is this necessary in order to ensure that the original thread stack is retained in bdwgc's list of stacks?

It is necessary for bdwgc. You can call GC_crtn_resume() only if the caller is a coroutine itself. This is similar to Win32 ConvertThreadToFiber purpose. This is needed for efficiency inside bdwgc (otherwise we would need to always put every thread into 2 tables - one for the thread and one for the case it is treated as a coroutine). The goods news is you do not need to call GC_crtn_deinit(0) (to convert a coroutine back to thread) - it is to be done automatically on thread destruction.

@ivmai
Copy link
Owner

ivmai commented Mar 2, 2023

That coroutine implementation lets us hook into the stack allocation and deallocation methods, so I suppose we'd call GC_crtn_{create,deinit} in those places.

I should also note that once created you could reuse the entity multiple time before deinit (provided its ucp_id is not changed).

When do we call GC_crtn_start? We already know the stack base of the coroutine when we allocate it. Should we "start" it immediately?

Yes, OK to start immediately, but you need to fill in sb yourself. There is no way to set the stack bottom before coroutine starts running.

void my_coroutine(arg)
{
  struct GC_stack_base sb;
  sb.mem_base = stack_base + stack_size;
  // if collection occurs here, the stack is not scanned (this means the caller should ensure arg object is visible to bdwgc o/w arg might be GC'ed)
  GC_crtn_start(ucp_id, &sb); // ucp_id should be already known to bdwgc (by calling GC_crtn_create in advance)
  ...
  GC_crtn_resume(...);
  ...
  GC_crtn_ret();
}

@ivmai
Copy link
Owner

ivmai commented Mar 2, 2023

Yes, OK to start immediately, but you need to fill in sb yourself.

Let's say, use of GC_call_with_stack_base() is a portable way of filling in sb variable.

@ivmai
Copy link
Owner

ivmai commented Mar 2, 2023

If I recall correctly, the need for these two function came from the way bdwgc's internal stack bookkeeping had to be updated around switch time, but if bdwgc knows about all stacks, couldn't it derive the right information from the stack pointer at any given time; before, after and during the stack switch?

Let's return back to this suggestion later (yes, I see this should be technically possible as you know the stack base and size in advance) if no better variants.

@ivmai
Copy link
Owner

ivmai commented Mar 2, 2023

The GC_crtn_ret and GC_crtn_resume do not appear to be a good match for boost coroutine, as the primitives mentioned appear to be a bit more low level and do not help with the passing of values between the coroutines. At least, boost coroutine doesn't let use override or wrap the low level primitives it uses.

Please give me a sample of code.

GC_crtn_ret() should be placed just before each co_return (and before each return statement of the top-level coroutine function, and as a last statement of the top-level coroutine function). The expression passed to co_return should be simple ,e.g.:

co_return GC_malloc(1);
->
void *obj = GC_malloc(1);
GC_crtn_ret(); // actually I think I should modify GC_crtn_ret to accept 1 argument (obj in this case).
// If collection occurs here, the stack is not scanned.
co_return obj;

@ivmai
Copy link
Owner

ivmai commented Mar 2, 2023

GC_crtn_resume do not appear to be a good match for boost coroutine

The support of context switching (GC_crtn_resume) is most difficult part, it's implementation is very similar to GC_do_blocking() one, GC_crtn_resume does:

  • save all registers to stack
  • record the stack top
  • disassociate coroutine stack info (GC_stack_context_t) with the current thread (saved_crtn = me->crtn; me->crtn = 0; saved_crtn->running = false)
  • call the underlying context swapper // the execution continues somewhere else until returned back to this point
  • restore coroutine stack info pointer (me->crtn = saved_crtn) and set back running state.

For convenience resume_fn accepts 2 arguments with client-defined semantics (like koishi_resume). Both symmetric and asymmetric coroutines should be supported.

@ivmai
Copy link
Owner

ivmai commented May 17, 2023

@bcardiff, do you push all registers to the stack during context switch (e.g. by GC_with_callee_saves_pushed) or store them to context structure (and later scan them there)?

@bcardiff
Copy link
Contributor

@ivmai No, not all. In crystal we do the context switch as if we would do a normal C call mostly + tweaking the stack pointer register to change the returning stack.

The (assembly) code that depends on the arch can be found at https://github.com/crystal-lang/crystal/blob/master/src/fiber/context/x86_64-sysv.cr#L22-L40 (and other files in the same directory).

We preserve the stack pointer in the co-routine structure so we can restore it later, but other than that the registers are preserved in the stack itself according to the calling convention.

Does this helps?

@ivmai
Copy link
Owner

ivmai commented May 19, 2023

@bcardiff, thank you, I see, you have own swap context routine which saves regs to the stack and then, during push_other_roots, ps is extracted from the coroutine context.

@bcardiff
Copy link
Contributor

Yes. That's right. How we push roots is essentially what you say. There is a bit of more details described in the following blog post https://crystal-lang.org/2022/02/16/bdw-gc-coroutines-support/

@ivmai
Copy link
Owner

ivmai commented May 20, 2023

@roberth, I have modified your patch for upstream master, but still not sure if it is worth upstreaming, still thinking.

diff --git a/include/private/gcconfig.h b/include/private/gcconfig.h
index 16015195..19919595 100644
--- a/include/private/gcconfig.h
+++ b/include/private/gcconfig.h
@@ -3105,6 +3105,17 @@ EXTERN_C_BEGIN
 # define HAVE_NO_FORK
 #endif
 
+#if defined(PTHREAD_STOP_WORLD_IMPL) \
+    && !defined(E2K) && !defined(IA64) && !defined(REDIRECT_MALLOC) \
+    && defined(HAVE_PTHREAD_GETATTR_NP) && defined(GC_LINUX_THREADS)
+  /* Have "crude" (minimalist) support of coroutines (ability to deal   */
+  /* with threads those stack pointer (sp) is lost because of going     */
+  /* into a coroutine).                                                 */
+  /* Note: unimplemented in case of redirection of realloc() because    */
+  /* pthread_getattr_np() may call realloc().                           */
+# define CRUDE_CORO_SUPPORT
+#endif
+
 #if !defined(USE_MARK_BITS) && !defined(USE_MARK_BYTES) \
     && defined(PARALLEL_MARK)
   /* Minimize compare-and-swap usage.   */
diff --git a/pthread_stop_world.c b/pthread_stop_world.c
index f537283f..e50ed6d5 100644
--- a/pthread_stop_world.c
+++ b/pthread_stop_world.c
@@ -865,6 +865,42 @@ GC_INNER void GC_push_all_stacks(void)
             hi = crtn -> altstack + crtn -> altstack_size;
 #         endif
           /* FIXME: Need to scan the normal stack too, but how ? */
+        } else {
+#         ifdef CRUDE_CORO_SUPPORT
+              pthread_attr_t attr;
+              void *stackaddr = NULL;
+              size_t stack_sz = 0;
+
+              /* TODO: Call it once and record stack addr/size in p */
+              if (pthread_getattr_np(p -> id, &attr) != 0)
+                ABORT("pthread_getattr_np failed");
+              if (pthread_attr_getstack(&attr, &stackaddr, &stack_sz) != 0
+                  || NULL == stackaddr || 0 == stack_sz)
+                ABORT("pthread_attr_getstack failed");
+              (void)pthread_attr_destroy(&attr);
+
+              /* When a thread goes into a coroutine, we lose its       */
+              /* original sp until control flow returns to the thread.  */
+              /* While in the coroutine, the sp points outside the      */
+              /* thread stack, so we can detect this and push the       */
+              /* entire thread stack instead, as an approximation.      */
+              /* We assume that all registers were pushed to the thread */
+              /* stack when going into a coroutine.  It is also assumed */
+              /* that the client arranges the stacks of all coroutines  */
+              /* to be pushed entirely (e.g., by allocating them with   */
+              /* GC_malloc or registering them as static data roots).   */
+              /* We assume the entire stack is mapped.                  */
+#             ifdef STACK_GROWS_UP
+                GC_ASSERT((word)stackaddr <= (word)hi);
+                if ((word)lo <= (word)hi
+                    || (word)lo >= (word)stackaddr + stack_sz)
+                  lo = (ptr_t)stackaddr + stack_sz;
+#             else
+                GC_ASSERT((word)hi <= (word)stackaddr + stack_sz);
+                if ((word)lo <= (word)stackaddr || (word)lo >= (word)hi)
+                  lo = (ptr_t)stackaddr;
+#             endif
+#         endif
         }
         GC_push_all_stack_sections(lo, hi, traced_stack_sect);
 #       ifdef STACK_GROWS_UP

@ivmai
Copy link
Owner

ivmai commented May 20, 2023

@roberth, you wrote

The GC_crtn_ret and GC_crtn_resume do not appear to be a good match for boost coroutine, as the primitives mentioned appear to be a bit more low level and do not help with the passing of values between the coroutines. At least, boost coroutine doesn't let use override or wrap the low level primitives it uses

Do you mean it is hard to wrap yield() calls in serialize.cc by GC_crtn_resume?

yield();
if (yield.get()) ...

->

GC_crtn_resume(yield_fn, this, nulptr); // where yield_fn(o, dummy) = { o.yield(); }
if (yield.get()) ...

and

if (!data.empty()) yield(std::string(data));

->

if (!data.empty()) {
  auto s = std::string(data);
  GC_crtn_resume(yield_fn2, this, s); // where yield_fn2(o, s) = { o.yield(s); }
}

Previously you proposed the API to inform the GC about context switches (GC_before_stack_switch, etc.), the usage looks like this:

if (!data.empty()) {
  auto s = std::string(data);
  {void *ctx = GC_crtn_before_resume(); // you named it as GC_before_stack_switch
  yield(s);
  GC_crtn_after_resume(ctx);}
}

This, of course, is easier to use but the fundamental issue with it is to guarantee all registers are saved and to get sp value after the save.

@roberth
Copy link
Contributor Author

roberth commented May 20, 2023

I think I meant that it would not be possible to integrate it with boost coroutines without modifying the upstream implementation.

the fundamental issue with it is to guarantee all registers are saved and to get sp value after the save.

Isn't that a fundamental issue in general? Doesn't bdwgc have to solve that problem regardless of where the stack is?

Previously you proposed the API to inform the GC about context switches (GC_before_stack_switch, etc.), the usage looks like this:

Not really. I think the first name I gave for it better reflects the intent: GC_save_stack_pointer. It would only save the pointer, and it would only be used if there wasn't any better information to go by: only when that stack is not executing. When it's executing, we should always the stack pointer from the existing thread stopping code.

Whether that saved, backup stack pointer is good enough would depend on the implementation of yield, and I can see that it might store relevant pointers beyond the saved stack pointer when saving registers. How does your solution solve that? Doesn't it also have to call yield at some point and hope for the best without any means to figure out where such information was stored? It seems that you're still making an assumption about the calling convention, but maybe your assumption is good enough?

I must admit that my understanding of the machinery involved isn't as thorough as I would need to have a truly productive conversation about what the interface may look like, so I'll be quite happy to accept what you think the interface has to be. You're doing the work after all, and only an interface that's correctly implementable is useful.

@ivmai
Copy link
Owner

ivmai commented May 22, 2023

@roberth, please check commit f7a0708 - it adds API to check and modify sp during push_all_stacks.
Your initialization code might look like:

static void sp_corrector(void** sp_ptr, void* tid)
{
  pthread_attr_t attr; void *stackaddr; size_t stack_sz;
  word sp = (word)*sp_ptr;

  if (pthread_getattr_np((pthread_t)tid, &attr)) abort();
  if (pthread_attr_getstack(&attr, &stackaddr, &stack_sz)) abort();
  pthread_attr_destroy(&attr);
  if (sp <= (word)stackaddr || sp >= (word)stackaddr + stack_sz) *sp_ptr = (void*)stackaddr;
}

int main() {
  ...
  GC_set_sp_corrector(sp_corrector);
  if (!GC_get_sp_corrector()) disable_gc_when_in_coroutine(); // sp correction is not supported
  ...
}

If it works for you, then I land the patch to gc-8.2.4 (so you would be able e.g. detect presence of GC_set/get_sp_corrector() by bdwgc version comparison).

PS. This is, of course, a workaround, the work in this issue is to be continued for the proper portable support of coroutines (it will be released in future v8.4.0).

@roberth
Copy link
Contributor Author

roberth commented May 23, 2023

@ivmai That's much appreciated, but I can't promise that we get around to it soon. Patching the gc is not an obstacle at all for us, as Nix is usually built through Nix, which makes that easy.
The current implementation seems to hold up and I'd like to be conservative with change in this area, as changes are not easy to validate.
I think I'd prefer to invest time in a more final solution, and I wouldn't want to burden you with maintaining a workaround.

@ivmai
Copy link
Owner

ivmai commented May 23, 2023

Good, let's work together for the final solution.

@bcardiff
Copy link
Contributor

bcardiff commented May 24, 2023

I was thinking that maybe an approach that would work is to focus more on the stacks rather than in the coroutines. It would be nice if the GC is aware of the possible stacks and treat them as roots directly without the user needing to register those roots before collect. Of course this is influenced by the code is working in Crystal and what I imagine would've been nicer/simpler in terms of GC API.

The (coroutines) stacks need to be allocated by the user since there might be some logic there. In Crystal for example we use a memory pool.

The GC would need to know for a given stack if that is running in some thread or not. If it's not running, add it as root. If it's running the stack bottom of it should be the one used for that thread.

This is essentially Crystal's before collect callback: https://github.com/crystal-lang/crystal/blob/9b97e8483528a2c26c3e57afa3704f8ebb429142/src/gc/boehm.cr#L352-L367

To do this we would need a structure to represent the stacks and a way to extract some runtime information:

  • Are the stacks running in a coroutine?
    • if so in which thread?
    • if not, what is their stack top?
  • What is the stack bottom of the stack?
// a callback that give a stack_ref will return in which pthread_id the stack is being executed or NULL if it's not being executed.
// if *pthread_id = NULL then the stack_top != NULL
// always stack_bottom != NULL
typedef void (GC_CALLBACK * GC_stack_execution_state_proc(void *stack_ref, void **pthread_id, void **stack_bottom, void **stack_top))

GC_API void GC_CALL GC_set_stack_execution_state(GC_stack_execution_state_proc);

We need to create/destroy stack_ref so the GC is aware of the whole universe of stacks.

GC_API void GC_CALL GC_register_stack_ref(void *stack_ref);
GC_API void GC_CALL GC_delete_stack_ref(void *stack_ref);

The user will still need to use GC_get_my_stackbottom to register a stack_ref that will represent the native thread stack, but I don't think there is much way around it. Some stacks are heap allocated and some stacks are thread native.

How the user's runtime do the context switch will be out of scope for the GC. The only needed bit is to have the information provided by GC_stack_execution_state_proc so a built-in before collect can register the roots and fix the thread's stackbottom.

As I mentioned before in Crystal it was crucial to have a RWLock for this to work. So maybe that should happen before this API.

@bcardiff
Copy link
Contributor

I edited the previous comment. I was missing the stack bottom in the GC_stack_execution_state_proc.

@ivmai
Copy link
Owner

ivmai commented May 24, 2023

As I mentioned before in Crystal it was crucial to have a RWLock for this to work. So maybe that should happen before this API.

Maybe, but for now I think I will return to it (#473) after finishing with this issue.

@ivmai
Copy link
Owner

ivmai commented May 24, 2023

It would be nice if the GC is aware of the possible stacks and treat them as roots directly without the user needing to register those roots before collect.
...
We need to create/destroy stack_ref so the GC is aware of the whole universe of stacks

Agree, we need to provide create/destroy API functions. Details of this create/destroy API may vary:

  • Robert's variant: pass bottom/top to the creation function which returns some pointer to opaque struct
  • mine variant: pass some custom unique id to the creation function which, e.g. could be stack bottom or handle returned by Win32 CreateFiber

A small question here: should bdwgc remove the coroutine stack if there no reference to it from the client? I think No because the stack might be some native memory which is not managed by the bdwgc.

Other misc related questions:

  • Is it OK to acquire GC global lock during create/destroy? I think Yes (because coroutines creation and deletion are not as frequent compared to context switching)
  • Should we support coroutines recycling w/o the need to call destroy and create? Probably, would be nice but not of the first priority
  • Should we be able to destroy incomplete (but not running) coroutine? I think Yes

@joe-conigliaro
Copy link

joe-conigliaro commented Feb 9, 2024

Hello,

Thanks to everyone and their work here 😄

I've been working on coroutines in the V language, and I experienced segfault's due to the GC's incorrect thread stack base.
We are currently using Photon, but this applies to any stackfull coroutines I guess.

GC_set_sp_corrector was exactly what I needed to update the stack pointer, I have added here a version of the sp_corrector function which supports MacOS, Windows, and Linux.

static void sp_corrector(void** sp_ptr, void* tid) {
    size_t stack_size;
    char* stack_addr;
#ifdef __APPLE__
    stack_size = pthread_get_stacksize_np((pthread_t)tid);
    stack_addr = (char*) pthread_get_stackaddr_np((pthread_t)tid);
#elif defined(_WIN64)
    ULONG_PTR stack_low, stack_high;
    GetCurrentThreadStackLimits(&stack_low, &stack_high);
    stack_size = stack_high - stack_low;
    stack_addr = (char*)stack_low;
#elif defined(__linux__)
    pthread_attr_t gattr;
    pthread_getattr_np((pthread_t)tid, &gattr);
    pthread_attr_getstack(&gattr, (void**)&stack_addr, &stack_size);
    pthread_attr_destroy(&gattr);
#else
    assert("unsupported platform");
#endif
    char *sp = (char*)*sp_ptr;
    if(sp <= stack_addr || sp >= stack_addr+stack_size) {
        *sp_ptr = (void*)stack_addr;
    }
}

Thanks for the great help @ivmai 🥂

I think you forgot to add some parenthesis in the example sp_corrector function you posted above.
I'm pretty sure it should be if (sp <= (word)stackaddr || sp >= (word)(stackaddr + stack_sz)) instead of if (sp <= (word)stackaddr || sp >= (word)stackaddr + stack_sz)

What is the benefit of that check, apposed to just updating the stack pointer every time? I'm just wondering if any performance benefit is negated by the check itself 🤔

@ivmai
Copy link
Owner

ivmai commented Feb 10, 2024

Hello @joe-conigliaro
Thank you for the feedback about sp_corrector! It is an intermediate (quick) solution to address a request the coroutine support, I hope the future v8.4.0 release will come with the proper solution.

I have added here a version of the sp_corrector function which supports MacOS, Windows, and Linux.
...
sp_ptr = (void)stack_addr;

Please note that this code works only for platforms with the stack growing down (but MacOs, Win32 and Linux always have stack growing down, so it is OK, this is just a note).

I think you forgot to add some parenthesis in the example sp_corrector function you posted above.
I'm pretty sure it should be:
if (sp <= (word)stackaddr || sp >= (word)(stackaddr + stack_sz))
instead of:
if (sp <= (word)stackaddr || sp >= (word)stackaddr + stack_sz)

No, in my sample above stackaddr is of void* (thus to should be either converted to char* or to word before adding stack_sz).

What is the benefit of that check, apposed to just updating the stack pointer every time? I'm just wondering if any performance benefit is negated by the check itself?

This is essential for performance! In most cases the sp is supposed to be between stack top/bottom (e.g. probably relatively close to stackaddr+stack_sz value). With this check the collector will scan only space between current sp and stackaddr+stack_sz, without this check - berween stackaddr and stackaddr+stack_sz.

@joe-conigliaro
Copy link

joe-conigliaro commented Feb 10, 2024

Thanks for the feedback @ivmai!

Thank you for the feedback about sp_corrector! It is an intermediate (quick) solution to address a request the coroutine support, I hope the future v8.4.0 release will come with the proper solution.

I understand that it would be ideal to just update the thread stack base when context switching happens, unfortunately that is not always possible. At least we have a solution for now, until a proper fix is added :)

No, in my sample above stackaddr is of void* (thus to should be either converted to char* or to word before adding stack_sz).

Ahh I see, oops silly me! 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants