Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prototype] boost fiber instead of asio for async #16699

Closed
wants to merge 18 commits into from

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Jun 28, 2021

Why are these changes needed?

  • Add grpc client side support of fiber
  • Use fiber for GetObjectLocationOwner and update it in code.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone added the do-not-merge Do not merge this PR! label Jun 28, 2021
@ericl
Copy link
Contributor

ericl commented Jun 29, 2021

What's blocking us from merging this? I guess

  • Feature parity with event stats for ASIO
  • If we depend on the asio loop for thread safety, need to port all other callbacks on the same loop to fibers as well

cc @pcmoritz @stephanie-wang for comments

@@ -40,6 +40,14 @@ namespace rpc {
INVOKE_RPC_CALL(SERVICE, METHOD, request, callback, rpc_client); \
}

#define FIBER_RPC_CLIENT_METHOD(SERVICE, METHOD, rpc_client, SPECS) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this have reduced performance compared to the RPCs run directly on gRPC threads?

How does thread pooling work for fibers, do we have a pool of fiber threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the gRPC thread, it'll move the result out, so it won't reduce the performance.
Comparision:

  • Previously, we post a job to io_thread pool. Here a job is created in asio pool.
  • Here, we pass the result to promise with move. The extra cost is moving the structure, which is ok. For example, this is the pub sub message, if the arena is the same, it'll just swap and we always use the default one.
  inline PubMessage& operator=(PubMessage&& from) noexcept {
    if (GetArena() == from.GetArena()) {
      if (this != &from) InternalSwap(&from);
    } else {
      CopyFrom(from);
    }
    return *this;
  }

No, we don't have a pool for fiber thread for this prototype. We just reuse the thread calling CoreWorker which can be optimized later. We should put everything not cpu intensive to one thread (fiber thread) and put cpu intensive to cpu thread pool.

Fiber's thread is like asio thread, the different part is that, for asio, they manage task, which is a closure. For fiber, it also manage context and scheduling of the tasks. In asio, we wrap everything into lambda capture, and we lose call stack. In fiber, it put things into context and makes them resumable.

In the middle of migration, we can share the same thread with asio. I mean if we go this way we can do it granularly.

@scv119
Copy link
Contributor

scv119 commented Jun 29, 2021

my major concerns are the team's expertise with fiber.

  • I was under the impression that fiber was a bit tricky to use, debug and lacks of compiler support.[1].
  • I know Facebook has tried using fiber and moved away from it [2].
  • Another thing might take into consideration is that coroutine has become part of C++ 20 standard, which should be very similar to fiber but with language/compiler support.

To summarize my takes on the trade-offs for adopting boost::fiber:

  • pros: we should see a big win in term of code readability, slightly win of performance. My guess is that most of time we don't have any issues with it.
  • cons: if weird fiber bug ever happens, it might be very hard to debug (due to limited adoption, heterogeneous runtime environment and compilers) . Also if the C++ standard is moving to coroutine eventually, we might need migrate to coroutines in the future.

@ericl
Copy link
Contributor

ericl commented Jun 29, 2021 via email

@fishbone
Copy link
Contributor Author

@scv119 I read the doc and I don't think the arguments in the doc hold a lot. Go coroutines are also stack full and it gives you benefit and it might suffer from stackoverflow. As for context switch, since we'd like to adopt better thread model, it should be avoided.

Besides, for c++20, even the current compiler is not supported well, for adoption it'll take a longer time, maybe 5 years later. I don't treat this one as a solution. Besides, later move from library-based coroutine to compiler supported one is not difficult. I remember you mentioned the code is hard to understand by reading a lot of callbacks right? And also the code flow is much more cleaner compared with the previous one, right?

The performance gain comes from avoiding copying things into lambda capture compared with asio, but it's not the key benefit from this.

@fishbone
Copy link
Contributor Author

@ericl promise/future/then can not give you similar things like this one. Here, everything can be wrapped into one function directly, but with promise future then, basically, we need to divide the big function into parts, like you are chaining functions. Some syntax might look wired, for example, if you have a for loop and you want to decide what to do next with the given result, it's not natural to write with promise/future/then.

To build functions around asio to provide promise/future/then, I'd prefer something built around and under maintaining for a long time.

@ericl ericl removed their assignment Jun 29, 2021
@fishbone
Copy link
Contributor Author

@neverchanje
Copy link

neverchanje commented Jul 9, 2021

I've just seen this PR occasionally when I searched GitHub for issues around boost.fiber. I have no much knowledge around ray. But I'm very interested in this discussion.

I recently built a Postgres database prototype using boost.fiber, and the backend is multi-threaded using a thread-pool like design, but with fiber inside. The network module is based on Boost.Asio. So far I have not met issues stated in http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2018/p1364r0.pdf . For instance, the paper says the fiber runtime can not integrate with tcmalloc. But at least for now my server didn't crash due to tcmalloc.

The debuggability is really an issue. From my experience, application bugs are as easy to be detected as a normal program. But my real concern is that if the library is not widely adopted and stable enough, I may not be able to dig out the library's bug. As I step-by-step look into the gdb call stack of the Boost.Fiber, the code is really complicated to catch up.

I can share my fiber pool implementation if you are interested.

Furthermore, the paper also stated that the fiber runtime needs the entire project to be refactored and all the blocking API must be wrapped (via promise/future) into fiber-aware API. This is true. But there's approach where different runtimes can coexist:

For example, assume now we have a project based on actor model, and when a task is issued, it runs asynchronously and finally gives a result to a callback. To refactor this task with boost.fiber, you can set up a set of isolated threads dedicated for fiber. Then the task is sent to the "fiber-instrumented threads" and execute in synchronous programming style, benefited from fiber. When the task ends, it can send the result back to the "actor-instrumented threads" for finally executing the callback.

void RunTask(std::function<void()> task, task_callback) {
  FiberAsync([](){
    task();
    ActorAsync(task_callback);
  });
}

@scv119
Copy link
Contributor

scv119 commented Jul 9, 2021

I connected with @neverchanje, and he shared with me a lot of positive experiences using fiber. Apparently, @neverchanje is few steps ahead of us on evaluating boost::fiber, and we will exchange more details of the fiber evaluation. I learned from this conversation that boost::fiber might not yet support Apple M1 chips, as boost::fiber uses assembly to implement its usercontext/user-level stacks. This could affect M1 support of Ray. cc @wuisawesome @simon-mo

@simon-mo
Copy link
Contributor

simon-mo commented Jul 9, 2021

I learned from this conversation that boost::fiber might not yet support Apple M1 chips, as boost::fiber uses assembly to implement its usercontext/user-level stacks.

This is interesting, as a data point, we already use boost:fiber for async actor support https://github.com/ray-project/ray/blob/master/src/ray/core_worker/fiber.h.

@simon-mo
Copy link
Contributor

simon-mo commented Jul 9, 2021

boost::fiber might not yet support Apple M1 chips, as boost::fiber uses assembly to implement its usercontext/user-level stacks.

@scv119, actually boost:fiber should work in M1. Because boost fiber has assembly for arm and M1 is arm based. We already use boost:fiber, and ray can already compile with arm chip (in graviton and in m1).

@scv119
Copy link
Contributor

scv119 commented Jul 9, 2021

That's great to know :)

@neverchanje
Copy link

I thought it would require some special customization for Mac arm. Good to know it works.

@fishbone
Copy link
Contributor Author

Thanks, @neverchanje for the sharing. I feel the only thing that concerns me is the maturity of the library itself which we need to evaluate more.

As for debuggability, right now, it's hard to list the user thread stack since it's not integrated with gdb. Only the active ones can be listed. But it's not going to make things worse, since right now. we use ASIO, and it can't list tasks there as well. Btw, we won't have multiple physical threads to execute fiber tasks, only one physical thread for all fiber tasks for simplicity.

As for refactoring the entire project for adoption, it's not an issue here as well since our program is already written in async way. We can just add a new API for future/promise and change the async function granually into sync one. The async rise on the gRPC side, and it's not difficult to update it.

@fishbone
Copy link
Contributor Author

I'll close this PR for now since we've decided to experiment with GCS services.

@fishbone fishbone closed this Jul 21, 2021
@fishbone fishbone deleted the fiber-cb branch November 23, 2021 05:54
@fishbone fishbone restored the fiber-cb branch November 23, 2021 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge Do not merge this PR!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants