-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC][core] GcsClient async binding, aka remove PythonGcsClient. #45289
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…ient on it. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
f7b857a
to
9438b09
Compare
Status ReportI made it to a point that it can replace all of the existing GcsAioClient, and the Architecture
On completion,
Note on async: the C++ GcsClient requires a boost asio event loop (instrumented_io_context) to work. But Python API needs GcsClient even without a running Ray worker. So I implemented a Status QuoThese async methods are supported:
That's all what GcsAioClient needs. These sync methods are supported:
DemoIn this PR, I replaced GcsAioClient and (part of) GcsClient implementations to the new binding. I tweaked IncompatibilitiesThere are many nuanced API differences between the PythonGcsClient and the C++ GcsClient.
|
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
so we need to move back all TimeoutError back to RpcError (sad) or we can "update" all RpcError to TimeoutError?? |
TODOs before merging this:
|
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
This PR is a proof of concept that, we can do bindings for Async C++ APIs. Specifically, we can wrap a C++
ray::gcs::GcsClient
Async APIs (by callbacks) into Python Async APIs (by async/await).Why?
On Python gRPC, we now have a very complicated and inconsistent way. We have:
grpcio
based Client.GcsClient
-> C++PythonGcsClient
-> C++grpc::Channel
,GcsAioClient
-> thread pool executor -> PythonGcsClient
-> C++PythonGcsClient
-> C++grpc::Channel
,Gcs.*Subscriber
(sync)GcsAio.*Subscriber
(async)All of them talking to the GCS with more or less similar but subtly different APIs. This introduces maintenance overhead, makes debugging harder, and makes it harder to add new features.
Beyond Python, all these APIs are also having slightly different semantics than the C++ GcsClient itself as used by core_worker C++ code. For example,
PythonGcsClient::Connect
we retry several times, each time recreating aGcsChannel
. This is supposed to "make reconnection" faster by not waiting in the grpc-internal backoff. But this is not applied in C++ GcsClient or the Python subscribers. In fact, in C++ GcsClient, we don't manage the channel at all. We use the Ray-wide GcsRpcClient to manage it. Indeed, if we wanna "fast recreate" channels, we may want it to be consistenly applied to all clients.self._gcs_node_info_stub.GetAllNodeInfo
call in node_head.py, because they want the full reply whereas the Python GcsClient method only returns partial data the original caller wanted.What's blocking us?
Async. Cython is not known to be good at binding async APIs. We have a few challenges:
What's in this PR?
A simple "MyGcsClient" that wraps the C++ GcsClient. It has only 1 method to asynchronously return a "next job id". See this:
What's next?
In P2 (not urgent), we need to evaluate if this is something worth proceeding. We need to answer: (with my current answers)
Q: In endgame, what's benefit?
A: Removal of all bindings in
##Why?
section above, with a single API consistent with C++ GcsClient. With a notable exception: we probably don't want to bind async servers (needs more experiment, risky).Q: User visible?
A: No. This is a refactor.
Q: Risk?
A: Types and perf costs. I think the async game is derisked.
Q: Effort, large or small?
A: Large. The binding itself is OK-ish, but there are so many callsites to change.