Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add brpc support. #10804

Closed
7 of 16 tasks
gongweibao opened this issue May 21, 2018 · 4 comments
Closed
7 of 16 tasks

Add brpc support. #10804

gongweibao opened this issue May 21, 2018 · 4 comments
Assignees

Comments

@gongweibao
Copy link
Contributor

gongweibao commented May 21, 2018

  • Add rpc_server rpc_client interface to decouple RPC communication and operators
    • Move sync_mode_ scope_ from grpc server to operators which should care about them.
    • Add some comments about the RPC client interface.
    • Cleanup the source directory.
    • Omit the argument ctx in functions such as AsyncSendVariable?
    • Cleanup deserialization.
  • Add brpc support
    • Add brpc framework.
    • Comptable with grpc::ByteBuffer and brpc::IOBuf
    • Add brpc implementation of rpc_server rpc_client.
    • Add brpc unit test.
    • use fewer channel and more stub
    • How to prevent many client access one server at same time.
    • Clean up var_is_not_stable
    • Modify rdma from macro to if condition.
  • Check tensor size to avoid small data transfermation.
@typhoonzero
Copy link
Contributor

"to split data transformation and rpc logic." => to decouple RPC communication and operators

@panyx0718
Copy link
Contributor

panyx0718 commented May 23, 2018

Overall, I have no objection for trying out brpc. (still not sure if we should do it now or after 7.5)

However, given our hard deadline to have good distributed training by 6.20, we need to be very very careful about task assignment and timing.

I suggest we allocate 1 RD for trying out brpc, while others keep optimizing the common distributed training codes and existing grpc.

There are still many things to improve before the 6.20 deadline:
data reading, threading, caching, scheduling, rdma, failure recovery, checkpointing, code robustness, documentation and testing.

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

@paddle-bot-old
Copy link

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants