-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix grpc bugs #7435
Fix grpc bugs #7435
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
|
||
GPR_ASSERT(ok); | ||
PADDLE_ENFORCE(tag); | ||
|
||
// TODO(gongwb): add more retries. | ||
ClientBase* c = static_cast<ClientBase*>(tag); | ||
if (!c->status_.ok()) { | ||
LOG(ERROR) << "proc param error:" << c->var_h_.String(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only log one time for the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/send_op.cc
Outdated
client_.wait(); | ||
if (!client_.wait()) { | ||
LOG(ERROR) << "send op exit"; | ||
exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use exit
in operators, use PADDLE_ENFORCE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,169 @@ | |||
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please split the fix and the book dist sample in two PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -184,15 +183,20 @@ void AsyncGRPCServer::HandleRequest(bool wait, grpc::ServerCompletionQueue* cq, | |||
break; | |||
} | |||
|
|||
assert(tag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use PADDLE_ENFORCE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/recv_op.cc
Outdated
@@ -100,6 +100,8 @@ class RecvOp : public framework::OperatorBase { | |||
// TODO(gognwb): simply this loop. | |||
// Get from multiple trainers, we don't care about order in which | |||
// the gradient arrives, just add suffix 0~n then average the gradient. | |||
VLOG(4) << "param_count:" << param_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce VLOG
appearances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -28,12 +28,15 @@ class RequestBase { | |||
public: | |||
explicit RequestBase(sendrecv::SendRecvService::AsyncService* service, | |||
grpc::ServerCompletionQueue* cq) | |||
: service_(service), cq_(cq), status_(PROCESS) {} | |||
: service_(service), cq_(cq), status_(PROCESS) { | |||
assert(cq_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PADDLE_ENFORCE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
TryToRegisterNewOne(); | ||
delete base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other places to release this memory then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a grpc bug or it's our application bug.
When delete base often, I met an error
VLOG(4) << cq_name << " recv no regular event"; | ||
LOG(WARNING) << cq_name << " recv no regular event:argument name" | ||
<< base->GetReqName(); | ||
// FIXME(gongwb): delete the old one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment does not make things clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't get more context when ok != true
.
paddle/operators/send_op.cc
Outdated
@@ -48,7 +48,10 @@ class SendOp : public framework::OperatorBase { | |||
client_.AsyncGetVariable(epmap[i], ctx, scope, outs[i]); | |||
} | |||
|
|||
client_.wait(); | |||
if (!client_.wait()) { | |||
LOG(ERROR) << "send op exit"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log is too simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detail logs had been logged in functions it calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
Fix #7520 (comment)
Fix grpc/grpc#13983