-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: gRPC segfault due to Low Request Cancellation Timeout #7840
Conversation
@@ -169,28 +170,61 @@ def test_grpc_async_infer_cancellation_at_step_start(self): | |||
with open(server_log_name, "r") as f: | |||
server_log = f.read() | |||
|
|||
cancel_at_start_count = len( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the counting for "Cancellation notification received" to test.sh.
Expect one "Cancellation notification received" in log per cancellation after the change to grpc infer_handler.
cc @oandreeva-nv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ould you please briefly explain why this move is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the change, the message "Cancellation notification received" is logged once, which is consistent with other tests. There is an existing check for this message in test.sh
already but it does not check the number of occurances. So I just modify the check in test.sh
to also check for number of occurances.
server/qa/L0_request_cancellation/test.sh
Lines 117 to 125 in cd92b05
if [ $count == 0 ]; then | |
echo -e "\n***\n*** Cancellation not received by server on $TEST_CASE\n***" | |
cat $SERVER_LOG | |
RET=1 | |
elif [ $count -ne 1 ]; then | |
echo -e "\n***\n*** Unexpected cancellation received by server on $TEST_CASE. Expected 1 but received $count.\n***" | |
cat $SERVER_LOG | |
RET=1 | |
fi |
src/grpc/infer_handler.h
Outdated
@@ -1076,6 +1081,21 @@ class InferHandlerState { | |||
if (pstr != nullptr) { | |||
delay_process_ms_ = atoi(pstr); | |||
} | |||
const char* nstr = getenv("TRITONSERVER_DELAY_GRPC_NOTIFICATION"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will refactor variable names, e.g. nstr, in another gRPC improvement PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be better to re-name them now? then there's no need to go back with refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the naming style to avoid changes to lines above which look distracting. There will be another PR incoming which I will add more loggings to the gRPC handler that allows easier debugging, as well as variables renaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is refactored.
src/grpc/infer_handler.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the PR do?
Handles multiple corner cases under Low Request Cancellation Timeout.
Please go into more detail in the PR description on what the corner cases were, and how they were addressed by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give a try... They are not easy to describle though but two new test cases should explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perplexity might help to clarify your thoughts =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description updated.
@@ -1529,7 +1568,7 @@ class ModelInferHandler | |||
|
|||
protected: | |||
void StartNewRequest() override; | |||
bool Process(State* state, bool rpc_ok) override; | |||
bool Process(State* state, bool rpc_ok, bool is_notification) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for parity with stream_infer_handler, let's set is_notification
by default to false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Process
is called at one place in infer_handler.h where is_notification
is always passed. The default parameter is never used in Process
in infer_handler.h
. Do you still think we should make this change?
… yinggeh-DLIS-7059-grpc-cancellation-segfault
What does the PR do?
Handles multiple corner cases under Low Request Cancellation Timeout.
Case 1:
ISSUED
.Process
function, the inference request finishes and invokesInferResponseComplete
. Since the callback thread locks the mutex first, the cancellation thread is blocked at the entry toProcess
function.InferResponseComplete
, the function checks if there is a cancellation. If yes, status step is changed toCANCELLED
. The state is put back to gRPC completion queue to release later.InferResponseComplete
callback is done and the cancellation thread can enter theProcess
function. Since the context is canceled,Process
returnsfalse
to release the state and setcontext_ = nullptr
.context_->IsCancelled
instate->IsGrpcContextCancelled()
, which causes the segfault.Solution: Handle this case in
HandleCancellation
specifically. See here.Case 2:
InferResponseComplete
callback after finishing processing.Finish
, a cancellation is received. The callback does not catch the cancellation and put the status into completion queue for step COMPLETE and FINISH.context_ = nullptr
.context_->IsCancelled
instate->IsGrpcContextCancelled()
, which causes the segfault.Solution: This corner case was found during the fix to the first one. Since the previous logic did not run into this case, I reverted some changes in case 1 fix and now case 2 passes.
Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
L0_request_cancellation--base
L0_grpc_state_cleanup--base
L0_grpc_error_state_cleanup--base
21060113
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)