Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: gRPC segfault due to Low Request Cancellation Timeout #7840

Merged
merged 8 commits into from
Dec 9, 2024

Conversation

yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Nov 27, 2024

What does the PR do?

Handles multiple corner cases under Low Request Cancellation Timeout.
Case 1:

  1. Request received by the server and backend processes as usual. Current status step is ISSUED.
  2. A cancellation is received by server. Before the cancellation thread entering Process function, the inference request finishes and invokes InferResponseComplete. Since the callback thread locks the mutex first, the cancellation thread is blocked at the entry to Process function.
  3. In InferResponseComplete, the function checks if there is a cancellation. If yes, status step is changed to CANCELLED. The state is put back to gRPC completion queue to release later.
  4. Now InferResponseComplete callback is done and the cancellation thread can enter the Process function. Since the context is canceled, Process returns false to release the state and set context_ = nullptr.
  5. Finally, the state we put back to the gRPC completion queue is processed and calls context_->IsCancelled in state->IsGrpcContextCancelled(), which causes the segfault.

Solution: Handle this case in HandleCancellation specifically. See here.

Case 2:

  1. Request received by the server and invoked InferResponseComplete callback after finishing processing.
  2. During the callback after checking cancellation and before Finish, a cancellation is received. The callback does not catch the cancellation and put the status into completion queue for step COMPLETE and FINISH.
  3. The cancellation thread releases the state and set context_ = nullptr.
  4. Finally, the state from step 3 is processed and calls context_->IsCancelled in state->IsGrpcContextCancelled(), which causes the segfault.

Solution: This corner case was found during the fix to the first one. Since the previous logic did not run into this case, I reverted some changes in case 1 fix and now case 2 passes.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • fix

Related PRs:

Where should the reviewer start?

Test plan:

L0_request_cancellation--base
L0_grpc_state_cleanup--base
L0_grpc_error_state_cleanup--base

  • CI Pipeline ID:
    21060113

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

@yinggeh yinggeh added the PR: fix A bug fix label Nov 27, 2024
@yinggeh yinggeh requested a review from tanmayv25 November 27, 2024 21:31
@yinggeh yinggeh self-assigned this Nov 27, 2024
@yinggeh yinggeh requested a review from oandreeva-nv November 27, 2024 21:37
@@ -169,28 +170,61 @@ def test_grpc_async_infer_cancellation_at_step_start(self):
with open(server_log_name, "r") as f:
server_log = f.read()

cancel_at_start_count = len(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the counting for "Cancellation notification received" to test.sh.
Expect one "Cancellation notification received" in log per cancellation after the change to grpc infer_handler.
cc @oandreeva-nv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ould you please briefly explain why this move is needed?

Copy link
Contributor Author

@yinggeh yinggeh Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the change, the message "Cancellation notification received" is logged once, which is consistent with other tests. There is an existing check for this message in test.sh already but it does not check the number of occurances. So I just modify the check in test.sh to also check for number of occurances.

if [ $count == 0 ]; then
echo -e "\n***\n*** Cancellation not received by server on $TEST_CASE\n***"
cat $SERVER_LOG
RET=1
elif [ $count -ne 1 ]; then
echo -e "\n***\n*** Unexpected cancellation received by server on $TEST_CASE. Expected 1 but received $count.\n***"
cat $SERVER_LOG
RET=1
fi

@@ -1076,6 +1081,21 @@ class InferHandlerState {
if (pstr != nullptr) {
delay_process_ms_ = atoi(pstr);
}
const char* nstr = getenv("TRITONSERVER_DELAY_GRPC_NOTIFICATION");
Copy link
Contributor Author

@yinggeh yinggeh Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will refactor variable names, e.g. nstr, in another gRPC improvement PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be better to re-name them now? then there's no need to go back with refactor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the naming style to avoid changes to lines above which look distracting. There will be another PR incoming which I will add more loggings to the gRPC handler that allows easier debugging, as well as variables renaming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is refactored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the PR do?

Handles multiple corner cases under Low Request Cancellation Timeout.

Please go into more detail in the PR description on what the corner cases were, and how they were addressed by this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give a try... They are not easy to describle though but two new test cases should explain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perplexity might help to clarify your thoughts =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description updated.

@@ -1529,7 +1568,7 @@ class ModelInferHandler

protected:
void StartNewRequest() override;
bool Process(State* state, bool rpc_ok) override;
bool Process(State* state, bool rpc_ok, bool is_notification) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for parity with stream_infer_handler, let's set is_notification by default to false

Copy link
Contributor Author

@yinggeh yinggeh Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Process is called at one place in infer_handler.h where is_notification is always passed. The default parameter is never used in Process in infer_handler.h. Do you still think we should make this change?

src/grpc/infer_handler.h Outdated Show resolved Hide resolved
@yinggeh yinggeh requested a review from tanmayv25 December 6, 2024 23:57
… yinggeh-DLIS-7059-grpc-cancellation-segfault
src/grpc/infer_handler.h Outdated Show resolved Hide resolved
src/grpc/infer_handler.h Show resolved Hide resolved
src/grpc/infer_handler.h Outdated Show resolved Hide resolved
src/grpc/infer_handler.h Outdated Show resolved Hide resolved
@yinggeh yinggeh requested a review from tanmayv25 December 7, 2024 01:31
@yinggeh yinggeh merged commit f5e4f69 into main Dec 9, 2024
3 checks passed
@yinggeh yinggeh deleted the yinggeh-DLIS-7059-grpc-cancellation-segfault branch December 9, 2024 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: fix A bug fix
Development

Successfully merging this pull request may close these issues.

4 participants