Fix error handling for GPU tensors #249

Tabrizian · 2023-05-25T04:55:33Z

Bug

the problem was that the inference_response->Send method did not properly handle the error case. GUARDED_RESPOND_IF_ERROR after Send would cause a double free because of sending the response twice.

Another issue existed in the GPU tensor error handling where the error was not properly catched and returned to the stub process.

Fix

For the first issue the fix was to simplify the error handling and only send an error using the inference response.

For the second issue, new data structures were added to properly catch the error.

triton-inference-server/server#5871

rmccorm4 · 2023-05-29T22:53:40Z

Can you summarize the bug and fix in the description?

Tabrizian · 2023-05-30T13:40:20Z

@rmccorm4 added.

src/pb_stub.cc

src/gpu_buffers.cc

src/infer_request.cc

src/python_be.cc

src/response_sender.cc

src/gpu_buffers.h

nnshah1 · 2023-06-02T16:39:55Z

src/gpu_buffers.h

+  uint32_t buffer_count;
+};
+
+class GPUBufferTransporter {


Is the term Transporter common - not familiar - maybe a short one or two line class description would help

Added comments.

nnshah1 · 2023-06-02T16:42:58Z

src/gpu_buffers.cc

+void
+GPUBufferTransporter::Complete(std::unique_ptr<SharedMemoryManager>& shm_pool)
+{
+  if (completed_) {


Should this be an error case as above with adding a buffer to a completed transaction?

src/pb_stub.cc

nnshah1 · 2023-06-02T17:14:48Z

src/pb_utils.h

@@ -212,23 +212,17 @@ struct ResponseSenderBase {
 struct ResponseSendMessage : ResponseSenderBase {
  bi::managed_external_buffer::handle_t response;

-  // GPU Buffers handle
+  // A pointer to GPUBuffersShm object.


seems like we could keep this the same if the structure was renamed to GPUBuffers - is it still a handle or is that seperate from pointer?

By pointer I meant a handle to a GPUBuffersShm. Will update the comment.

nnshah1 · 2023-06-02T17:17:14Z

src/python_be.cc

@@ -661,46 +660,33 @@ ModelInstanceState::ExecuteBLSRequest(
            lbackend_memory.reset(backend_memory);
            input_tensor->SetMemory(std::move(PbMemory::Create(
                Stub()->ShmPool(), std::move(lbackend_memory))));
+            gpu_buffer_transporter.AddBuffer(


instead of transporter would response work? GPUBuffers and GPUBuffersResponse instead of GPUBuffersShm and GPUBufferTransporter

I like transporter more as I think "response" may imply that it can only be used with the backend responses which is not true. Let me know if you have any other suggestions.

src/python_be.cc

nnshah1 · 2023-06-02T17:22:18Z

I don't quite follow the logic of when we use the GPUBufferTransporter and when we use the GPUBuffersShm structure directly - but I think that's from my lack of familiarity with the code.

rmccorm4

LGTM other than minor comment

krishung5

LGTM!

Tabrizian force-pushed the imant-fix-gpu-bug branch 2 times, most recently from 00df006 to b4946bd Compare May 29, 2023 22:38

Tabrizian marked this pull request as ready for review May 29, 2023 22:40

Tabrizian requested review from nnshah1, rmccorm4 and krishung5 May 29, 2023 22:41

Tabrizian mentioned this pull request May 29, 2023

Add testing for GPU tensor error handling triton-inference-server/server#5871

Merged

Tabrizian force-pushed the imant-fix-gpu-bug branch from b4946bd to 4e25872 Compare May 30, 2023 13:57

krishung5 reviewed May 30, 2023

View reviewed changes

src/pb_stub.cc Outdated Show resolved Hide resolved

rmccorm4 reviewed May 31, 2023

View reviewed changes

src/gpu_buffers.cc Outdated Show resolved Hide resolved

src/infer_request.cc Show resolved Hide resolved

src/python_be.cc Show resolved Hide resolved

src/response_sender.cc Outdated Show resolved Hide resolved

Tabrizian requested review from rmccorm4 and krishung5 June 1, 2023 14:28

Tabrizian force-pushed the imant-fix-gpu-bug branch from f68d168 to fd3df73 Compare June 1, 2023 14:56

Tabrizian added 6 commits June 1, 2023 15:09

Fix error handling for GPU tensors

63befef

Fix GPU buffer handling

68dbc79

Review edit

6cc4f4f

Fix for dynamically batched responses with GPU tensor

cb72dd6

Review edits

1f7bbb1

Fix unused i variable for GPU=OFF

73e902f

Tabrizian force-pushed the imant-fix-gpu-bug branch from b3c3d9f to 73e902f Compare June 1, 2023 19:09

nnshah1 reviewed Jun 2, 2023

View reviewed changes

src/gpu_buffers.h Show resolved Hide resolved

nnshah1 reviewed Jun 2, 2023

View reviewed changes

src/pb_stub.cc Show resolved Hide resolved

nnshah1 reviewed Jun 2, 2023

View reviewed changes

src/python_be.cc Outdated Show resolved Hide resolved

rmccorm4 reviewed Jun 5, 2023

View reviewed changes

Review comments

3779acb

Tabrizian force-pushed the imant-fix-gpu-bug branch from e661853 to 3779acb Compare June 5, 2023 17:41

nnshah1 previously approved these changes Jun 5, 2023

View reviewed changes

krishung5 previously approved these changes Jun 5, 2023

View reviewed changes

Review edit

e835269

Tabrizian dismissed stale reviews from krishung5 and nnshah1 via e835269 June 5, 2023 21:35

Tabrizian requested review from nnshah1, krishung5 and rmccorm4 June 6, 2023 14:56

rmccorm4 approved these changes Jun 6, 2023

View reviewed changes

Tabrizian merged commit 0a54e59 into main Jun 6, 2023

Tabrizian deleted the imant-fix-gpu-bug branch August 10, 2023 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error handling for GPU tensors #249

Fix error handling for GPU tensors #249

Tabrizian commented May 25, 2023 •

edited

Loading

rmccorm4 commented May 29, 2023

Tabrizian commented May 30, 2023

nnshah1 Jun 2, 2023

Tabrizian Jun 5, 2023

nnshah1 Jun 2, 2023

nnshah1 Jun 2, 2023

Tabrizian Jun 2, 2023

nnshah1 Jun 2, 2023

Tabrizian Jun 2, 2023

nnshah1 commented Jun 2, 2023

rmccorm4 left a comment

krishung5 left a comment

Fix error handling for GPU tensors #249

Fix error handling for GPU tensors #249

Conversation

Tabrizian commented May 25, 2023 • edited Loading

Bug

Fix

rmccorm4 commented May 29, 2023

Tabrizian commented May 30, 2023

nnshah1 Jun 2, 2023

Choose a reason for hiding this comment

Tabrizian Jun 5, 2023

Choose a reason for hiding this comment

nnshah1 Jun 2, 2023

Choose a reason for hiding this comment

nnshah1 Jun 2, 2023

Choose a reason for hiding this comment

Tabrizian Jun 2, 2023

Choose a reason for hiding this comment

nnshah1 Jun 2, 2023

Choose a reason for hiding this comment

Tabrizian Jun 2, 2023

Choose a reason for hiding this comment

nnshah1 commented Jun 2, 2023

rmccorm4 left a comment

Choose a reason for hiding this comment

krishung5 left a comment

Choose a reason for hiding this comment

Tabrizian commented May 25, 2023 •

edited

Loading